# **Hybrid Search**
**BM25** is a sophisticated ranking function used in information retrieval. Acting like a highly efficient librarian, it excels in navigating through extensive collections of documents. Its effectiveness lies in term Frequency: Evaluating how often search terms appear in each document. Document Length Normalization: Ensuring a fair chance for both short and long documents in search results. Bias-Free Information Retrieval: Ideal for large data sets where unbiased results are critical. About LanceDB (VectorDB) LanceDB extends our search capabilities beyond mere keyword matching. It brings in a layer of contextual understanding, interpreting the semantics of search queries to provide results that align with the intended meaning

**Hybrid Search Approach** - Our hybrid search system synergizes BM25's keyword-focused precision with LanceDB's semantic understanding. This duo delivers nuanced, comprehensive search results, perfect for complex and varied datasets.

## Installing all the dependencies

In [None]:
!pip -q install langchain huggingface_hub   lancedb openai  tiktoken  rank_bm25 pypdf

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.4/78.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.4/221.4 kB[0m [31m24.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m63.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.6/277.6 kB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m181.5/181.5 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.2/48.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.2/19.2 MB[0m [31m58.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
# pass openai api key
import os
os.environ["OPENAI_API_KEY"] = "sk-your api key"

### OpenSource Models
https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

You can also compare your results with normal retriever vs ensemble retriever

## Hybrid Search

**BM25 Retriever** - Sparse retriever

**Embeddings** - Dense retrievers Lancedb

`Hybrid search = Sparse + Dense retriever`

## Load the data

In [None]:
# download the pdf
!wget https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf

--2023-12-08 10:04:09--  https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf
Resolving pdf.usaid.gov (pdf.usaid.gov)... 184.29.214.144, 2600:1406:5e00:180::1923, 2600:1406:5e00:1a4::1923
Connecting to pdf.usaid.gov (pdf.usaid.gov)|184.29.214.144|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6419525 (6.1M) [application/pdf]
Saving to: ‘PA00TBCT.pdf.1’


2023-12-08 10:04:10 (16.9 MB/s) - ‘PA00TBCT.pdf.1’ saved [6419525/6419525]



In [None]:
# load single pdf
loader = PyPDFLoader("/content/PA00TBCT.pdf")
pages = loader.load_and_split()

## Importing all the libraries

In [None]:
from langchain.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

## Initialize Embeddings

In [None]:
# Initialize embeddings
embedding = OpenAIEmbeddings()

## Initialize the BM25

In [None]:
# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k =  2  # Retrieve top 2 results

print("type of bm25",type(bm25_retriever))

## Initialize the database

In [None]:
db = lancedb.connect('/tmp/lancedb')
table = db.create_table("pandas_docs", data=[
    {"vector": embedding.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")
# docsearch = LanceDB.from_texts(doc_list, embedding, connection=table)
# retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})


## Instantiate the retriever

In [None]:
# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=table)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, retriever_lancedb],
                                       weights=[0.2, 0.8])

## Query

In [None]:
# Example customer query
query = "what nutrition needed for pregnant women ?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

# Extract and print only the page content from each document
# for doc in docs:
#     print(doc.page_content)

docs

type of bm25 <class 'langchain.retrievers.bm25.BM25Retriever'>


[Document(page_content='Food and Nutrition Handbook for Extension Workers34guidelines for selecting energy-giving foods, body-building foods \nand protective foods. Pregnant women especially need foods rich in \niron and vitamin A in addition to the balanced diet. Iron needs are \nhighly increased partly due to the need to build reserves for child \nup to six months after birth before initiating complementary food \nintake.\n• Pregnant women need to take foods rich in calcium, e.g., milk and mukene (silver fish) partly to take care of the increased requirement \nfor building the foetus skeletal structure.\n• Pregnant women have higher needs for nutrients generally and \nshould take snacks in between meals.\nIn addition, pregnant women should be educated to strictly observe the \nfollowing:\n1. Take the required amounts of iron and folic acid supplements to \nprevent anaemia.\n2. Sleep under an insecticide-treated mosquito net.\n3. Visit the nearest health facility at least four (4) tim

## Ask questions on this retriever doc

In [None]:
llm = ChatOpenAI(openai_api_key="sk-")

qa = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=ensemble_retriever)


query = "what nutrition needed for pregnant women  "
qa.run(query)


"Pregnant women need to consume a balanced diet that includes a variety of foods from the main food groups. They should also have one additional meal in addition to the three meals they receive daily. Some specific nutrients that are important for pregnant women include:\n\n1. Iron: Pregnant women should consume foods rich in iron, such as lean meats, poultry, fish, beans, and leafy green vegetables. They may also need to take iron supplements as prescribed by a healthcare professional.\n\n2. Folate: Pregnant women should take folate tablets daily in addition to consuming foods rich in folate, such as dark leafy greens, citrus fruits, beans, and fortified grains.\n\n3. Calcium: Foods rich in calcium, such as milk, yogurt, cheese, and fortified plant-based milk alternatives, are important for the development of the baby's skeletal structure.\n\n4. Vitamin A: Pregnant women should consume foods rich in vitamin A, such as carrots, sweet potatoes, spinach, and mangoes. They may also need t

In [None]:
query = "which food needed for building strong bones and teeth ? which Vitamin & minerals importat for this? "
qa.run(query)


"The foods that are important for building strong bones and teeth are:\n\n1. Calcium: Milk and milk products, fish with bones (such as sardines), dark green vegetables (such as broccoli and kale).\n\n2. Vitamin D: Sun exposure is the primary source, but it can also be found in vitamin D-fortified milk, eggs, fish-liver oils, and fatty fish like herring, mackerel, and salmon.\n\n3. Vitamin K: Vegetables such as spinach, lettuce, cauliflower, and cabbage, broccoli, fish, liver, meat, and eggs.\n\n4. Magnesium: Legumes, whole-grain cereals, nuts, and dark-green vegetables, as well as meat, seafood, and dairy products.\n\n5. Fluoride: Seafood, tea, coffee, soybeans, and sodium fluoride (which is often added to the water supply).\n\n6. Iron: Red meat, organ meats, whole-wheat products, shellfish, nuts, and dried fruit.\n\nIt's important to note that these nutrients work together and are best obtained through a balanced diet."

## Bonus
### FTS is another important feature for extracting all info .. if any one word is matching

**Usecase** : E-Commerce Product Search

**Context**: Customers searching for products on an e-commerce website.

Application: When a customer types a query (like "fitness t-shirt"), the system uses the ensemble retriever to find the most relevant products from the product descriptions. The BM25 component helps capture keyword-based matches, while the dense vector retriever (LanceDB) understands the semantic context of the query.

In [None]:
!pip install tantivy==0.20.1

read more about fts https://lancedb.github.io/lancedb/fts/

In [None]:
#example of FTS. when you want to serch whole text
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table("my_tableasd",
            data=[{"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
                  {"vector": [5.9, 26.5], "text": "There are several kittens playing"}])


table.create_fts_index("text")


x= table.search("puppy").limit(10).select(["text"]).to_list()


texts = [item['text'] for item in x]
print(texts)