# **Hybrid Search**
**BM25** is a sophisticated ranking function used in information retrieval. Acting like a highly efficient librarian, it excels in navigating through extensive collections of documents. Its effectiveness lies in term Frequency: Evaluating how often search terms appear in each document. Document Length Normalization: Ensuring a fair chance for both short and long documents in search results. Bias-Free Information Retrieval: Ideal for large data sets where unbiased results are critical. About LanceDB (VectorDB) LanceDB extends our search capabilities beyond mere keyword matching. It brings in a layer of contextual understanding, interpreting the semantics of search queries to provide results that align with the intended meaning

**Hybrid Search Approach** - Our hybrid search system synergizes BM25's keyword-focused precision with LanceDB's semantic understanding. This duo delivers nuanced, comprehensive search results, perfect for complex and varied datasets.

## Installing all the dependencies

In [10]:
!pip -q install langchain huggingface_hub  langchain_openai lancedb openai  tiktoken  rank_bm25 pypdf

In [2]:
# pass openai api key
import os

os.environ["OPENAI_API_KEY"] = "sk-"

### OpenSource Models
https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

You can also compare your results with normal retriever vs ensemble retriever

## Hybrid Search

**BM25 Retriever** - Sparse retriever

**Embeddings** - Dense retrievers Lancedb

`Hybrid search = Sparse + Dense retriever`

## Load the data

In [3]:
# download the pdf
!wget https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf

--2024-01-23 21:19:49--  https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf
Resolving pdf.usaid.gov (pdf.usaid.gov)... 23.1.203.79, 2600:1408:7:1b8::1923, 2600:1408:7:1b4::1923
Connecting to pdf.usaid.gov (pdf.usaid.gov)|23.1.203.79|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6419525 (6.1M) [application/pdf]
Saving to: ‘PA00TBCT.pdf’


2024-01-23 21:19:49 (26.3 MB/s) - ‘PA00TBCT.pdf’ saved [6419525/6419525]



In [6]:
# load single pdf
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/PA00TBCT.pdf")
pages = loader.load_and_split()

## Importing all the libraries

In [11]:
from langchain.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

## Initialize Embeddings

In [12]:
# Initialize embeddings
embedding = OpenAIEmbeddings()

  warn_deprecated(


## Initialize the BM25

In [13]:
# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k = 2  # Retrieve top 2 results

print("type of bm25", type(bm25_retriever))

type of bm25 <class 'langchain_community.retrievers.bm25.BM25Retriever'>


## Initialize the database

In [14]:
db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
    "pandas_docs",
    data=[
        {
            "vector": embedding.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)
# docsearch = LanceDB.from_texts(doc_list, embedding, connection=table)
# retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

## Instantiate the retriever

In [15]:
# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=table)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever_lancedb], weights=[0.2, 0.8]
)

## Query

In [16]:
# Example customer query
query = "what nutrition needed for pregnant women ?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

# Extract and print only the page content from each document
# for doc in docs:
#     print(doc.page_content)

docs

[Document(page_content='Food and Nutrition Handbook for Extension Workers34guidelines for selecting energy-giving foods, body-building foods \nand protective foods. Pregnant women especially need foods rich in \niron and vitamin A in addition to the balanced diet. Iron needs are \nhighly increased partly due to the need to build reserves for child \nup to six months after birth before initiating complementary food \nintake.\n• Pregnant women need to take foods rich in calcium, e.g., milk and mukene (silver fish) partly to take care of the increased requirement \nfor building the foetus skeletal structure.\n• Pregnant women have higher needs for nutrients generally and \nshould take snacks in between meals.\nIn addition, pregnant women should be educated to strictly observe the \nfollowing:\n1. Take the required amounts of iron and folic acid supplements to \nprevent anaemia.\n2. Sleep under an insecticide-treated mosquito net.\n3. Visit the nearest health facility at least four (4) tim

## Ask questions on this retriever doc

In [18]:
llm = ChatOpenAI(openai_api_key=os.environ["OPENAI_API_KEY"])

qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=ensemble_retriever
)


query = "what nutrition is needed for pregnant women  "
qa.run(query)

'Pregnant women need a balanced diet that includes a variety of foods from all food groups. They should consume foods rich in iron, calcium, and vitamin A. It is also important for pregnant women to take iron and folate supplements to prevent anemia. Additionally, pregnant women should have regular snacks in between meals to meet their increased nutrient needs. They should avoid excessive workloads, limit intake of alcohol and cigarettes, and strictly follow the advice of healthcare professionals regarding medication. Pregnant women should also practice good hygiene and avoid negative cultural practices that may impact their health.'

In [19]:
query = "which food needed for building strong bones and teeth ? which Vitamin & minerals importat for this? "
qa.run(query)

"For building strong bones and teeth, the following foods are important:\n\n1. Calcium: Dairy products such as milk, cheese, and yogurt are excellent sources of calcium. Other sources include fish with edible bones (such as canned salmon or sardines), tofu, almonds, and leafy green vegetables like kale and broccoli.\n\n2. Vitamin D: Vitamin D helps the body absorb calcium and is crucial for bone health. Food sources of vitamin D include fatty fish like salmon and mackerel, egg yolks, and fortified foods like milk and orange juice. The body can also produce vitamin D when the skin is exposed to sunlight.\n\n3. Phosphorus: Phosphorus works together with calcium to build strong bones and teeth. Good sources of phosphorus include dairy products, meat, fish, poultry, whole grains, and nuts.\n\n4. Vitamin K: Vitamin K plays a role in bone metabolism and helps with the formation of osteocalcin, a protein that is necessary for bone strength. Good sources of vitamin K include leafy green vegeta

## Bonus
### FTS is another important feature for extracting all info .. if any one word is matching

**Usecase** : E-Commerce Product Search

**Context**: Customers searching for products on an e-commerce website.

Application: When a customer types a query (like "fitness t-shirt"), the system uses the ensemble retriever to find the most relevant products from the product descriptions. The BM25 component helps capture keyword-based matches, while the dense vector retriever (LanceDB) understands the semantic context of the query.

In [20]:
!pip install tantivy==0.20.1



read more about fts https://lancedb.github.io/lancedb/fts/

In [21]:
# example of FTS. when you want to serch whole text
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table(
    "my_tableasd",
    data=[
        {"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
        {"vector": [5.9, 26.5], "text": "There are several kittens playing"},
    ],
)


table.create_fts_index("text")


x = table.search("puppy").limit(10).select(["text"]).to_list()


texts = [item["text"] for item in x]
print(texts)

['Frodo was a happy puppy']
