# **Hybrid Search**
**BM25** is a sophisticated ranking function used in information retrieval. Acting like a highly efficient librarian, it excels in navigating through extensive collections of documents. Its effectiveness lies in term Frequency: Evaluating how often search terms appear in each document. Document Length Normalization: Ensuring a fair chance for both short and long documents in search results. Bias-Free Information Retrieval: Ideal for large data sets where unbiased results are critical. About LanceDB (VectorDB) LanceDB extends our search capabilities beyond mere keyword matching. It brings in a layer of contextual understanding, interpreting the semantics of search queries to provide results that align with the intended meaning

**Hybrid Search Approach** - Our hybrid search system synergizes BM25's keyword-focused precision with LanceDB's semantic understanding. This duo delivers nuanced, comprehensive search results, perfect for complex and varied datasets.

## Installing all the dependencies

In [5]:
!pip -q install langchain huggingface_hub langchain_community langchain_openai lancedb openai tiktoken rank_bm25 pypdf

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m43.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.5/409.5 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m70.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.5/49.5 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# pass openai api key
import os

os.environ["OPENAI_API_KEY"] = "sk-proj-...."

### OpenSource Models
https://github.com/lancedb/vectordb-recipes/blob/main/tutorials/chatbot_using_Llama2_&_lanceDB

You can also compare your results with normal retriever vs ensemble retriever

## Hybrid Search

**BM25 Retriever** - Sparse retriever

**Embeddings** - Dense retrievers Lancedb

`Hybrid search = Sparse + Dense retriever`

## Load the data

In [3]:
# download the pdf
!wget https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf

--2024-11-24 07:35:55--  https://pdf.usaid.gov/pdf_docs/PA00TBCT.pdf
Resolving pdf.usaid.gov (pdf.usaid.gov)... 96.17.46.187, 2600:1408:7:1b8::1923, 2600:1408:7:1b4::1923
Connecting to pdf.usaid.gov (pdf.usaid.gov)|96.17.46.187|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6419525 (6.1M) [application/pdf]
Saving to: ‘PA00TBCT.pdf’


2024-11-24 07:35:55 (52.6 MB/s) - ‘PA00TBCT.pdf’ saved [6419525/6419525]



In [6]:
# load single pdf
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("/content/PA00TBCT.pdf")
pages = loader.load_and_split()

## Importing all the libraries

In [7]:
from langchain_community.vectorstores import LanceDB
import lancedb
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.schema import Document
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA

## Initialize Embeddings

In [8]:
# Initialize embeddings
embedding = OpenAIEmbeddings()

  embedding = OpenAIEmbeddings()


## Initialize the BM25

In [17]:
# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(pages)
bm25_retriever.k = 2  # Retrieve top 2 results

print("type of bm25", type(bm25_retriever))

type of bm25 <class 'langchain_community.retrievers.bm25.BM25Retriever'>


## Initialize the database

In [None]:
db = lancedb.connect("/tmp/lancedb")
table = db.create_table(
    "pandas_docs",
    data=[
        {
            "vector": embedding.embed_query("Hello World"),
            "text": "Hello World",
            "id": "1",
        }
    ],
    mode="overwrite",
)
# docsearch = LanceDB.from_texts(doc_list, embedding, connection=table)
# retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

## Instantiate the retriever

In [24]:
# Initialize LanceDB retriever
docsearch = LanceDB.from_documents(pages, embedding, connection=db)
retriever_lancedb = docsearch.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever_lancedb], weights=[0.2, 0.8]
)

## Query

In [25]:
# Example customer query
query = "what nutrition needed for pregnant women ?"


# Retrieve relevant documents/products
docs = ensemble_retriever.get_relevant_documents(query)

# Extract and print only the page content from each document
# for doc in docs:
#     print(doc.page_content)

docs

[Document(metadata={'page': 46, 'source': '/content/PA00TBCT.pdf'}, page_content='Food and Nutrition Handbook for Extension Workers\n35\nNutrition\tfor\tbreastfeeding\tmothers\nNutritional requirements during breastfeeding are higher than during \npregnancy because the mother has to produce enough milk to sustain a \nbaby (bigger than the foetus) for the first six months and beyond. Breast-\nfeeding women need to eat a wide variety of foods.\nNutrition guidelines for pregnant women as well apply here but a \nlactating mother needs to eat much more; that is to say one extra meal \n(five meals in total).\nBreastfeeding mothers should also take a lot of fluids to cater for the \nhigh amounts of water used to make breast milk. They should avoid \nself-medication, smoking and alcohol to prevent intoxicating the baby.\nBreastfeeding mothers should avoid stress and have enough rest.\nKEY MESSAGES \n• Ensure that a pregnant mother has a balanced diet, with a vari-\nety of foods from the food g

## Ask questions on this retriever doc

In [26]:
llm = ChatOpenAI(openai_api_key=os.environ["OPENAI_API_KEY"])

qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=ensemble_retriever
)


query = "what nutrition is needed for pregnant women  "
qa.run(query)

  qa.run(query)


'Pregnant women need to consume a balanced diet with a variety of foods from the main food groups daily. They should include foods rich in iron, calcium, and vitamin A. Additionally, pregnant women should take iron and folate tablets daily, get adequate rest, avoid stress, and have regular antenatal care visits.'

In [27]:
query = "which food needed for building strong bones and teeth ? which Vitamin & minerals importat for this? "
qa.run(query)

'Foods that are needed for building strong bones and teeth include sources of calcium, magnesium, vitamin D, and fluoride. Calcium and vitamin D are essential for bone health, while magnesium plays a role in bone structure. Fluoride is important for tooth formation and preventing tooth decay. Sources of these nutrients include:\n\n- Calcium: milk and dairy products, fish eaten with bones, dark green vegetables.\n- Magnesium: legumes, whole-grain cereals, nuts, and dark-green vegetables.\n- Vitamin D: sun exposure, Vitamin D-fortified milk, eggs, fatty fish.\n- Fluoride: seafood, tea, coffee, soybeans, iodized salt.\n\nThese nutrients play crucial roles in building and maintaining strong bones and teeth.'

## Bonus
### FTS is another important feature for extracting all info .. if any one word is matching

**Usecase** : E-Commerce Product Search

**Context**: Customers searching for products on an e-commerce website.

Application: When a customer types a query (like "fitness t-shirt"), the system uses the ensemble retriever to find the most relevant products from the product descriptions. The BM25 component helps capture keyword-based matches, while the dense vector retriever (LanceDB) understands the semantic context of the query.

In [28]:
!pip install tantivy==0.20.1

Collecting tantivy==0.20.1
  Downloading tantivy-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.3 kB)
Downloading tantivy-0.20.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.1/4.1 MB[0m [31m32.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: tantivy
Successfully installed tantivy-0.20.1


read more about fts https://lancedb.github.io/lancedb/fts/

In [29]:
# example of FTS. when you want to serch whole text
import lancedb

uri = "data/sample-lancedb"
db = lancedb.connect(uri)

table = db.create_table(
    "my_tableasd",
    data=[
        {"vector": [3.1, 4.1], "text": "Frodo was a happy puppy"},
        {"vector": [5.9, 26.5], "text": "There are several kittens playing"},
    ],
)


table.create_fts_index("text")


x = table.search("puppy").limit(10).select(["text"]).to_list()


texts = [item["text"] for item in x]
print(texts)

['Frodo was a happy puppy']
