## Combining BM25 and Semantic Search

#### In this notebook we will compare the context relevance between normal retriever (Semantic search) and Hybrid Search (Which is the combination of Keyword based search and semantic search) . We will use Uptrain for finding relevance score .

In [1]:
from langchain.embeddings import  OpenAIEmbeddings  
from langchain.vectorstores import FAISS 
from langchain.retrievers import BM25Retriever ,EnsembleRetriever
from langchain.chat_models import ChatOpenAI 
from langchain.text_splitter import RecursiveCharacterTextSplitter  
from langchain.document_loaders import WebBaseLoader
from langchain.chains import RetrievalQA

In [2]:
import  os
os.environ['OPENAI_API_KEY'] = ""

In [3]:
# Load data 
url = "https://medium.com/@singhjoban495/linear-regression-in-details-c0e27c4f24aa"
loader  = WebBaseLoader(url)
docs = loader.load()

In [4]:
# Split data into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000 ,chunk_overlap = 100)
texts = text_splitter.split_documents(docs)

In [5]:
len(texts)

6

In [6]:
# Create vectorstore using FAISS
embeddings = OpenAIEmbeddings() 
faiss_vectorstore = FAISS.from_documents(texts , embeddings)

  warn_deprecated(


In [14]:
# Retriever FAISS
faiss_retriever  = faiss_vectorstore.as_retriever(search_kwargs={'k':3})

In [15]:
# create sparse embedding , create a comibation of keyword based and semantcic search based retriever using EnsembleRetriever     Best Match25
bm25 =  BM25Retriever.from_documents(texts)
bm25.k=2
faiss_retriever2 = faiss_vectorstore.as_retriever(search_kwargs={'k':2})
ensemble_retriever = EnsembleRetriever(retrievers=[bm25 , faiss_retriever2] , weights = [0.5,0.5])

##### Define a function to calculate context relevance using Uptrain

In [11]:
from uptrain import EvalLLM, Evals

def relevance_score(query , context):
    data = [{
        "question": query,
        "context": context
    }]
    
    OPENAI_API_KEY = ""
    
    eval_llm = EvalLLM(openai_api_key=OPENAI_API_KEY)
    
    res = eval_llm.evaluate(
        data = data,
        checks = [Evals.CONTEXT_RELEVANCE]
    )
    return res




#####  Retrieving and evaluating Semantic search based retriever

In [16]:
query  = "What is Linear regression ?"

In [33]:
%%time
ocs = faiss_retriever.invoke(query)
context = ""
for i in docs:
    context+=i.page_content

CPU times: user 29.8 ms, sys: 171 µs, total: 29.9 ms
Wall time: 5.52 s


In [19]:
# Evaluation
score = relevance_score(query , context)

100%|█████████████████████████████████████████████| 1/1 [00:19<00:00, 19.01s/it]
  with ThreadPoolExecutor(max_workers=1) as executor:
[32m2024-06-26 12:44:43.560[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m376[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m


In [37]:
score[0]['score_context_relevance']

0.5

#####  Retrieving and evaluating Hybrid based retriever /EnsembleRetriever

In [34]:
%%time
docs2 = ensemble_retriever.invoke(query)
context2 = ""
for i in docs2:
    context2+=i.page_content

CPU times: user 15.1 ms, sys: 0 ns, total: 15.1 ms
Wall time: 425 ms


In [26]:
# Evaluation
score2 = relevance_score(query , context2)

100%|█████████████████████████████████████████████| 1/1 [00:17<00:00, 17.01s/it]
  with ThreadPoolExecutor(max_workers=1) as executor:
[32m2024-06-26 12:47:39.149[0m | [1mINFO    [0m | [36muptrain.framework.evalllm[0m:[36mevaluate[0m:[36m376[0m - [1mLocal server not running, start the server to log data and visualize in the dashboard![0m


In [35]:
score2[0]['score_context_relevance']

1.0

In [36]:
score2

[{'question': 'What is Linear regression ?',
  'context': 'between the independent variables (x-axis) and dependent variable (y-axis).Types of Linear RegressionSimple Linear Regression: If there is only a single independent variable. Ex: In our dataset, cgpa (independent variable) and we have to predict package(dependent variable).Multiple Linear Regression: If there is more one independent variable is present in the dataset, then we can apply multiple linear regression. eg: cgpa,studytime are two independent variables or even more can exists.Regression is defined as the line or curve that passes through all the data points on the target-predictor graph with the shortest vertical distance between the data points and regression line.Completely linear dataThis is completely linear data, but in real world data,it si difficult to find completely linear data,we have a sort of linear data . You can see the image below.Y=mx+b is the equation of the line.Y-Dependent variablem=slopex=independen