## Combining BM25 and Semantic Search

In this notebook we will compare the context relevance between normal retriever `(Semantic search)` and `Hybrid Search` (Which is the combination of Keyword based search and semantic search) . We will use Uptrain for finding relevance score .

1. Semantic Search: This is a search technique that uses embeddings to understand the context and meaning of the query and documents. It goes beyond simple keyword matching by capturing the semantic meaning of the text.
2. Hybrid Search: This combines both keyword-based search (like BM25) and semantic search. The idea is to leverage the strengths of both methods to improve the relevance of search results.

### Importing the dependencies

In [4]:
# Importing OpenAIEmbeddings for creating embeddings
from langchain.embeddings import OpenAIEmbeddings  

# Importing FAISS for creating a vector store
from langchain.vectorstores import FAISS  

# Importing BM25Retriever and EnsembleRetriever for retrieval tasks
from langchain.retrievers import BM25Retriever, EnsembleRetriever  

# Importing ChatOpenAI for chat models
from langchain.chat_models import ChatOpenAI  

# Importing RecursiveCharacterTextSplitter for splitting text into chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter  

# Importing WebBaseLoader for loading documents from the web
from langchain.document_loaders import WebBaseLoader  

# Importing StrOutputParser for parsing output
from langchain_core.output_parsers import StrOutputParser  

# Importing RunnablePassthrough for running tasks
from langchain_core.runnables import RunnablePassthrough  

# Importing hub for accessing LangChain hub
from langchain import hub  

USER_AGENT environment variable not set, consider setting it to identify your requests.


### Setting up API key for the LLM

In [2]:
import  os
os.environ['OPENAI_API_KEY'] = ""

### Scraping and Loading the data

In [7]:
def load_and_chunk_data(url):
    """
    Load data from a given URL and split it into chunks.

    Args:
        url (str): The URL of the document to load.

    Returns:
        list: A list of text chunks.
    """
    # Loading data from the url
    loader = WebBaseLoader(url)
    docs = loader.load()

    # Splitting the data into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
    texts = text_splitter.split_documents(docs)
    
    return texts

# Example usage
url = "https://medium.com/@singhjoban495/linear-regression-in-detailzs-c0e27c4f24aa"
texts = load_and_chunk_data(url)

print("Total lenght of the text : ",len(texts))

Total lenght of the text :  6


### Creation of FAISS vector store (Retriever)

A vector store is a specialized database designed to store and manage high-dimensional vectors, which are often used in machine learning and information retrieval tasks. These vectors typically represent data points in a continuous vector space, such as word embeddings, image features, or other types of embeddings generated by neural networks. Vector stores are optimized for efficient similarity search, allowing for quick retrieval of vectors that are close to a given query vector based on a chosen distance metric (e.g., Euclidean distance, cosine similarity).

1. FAISS (Facebook AI Similarity Search)
Developed by Facebook AI, FAISS is a library for efficient similarity search and clustering of dense vectors. It provides various indexing methods and supports both CPU and GPU for fast computation.
2. Pinecone is a managed vector database service that provides scalable and efficient similarity search. It abstracts away the complexity of managing vector data and offers easy-to-use APIs.

3. Weaviate is an open-source vector search engine that supports various data types and integrates with machine learning models. It provides a flexible schema and powerful search capabilities.

In [47]:
# Create vectorstore using FAISS
embeddings = OpenAIEmbeddings() 
faiss_vectorstore = FAISS.from_documents(texts , embeddings)

In [55]:
# create sparse embedding , create a comibation of keyword based and semantcic search based retriever using EnsembleRetriever     Best Match25
bm25 =  BM25Retriever.from_documents(texts)
bm25.k=2
faiss_retriever2 = faiss_vectorstore.as_retriever(search_kwargs={'k':3})
ensemble_retriever = EnsembleRetriever(retrievers=[bm25 , faiss_retriever2] , weights = [0.5,0.5])

#####  Retrieving Hybrid based retriever /EnsembleRetriever

In [56]:
query = "What is Linear regression ?"

In [57]:
%%time
docs = ensemble_retriever.invoke(query)

CPU times: user 27.6 ms, sys: 4.58 ms, total: 32.1 ms
Wall time: 523 ms


In [59]:
docs

[Document(page_content='between the independent variables (x-axis) and dependent variable (y-axis).Types of Linear RegressionSimple Linear Regression: If there is only a single independent variable. Ex: In our dataset, cgpa (independent variable) and we have to predict package(dependent variable).Multiple Linear Regression: If there is more one independent variable is present in the dataset, then we can apply multiple linear regression. eg: cgpa,studytime are two independent variables or even more can exists.Regression is defined as the line or curve that passes through all the data points on the target-predictor graph with the shortest vertical distance between the data points and regression line.Completely linear dataThis is completely linear data, but in real world data,it si difficult to find completely linear data,we have a sort of linear data . You can see the image below.Y=mx+b is the equation of the line.Y-Dependent variablem=slopex=independent variable(data point)b=interceptMa

### Chaining everything together

In [62]:
# Pull the prompt from the hub
prompt = hub.pull("rlm/rag-prompt")

# Initialize the language model
llm = ChatOpenAI()

def format_docs(docs):
    """
    Format the retrieved documents into a single string with each document's content separated by two newlines.

    Args:
        docs (list): List of Document objects.

    Returns:
        str: Formatted string containing the content of all documents.
    """
    return "\n\n".join(doc.page_content for doc in docs)

# Create the RAG chain by combining the ensemble retriever, prompt, language model, and output parser
rag_chain = (
    {"context": ensemble_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

  warn_deprecated(


In [63]:
rag_chain.invoke(query)

'Linear regression is a supervised machine learning algorithm used for predicting continuous values by showing the linear relationship between independent and dependent variables. There are two types: simple linear regression with one independent variable and multiple linear regression with more than one independent variable. The main goal of linear regression is to find the best-fit line that minimizes the distance between predicted and actual values.'