1. Data Processing:

    - Load and preprocess documents using tools like LangChain's document loaders.

    - Generate embeddings and store them in your chosen vector database.

2. Indexing for Keyword Search:

    - Index the same documents in a keyword-based search engine like Elasticsearch.

3. Develop Retrieval Logic:

    - Implement functions to perform both semantic and keyword searches.

    - Merge and rank the results, possibly using reranking models like Cohere's reranker .

4. Integrate with LLM:

    - Use the combined context to prompt the LLM and generate responses.

5. Build the User Interface:

    - Develop a user-friendly interface for query input and response display.

6. Testing and Evaluation:

    - Test the system with various queries to evaluate performance.

    - Monitor metrics like response accuracy, latency, and user satisfaction.

In [6]:
#  Step 1: Import Required Libraries
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
import os

#  Step 2: Set Your Google API Key
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
#  Step 3: Load and Split the PDF Document
pdf_loader = PyPDFLoader("Regression.pdf")
pdf_pages = pdf_loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(pdf_pages)

#  Step 4: Extract Text Content from Documents
texts = [doc.page_content for doc in split_documents]

#  Step 5: Initialize Google Generative AI Embeddings
embeddings = GoogleGenerativeAIEmbeddings(
    model="models/embedding-001",
    google_api_key=GOOGLE_API_KEY
)

#  Step 6: Create and Save FAISS Vector Store
vectorstore = FAISS.from_texts(texts, embedding=embeddings)
vectorstore.save_local("faiss_index")

#  Step 7: Load the FAISS Vector Store
vectorstore = FAISS.load_local("faiss_index", embeddings, allow_dangerous_deserialization=True)

#  Step 8: Set Up Dense Retriever from FAISS
dense_retriever = vectorstore.as_retriever()

#  Step 9: Set Up BM25 Retriever (Sparse)
bm25_retriever = BM25Retriever.from_documents(split_documents)
bm25_retriever.k = 5  # Number of top documents to retrieve

#  Step 10: Combine Dense and Sparse Retrievers into a Hybrid Retriever
hybrid_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, dense_retriever],
    weights=[0.3, 0.7]
)

#  Step 11: Initialize the Gemini Language Model
llm = ChatGoogleGenerativeAI(
    model="gemini-2.0-flash",
    temperature=0.7,
    google_api_key=GOOGLE_API_KEY
)

#  Step 12: Create a Prompt Template for the QA System
prompt = ChatPromptTemplate.from_template(
    "Use the following context to answer the question:\n\n{context}\n\nQuestion: {input}"
)

#  Step 13: Create a Chain to Combine Retrieved Documents
combine_docs_chain = create_stuff_documents_chain(llm, prompt)

#  Step 14: Create the Retrieval Chain Using the Hybrid Retriever and Combine Docs Chain
retrieval_chain = create_retrieval_chain(hybrid_retriever, combine_docs_chain)

#  Step 15: Ask a Question and Retrieve the Answer
question = "What is the linear regression?"
response = retrieval_chain.invoke({"input": question})

#  Step 16: Print the Answer
print(response["input"])
print(response["answer"])


What is the linear regression?
Linear regression is a type of supervised machine-learning algorithm that learns from labeled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, meaning the output changes at a constant rate as the input changes. This relationship is represented by a straight line.
