## Document QnA Solution (Open Source)

Text Processing: The code extracts text from a PDF, splits it into smaller chunks, likely for efficient processing.

Embedding Generation: It generates semantic embeddings for each text chunk using a Hugging Face model, capturing the meaning of the text.

Indexing: FAISS is used to create an index structure for fast similarity search based on the embeddings.

Question Answering Model: It loads a pre-trained question answering model capable of understanding questions and finding answers.

Query Execution: Users input a question, which triggers a search within the indexed text chunks. The question answering model then returns relevant answers from the indexed data.

In [1]:
from PyPDF2 import PdfReader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains import StuffDocumentsChain, LLMChain
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.llms import CTransformers
from ctransformers import AutoModelForCausalLM
import os

In [2]:
pdf_path = input("Enter the PDF path: ")

Enter the PDF path:  indian_budget.pdf


In [3]:
pdf_reader = PdfReader(pdf_path)
raw_text = ''
for page in pdf_reader.pages:
    raw_text += page.extract_text()

In [4]:
raw_text

"GOVERNMENT OF INDIA\nINTERIM BUDGET 2024-2025\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2024 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nInclusive Development and Growth  2 \nSocial Justice   3  \nExemplary  Track Record of Governance,  \nDevelopment and Performance (GDP)  7 \nEconomic Management  8 \nGlobal Context  9 \nVision for ‘Viksit Bharat’  10 \nStrategy for  ‘Amrit Kaal’  11 \nInfrastructure Development  17 \nAmrit Kaal as Kartavya Kaal  22 \nRevised Estimates 2023 -24 23 \nBudget Estimates 2024 -25 23 \nPART – B \nDirect taxes  25 \nIndirect Taxes   26 \nEconomy – Then and Now  28 \n  \n  1 \n Interim Budget 2024 -2025  \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1, 2024  \nHon’ble Speaker,  \n I present the Interim Budget for 2024 -25.  \nIntroduction  \n1. The Indian  economy  has witnessed profound positive \ntransformation in the last ten years. The people of India are \nlooking ahead to the future with hop

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=800, chunk_overlap=200)
texts = text_splitter.split_text(raw_text)

In [6]:
len(texts)

61

In [7]:
# Initialize HuggingFace embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={"device": "cpu"},
)

In [8]:
# FAISS will internally create an index structure optimized for fast similarity search based on the provided embeddings.

document_search = FAISS.from_texts(texts, embeddings)

In [9]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x1f3a14943e0>

In [10]:
from langchain.chains.question_answering import load_qa_chain
from langchain import HuggingFaceHub
from langchain_community.llms import HuggingFaceEndpoint

In [11]:
os.environ["HUGGINGFACEHUB_API_TOKEN"] = "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"

In [12]:
llm = HuggingFaceEndpoint(repo_id='mistralai/Mistral-7B-Instruct-v0.2', temperature=0.1, add_to_git_credential=True)

                    add_to_git_credential was transferred to model_kwargs.
                    Please make sure that add_to_git_credential is what you intended.


Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to C:\Users\prabh\.cache\huggingface\token
Login successful


In [13]:
chain = load_qa_chain(llm=llm, chain_type="stuff")

Other chain types - 

bert: Powerful, context-aware model; computationally expensive.

roberta: Similar to BERT, better training strategies; computationally expensive.

distilbert: Lighter-distilled version of bert (lesser parameters), comparable performance; sacrifices some accuracy.

albert: Lightweight BERT variant; efficient but slightly less accurate.

gpt: Generates text, unidirectional; limited for understanding context.

stuff: Custom configuration; flexibility but lacks standardization.

In [15]:
# query = 
# "What was The FDI inflow during 2014-23?"
# "Vision for Amrit Kaal?"

# query = 
# "How to write good commit message?"
# "What is a fork?"
# "How to resolve merge conflicts?"
# "How to rewrite history with Git reset?" 

query = input("Enter your question: ")
docs = document_search.similarity_search(query)
result = chain.run(input_documents=docs, question=query).strip()
print(result)

Enter your question:  What was The FDI inflow during 2014-23?


The FDI inflow during 2014-23 was USD 596 billion.
