## Document QnA Solution using Langchain

The text from the document is combined into a single string and split into smaller chunks, possibly to enhance processing efficiency.

The code then generates embeddings from the text using OpenAI's embedding model (Embeddings are numerical representations of words or phrases that capture their semantic meaning. These embeddings are crucial for understanding the context and meaning of the text.)

Using the generated embeddings, the code creates a search index using the FAISS library. This index organizes the embeddings in a manner that enables efficient similarity searches, facilitating quick retrieval of relevant information.

Next, the code loads a pre-trained question answering model using Langchain's tools. This model is trained to understand questions and find answers within a given context.

The code executes queries against the search index using the loaded question answering model. It takes user questions as input and searches for relevant information within the text data (either the local document or document from an online source), returning answers based on the context provided by the embeddings.

In [21]:
from PyPDF2 import PdfReader
from langchain_openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.indexes import VectorstoreIndexCreator

In [22]:
import os
os.environ["OPENAI_API_KEY"] = "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"

### QnA with PDF Document present on the PC

In [23]:
# provide the path of  pdf file/files.
pdf_path = input("Enter the PDF path")
pdfreader = PdfReader(pdf_path)

Enter the PDF path budget_speech.pdf


In [24]:
from typing_extensions import Concatenate
# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
    content = page.extract_text()
    if content:
        raw_text += content

In [25]:
raw_text

"GOVERNMENT OF INDIA\nINTERIM BUDGET 2024-2025\nSPEECH\nOF\nNIRMALA SITHARAMAN\nMINISTER OF FINANCE\nFebruary 1,  2024 \nCONTENTS  \n \nPART – A \n Page No.  \nIntroduction  1 \nInclusive Development and Growth  2 \nSocial Justice   3  \nExemplary  Track Record of Governance,  \nDevelopment and Performance (GDP)  7 \nEconomic Management  8 \nGlobal Context  9 \nVision for ‘Viksit Bharat’  10 \nStrategy for  ‘Amrit Kaal’  11 \nInfrastructure Development  17 \nAmrit Kaal as Kartavya Kaal  22 \nRevised Estimates 2023 -24 23 \nBudget Estimates 2024 -25 23 \nPART – B \nDirect taxes  25 \nIndirect Taxes   26 \nEconomy – Then and Now  28 \n  \n  1 \n Interim Budget 2024 -2025  \nSpeech of  \nNirmala Sitharaman  \nMinister of Finance  \nFebruary 1, 2024  \nHon’ble Speaker,  \n I present the Interim Budget for 2024 -25.  \nIntroduction  \n1. The Indian  economy  has witnessed profound positive \ntransformation in the last ten years. The people of India are \nlooking ahead to the future with hop

In [26]:
# We need to split the text using Character Text Split such that it sshould not increse token size
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 800,
    chunk_overlap  = 200,
    length_function = len,
)
texts = text_splitter.split_text(raw_text)

In [27]:
len(texts)

61

In [28]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [29]:
# FAISS will internally create an index structure optimized for fast similarity search based on the provided embeddings.

document_search = FAISS.from_texts(texts, embeddings)

In [30]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x2bfa257bcd0>

In [31]:
from langchain_openai import OpenAI

chain = load_qa_chain(OpenAI(), chain_type="stuff")

Other chain types - 

bert: Powerful, context-aware model; computationally expensive.

roberta: Similar to BERT, better training strategies; computationally expensive.

distilbert: Lighter-distilled version of bert (lesser parameters), comparable performance; sacrifices some accuracy.

albert: Lightweight BERT variant; efficient but slightly less accurate.

gpt: Generates text, unidirectional; limited for understanding context.

stuff: Custom configuration; flexibility but lacks standardization.

In [32]:
# query = 
# "What was The FDI inflow during 2014-23?"
# "Vision for Amrit Kaal?"

query = input("Enter your question: ")
docs = document_search.similarity_search(query)
result = chain.run(input_documents=docs, question=query).strip()
print(result)

Enter your question:  What was The FDI inflow during 2014-23?


The FDI inflow during 2014-23 was USD 596 billion.


### QnA with PDF Document present on the internet

In [33]:
from langchain.document_loaders import OnlinePDFLoader

In [34]:
# Using the GitHub Training Manual - https://githubtraining.github.io/training-manual/legacy-manual.pdf

pdf_link = input("Enter the link of the PDF document")
loader = OnlinePDFLoader(pdf_link)

Enter the link of the PDF document https://githubtraining.github.io/training-manual/legacy-manual.pdf


In [35]:
data = loader.load()

data



In [36]:
# Download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [37]:
index = VectorstoreIndexCreator().from_loaders([loader])

In [39]:
# query = 
# "How to write good commit message?"
# "What is a fork?"
# "How to resolve merge conflicts?"
# "How to rewrite history with Git reset?" 

query = input("Enter your question: ")
result = index.query(query).strip()
print(result)

Enter your question:  How to rewrite history with Git reset?


To rewrite history with Git reset, you can use the "git reset" command with different options such as "--soft", "--mixed", or "--hard" to move the HEAD pointer to a different commit and change the state of the staging area and working directory accordingly. This allows you to undo or modify previous commits and make changes to your project's history. However, it is important to use this command carefully as it can be destructive and potentially cause loss of work.
