#### Set the Open-AI Key

In [1]:
import os
from constants import openai_key

os.environ["OPENAI_API_KEY"] = openai_key

#### PDF Query Using Langchain

In [2]:
!pip install langchain
!pip install openai
!pip install PyPDF2    
!pip install faiss-cpu
!pip install tiktoken

Collecting faiss-cpu
  Using cached faiss_cpu-1.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Using cached faiss_cpu-1.9.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
Installing collected packages: faiss-cpu
Successfully installed faiss-cpu-1.9.0
Collecting tiktoken
  Using cached tiktoken-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting regex>=2022.1.18 (from tiktoken)
  Using cached regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
Using cached tiktoken-0.8.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
Using cached regex-2024.9.11-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (797 kB)
Installing collected packages: regex, tiktoken
Successfully installed regex-2024.9.11 tiktoken-0.8.0


In [3]:
from PyPDF2 import PdfReader    # TO read from the PDF files
from langchain.embeddings.openai import OpenAIEmbeddings   # measure the relatedness of text strings
from langchain.text_splitter import CharacterTextSplitter   # to split the text by considering some special characters
from langchain.vectorstores import FAISS  # to store the vectors 

In [4]:
# provide the path of pdf file/ files.
pdf_reader = PdfReader("data/budget_speech.pdf")

In [5]:
from typing_extensions import Concatenate

# read text from pdf
raw_text = ' '
for i, page in enumerate(pdf_reader.pages):
    content = page.extract_text()
    if content:
        raw_text += content 

In [6]:
# now, split the text using Character Text Split such that it should not increase token size
text_splitter = CharacterTextSplitter(
    separator='\n',
    chunk_size=800,
    chunk_overlap=200,
    length_function=len,
)

texts = text_splitter.split_text(raw_text)

In [7]:
len(texts)

160

In [8]:
## download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

  embeddings = OpenAIEmbeddings()


In [9]:
# put the `text` into `embeddings` and get the entire vector

document_search = FAISS.from_texts(texts, embeddings)

In [10]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x7f473df230b0>

In [11]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [12]:
chain = load_qa_chain(OpenAI(), chain_type='stuff')

  chain = load_qa_chain(OpenAI(), chain_type='stuff')
stuff: https://python.langchain.com/docs/versions/migrating_chains/stuff_docs_chain
map_reduce: https://python.langchain.com/docs/versions/migrating_chains/map_reduce_chain
refine: https://python.langchain.com/docs/versions/migrating_chains/refine_chain
map_rerank: https://python.langchain.com/docs/versions/migrating_chains/map_rerank_docs_chain

See also guides on retrieval and question-answering here: https://python.langchain.com/docs/how_to/#qa-with-rag
  chain = load_qa_chain(OpenAI(), chain_type='stuff')


In [13]:
query = "How much the agriculture target will be"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

  chain.run(input_documents=docs, question=query)


' I cannot accurately determine the agriculture target based on the given context. The text mentions various initiatives and policies related to agriculture, but does not specify a specific target or goal.'

In [14]:
query = "Productivity and resilience in Agriculture"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

" \n\nThe focus of the government's agriculture research setup is on increasing productivity and developing climate resilient varieties. They will also provide funding in challenge mode and involve domain experts to oversee the research. In addition, they will release new high-yielding and climate-resilient varieties of crops for cultivation by farmers and promote natural farming. The government also plans to strengthen production, storage, and marketing for pulses and oilseeds, as well as develop large scale clusters for vegetable production. They will also implement a Digital Public Infrastructure for Agriculture in partnership with states. "

In [15]:
query = "how much for agriculture and allied sector"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)


' The provision is for ` 1.52 lakh crore.'

In [16]:
query = "Vikas bhi Virasat bhi"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)

' Vikas bhi Virasat bhi is a phrase that means "development and heritage together." It is mentioned in the context as a concept that will be showcased in the growth trajectory of the industrial node at Gaya, Bihar, as well as in the development of Vishnupad Temple Corridor and Mahabodhi Temple Corridor. It highlights the importance of preserving cultural heritage and promoting economic development simultaneously. '

In [17]:
query = "Employment and Investment"
docs = document_search.similarity_search(query)
chain.run(input_documents=docs, question=query)


' The government plans to implement schemes for employment linked incentives, including providing one-month salary to first-time employees and supporting employees and employers. They also plan to promote private investment in infrastructure and encourage states to invest in public infrastructure. Additionally, there are plans to launch a Phase IV of the Pradhan Mantri Gram Sadak Yojana to provide all-weather connectivity to rural areas, and to create a Critical Mineral Mission to promote domestic production and acquisition of critical mineral assets. '