## PDF QUERY LANGCHAIN 

In [1]:
# Run this cell first to install dependencies
%pip install -q langchain-openai langchain-community langchain-classic langchain-text-splitters pypdf python-dotenv faiss-cpu

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 26.0 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
import os
from pathlib import Path
from dotenv import load_dotenv
from pypdf import PdfReader

from langchain_core.documents import Document
from langchain_openai import OpenAI
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate
from langchain_classic.chains import create_retrieval_chain
from langchain_classic.chains.combine_documents import create_stuff_documents_chain


def load_pdf_documents(pdf_path: str):
    """Load a PDF file into LangChain Document objects using pypdf."""
    reader = PdfReader(pdf_path)
    docs = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        docs.append(Document(page_content=text, metadata={"source": pdf_path, "page": i + 1}))
    return docs


In [3]:
# Load API keys from .env (create .env in project root with OPENAI_API_KEY=...)
load_dotenv()
load_dotenv(Path.cwd().parent / ".env")  # if running from 1.5-PDF_QUERY folder
assert os.getenv("OPENAI_API_KEY"), "Set OPENAI_API_KEY in .env or environment"

In [11]:
# Path to PDF (place Budget_Speech.pdf in this folder or set path)
pdf_path = Path("Budget_Speech.pdf")
if not pdf_path.exists():
    pdf_path = Path.cwd().parent / "Budget_Speech.pdf"
if not pdf_path.exists():
    raise FileNotFoundError("Put Budget_Speech.pdf in 1.5-PDF_QUERY or project root.")

docs = load_pdf_documents(str(pdf_path))
print(f"Loaded {len(docs)} pages from {pdf_path.name}")

# Split into chunks for retrieval
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
print(f"Split into {len(splits)} chunks")

# Embeddings and vector store (FAISS)
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(k=4)

# Prompt and document chain
prompt = ChatPromptTemplate.from_template(
    "Answer based only on this context:\n\n<context>\n{context}\n</context>\n\nQuestion: {input}"
)
llm = OpenAI(temperature=0)
document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)

Loaded 65 pages from Budget_Speech.pdf
Split into 159 chunks


In [5]:
# Query the PDF
question = "What are the main highlights or key points of the budget?"
result = retrieval_chain.invoke({"input": question})
print("Answer:", result["answer"])
print("\n--- Sources (retrieved chunks) ---")
for i, doc in enumerate(result["context"][:2], 1):
    print(f"[{i}] {doc.page_content[:200]}...")

Answer: 

1. Increased investment in health infrastructure
2. Launch of PM AtmaNirbhar Swasth Bharat Yojana with an outlay of about Rs. 64,180 crores over 6 years
3. Focus on strengthening preventive, curative, and wellbeing aspects of healthcare
4. Introduction of new centrally sponsored scheme for health systems
5. Emphasis on inclusive development for aspirational India
6. Reinvigorating human capital through various initiatives
7. Focus on innovation and R&D
8. Emphasis on minimum government and maximum governance
9. Fiscal deficit target of 3% of GSDP by 2023-24
10. Disclosure of extra budgetary resources and discontinuation of NSSF loan to FCI for food subsidy.

--- Sources (retrieved chunks) ---
[1] 5 
 
 
 
iii. Inclusive Development for Aspirational India 
iv. Reinvigorating Human Capital  
v. Innovation and R&D 
vi. Minimum Government and Maximum Governance 
 
1. Health and Wellbeing 
 
28. Ev...
[2] CONTENTS 
PART-A 
  Page No. 
 Introduction 1 
 Health and Wellbeing 5 
 

In [6]:
# Try your own question (run the cell above first)
your_question = "What is the total budget allocation?"  # change this
response = retrieval_chain.invoke({"input": your_question})
print(response["answer"])



Answer: The total budget allocation for Health and Wellbeing is `2,23,846 crores in BE 2021-22, which is an increase of 137% from the previous year's BE of `94,452 crores. The details of the budget allocation for other sectors are not mentioned in this context.
