# PDF-based Question-Answering system [RAG]

I built a PDF-based Question-Answering system using Python, LangChain, Chroma, and Groq LLM. The system can read PDF documents, split them into manageable text chunks, convert the text into embeddings, and store them in a vector database for fast semantic search. When a question is asked, the system retrieves the most relevant chunks and uses a language model to generate accurate answers. This approach allows interactive, AI-powered exploration of documents, making it easy to get insights directly from PDFs.

**1. What is RAG?**

- RAG (Retrieval-Augmented Generation) is a method in AI that combines retrieval-based search with generative AI.
  - Instead of relying only on a language model’s memory, RAG retrieves relevant information from external sources (like documents, databases, or the web) before generating an answer.
- This improves accuracy, reduces hallucinations, and allows handling large knowledge bases.
  - In simple words:
  - “RAG is when an AI first looks up relevant information and then uses that to answer questions.”

**2. Why RAG is used**

RAG is used to:
  - Handle Large Knowledge
  - LLMs have a token limit. RAG lets the model access information outside its memory.
- Increase Accuracy
  - The model can base answers on real data retrieved from documents or databases.
  - Reduce Hallucination
- Generative models sometimes make up facts. Retrieval helps anchor the answer in real information.
  - Enable Domain-Specific Q&A
  - Useful in enterprise settings, research, or PDFs where the content is specific.

**3. Functions / Components of RAG**

**RAG typically involves three main components:**
**1.Retriever**

**Purpose**: Finds relevant documents or information from a knowledge base.
- How it works:
  - Converts your query into a vector (embedding).
  - Searches a database of document vectors (like FAISS, Chroma, or Pinecone) to find the closest matches.
  - Example: If you ask, “What is RAG in AI?”, the retriever fetches documents explaining RAG.
**2.Knowledge Base / Document Store**
**Purpose**: Stores all the information the retriever can search through.
- Types:
  - Text files, PDFs, or web data.
  - Embedding-based vector databases for fast similarity search.
  - Example Tools: ChromaDB, Pinecone, FAISS.

**3.Generator**

**Purpose**: Uses the retrieved documents to generate a final, coherent response.
- How it works:
  - Takes the retrieved documents as context.
  - Generates natural language answers using models like GPT, LLaMA, or T5.
  - Key: Ensures the answer is informative and contextually relevant, not just based on memorized knowledge.

**4. How RAG Works (Step by Step)**

- Input Question
  - User asks: “What is Python?”
- Retrieval Step
  - The retriever searches the vector database for relevant chunks from PDFs, docs, or knowledge bases.
- Augmentation Step
  - Retrieved documents are passed as context to the LLM.
- Generation Step
  - LLM generates an answer using both the context and its language knowledge.
  - Output Answer
  - Returns the answer along with optional source documents.

**Workflow Diagram (simplified):**

[User Question] --> [Retriever] --> [Relevant Docs] --> [LLM] --> [Answer]

          ┌──────────────────────────┐
          │        User Query         │
          └─────────────┬────────────┘
                        │
                        ▼
              ┌───────────────────┐
              │   Retriever (DB)  │
              │  • Vector Store   │
              │  • Search Index   │
              └─────────┬─────────┘
                        │  (Top relevant docs)
                        ▼
              ┌───────────────────┐
              │   Augmentation    │
              │  (Combine Query + │
              │  Retrieved Docs)  │
              └─────────┬─────────┘
                        │
                        ▼
              ┌───────────────────┐
              │   Generator (LLM) │
              │  • GPT / LLaMA    │
              │  • Produces text  │
              └─────────┬─────────┘
                        │
                        ▼
          ┌──────────────────────────┐
          │       Final Answer        │
          └──────────────────────────┘


# Chroma
- Chroma is a special database for storing and searching text as vectors (numbers that represent meaning).
- It lets AI find information based on meaning, not just exact words.
- Works great for Q&A: you give it a question, it finds the most relevant text from documents, and then the AI uses that to answer.
- In the project, Chroma stores the PDF chunks so your model can quickly retrieve relevant content when you ask something.
Simple analogy:
**Think of Chroma like a smart library. Instead of looking for exact book titles, it finds the books that best match your question.**

In [None]:
!pip install chromadb==0.5.5 langchain-chroma==0.1.2 langchain==0.2.11 langchain-community==0.2.10 langchain-text-splitters==0.2.2 langchain-groq==0.1.6 transformers==4.43.2 sentence-transformers==3.0.1 unstructured==0.15.0 unstructured[pdf]==0.15.0

'''What it does:
Installs all the Python packages needed for your workflow.
Key libraries:
LangChain: Framework for building applications with LLMs (Large Language Models). Provides document loaders, chains, retrievers, embeddings, etc.
LangChain-Chroma: Integration of Chroma vector database with LangChain.
LangChain-Groq: Integration with Groq LLM API.
Transformers: For Hugging Face models (used for embeddings or LLMs).
Sentence-Transformers: Used to create embeddings of text.
Unstructured: Reads unstructured documents like PDFs, DOCX, etc.
Poppler-utils: Required system utility to read PDFs on Linux.
Purpose: Prepare the environment for loading PDFs, splitting text, embedding it, and querying using an LLM.'''

Collecting chromadb==0.5.5
  Using cached chromadb-0.5.5-py3-none-any.whl.metadata (6.8 kB)
Collecting langchain-chroma==0.1.2
  Using cached langchain_chroma-0.1.2-py3-none-any.whl.metadata (1.3 kB)
Collecting langchain==0.2.11
  Using cached langchain-0.2.11-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-community==0.2.10
  Using cached langchain_community-0.2.10-py3-none-any.whl.metadata (2.7 kB)
Collecting langchain-text-splitters==0.2.2
  Using cached langchain_text_splitters-0.2.2-py3-none-any.whl.metadata (2.1 kB)
Collecting langchain-groq==0.1.6
  Using cached langchain_groq-0.1.6-py3-none-any.whl.metadata (2.8 kB)
Collecting transformers==4.43.2
  Using cached transformers-4.43.2-py3-none-any.whl.metadata (43 kB)
Collecting sentence-transformers==3.0.1
  Using cached sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting unstructured==0.15.0
  Using cached unstructured-0.15.0-py3-none-any.whl.metadata (29 kB)
Collecting chroma-hnswlib==0.7.6 (from ch

  error: subprocess-exited-with-error
  
  × Building wheel for chroma-hnswlib (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [5 lines of output]
      running bdist_wheel
      running build
      running build_ext
      building 'hnswlib' extension
      error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for chroma-hnswlib
error: failed-wheel-build-for-install

× Failed to build installable wheels for some pyproject.toml based projects
╰─> chroma-hnswlib


In [None]:
!apt-get install poppler-utils

#Poppler is a tool to work with PDFs (convert PDF to text or images).

'apt-get' is not recognized as an internal or external command,
operable program or batch file.


In [None]:
import os
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import Chroma   # ✅ Correct import
from langchain.chains import RetrievalQA
from langchain_groq import ChatGroq

'''What it does:
os: Interact with the file system (check if files exist, paths, etc.).
UnstructuredFileLoader: Loads content from PDF files or other document types.
CharacterTextSplitter: Splits long text into smaller chunks for better LLM processing.
HuggingFaceEmbeddings: Converts text chunks into numerical vectors.
Chroma: Stores embeddings in a vector database for retrieval.
RetrievalQA: Combines a retriever (vector DB) and LLM to answer questions.
ChatGroq: Uses Groq LLM API to generate answers.'''


In [None]:
os.environ['GROQ_API_KEY'] = "gsk....................Ckg"
'''Sets your Groq API key in the environment so ChatGroq can authenticate requests.
Groq API: Your LLM provider to generate responses (like GPT models).'''

In [None]:
import requests
file_path = r"C:\Users\ASUS\Downloads\Python_small.pdf"
'''This code does two things:
import requests → Loads the requests library, which is normally used to download files or get data from the internet.
file_path = r"C:\Users\ASUS\Downloads\Python_small.pdf" → Saves the location of your PDF file on your computer in a variable called file_path.'''

In [None]:
# Read the PDF content directly from the file system
with open(file_path, "rb") as f:
    pdf_content = f.read()
    
'''What it does:
Reads a PDF file from your system.
UnstructuredFileLoader converts the PDF content into a list of Document objects that LangChain can process.
Each Document contains .page_content and .metadata.'''

In [58]:
loader = UnstructuredFileLoader(r"C:\Users\ASUS\Downloads\Python_small.pdf")

In [None]:
import os

file_path = r"C:\Users\ASUS\Downloads\Learning_Python.pdf"
print(os.path.exists(file_path))  # ✅ should be True
'''Simple check to ensure your PDF exists on disk.
Returns True if the file exists, False otherwise.'''

True


In [None]:
documents = loader.load()
documents
'''this reads the file(s) and stores the content in documents so you can work with it in Python.'''





In [None]:
text_splitter = CharacterTextSplitter(chunk_size=1000,
                                      chunk_overlap=100)
texts = text_splitter.split_documents(documents)
'''Why we split text:
LLMs can only handle a limited number of tokens at a time.
Splitting text allows each chunk to fit in LLM input limits.
chunk_size=1000: Each chunk contains 1000 characters.
chunk_overlap=100: Overlap ensures context continuity across chunks.'''

Created a chunk of size 1042, which is longer than the specified 1000
Created a chunk of size 1270, which is longer than the specified 1000
Created a chunk of size 1055, which is longer than the specified 1000
Created a chunk of size 1872, which is longer than the specified 1000
Created a chunk of size 1249, which is longer than the specified 1000
Created a chunk of size 1628, which is longer than the specified 1000
Created a chunk of size 1643, which is longer than the specified 1000


In [62]:
type(texts)

list

In [63]:
len(texts)  #It simply returns the number of elements in the list (or iterable)

373

In [64]:
texts[2]

Document(metadata={'source': 'C:\\Users\\ASUS\\Downloads\\Python_small.pdf'}, page_content='19 19 19 20 21 22 22 24 29\n\n5 Data Structures\n\n5.1 More on Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The del statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Tuples and Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 5.5 Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Looping Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 More on Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Comparing Sequences and Other Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .\n\n31 31 35 

In [None]:
embeddings = HuggingFaceEmbeddings()
'''Converts each text chunk into a vector representation.
Vectors capture semantic meaning so similar texts are close in vector space.'''

  embeddings = HuggingFaceEmbeddings()


In [66]:
persist_dir = "doc_db"

In [67]:
vector_db = Chroma.from_documents(documents=texts,
                                 embedding=embeddings,
                                 persist_directory=persist_dir)

In [None]:
retriever = vector_db.as_retriever()
'''What it does:
Stores embeddings in a vector database.
Chroma.from_documents: Embeds text and saves to Chroma DB.
as_retriever(): Makes a retriever object to search similar documents for a query.
Purpose: Enables semantic search—finding relevant document chunks for your question.'''

In [None]:
# LLM from groq
llm = ChatGroq(
    model = "llama-3.1-8b-instant",
    temperature = 0)
'''Connects to Groq’s LLM (like GPT) using llama-3.1-8b-instant model.
temperature=0: Ensures deterministic, factual answers (low creativity).'''

In [None]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                 chain_type="stuff",
                                 retriever=retriever,
                                 return_source_documents=True)
'''RetrievalQA combines:
Retriever (vector DB search)
LLM (answer generation)
chain_type="stuff": Concatenates all retrieved documents and sends them to LLM.
return_source_documents=True: Keeps track of which text chunks were used in the answer.'''

In [None]:
query = "What is python?"
result = qa_chain.invoke({"query": query})
result["result"]
'''What happens:
The retriever searches the Chroma DB for chunks relevant to “What is python?”
Relevant chunks are sent to Groq LLM.
LLM generates a concise answer.
result contains:
result["result"]: The answer.
result["source_documents"]: Original chunks used for reference.'''

'Python is an interpreted programming language that can be used for various purposes such as program development, desk calculations, and more. It is known for its simplicity, readability, and high-level data types, which allow for compact and expressive code. Python is also easy to use and offers a lot of structure and support for large programs, making it a popular choice among developers and programmers.'

In [79]:
print(result)

{'query': 'What is python?', 'result': 'Python is an interpreted programming language that can be used for various purposes such as program development, desk calculations, and more. It is known for its simplicity, readability, and high-level data types, which allow for compact and expressive code. Python is also easy to use and offers a lot of structure and support for large programs, making it a popular choice among developers and programmers.', 'source_documents': [Document(metadata={'source': 'C:\\Users\\ASUS\\Downloads\\Python_small.pdf'}, page_content='Python is an interpreted language, which can save you considerable time during program development because no compilation and linking is necessary. The interpreter can be used interactively, which makes it easy to experiment with features of the language, to write throw-away programs, or to test functions during bottom-up program development. It is also a handy desk calculator.\n\nPython enables programs to be written compactly and 

In [80]:
query = "what are python functions?"
result = qa_chain.invoke({"query": query})
result["result"]

'In Python, a function is a series of statements that returns some value to a caller. It can also be passed zero or more arguments which may be used in the execution of the body. Functions are a fundamental concept in programming and are used to organize and reuse code.\n\nA function typically has the following characteristics:\n\n1. It has a name, which is used to call the function.\n2. It takes zero or more arguments, which are passed to the function when it is called.\n3. It has a body, which is the series of statements that are executed when the function is called.\n4. It returns a value, which is the result of the function\'s execution.\n\nFunctions can be used to perform a variety of tasks, such as:\n\n* Calculating a value based on input arguments\n* Performing a specific operation on a set of data\n* Returning a value based on a set of conditions\n* Organizing and reusing code\n\nFunctions can be defined using the `def` keyword, followed by the function name and a list of argum

**Overall Workflow**

Load PDF → UnstructuredFileLoader
Split text → CharacterTextSplitter
Embed chunks → HuggingFaceEmbeddings
Store in vector DB → Chroma
Retrieve relevant chunks for query → retriever
Generate answer using LLM → ChatGroq via RetrievalQA

**Purpose**:
You have a PDF-based Q&A system. You can ask questions about the content of your PDF, and it will retrieve relevant chunks and give answers using a language model.

**Libraries Used and Their Roles**

Library	                                Role
langchain	=        Orchestrates LLM workflows, chains, retrievers
langchain-community =  	Provides embeddings & vector store integrations
langchain-groq   =	Access to Groq LLMs
chromadb	=  Vector database for storing embeddings
unstructured	=  Extract text from PDFs & other documents
sentence-transformers =	Generate embeddings for text
transformers	=     Supports Hugging Face models
poppler-utils= 	Needed to read PDFs on Linux