#Task 1: Chat with PDF Using RAG Pipeline<br>
**Overview** <br>
The goal is to implement a Retrieval-Augmented Generation (RAG) pipeline
   that allows users to
interact with semi-structured data in multiple PDF files. The system should extract, chunk,
embed, and store the data for eficient retrieval. It will answer user queries and perform
comparisons accurately, leveraging the selected LLM model for generating responses.<br>
**Functional Requirements** <br>
**1. Data Ingestion**
• Input: PDF files containing semi-structured data.
• Process:
o Extract text and relevant structured information from PDF files.
o Segment data into logical chunks for better granularity.
o Convert chunks into vector embeddings using a pre-trained embedding model.
o Store embeddings in a vector database for e icient similarity-based retrieval. <br>
**2. Query Handling**
• Input: User's natural language question.
• Process:
o Convert the user's query into vector embeddings using the same embedding
model.
o Perform a similarity search in the vector database to retrieve the most relevant
chunks.
o Pass the retrieved chunks to the LLM along with a prompt or agentic context to
generate a detailed response.<br>
**3. Comparison Queries**
• Input: User's query asking for a comparison  
• Process:
o Identify and extract the relevant terms or fields to compare across multiple PDF
f
 iles.
o Retrieve the corresponding chunks from the vector database.
o Process and aggregate data for comparison.
o Generate a structured response (e.g., tabular or bullet-point format).<br>
**4. Response Generation**
• Input: Relevant information retrieved from the vector database and the user query.
• Process:
o Use the LLM with retrieval-augmented prompts to produce responses with exact
values and context.
o Ensure factuality by incorporating retrieved data directly into the response.

In [4]:
pip install PyMuPDF transformers sentence-transformers faiss-cpu openai

Collecting PyMuPDF
  Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading pymupdf-1.25.1-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m36.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m40.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF, faiss-cpu
Successfully installed PyMuPDF-1.25.1 faiss-cpu-1.9.0.post1


In [6]:
import fitz  # PyMuPDF for PDF extraction
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer
import openai

# Step 1: Extract Text from PDF
def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    return text

# Step 2: Text Chunking
def chunk_text(text, chunk_size=500):
    # Split text into smaller chunks for processing
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    return chunks

# Step 3: Embed Text using Sentence-Transformers (or OpenAI)
def embed_text(chunks, model_name="all-MiniLM-L6-v2"):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks, convert_to_tensor=True)
    return embeddings

# Step 4: Store Embeddings in FAISS (for efficient retrieval)
def create_faiss_index(embeddings):
    dim = embeddings.shape[1]  # Dimension of the embeddings
    index = faiss.IndexFlatL2(dim)
    faiss.normalize_L2(embeddings)  # Optional: Normalize the vectors
    index.add(embeddings)  # Add the embeddings to the index
    return index

# Step 5: Perform Retrieval and Response Generation
def retrieve_relevant_chunks(query, index, chunks, model_name="all-MiniLM-L6-v2"):
    # Embed the query
    model = SentenceTransformer(model_name)
    query_embedding = model.encode([query], convert_to_tensor=True)

    # Search in FAISS
    D, I = index.search(query_embedding, k=5)  # Retrieve top 5 relevant chunks
    relevant_chunks = [chunks[i] for i in I[0]]
    return relevant_chunks

def generate_response(query, relevant_chunks):
    context = "\n".join(relevant_chunks)
    # Use OpenAI or any LLM model to generate a response based on the context
    openai.api_key = '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'
    # Use openai.ChatCompletion.create for chat models like gpt-4o-mini
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": f"Answer the following question based on the context:\n\n{context}\n\nQuestion: {query}"}
        ],
        max_tokens=300
    )
    return response.choices[0].message.content.strip() # Access the message content
# Example Usage
def rag_pipeline(pdf_path, query):
    all_chunks = []
    all_embeddings = []

    # Extract text from the PDF and chunk it
    text = extract_text_from_pdf(pdf_path)
    chunks = chunk_text(text)
    all_chunks.extend(chunks)

    # Embed the chunks and store in FAISS index
    embeddings = embed_text(all_chunks)
    index = create_faiss_index(embeddings.numpy())

    # Retrieve relevant chunks for the query
    relevant_chunks = retrieve_relevant_chunks(query, index, all_chunks)

    # Generate and return the response
    response = generate_response(query, relevant_chunks)
    return response

# Main function to get user input and run the RAG pipeline
# Get the query from the user
query = input("Please enter your query: ")

# Specify the path of the PDF file you want to process
pdf_path = "/content/test1.pdf"
# Replace with your actual PDF path

# Run the RAG pipeline and get the response
response = rag_pipeline(pdf_path, query)
print("Response: ", response)



Please enter your query: from page 2 get exact unempoyment information based on type of degree input
Response:  It appears that the content you provided doesn’t explicitly include any data or information regarding unemployment rates based on different types of degrees. If this information was contained in a specific table or section not included in your message, please share that portion directly so I can assist you better.

If you want to know about typical unemployment rates by degree type, I can provide general knowledge in that area. Generally, higher education levels tend to correlate with lower unemployment rates. Would you like to know more about that or provide specific data for analysis?


In [3]:
pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.57.4
    Uninstalling openai-1.57.4:
      Successfully uninstalled openai-1.57.4
Successfully installed openai-0.28.0
