# 🧪 Exercise 3: PDF Question Answering using Chunking, Vector Search, and LLM

In this exercise, you'll complete a **retrieval-augmented generation (RAG)** pipeline that:
- Chunks and embeds the content of a PDF
- Stores the chunks in an in-memory vector database (Qdrant)
- Uses a local LLM to answer yes/no/unknown questions based strictly on the PDF content

You will implement the missing components of this pipeline, focusing on document chunking, retrieval, and prompt construction.

---

### 🎯 Goal

Your objective is to build a system that can answer **yes**, **no**, or **unknown** questions based solely on the information in a given **PDF file**.

- The answer must be **one word only**: `"Yes"`, `"No"`, or `"Unknown"`
- The LLM must not use external knowledge — it must rely **only** on content retrieved from the PDF
- The total prompt sent to the LLM is limited to **2000 characters**, including:
  - The instruction
  - Retrieved context chunks  
  *(🚫 The question itself is not included in this limit and will be added separately)*

---

### 🧠 What You Need to Do

You must implement the following three core functions:

---

#### ✅ `chunk_and_store(text: str) -> tuple[QdrantClient, SentenceTransformer]`

This function prepares the document for retrieval.

**Responsibilities:**
- Chunk the input `text` into smaller segments
- Encode each chunk into a vector using a pre-trained embedding model such as `all-MiniLM-L6-v2`
- Store the vectors in an in-memory **Qdrant** database using `QdrantClient(":memory:")`
- Store relevant metadata for each chunk (e.g., start offset, method)

**Returns:**
- `client`: a Qdrant client object that contains the embedded chunks
- `model`: the SentenceTransformer model used for encoding

---

#### ✅ `create_prompt(question: str, client: QdrantClient, model: SentenceTransformer) -> str`

This function builds the prompt to be sent to the LLM.

**Responsibilities:**
- Retrieve the top-k most relevant chunks from Qdrant using the question as a query
- Construct a prompt that includes:
  - A fixed instruction (you may define this in the function)
  - The most relevant retrieved chunks
- The full prompt must not exceed **400 characters**, excluding the question

**Returns:**
- A string containing the prompt (instruction + context), **excluding** the question

---


### ✅ `my_call_llm(prompt: str, question: str) -> str`

This function provides an interface to the LLM, but must not invoke the LLM directly.

**Responsibilities:**
- Optionally apply logic to enhance or adapt the query (e.g., pre-processing the prompt, logging, enforcing formatting rules)
- Call the provided `call_llm(prompt, question)` function to actually interact with the model
- Return the result unchanged, or with controlled, explainable adjustments **that do not modify the content of the LLM’s response**

**Rules:**
- ❌ Must **not** embed, re-embed, or analyze any part of the original document or its chunks
- ❌ Must **not** call `subprocess`, `ollama`, or any direct LLM API
- ✅ Must **only** call `call_llm(prompt, question)` to obtain the response

**Returns:**
- A string (typically `"Yes"`, `"No"`, or `"Unknown"`) returned by the LLM, possibly post-processed for stability, format, or logging

**Purpose:**
- This function acts as a controlled gateway to LLM usage, allowing improvements in how prompts are used or tracked, without modifying or reprocessing the document or query logic

---


> 💡 **Tip:** You are encouraged to define helper functions to simplify your code and improve readability.

---

### ✨ Provided Function (DO NOT CHANGE)

#### ✅ `call_llm(prompt: str) -> str`

This function is already implemented for you.

- It calls the `llama3.2:3b` model via `ollama`
- Receives the question and your constructed prompt
- Returns the LLM's answer (expected: `"Yes"`, `"No"`, or `"Unknown"`)

You do **not** need to re-implement or modify this function.

---

### 🧪 Evaluation Criteria

- Your system will be evaluated using a **corpus of 100 questions** on a **known PDF document**
- You will be given in advance a **sample of 20 questions** from the evaluation corpus for development and testing
- Your code must generate the correct **yes/no/unknown** answers for the full 100-question set
- **Total execution time** will be measured for the entire run (reading, chunking, querying, and answering)

---

### 🚫 Restrictions

- **Do not** modify any code cell marked with `# DO NOT CHANGE`.
- **Do not** override any variable or function defined in those protected cells.
- Your code must run successfully in the **Lab 10002** environment (`GenAI025_CUDA`), using only the libraries provided.

---

> 💡 **Tip:** Write clean, modular code. Aim for accuracy, clarity, and runtime efficiency.

---

## ✅ Good luck!

---

## 📉 Points Deduction Rules

1. **Modifying restricted code**  
   - Changing any `# DO NOT CHANGE` cell or variable: **–50 points**

2. **Importing any additional library**  
   - Importing any library that is **not already used** in the template: **–5 points per library**  
   - ✅ *No penalty* for importing additional modules or functions from libraries that are already used (e.g., importing more from `langchain` or `sentence_transformers`)

3. **Code compatibility**  
   - Code fails to run in Lab 10002: **–100 points**

4. **Execution time (total run of 100 questions)**  
   - Runs for **5–10 minutes**: **–30 points**  
   - Runs for **>10 minutes**: **–100 points**

5.  **Violating restrictions inside `my_call_llm()`**  
   - ❌ Must **not** embed, re-embed, or analyze any part of the original document or its chunks  
   - ❌ Must **not** call `subprocess`, `ollama`, or any direct LLM API  
   - ✅ Must **only** call `call_llm(prompt, question)` to obtain the response  
   - Penalty: **–100 points**

---

## 🧮 Final Score Calculation

$$
\text{Final Score} = \min \left(100,\ \frac{\text{Your correct answers}}{\text{Gadi’s correct answers}} \times 100 \right) - \text{Total Deductions}
$$

---

📌 *Submit clean, working code. Only modify what you're allowed to. You got this!*


In [None]:
# SET PATH According to your configuration.
PDF_PATH = "MyBank Credit Card Brochure.pdf"
QUESTIONS_PATH = "questions.txt"


In [None]:
# DO NOT CHANGE
import fitz
import os
import uuid
# import spacy
import subprocess
from langchain.text_splitter import (
    CharacterTextSplitter,
    NLTKTextSplitter,
    SpacyTextSplitter,
    RecursiveCharacterTextSplitter
)
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.http.models import VectorParams, PointStruct
import time

In [None]:
# DO NOT CHANGE
def load_pdf(pdf_path):
    with fitz.open(pdf_path) as doc:
        text = "\n".join(page.get_text() for page in doc)
    return text


In [None]:
# DO NOT CHANGE
def load_questions(questions_path):
    with open(questions_path, "r") as f:
        questions = [q.strip() for q in f.readlines() if q.strip().endswith('?')]
    return questions


### ✅ Task 1: Implement `chunk_and_store(text)`

In [None]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"

def chunk_and_store(text: str):
    """
    Splits a given text into smaller chunks and stores them in a vector database or an internal memory structure.

    Parameters:
    ----------
    text : str
        The input text to be processed. This should be a large block of text (e.g., a document, an article, or a report).

    Behavior:
    --------
    1. The function splits the input `text` into manageable chunks based on predefined chunking rules 
       (e.g., maximum character count, sentence boundaries, semantic meaning).
    2. Each chunk is optionally enriched with metadata (e.g., chunk number, character offsets, original document ID).
    3. Each chunk is stored in a storage system such as:
       - An in-memory list or dictionary (for simple setups)
       - A vector database (e.g., Qdrant, FAISS, ChromaDB) after embedding the chunk using an encoder model
    
    Returns:
    -------
    client : qdrant_client.QdrantClient
        A Qdrant client object that contains the embedded and stored chunks.

    model : sentence_transformers.SentenceTransformer
        The SentenceTransformer model used for embedding the text chunks.
   
    Notes:
    -----
    - If using a vector database, the chunk is first passed through an embedding model to create a vector representation.
    - Chunking methods might vary (e.g., fixed-size, sentence-based, semantic-split) depending on implementation details.
    - The function assumes that the storage backend is already initialized and ready for storing chunks.

    Raises:
    ------
    ValueError
        If the input `text` is empty or not a valid string.

    Example:
    --------
    >>> chunk_and_store("This is a long article about machine learning...")
    # Splits the article into chunks and stores them internally or externally.

    """
    # Implementation goes here
    # TODO: implement chunking using multiple strategies
    # TODO: create in-memory Qdrant collection
    # TODO: embed each chunk and store in the DB with metadata (chunking method, start_offset)
    
        # Preprocess PDF text
    text = text.replace('-\n', '').replace('\n', ' ')
    while '  ' in text:
        text = text.replace('  ', ' ')
    # Initialize model + in-memory Qdrant
    model = SentenceTransformer('all-MiniLM-L6-v2')
    dim = model.get_sentence_embedding_dimension()
    client = QdrantClient(':memory:')
    client.recreate_collection(
        collection_name='pdf_chunks',
        vectors_config=VectorParams(size=dim, distance='Cosine'),
    )
    # Chunk into ~400-char windows with 200-char overlap
    splitter = CharacterTextSplitter(
        separator=r"\n\n|[\.!\?]\s|\n|\s",
        is_separator_regex=True,
        chunk_size=400,
        chunk_overlap=200,
    )
    chunks = splitter.split_text(text)
    # Embed & store
    points = []
    for idx, chunk in enumerate(chunks):
        vec = model.encode(chunk).tolist()
        points.append(PointStruct(id=idx, vector=vec, payload={'text': chunk}))
    client.upsert(collection_name='pdf_chunks', points=points)
    return client, model

### ✅ Task 2: Implement `create_prompt(question, client, model)`

In [None]:
def create_prompt(question: str, client, model):
    """
    Creates a context-only prompt for an LLM by retrieving relevant chunks from a vector database 
    based on a user question, using a vector similarity search.

    Parameters:
    ----------
    question : str
        The input question provided by the user. It should be a natural language query.

    client : qdrant_client.QdrantClient
        The Qdrant client connected to the database that contains stored and embedded text chunks.

    model : sentence_transformers.SentenceTransformer
        The SentenceTransformer model used to encode the input question into a vector embedding 
        for similarity search.

    Behavior:
    --------
    1. The function encodes the input `question` into a vector using the provided `model`.
    2. It queries the `client` (Qdrant database) using vector similarity search to find the most relevant chunks.
    3. It assembles a prompt by combining the retrieved chunks and other info (but without adding the question itself).
    4. The resulting prompt consists **only of context**, intended to be passed separately along with the question 
       in a later step when calling the LLM.

    Returns:
    -------
    prompt : str
        A fully formatted prompt string. 
        **The user's question is NOT included in the returned prompt.**

    Notes:
    -----
    - The search typically retrieves the top-k most similar chunks (e.g., top 5).
    - Retrieved chunks are usually concatenated together, separated by delimiters (e.g., "\n\n").
    - The question should be provided separately to the LLM after sending the prompt, or combined externally later.
    - This function assumes that both the client and model are already initialized and ready to use.

    Raises:
    ------
    ValueError
        If the input `question` is empty or not a valid string.

    Example:
    --------
    >>> context_prompt = create_prompt("What benefits does the Platinum Voyager Card offer?", client, model)
    >>> print(context_prompt)
    "Context:\n<retrieved chunks>"

    # Later, when calling the LLM:
    # final_prompt = context_prompt + "\n\nQuestion:\nWhat benefits does the Platinum Voyager Card offer?"
    """
    # TODO: use find_chunks()
    # TODO: build the prompt with CONTEXT_HEADER and top chunks
    # TODO: truncate to PROMPT_CHAR_LIMIT if needed
    # 1) pull out digits from the question
    num = "".join(ch for ch in question if ch.isdigit())
    # 2) embed + retrieve top-20
    q_vec = model.encode(question).tolist()
    results = client.search(
        collection_name='pdf_chunks',
        query_vector=q_vec,
        limit=20,
        with_payload=True,
        with_vectors=False,
    )
    # 3) if there's a number, filter chunks to those containing it
    if num:
        numeric_hits = [
            h for h in results
            if num in h.payload["text"].replace(",", "")
        ]
        candidates = numeric_hits if numeric_hits else results
    else:
        candidates = results
    # 4) hybrid re-rank by token overlap
    tokens = [w.strip(".,?!;:").lower() for w in question.split() if len(w) > 2]
    scored = []
    for h in candidates:
        txt = h.payload["text"].lower()
        overlap = sum(1 for t in tokens if t in txt)
        bonus = (0.1 * overlap / len(tokens)) if tokens else 0
        scored.append((h.score + bonus, h))
    scored.sort(key=lambda x: x[0], reverse=True)
    top_hits = [h for _, h in scored[:6]]
    # 5) assemble ≤400-char prompt
    instr = "Answer Yes/No/Unknown:"
    prompt = instr + "\n\n"
    for h in top_hits:
        chunk = h.payload["text"].replace("\n", " ").strip()
        entry = chunk + "\n\n"
        if len(prompt) + len(entry) > 395:
            break
        prompt += entry
    return prompt[:400].rstrip()

In [None]:
# DO NOT CHANGE
# LLM via Ollama
def call_llm(prompt: str, question: str) -> str:
    """
    Calls a local LLM using the Ollama CLI and returns the model's response.

    This function sends a prompt to the locally hosted `llama3.2:3b` model via the `ollama` command-line interface.
    It ensures the prompt does not exceed 500 characters and captures the model's output.

    Parameters:
        prompt (str): The full input prompt to be sent to the LLM. It should include context and instructions,
                      but not the question itself if using external control.

    Returns:
        str: The raw response generated by the model. If the model call times out, returns "Unknown".

    Notes:
        - The prompt is truncated to a maximum of 2000 characters before being sent.
        - The model is expected to return a one-word answer such as "Yes", "No", or "Unknown".
    """
    prompt = prompt[:2000] + "\nQuestion: " + question
    try:
        result = subprocess.run(
            ["ollama", "run", "llama3.2:3b"],
            input=prompt.encode("utf-8"),
            stdout=subprocess.PIPE,
            stderr=subprocess.PIPE,
            timeout=30
        )
        return result.stdout.decode("utf-8").strip()
    except subprocess.TimeoutExpired:
        return "Unknown"

### ✅ Task 3: Implement `my_call_llm(prompt: str, question: str)`

In [None]:
def my_call_llm(prompt: str, question: str) -> str:
    """
    A wrapper function for controlled interaction with the local LLM.

    This function allows for preprocessing, logging, or evaluation logic
    around a call to the provided `call_llm()` function, but it must not
    directly interact with the LLM (e.g., via subprocess or embedding logic).

    🚫 Restrictions:
        - Must NOT embed, re-embed, or analyze any part of the original document or its chunks
        - Must NOT call `subprocess`, `ollama`, or any direct LLM APIs
        - Must ONLY interact with the LLM via the provided `call_llm(prompt, question)` function

    ✅ Allowed:
        - Logging or printing
        - Handling empty prompts or question formats
        - Calling `call_llm()` multiple times for retry logic or consistency checking
        - Standard string manipulations (if needed)

    Parameters:
        prompt (str): The constructed prompt (instructions + context, excluding the question).
        question (str): The original user question (to be passed to the LLM interface).

    Returns:
        str: - return a one-word answer typically one of "Yes", "No", or "Unknown".

    Example:
        >>> my_call_llm("Context: Data is collected by Google...", "Does Google share my location?")
        "Yes"
    """
    
    # Debug passthrough: returns whatever the LLM says
    print("\n--- DEBUG PROMPT ---\n", prompt)
    print("--- DEBUG QUESTION ---\n", question)
    resp = call_llm(prompt, question)
    print("--- RAW LLM RESPONSE ---\n", resp)
    return resp

In [None]:
# DO NOT CHANGE
def run_rag_pipeline(pdf_path,questions_path):
    """
    Runs the RAG pipeline for all questions in the input list, printing full results and tracking execution time.

    The process includes:
    1. Loading and chunking the PDF.
    2. Embedding and storing chunks in Qdrant.
    3. Answering each question using a locally hosted LLM (via Ollama).
    4. Printing the full Q&A pairs.
    5. Reporting total runtime with a warning if the run exceeds 5 or 10 minutes.
    6. Printing a summary of answers only (one per line).
    """
    start_time = time.time()

    text = load_pdf(pdf_path)
    questions = load_questions(questions_path)

    # Chunk and store once (not inside the loop)
    client, model = chunk_and_store(text) # your function

    all_answers = []

    print("🧠 Answering questions...")
    for question in questions:
        prompt = create_prompt(question, client, model) # your function
        answer = my_call_llm(prompt,question)
        all_answers.append((question, answer)) 
        # print(f"\nQ: {prompt} \n Q: {question} \n A: {answer} \n {'-'*60} \n")
        # print(f"Q: {question} \n A: {answer} \n {'-'*60} \n")

    total_time = time.time() - start_time
    minutes = total_time / 60

    print("\n⏱️ Total Runtime: {:.2f} seconds ({:.2f} minutes)".format(total_time, minutes))
    if minutes > 10:
        print("⚠️ Warning: Runtime exceeds 10 minutes!")
    elif minutes > 5:
        print("⚠️ Notice: Runtime exceeds 5 minutes.")

    print("\n📝 Summary of Answers:")
    i=0
    for _, answer in all_answers:
        i+=1
        print(i,". ",answer)

In [None]:
# DO NOT CHANGE
run_rag_pipeline(PDF_PATH,QUESTIONS_PATH)