In [38]:
# Imports & Setup
import os
import numpy as np
from typing import List, Tuple

import PyPDF2 #read pdfs
#text processing
import nltk
from nltk.tokenize import sent_tokenize

from sentence_transformers import SentenceTransformer # embedding model
import faiss # vector search - facebook ai similarity search
from transformers import pipeline

# sentence tokenizer
nltk.download('punkt')
nltk.download('punkt_tab')

EMBED_MODEL = "sentence-transformers/all-MiniLM-L6-v2" #small, fast
embedder = SentenceTransformer(EMBED_MODEL)

[nltk_data] Downloading package punkt to /Users/lydiaavin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/lydiaavin/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [7]:
# Load PDF as Text - just extracts raw text from pdf
def load_text_from_pdf(pdf_path: str) -> str:
    text_chunks = []
    with open(pdf_path, 'rb') as f:
        reader = PyPDF2.PdfReader(f)
        for p in range(len(reader.pages)):
            page_text = reader.pages[p].extract_text() or ""
            text_chunks.append(page_text)
    return "\n".join(text_chunks)

In [8]:
#Chunk Text into Passages
def chunk_text(text: str, chunk_size_words: int = 300, overlap_words: int = 50) -> List[str]:
    sentences = sent_tokenize(text)
    chunks = []
    current_words = []

    for s in sentences:
        words = s.split() #sentence to words
        if len(current_words) + len(words) <= chunk_size_words:
            current_words.extend(words)
        else:
            chunks.append(" ".join(current_words)) #overlap to avoid missing context
            overlap = current_words[-overlap_words:] if overlap_words > 0 else []
            current_words = overlap + words

    if current_words:
        chunks.append(" ".join(current_words))

    return [c.strip() for c in chunks if c.strip()]

# Why chunking?
LLMs and QA models can't handle very long texts at once.
So, we split (chunk) the big PDF text into smaller passages
of ~300 words each, with a little overlap to preserve context.

In [39]:
# Build FAISS Index with Embeddings - nemerical representation of text
def build_faiss_index(chunks: List[str]) -> Tuple[faiss.IndexFlatL2, np.ndarray]:
    embeds = embedder.encode(chunks, convert_to_numpy=True, show_progress_bar=True) #generates embeddings
    dim = embeds.shape[1]
    index = faiss.IndexFlatL2(dim) #initialize index
    index.add(embeds.astype('float32'))
    return index, embeds

To answer questions, we first need to *retrieve* the most relevant chunks.
For retrieval, we:
   1. Convert each chunk into a vector (embedding) using SentenceTransformer.
   2. Store these vectors inside a FAISS index for fast similarity search.

In [30]:
# Retrieval & Context Assembly
def retrieve_top_k(question: str, chunks: List[str], index: faiss.IndexFlatL2, k: int = 4) -> List[Tuple[int, float]]:
    q_emb = embedder.encode([question], convert_to_numpy=True).astype('float32') #question to embedding
    D, I = index.search(q_emb, k)  # d-similarity score, i-matching indexes
    results = []
    for idx, dist in zip(I[0], D[0]):
        if idx == -1:
            continue
        results.append((int(idx), float(dist)))
    return results

def assemble_context(retrieved: List[Tuple[int, float]], chunks: List[str], max_concat_chars: int = 1000) -> str:
    pieces = []
    total_len = 0
    for idx, _ in retrieved:
        part = chunks[idx].strip()
        if not part:
            continue
        if total_len + len(part) > max_concat_chars and pieces:
            break
        pieces.append(part)
        total_len += len(part)
    return "\n\n".join(pieces)

   1. Encode the user question into an embedding.
   2. Search the FAISS index for the most similar chunks (semantic search).
   3. Combine those chunks into a "context" to feed into a QA model.

In [31]:
#QA Pipeline
qa_model_name = "deepset/roberta-base-squad2"  #for extractive question answering
qa = pipeline("question-answering", model=qa_model_name, tokenizer=qa_model_name)

def answer_question(question, chunks, index, top_k=3):
    retrieved = retrieve_top_k(question, chunks, index, k=top_k) #retrieve top k relevant chunks
    if not retrieved:
        return {
            "answer": "No relevant text found.",
            "score": 0.0,
            "retrieved": [],
            "context": ""   
        }
    
    context = assemble_context(retrieved, chunks).strip()
    if not context:
        return {
            "answer": "No context available.",
            "score": 0.0,
            "retrieved": retrieved,
            "context": ""  
        }

    print("Retrieved indices:", [i for i, _ in retrieved])
    print("Context length:", len(context))
    
    result = qa(question=question, context=context)
    return {
        "answer": result.get("answer"),
        "score": float(result.get("score", 0.0)),
        "retrieved": retrieved,
        "context": context   
    }

Device set to use mps:0


In [37]:
pdf_path = "../data/Machine_learning.pdf" 
text = load_text_from_pdf(pdf_path)

chunks = chunk_text(text, chunk_size_words=300, overlap_words=50)

index, embeddings = build_faiss_index(chunks)

print("PDF length (chars):", len(text))
print("Number of chunks:", len(chunks))
print("Index type:", type(index))
print("Embeddings shape:", embeddings.shape)


questions = [
    "What is machine learning?",
    "What is overfitting in machine learning?",
]

for q in questions:
    r = answer_question(q, chunks, index, top_k=4)
    print("\nQ:", q)
    print("A:", r['answer'])
    print("Score:", r['score'])
    print("Retrieved indices:", [idx for idx, _ in r['retrieved']])
    print("Context preview:", r['context'][:300], "...")

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

PDF length (chars): 140515
Number of chunks: 72
Index type: <class 'faiss.swigfaiss_avx2.IndexFlatL2'>
Embeddings shape: (72, 384)
Retrieved indices: [0, 8, 19, 7]
Context length: 1835

Q: What is machine learning?
A: a field of study in artificial intelligence
Score: 0.3026069402694702
Retrieved indices: [0, 8, 19, 7]
Context preview: Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions . [1] Within a subdiscipline in machin ...
Retrieved indices: [33, 28, 34, 9]
Context length: 1837

Q: What is overfitting in machine learning?
A: Settling on a bad, overly complex theory
Score: 0.07107354700565338
Retrieved indices: [33, 28, 34, 9]
Context preview: or Interpretable AI, or Explainable Machine Learning (XML), is artificial intelligence (AI) in which humans can understand the decisi

# Notes

this demonstrates a **simple Retrieval-Augmented Generation (RAG) QA pipeline** using local embeddings and HuggingFace models:

1. **PDF ingestion** → Load PDF into raw text.
2. **Chunking** → Split text into smaller passages (~300 words) with overlap to retain context.
3. **Embedding + FAISS** → Convert chunks into embeddings and store in a FAISS index for fast similarity search.
4. **Question answering** → Given a question, retrieve top-k chunks, assemble context, and pass it to a HuggingFace QA model.
5. **Answer + context** → Return the extracted answer, confidence score, retrieved chunk indices, and context snippet.

---

## Learnings

- **Chunking matters:** Smaller chunks with slight overlap preserve context for better answers.
- **Embeddings:** SentenceTransformers provides meaningful numerical vectors to compare semantic similarity.
- **FAISS Index:** Fast nearest-neighbor search allows retrieval of relevant chunks quickly.
- **Context assembly:** Concatenating top-k chunks ensures the QA model has sufficient information.
- **QA model:** Extractive QA models (like `roberta-base-squad2`) find answer spans in context; outputs are more accurate if the context is high-quality.

---

```mermaid
flowchart TD
    A[Load PDF] --> B[Chunk Text]
    B --> C[Embed Chunks with SentenceTransformer]
    C --> D[Build FAISS Index]
    E[User Question] --> F[Embed Question]
    F --> G[Retrieve Top-K Chunks from FAISS]
    G --> H[Assemble Context]
    H --> I[Pass to QA Model]
    I --> J[Return Answer + Score + Context]
