<a href="https://colab.research.google.com/github/ma2070-spec/Outamation_AI_Externship/blob/main/colab/project5_mortgage_rag_llamaindex_final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Note for reviewers
This notebook uses Colab User secrets or a masked prompt for `GOOGLE_API_KEY`.
To run: File → Save a copy in Drive, then run cells top-to-bottom.
The submitted notebook is view-only and outputs were cleared.


In [None]:
# Install/refresh everything needed
%pip -q install -U pip setuptools wheel
%pip -q install -U \
  "llama-index>=0.11" \
  "llama-index-llms-gemini>=0.2" \
  "llama-index-retrievers-bm25>=0.2" \
  "llama-index-embeddings-huggingface>=0.2" \
  sentence-transformers pymupdf jedi


In [None]:
# Load GOOGLE_API_KEY (Colab secrets -> fallback to masked prompt)
import os
def _load_secret(key_name="GOOGLE_API_KEY"):
    try:
        from google.colab import userdata
        v = userdata.get(key_name)
        if v: return v
    except Exception:
        pass
    v = os.environ.get(key_name)
    if v: return v
    from getpass import getpass
    return getpass(f"Enter your {key_name} (hidden): ")
os.environ["GOOGLE_API_KEY"] = _load_secret(); del _load_secret

# Imports (with compatibility fallbacks)
try:
    from llama_index.llms.gemini import Gemini
except Exception as e:
    raise ImportError("Gemini import failed. Did you run the install + restart?") from e

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
try:
    from llama_index.core.node_parser import SentenceSplitter
except ImportError:
    # Older/newer versions sometimes rename; this keeps things working
    from llama_index.core.node_parser import TokenTextSplitter as SentenceSplitter

print("Imports OK; key loaded:", bool(os.environ.get("GOOGLE_API_KEY")))



In [None]:
# --- STEP 3: Configure dirs, knobs, and LlamaIndex settings ---
import os
from pathlib import Path

# Dirs
DATA_DIR = Path("./data"); DATA_DIR.mkdir(parents=True, exist_ok=True)
STORAGE_DIR = Path("./storage"); STORAGE_DIR.mkdir(parents=True, exist_ok=True)

# Retrieval knobs
CHUNK_SIZE = 512
CHUNK_OVERLAP = 32
SIMILARITY_TOP_K = 5
BM25_TOP_K = 5

# Embeddings + rerank knobs
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
USE_RERANK = True
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
RERANK_TOP_N = 5

# LLM + embedding + chunker (handle model name differences)
api_key = os.environ["GOOGLE_API_KEY"]
try:
    Settings.llm = Gemini(api_key=api_key, model="models/gemini-1.5-flash")
except Exception:
    # Some versions expect the short name
    Settings.llm = Gemini(api_key=api_key, model="gemini-1.5-flash")

Settings.embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
Settings.node_parser = SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print("Configured ✅")
print("Data dir:", DATA_DIR.resolve())


In [None]:
# Core imports
from pathlib import Path
import os
import pandas as pd
from typing import List

# LlamaIndex pieces
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex, get_response_synthesizer
from llama_index.core.retrievers import BaseRetriever, VectorIndexRetriever
from llama_index.core.schema import NodeWithScore, QueryBundle

# Splitter name sometimes differs across versions
try:
    from llama_index.core.node_parser import SentenceSplitter
except ImportError:
    from llama_index.core.node_parser import TokenTextSplitter as SentenceSplitter

# Reranker import
try:
    from llama_index.core.postprocessor import SentenceTransformerRerank
except ImportError:
    SentenceTransformerRerank = None  # we'll handle None below

# LLM + embeddings + BM25
from llama_index.llms.gemini import Gemini
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.retrievers.bm25 import BM25Retriever

print("Imports OK ✅")


In [None]:
# Dirs
DATA_DIR = Path("./data"); DATA_DIR.mkdir(parents=True, exist_ok=True)
STORAGE_DIR = Path("./storage"); STORAGE_DIR.mkdir(parents=True, exist_ok=True)

# Retrieval knobs
CHUNK_SIZE = 512
CHUNK_OVERLAP = 32
SIMILARITY_TOP_K = 5
BM25_TOP_K = 5

# Embeddings + rerank knobs
EMBED_MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
USE_RERANK = True
RERANK_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"
RERANK_TOP_N = 5

# LLM + embedding + chunker
api_key = os.environ["GOOGLE_API_KEY"]
try:
    Settings.llm = Gemini(api_key=api_key, model="models/gemini-1.5-flash")
except Exception:
    # some versions expect short model name
    Settings.llm = Gemini(api_key=api_key, model="gemini-1.5-flash")

Settings.embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL_NAME)
Settings.node_parser = SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

print("Configured ✅  Data dir:", DATA_DIR.resolve())


In [None]:
# If you're in Colab, this opens a file picker
try:
    from google.colab import files  # type: ignore
    uploaded = files.upload()
    for name, content in uploaded.items():
        with open(DATA_DIR / name, "wb") as f:
            f.write(content)
    print("Saved:", list(DATA_DIR.glob("*")))
except Exception as e:
    print("If not in Colab, just place PDFs into ./data manually.")


In [None]:
# --- STEP 7 (updated): Build vector index + BM25 using same chunks ---

# 1) Load docs
docs = SimpleDirectoryReader(str(DATA_DIR)).load_data()
print(f"Loaded {len(docs)} documents")

# 2) Vector index (dense)
vector_index = VectorStoreIndex.from_documents(docs, show_progress=True)

# 3) Create nodes for BM25 using the same splitter config
parser = SentenceSplitter(chunk_size=CHUNK_SIZE, chunk_overlap=CHUNK_OVERLAP)

nodes = None
try:
    # Newer API
    nodes = parser.get_nodes_from_documents(docs)
except Exception:
    # Fallback: older API name (some builds)
    try:
        nodes = parser.build_nodes_from_documents(docs)  # fallback name
    except Exception as e:
        print("Couldn't create nodes via parser; will pass documents directly to BM25.\n", e)

# 4) BM25 retriever (prefer nodes; otherwise documents)
if nodes:
    bm25 = BM25Retriever.from_defaults(nodes=nodes, similarity_top_k=BM25_TOP_K)
else:
    bm25 = BM25Retriever.from_defaults(documents=docs, similarity_top_k=BM25_TOP_K)

# 5) Dense retriever
# (Using as_retriever is more version-proof than VectorIndexRetriever)
vector_retriever = vector_index.as_retriever(similarity_top_k=SIMILARITY_TOP_K)

print("Indexes ready: vector (dense) + BM25 (sparse)")



In [None]:
class HybridRRFRetriever(BaseRetriever):
    def __init__(self, retrievers, top_k=5, rrf_k=60):
        self.retrievers = retrievers
        self.top_k = top_k
        self.rrf_k = rrf_k

    def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
        scores = {}
        node_map = {}
        for retr in self.retrievers:
            results = retr.retrieve(query_bundle.query_str)
            for rank, res in enumerate(results, 1):
                nid = res.node.node_id
                node_map[nid] = res.node
                scores[nid] = scores.get(nid, 0.0) + 1.0 / (self.rrf_k + rank)
        fused = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:self.top_k]
        return [NodeWithScore(node=node_map[nid], score=sc) for nid, sc in fused]

hybrid_retriever = HybridRRFRetriever(
    retrievers=[bm25, vector_retriever],
    top_k=max(SIMILARITY_TOP_K, BM25_TOP_K)
)
print("Hybrid retriever ready.")


In [None]:
reranker = None
if USE_RERANK and SentenceTransformerRerank is not None:
    reranker = SentenceTransformerRerank(model=RERANK_MODEL, top_n=RERANK_TOP_N)
    print("Reranker loaded:", RERANK_MODEL)
else:
    print("Reranker disabled or package unavailable.")


In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
response_synthesizer = get_response_synthesizer(response_mode="compact")

query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    response_synthesizer=response_synthesizer,
    node_postprocessors=[reranker] if reranker else None,
)
print("Query engine ready.")


In [None]:
def ask(q: str, show_sources: bool = True):
    print(f"Q: {q}\n")
    resp = query_engine.query(q)
    print("A:", str(resp), "\n")
    if show_sources and getattr(resp, "source_nodes", None):
        rows = []
        for i, sn in enumerate(resp.source_nodes, 1):
            m = sn.node.metadata or {}
            rows.append({
                "rank": i,
                "score": round(getattr(sn, "score", 0.0), 4),
                "file": m.get("file_name", m.get("filename", "")),
                "page": m.get("page_label", m.get("page", "")),
                "preview": (sn.node.get_content() or "")[:240].replace("\n", " ")
            })
        if rows:
            display(pd.DataFrame(rows))
    return resp

# Try:
# ask("What is the borrower's base interest rate and any discount points? Cite the page.")
# ask("Explain escrow requirements and any conditions for waivers with citations.")


In [None]:
eval_questions = [
    "Summarize borrower obligations for taxes and insurance with citations.",
    "How is private mortgage insurance handled? Include any thresholds and cite.",
    "What late fee policy applies and on which page is it described?"
]

def run_eval(questions):
    recs = []
    for q in questions:
        resp = query_engine.query(q)
        cites = []
        for sn in resp.source_nodes:
            m = sn.node.metadata or {}
            cites.append(f"{m.get('file_name','')} p.{m.get('page_label', m.get('page',''))}")
        recs.append({
            "question": q,
            "answer_len": len(str(resp)),
            "num_citations": len(cites),
            "citations": "; ".join([c for c in cites if c])
        })
    return pd.DataFrame(recs)

# run_eval(eval_questions)


In [None]:
vector_index.storage_context.persist(persist_dir=str(STORAGE_DIR))
print("Persisted vector index to", STORAGE_DIR.resolve())


In [None]:
ask("what is the borrower's base interest rate and nay discount points? Cite the page")

In [None]:
ask("Explain escrow requirements and any conditions for waivers with citations.")

In [None]:
ask("List all fees affecting APR and where they appear in the document.")

In [None]:
ask("What late fee policy applies and on which page is it described?")

In [None]:
df = run_eval(eval_questions)
df


In [None]:
vector_index.storage_context.persist(persist_dir=str(STORAGE_DIR))
print("Persisted to", STORAGE_DIR.resolve())


# Project 5 – Mortgage RAG Pipeline (LlamaIndex)

**Goal:** Build a working Retrieval-Augmented Generation (RAG) assistant for mortgage PDFs that returns grounded answers with citations.

## What this notebook does
- Ingests mortgage PDFs in `./data`
- Chunks text and builds hybrid retrieval (BM25 + dense embeddings)
- (Optional) re-ranks top candidates
- Answers questions with a chosen LLM (Gemini or Mistral) and shows citations

## How to run
1) Install dependencies (Step 1) and then Restart session (Step 2).  
2) Load your GOOGLE_API_KEY (Step 3).  
3) Run imports and config (Steps 4–5).  
4) Upload 1+ mortgage PDFs into `./data` (Step 6).  
5) Build the vector index and BM25 (Step 7).  
6) Create the hybrid retriever and (optional) reranker (Steps 8–9).  
7) Use `ask("your question")` to query and see citations (Steps 10–11).  
8) Run a quick evaluation table (Step 12).  
9) (Optional) Persist the index (Step 13).

## Settings we tuned
- Chunk size: 512, overlap: 32  
- similarity_top_k: 5, BM25 top_k: 5  
- Hybrid retrieval via RRF, optional cross-encoder re-rank

## Security
- API key is loaded via Colab secrets or a masked prompt (not printed).  
- Use sanitized/sample documents if originals are sensitive.

## Links to earlier projects (optional)
- P1: Text PDFs + regex extraction  
- P2: Preprocessing (binarize, deskew, denoise)  
- P3: OCR comparison (Tesseract vs PaddleOCR)  
- P4: First RAG build with chunking/embeddings/retrieval tuning

## Submission checklist
- Share this Colab as Viewer and paste the link into the Project 5 form (#1).  
- Create a short Google Doc reflection (about 100 words) and paste the link into the form (#2).  
- Paste your comparison insights (150–200 words) into the form (#3).  
- Build a brief presentation summarizing pipeline, tuning, example Q->A with citations, and lessons learned.
  

## Reflection (about 100 words)
Across Projects 1-4 I built a mortgage document pipeline from raw PDFs to a working RAG assistant. I started with text parsing and regex field extraction, then handled scanned files with OCR (Tesseract and PaddleOCR) and preprocessing to improve legibility. Next, I indexed documents in LlamaIndex, added chunking and Hugging Face embeddings, and tuned retrieval (top_k, overlap, hybrid BM25 plus dense). Finally, I connected an LLM (Gemini or Mistral) and enforced source-grounded answers. The result is a reproducible notebook that ingests PDFs, retrieves the right spans, and cites them. I strengthened Python, PDF and OCR tooling, evaluation, and prompt and retrieval design.

## Comparison Insights (150–200 words)
I compared several retrieval setups and two LLMs to see what most improved grounded answers. Dense-only retrieval with large 1,000-token chunks often returned broad context and produced vague summaries. Reducing chunks to about 400–600 tokens with 20–50 overlap increased the chance that a single chunk contained the full rate or fee detail needed for precise citations. Adding a hybrid retriever (BM25 plus dense) consistently surfaced term-specific pages (for example, escrow waivers or FHA MIP) that dense-only sometimes missed. A lightweight cross-encoder re-ranker on the top 10 candidates further improved multi-part questions by pushing the most relevant spans to the top. Gemini tended to produce more structured, citation-friendly answers out of the box, while Mistral responded faster and was cheaper but needed stricter prompts to reference sources consistently. Overall, retrieval quality (chunking, hybrid retrieval, and re-ranking) mattered more than the model choice; once context was strong, both LLMs performed well. My final settings use ~512-token chunks, overlap 32, similarity_top_k 5, hybrid on, and citation-first prompts.


# Troubleshooting & Tips

## Quick diagnostics
- After installs, always do **Runtime → Restart session**.
- Verify key & imports:
  ```python
  import os
  print("GOOGLE_API_KEY set:", bool(os.environ.get("GOOGLE_API_KEY")))
  from llama_index.llms.gemini import Gemini
  from llama_index.embeddings.huggingface import HuggingFaceEmbedding


In [None]:
%pip -q install -U "jedi>=0.18.2"


In [None]:
%pip -q install -U pip setuptools wheel
%pip -q install -U "llama-index>=0.11" "llama-index-llms-gemini>=0.2" \
  "llama-index-retrievers-bm25>=0.2" "llama-index-embeddings-huggingface>=0.2" \
  sentence-transformers pymupdf jedi
