# Local OCR-Enabled Document Q&A System (Mistral RAG)

This project implements an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline for semi-structured financial documents such as resumes, mortgage documents, and payslips.

The system supports:
- OCR for scanned PDFs
- Page-level document classification
- Semantic chunking and FAISS-based retrieval
- Local LLM inference using **Mistral** via `llama.cpp`
- Source-grounded answers with confidence estimation

**Goal:** Reduce manual document review time while improving consistency and traceability of extracted information.


## System Architecture

This pipeline follows a modular, left-to-right architecture:

PDF Upload
‚Üí OCR & Text Extraction (PyMuPDF + Tesseract)
‚Üí Page-Level Document Classification
‚Üí Logical Document Grouping
‚Üí Semantic Chunking (Sliding Window)
‚Üí Embedding Generation (Local)
‚Üí FAISS Vector Index
‚Üí Top-K Retrieval
‚Üí Mistral LLM Inference
‚Üí Answer + Citations + Confidence


Each stage is isolated to improve debuggability and system reliability.


## Setup and Installation
First install all necessary packages

In [None]:
# Install required packages
!pip install -q gradio
!pip install -q gradio_pdf
!pip install -q pypdf PyPDF2 pymupdf
!pip install -q faiss-cpu
!pip install pandas

# Install LlamaIndex packages for enhanced document processing
!pip install -q llama-index
!pip install -q llama-index-readers-file
!pip install -q llama-index-embeddings-huggingface
!pip install -q llama-index-vector-stores-faiss
!pip install -q sentence-transformers

!pip install llama-cpp-python --upgrade --force-reinstall \
    --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu124

!pip install -q pillow pytesseract

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu124
Collecting llama-cpp-python
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.16-cu124/llama_cpp_python-0.3.16-cp312-cp312-linux_x86_64.whl (551.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m551.3/551.3 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.5.0 (from llama-cpp-python)
  Using cached typing_extensions-4.15.0-py3-none-any.whl.metadata (3.3 kB)
Collecting numpy>=1.20.0 (from llama-cpp-python)
  Using cached numpy-2.3.5-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
Collecting diskcache>=5.6.1 (from llama-cpp-python)
  Using cached diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting jinja2>=2.11.3 (from llama-cpp-python)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 k

## üîß Core Imports and Configuration

In [None]:
# Load the Local LLM model
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf",  # << correct filename
    local_dir="/content"
)

print("Model path:", model_path)

import os
print("Size (GB):", os.path.getsize(model_path) / (1024*1024*1024))

In [None]:
import gradio as gr
from gradio_pdf import PDF
import fitz  # PyMuPDF
from PyPDF2 import PdfReader
import numpy as np
import faiss
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass
import json
from datetime import datetime
import hashlib

import io
from PIL import Image
import pytesseract

from llama_index.core import Document as LI_Document, VectorStoreIndex, StorageContext
from llama_index.core.schema import TextNode
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.vector_stores import MetadataFilters, MetadataFilter, FilterOperator
from llama_cpp import Llama
import os


if not os.path.exists(model_path):
    !wget -c "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2-Q4_K_M.gguf" -O {model_path}

    print(f"Model downloaded to {model_path}")

# Verify file exists and check size
if os.path.exists(model_path):
    print(f"Model file exists. Size: {os.path.getsize(model_path) / (1024 * 1024):.2f} MB")
else:
    print("Model file not found!")

try:
    llm = Llama(
      model_path=model_path,
      n_ctx=4096,
      n_threads=8,
      n_gpu_layers=40, # use full GPU layers for max speed
      n_batch=128,
      use_mmap=True,
      verbose=False,
      use_mlock=False,
)
    print("Model loaded successfully!")

except Exception as e:
    print(f"Error loading LLM model: {e}")

In [None]:
# Load embedding model
from sentence_transformers import SentenceTransformer
import torch

EMBED_MODEL_PATH = "/content/nomic-embed-text-v1.5.Q8_0.gguf"

if not os.path.exists(EMBED_MODEL_PATH):
    !wget -q https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-gguf/resolve/main/nomic-embed-text-v1.5.Q8_0.gguf -O {EMBED_MODEL_PATH}
    print(f"Downloaded embedding model to {EMBED_MODEL_PATH}")

try:
    embed_llm = SentenceTransformer(
        "nomic-ai/nomic-embed-text-v1.5",
        trust_remote_code=True,
        device="cuda" if torch.cuda.is_available() else "cpu"
    )
    print("Embedding model loaded successfully!")
except Exception as e:
    print(f"Error loading embed model: {e}")


Mistral and Embedding

In [None]:
def mistral_generate(prompt: str, max_tokens: int = 512) -> str:
    """Wrapper to generate text from Mistral."""
    out = llm(
        prompt,
        max_tokens=max_tokens,
        temperature=0.2,
        top_p=0.9,
        stop=["</s>"]
    )
    return out["choices"][0]["text"].strip()

def get_embedding(text: str) -> np.ndarray:
    """Get a float32 embedding vector for a given text using the embedding model."""
    res = embed_llm.create_embedding(text)
    vec = res["data"][0]["embedding"]
    return np.array(vec, dtype="float32")

## Data Structures for Enhanced Document Management

In [None]:
@dataclass
class PageInfo:
    """Stores information about a single page"""
    page_num: int
    text: str
    doc_type: Optional[str] = None
    page_in_doc: int = 0

@dataclass
class LogicalDocument:
    """Represents a logical document within a PDF"""
    doc_id: str
    doc_type: str
    page_start: int
    page_end: int
    text: str
    chunks: List[Dict] = None

@dataclass
class ChunkMetadata:
    """Rich metadata for each chunk"""
    chunk_id: str
    doc_id: str
    doc_type: str
    chunk_index: int
    page_start: int
    page_end: int
    text: str
    embedding: Optional[np.ndarray] = None

## Document Intelligence Functions
These functions handle document classification and boundary detection:

In [None]:
def classify_document_type(text: str) -> str:
    """
    Hybrid document classifier.
    Uses rules first, LLM second.
    """
    if not text or len(text.strip()) < 50:
        return "Other"

    t = text.lower().replace(" ", "")

    # STRONG PAYSLIP SIGNALS
    if (
        "netpay" in t
        or "grosspay" in t
        or "basic" in t
        or "deduction" in t
        or "employeeid" in t
        or "paydate" in t
    ):
        return "Payslip"

    # STRONG MORTGAGE/LOAN SIGNALS
    if (
        "loanestimate" in t
        or "closingdisclosure" in t
        or "interestrate" in t
        or "borrower" in t
        or "loanamount" in t
        or "principal" in t
    ):
        return "Mortgage Document"

    # STRONG INVOICE SIGNALS
    if (
        "invoice" in t
        or "subtotal" in t
        or "totaldue" in t
        or "billto" in t
    ):
        return "Invoice"

    # ‚úÖ RESUME SIGNALS (ADD THIS)
    if (
        "experience" in t
        or "education" in t
        or "skills" in t
        or "workexperience" in t
        or "professionalexperience" in t
        or "summary" in t
        or "projects" in t
        or "linkedin.com" in t
    ):
        return "Resume"

    return "Other"


## Advanced PDF Processing Pipeline
Enhanced PDF processing pipeline:

In [None]:
# helper
def cheap_same_doc_heuristic(prev_text: str, curr_text: str) -> bool:
    prev = prev_text.lower().replace(" ", "")
    curr = curr_text.lower().replace(" ", "")

    key_markers = [
        "employee",
        "netpay",
        "grosspay",
        "basic",
        "deduction",
        "paydate",
        "empid"
    ]

    matches = sum(1 for k in key_markers if k in prev and k in curr)


In [None]:
def detect_document_boundary(prev_text, curr_text, current_doc_type):
    """
    Returns True if pages belong to the SAME document.
    """

    if not prev_text.strip() or not curr_text.strip():
        return True

    prev_type = classify_document_type(prev_text)
    curr_type = classify_document_type(curr_text)

    # If the document type changes, force a split
    if prev_type != curr_type:
        return False

    return True


In [None]:
def extract_and_analyze_pdf(pdf_file) -> Tuple[List[PageInfo], List[LogicalDocument]]:
    """
    Extract text from PDF and group pages into logical documents.
    """
    print("üìñ Starting PDF extraction and analysis...")

    if isinstance(pdf_file, dict) and "content" in pdf_file:
        doc = fitz.open(stream=pdf_file["content"], filetype="pdf")
    elif hasattr(pdf_file, "read"):
        doc = fitz.open(stream=pdf_file.read(), filetype="pdf")
    else:
        doc = fitz.open(pdf_file)

    pages_info: List[PageInfo] = []
    for i, page in enumerate(doc):
        text = page.get_text()

        if not text.strip():
            print(f"  Page {i}: No text found, attempting OCR...")
            try:
                pix = page.get_pixmap()
                img = Image.open(io.BytesIO(pix.tobytes("png")))
                text = pytesseract.image_to_string(img)
                print(f"  Page {i}: OCR extracted {len(text)} characters")
            except Exception as e:
                print(f"  Page {i}: OCR failed - {e}")
                text = ""

        pages_info.append(PageInfo(page_num=i, text=text))

    doc.close()

    if not pages_info:
        raise ValueError("No text could be extracted from PDF")

    print(f"‚úÖ Extracted {len(pages_info)} pages")

    logical_docs: List[LogicalDocument] = []
    current_doc_pages: List[PageInfo] = []
    current_doc_type = None
    doc_counter = 0

    print("üß† Analyzing document structure...")

    for i, page_info in enumerate(pages_info):
        if i == 0:
            # Always classify first page ONCE
            current_doc_type = classify_document_type(page_info.text)
            page_info.doc_type = current_doc_type
            page_info.page_in_doc = 0
            current_doc_pages = [page_info]
            print(f"  Page {i}: New document detected - {current_doc_type}")
            continue

        prev_text = pages_info[i - 1].text

        # Decision order: heuristic ‚Üí LLM
        if cheap_same_doc_heuristic(prev_text, page_info.text):
            is_same = True
        else:
            is_same = detect_document_boundary(prev_text, page_info.text, current_doc_type)

        if is_same:
            page_info.doc_type = current_doc_type
            page_info.page_in_doc = len(current_doc_pages)
            current_doc_pages.append(page_info)
        else:
            # Finalize previous document
            logical_docs.append(
                LogicalDocument(
                    doc_id=f"doc_{doc_counter}",
                    doc_type=current_doc_type,
                    page_start=current_doc_pages[0].page_num,
                    page_end=current_doc_pages[-1].page_num,
                    text="\n\n".join(p.text for p in current_doc_pages)
                )
            )
            doc_counter += 1

            # Start new document
            current_doc_type = classify_document_type(page_info.text)
            page_info.doc_type = current_doc_type
            page_info.page_in_doc = 0
            current_doc_pages = [page_info]
            print(f"  Page {i}: New document detected - {current_doc_type}")

    # Final document
    if current_doc_pages:
        logical_docs.append(
            LogicalDocument(
                doc_id=f"doc_{doc_counter}",
                doc_type=current_doc_type,
                page_start=current_doc_pages[0].page_num,
                page_end=current_doc_pages[-1].page_num,
                text="\n\n".join(p.text for p in current_doc_pages)
            )
        )

    print(f"‚úÖ Identified {len(logical_docs)} logical documents")
    for ld in logical_docs:
        print(f"   - {ld.doc_type}: Pages {ld.page_start}-{ld.page_end}")

    return pages_info, logical_docs

## Intelligent Chunking with Metadata Preservation
We'll provide two chunking approaches - our custom implementation and LlamaIndex's built-in capabilities:

In [None]:
from llama_index.core import Document as LI_Document
from llama_index.core.node_parser import SentenceSplitter

def chunk_document_with_metadata(
    logical_doc: LogicalDocument,
    chunk_size: int = 300,
    overlap: int = 50
) -> List[ChunkMetadata]:
    """
    Structure-aware chunking for financial documents.
    Preserves tables and logical blocks.
    """
    text = logical_doc.text.strip()

    sections = []
    buffer = []

    for line in text.splitlines():
        line = line.strip()
        if not line:
            if buffer:
                sections.append(" ".join(buffer))
                buffer = []
        else:
            buffer.append(line)

    if buffer:
        sections.append(" ".join(buffer))

    chunks_metadata: List[ChunkMetadata] = []
    chunk_id = 0

    for section in sections:
        words = section.split()

        if len(words) <= chunk_size:
            chunks_metadata.append(
                ChunkMetadata(
                    chunk_id=f"{logical_doc.doc_id}_chunk_{chunk_id}",
                    doc_id=logical_doc.doc_id,
                    doc_type=logical_doc.doc_type,
                    chunk_index=chunk_id,
                    page_start=logical_doc.page_start,
                    page_end=logical_doc.page_end,
                    text=section,
                )
            )
            chunk_id += 1
            continue

        stride = chunk_size - overlap
        for start in range(0, len(words), stride):
            chunk_text = " ".join(words[start:start + chunk_size])

            chunks_metadata.append(
                ChunkMetadata(
                    chunk_id=f"{logical_doc.doc_id}_chunk_{chunk_id}",
                    doc_id=logical_doc.doc_id,
                    doc_type=logical_doc.doc_type,
                    chunk_index=chunk_id,
                    page_start=logical_doc.page_start,
                    page_end=logical_doc.page_end,
                    text=chunk_text,
                )
            )
            chunk_id += 1

            if start + chunk_size >= len(words):
                break

    return chunks_metadata

def chunk_with_llama_index(
    logical_doc: LogicalDocument,
    chunk_size: int = 400,
    chunk_overlap: int = 80
) -> List[ChunkMetadata]:
    """
    Sentence-based chunking using LlamaIndex.
    Better semantic coherence, slightly slower.
    """
    li_doc = LI_Document(
        text=logical_doc.text,
        metadata={
            "doc_id": logical_doc.doc_id,
            "doc_type": logical_doc.doc_type,
            "page_start": logical_doc.page_start,
            "page_end": logical_doc.page_end,
        },
    )

    splitter = SentenceSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        paragraph_separator="\n\n",
    )

    nodes = splitter.get_nodes_from_documents([li_doc])

    chunks: List[ChunkMetadata] = []
    for i, node in enumerate(nodes):
        chunks.append(
            ChunkMetadata(
                chunk_id=f"{logical_doc.doc_id}_chunk_{i}",
                doc_id=logical_doc.doc_id,
                doc_type=logical_doc.doc_type,
                chunk_index=i,
                page_start=node.metadata.get("page_start", logical_doc.page_start),
                page_end=node.metadata.get("page_end", logical_doc.page_end),
                text=node.text,
            )
        )

    return chunks


In [None]:
# REQUIRED: Define process_all_documents (missing in kernel)

def process_all_documents(
    logical_docs,
    use_llama_index: bool = True
):
    """
    Chunk all logical documents and attach metadata.
    Returns a flat list of ChunkMetadata.
    """
    all_chunks = []

    for logical_doc in logical_docs:
        if use_llama_index:
            chunks = chunk_with_llama_index(logical_doc)
        else:
            chunks = chunk_document_with_metadata(logical_doc)

        logical_doc.chunks = chunks
        all_chunks.extend(chunks)

        print(f"üìÑ {logical_doc.doc_type}: Created {len(chunks)} chunks")

    return all_chunks


## Query Routing and Intelligent Retrieval

In [None]:
def predict_query_document_type(query: str) -> Tuple[str, float]:
    """
    Predict which document type is most likely to contain the answer using Mistral.
    Returns (type, confidence).
    """
    prompt = f"""
You are a query routing assistant.

Analyze this query and predict which document type would most likely contain the answer.

Query: "{query}"

Choose exactly ONE of:
- Resume
- Contract
- Mortgage Contract
- Invoice
- Pay Slip
- Lender Fee Sheet
- Land Deed
- Bank Statement
- Tax Document
- Insurance
- Report
- Letter
- Form
- ID Document
- Medical
- Other

Respond in JSON ONLY, like:
{{"type": "Invoice", "confidence": 0.87}}
"""
    try:
        raw = mistral_generate(prompt, max_tokens=128)
        # Try to find JSON inside the response (in case of extra text)
        start = raw.find("{")
        end = raw.rfind("}") + 1
        if start != -1 and end != -1:
            raw_json = raw[start:end]
        else:
            raw_json = raw

        result = json.loads(raw_json)
        return result.get("type", "Other"), float(result.get("confidence", 0.5))
    except Exception as e:
        print(f"Query routing error: {e}")
        return "Other", 0.0

class IntelligentRetriever:
    """
    Advanced retrieval system with metadata filtering and query routing.
    Uses local embedding model + FAISS.
    """

    def __init__(self):
        self.index = None
        self.chunks_metadata: List[ChunkMetadata] = []
        self.doc_type_indices = {}
        self.total_queries = 0
        self.cache_hits = 0  # Placeholder if you add caching later

    def build_indices(self, chunks_metadata: List[ChunkMetadata]):
        print("üî® Building vector indices...")
        self.chunks_metadata = chunks_metadata

        # Compute embeddings
        texts = [
          f"MORTGAGE DOCUMENT.\n"
          f"Type: {chunk.doc_type}.\n"
          f"Pages: {chunk.page_start}-{chunk.page_end}.\n"
          f"Content: {chunk.text[:600]}"
          for chunk in chunks_metadata
        ]

        embeddings = []
        for t in texts:
            embeddings.append(get_embedding(t))
        embeddings = np.vstack(embeddings).astype("float32")

        for i, chunk in enumerate(chunks_metadata):
            chunk.embedding = embeddings[i]

        dim = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dim)
        self.index.add(embeddings)

        doc_types = set(chunk.doc_type for chunk in chunks_metadata)
        for doc_type in doc_types:
            type_indices = [i for i, chunk in enumerate(chunks_metadata)
                            if chunk.doc_type == doc_type]
            if type_indices:
                type_embeddings = embeddings[type_indices]
                type_index = faiss.IndexFlatL2(dim)
                type_index.add(type_embeddings)
                self.doc_type_indices[doc_type] = {
                    "index": type_index,
                    "mapping": type_indices
                }

        print(f"‚úÖ Indexed {len(chunks_metadata)} chunks across {len(doc_types)} document types")

    def retrieve(self, query: str, k: int = 4,
                 filter_doc_type: Optional[str] = None,
                 auto_route: bool = True) -> List[Tuple[ChunkMetadata, float]]:
        self.total_queries += 1

        query_vec = get_embedding(query)
        query_embedding = query_vec.reshape(1, -1).astype("float32")

        if filter_doc_type and filter_doc_type in self.doc_type_indices:
            type_data = self.doc_type_indices[filter_doc_type]
            D, I = type_data["index"].search(query_embedding, k)
            chunk_indices = [type_data["mapping"][i] for i in I[0]]
            distances = D[0]
        elif auto_route:
            predicted_type, confidence = predict_query_document_type(query)
            print(f"üéØ Query routed to: {predicted_type} (confidence: {confidence:.2f})")

            if confidence > 0.7 and predicted_type in self.doc_type_indices:
                type_data = self.doc_type_indices[predicted_type]
                D, I = type_data["index"].search(query_embedding, k)
                chunk_indices = [type_data["mapping"][i] for i in I[0]]
                distances = D[0]
            else:
                D, I = self.index.search(query_embedding, k)
                chunk_indices = I[0]
                distances = D[0]
        else:
            D, I = self.index.search(query_embedding, k)
            chunk_indices = I[0]
            distances = D[0]

        max_dist = max(distances) if len(distances) > 0 else 1.0
        scores = [(max_dist - d) / max_dist for d in distances]

        results = [(self.chunks_metadata[i], scores[idx])
                   for idx, i in enumerate(chunk_indices)]

        return results

## Enhanced Answer Generation with Source Attribution

In [None]:
def generate_answer_with_sources(query: str,
                                 retrieved_chunks: List[Tuple[ChunkMetadata, float]]) -> Dict:
    if not retrieved_chunks:
        return {
            "answer": "I couldn't find relevant information to answer your question.",
            "sources": [],
            "confidence": 0.0
        }

    context_parts = []
    sources = []

    MAX_CHARS = 500
    MAX_CHUNKS = 3
    for chunk_meta, score in retrieved_chunks[:MAX_CHUNKS]:
        context_parts.append(
            f"[From {chunk_meta.doc_type}, Pages {chunk_meta.page_start}-{chunk_meta.page_end}]"
            )
        context_parts.append(chunk_meta.text[:MAX_CHARS])
        context_parts.append("")

        sources.append({
            "doc_type": chunk_meta.doc_type,
            "pages": f"{chunk_meta.page_start}-{chunk_meta.page_end}",
            "relevance": f"{score:.2%}",
            "preview": chunk_meta.text[:100] + "..."
        })

    context = "\n".join(context_parts)

    prompt = f"""
You are a helpful AI assistant. Use the provided context to answer the question.
Be specific and cite which document type and pages support your answer.

Context:
{context}

Question: {query}

Instructions:
1. Answer based ONLY on the provided context.
2. Mention which document type(s) contain the information.
3. Be concise but complete.
4. If the context doesn't contain enough information, say so.

Answer:
"""
    try:
        answer = mistral_generate(prompt, max_tokens=512).strip()
        avg_score = sum(s for _, s in retrieved_chunks) / len(retrieved_chunks)
        return {
            "answer": answer,
            "sources": sources,
            "confidence": avg_score,
            "chunks_used": len(retrieved_chunks)
        }
    except Exception as e:
        print(f"Answer generation error: {e}")
        return {
            "answer": f"Error generating answer: {str(e)}",
            "sources": sources,
            "confidence": 0.0
        }


## Enhanced Document Store

In [None]:
class EnhancedDocumentStore:
    def __init__(self):
        self.pages_info: List[PageInfo] = []
        self.logical_docs: List[LogicalDocument] = []
        self.chunks_metadata: List[ChunkMetadata] = []
        self.retriever = IntelligentRetriever()
        self.is_ready = False
        self.processing_stats = {}
        self.filename = None

    def process_pdf(self, pdf_file, filename: str = "document.pdf"):
        self.filename = filename
        self.is_ready = False
        start_time = datetime.now()

        try:
            self.pages_info, self.logical_docs = extract_and_analyze_pdf(pdf_file)
            self.chunks_metadata = process_all_documents(self.logical_docs)
            self.retriever.build_indices(self.chunks_metadata)

            process_time = (datetime.now() - start_time).total_seconds()
            self.processing_stats = {
                "filename": filename,
                "total_pages": len(self.pages_info),
                "documents_found": len(self.logical_docs),
                "total_chunks": len(self.chunks_metadata),
                "document_types": list(set(doc.doc_type for doc in self.logical_docs)),
                "processing_time": f"{process_time:.1f}s"
            }

            self.is_ready = True
            return True, self.processing_stats
        except Exception as e:
            return False, {"error": str(e)}

    def query(self, question: str, filter_type: Optional[str] = None,
              auto_route: bool = True, k: int = 4) -> Dict:
        if not self.is_ready:
            return {
                "answer": "Please upload and process a PDF first.",
                "sources": [],
                "confidence": 0.0
            }

        retrieved = self.retriever.retrieve( # made K=3 because the tokens exceed 4096
            question, k=3,
            filter_doc_type=filter_type,
            auto_route=auto_route and filter_type is None
        )

        result = generate_answer_with_sources(question, retrieved)
        result["filter_used"] = filter_type or ("auto" if auto_route else "none")
        return result

    def get_document_structure(self) -> List[Dict]:
        if not self.logical_docs:
            return []

        structure = []
        for doc in self.logical_docs:
            structure.append({
                "id": doc.doc_id,
                "type": doc.doc_type,
                "pages": f"{doc.page_start + 1}-{doc.page_end + 1}",
                "chunks": len(doc.chunks) if doc.chunks else 0,
                "preview": doc.text[:200] + "..." if len(doc.text) > 200 else doc.text
            })
        return structure

## Gradio Interface with Enhanced Features
Gradio interface:

In [None]:
doc_store = EnhancedDocumentStore()

def process_pdf_handler(pdf_file):
    if pdf_file is None:
        return "‚ö†Ô∏è Please upload a PDF file", None, gr.update(choices=["All"]), None

    success, stats = doc_store.process_pdf(
        pdf_file,
        filename=getattr(pdf_file, "name", "document.pdf")
    )

    if success:
        status_msg = f"""
‚úÖ **Successfully Processed**
- File: {stats['filename']}
- Pages: {stats['total_pages']}
- Docs Found: {stats['documents_found']}
- Chunks: {stats['total_chunks']}
"""

        structure = doc_store.get_document_structure()
        structure_display = "\n".join([
            f"‚Ä¢ {doc['type']} (Pages {doc['pages']})"
            for doc in structure
        ])

        doc_types = ["All"] + stats["document_types"]

        # ‚úÖ RETURN PDF PATH HERE
        return status_msg, structure_display, gr.update(choices=doc_types, value="All")

    return "‚ùå Processing failed", None, gr.update(choices=["All"]), None

def chat_handler(message, history, doc_filter, auto_route, num_chunks):
    if history is None:
        history = []

    if not doc_store.is_ready:
        history.append(
            {"role": "assistant", "content": "üìö Please upload and process a PDF document first."}
        )
        return history

    filter_type = None if doc_filter == "All" else doc_filter
    result = doc_store.query(
        message,
        filter_type=filter_type,
        auto_route=auto_route and filter_type is None,
        k=num_chunks
    )

    response = result["answer"] + "\n\n"
    if result["sources"]:
        response += "üìç Sources:\n"
        for src in result["sources"]:
            response += f"- {src['doc_type']} (Pages {src['pages']})\n"

    response += f"\nConfidence: {result['confidence']:.1%}"

    history.append({"role": "user", "content": message})
    history.append({"role": "assistant", "content": response})

    return history

def create_interface():
    with gr.Blocks(title="Enhanced Document Q&A (Mistral RAG)") as demo:
        gr.Markdown("""
# üöÄ Enhanced Document Q&A System (Mistral RAG)
### Intelligent Multi-Document Analysis with Local RAG Pipeline
""")

        with gr.Row():
            with gr.Column(scale=2):
                pdf_input = gr.File(
                    label="üìÑ Upload PDF",
                    #interactive=True,
                    #height=600,
                    file_types=[".pdf"],
                    type="filepath"
                )

                with gr.Row():
                    process_btn = gr.Button("üîÑ Process Document", variant="primary", scale=2)
                    clear_all_btn = gr.Button("üóëÔ∏è Clear All", variant="secondary", scale=1)

            with gr.Column(scale=1):
                gr.Markdown("### üìä Document Info")
                status_output = gr.Markdown(value="‚è≥ Waiting for PDF upload...")
                structure_output = gr.Markdown(value="", label="Document Structure")

                gr.Markdown("### ‚öôÔ∏è Settings")
                doc_filter = gr.Dropdown(
                    choices=["All"],
                    value="All",
                    label="üè∑Ô∏è Document Type Filter",
                    info="Filter search to specific document type"
                )
                auto_route = gr.Checkbox(
                    value=True,
                    label="üéØ Auto-Route Queries",
                    info="Automatically detect relevant document type"
                )
                num_chunks = gr.Slider(
                    minimum=1,
                    maximum=10,
                    value=4,
                    step=1,
                    label="üìä Chunks to Retrieve"
                )

            with gr.Column(scale=2):
                gr.Markdown("### üí¨ Ask Questions")
                chatbot = gr.Chatbot(
                    label="Conversation",
                    height=500,
                    #elem_id="chatbot",
                    #show_label=False
                )

                with gr.Row():
                    msg_input = gr.Textbox(
                        label="Ask a question",
                        placeholder="e.g., What are the payment terms? What is the total amount?",
                        scale=4,
                        show_label=False
                    )
                    send_btn = gr.Button("üì§ Send", scale=1, variant="primary")

                with gr.Row():
                    clear_chat_btn = gr.Button("üóëÔ∏è Clear Chat", size="sm", scale=1)
                    example_btn1 = gr.Button("üìù What's the summary?", size="sm", scale=1)
                    example_btn2 = gr.Button("üí∞ Find amounts", size="sm", scale=1)

        with gr.Row():
            status_bar = gr.Markdown(
                value="**Status:** Ready | **Documents:** 0 | **Chunks:** 0 | **Cache Hits:** 0/0",
                elem_id="status_bar"
            )

        def update_status_bar():
            if doc_store.is_ready:
                stats = doc_store.processing_stats
                cache_rate = 0
                if doc_store.retriever.total_queries > 0:
                    cache_rate = (doc_store.retriever.cache_hits / doc_store.retriever.total_queries) * 100
                return f"**Status:** ‚úÖ Ready | **Documents:** {stats.get('documents_found', 0)} | **Chunks:** {stats.get('total_chunks', 0)} | **Cache Rate:** {cache_rate:.0f}%"
            return "**Status:** Ready | **Documents:** 0 | **Chunks:** 0 | **Cache Hits:** 0/0"

        def clear_all():
            doc_store.pages_info = []
            doc_store.logical_docs = []
            doc_store.chunks_metadata = []
            doc_store.retriever = IntelligentRetriever()
            doc_store.is_ready = False
            return (
                None,                              # pdf_input
                "‚è≥ Waiting for PDF upload...",     # status_output
                "",                                # structure_output
                gr.update(choices=["All"], value="All"),  # doc_filter
                [],                                # chatbot
                "",                                # msg_input
                update_status_bar()                # status_bar
            )

        def process_pdf_with_status(pdf_file):
            status, structure, filter_update = process_pdf_handler(pdf_file)
            status_bar_text = update_status_bar()
            return status, structure, filter_update, status_bar_text

        def chat_with_status(message, history, doc_filter, auto_route, num_chunks):
            new_history = chat_handler(message, history, doc_filter, auto_route, num_chunks)
            status_bar_text = update_status_bar()
            return new_history, status_bar_text

        def ask_summary(history):
            return chat_handler(
                "Can you provide a summary of the main points in this document?",
                history, doc_filter.value, auto_route.value, num_chunks.value
            )

        def ask_amounts(history):
            return chat_handler(
                "What are all the monetary amounts or financial figures mentioned?",
                history, doc_filter.value, auto_route.value, num_chunks.value
            )

        process_btn.click(
          fn=process_pdf_with_status,
          inputs=[pdf_input],
          outputs=[
            status_output,
            structure_output,
            doc_filter,
            status_bar,
          ]
        )

        clear_all_btn.click(
            fn=clear_all,
            outputs=[pdf_input, status_output, structure_output, doc_filter,
                     chatbot, msg_input, status_bar]
        )

        msg_input.submit(
            fn=chat_with_status,
            inputs=[msg_input, chatbot, doc_filter, auto_route, num_chunks],
            outputs=[chatbot, status_bar]
        ).then(lambda: "", outputs=[msg_input])

        send_btn.click(
            fn=chat_with_status,
            inputs=[msg_input, chatbot, doc_filter, auto_route, num_chunks],
            outputs=[chatbot, status_bar]
        ).then(lambda: "", outputs=[msg_input])

        clear_chat_btn.click(lambda: [], outputs=[chatbot])

    return demo

In [None]:
# Manually process PDF for metrics
# Replace with real file path

PDF_PATH = "Blob File Sample.pdf"

success, stats = doc_store.process_pdf(PDF_PATH)

print("Processed:", success)
print("Stats:", stats)
print("FAISS index exists:", doc_store.retriever.index is not None)


üìñ Starting PDF extraction and analysis...
‚úÖ Extracted 4 pages
üß† Analyzing document structure...
  Page 0: New document detected - Resume
  Page 1: New document detected - Mortgage Document
  Page 2: New document detected - Payslip
‚úÖ Identified 3 logical documents
   - Resume: Pages 0-0
   - Mortgage Document: Pages 1-1
   - Payslip: Pages 2-3
üìÑ Resume: Created 1 chunks
üìÑ Mortgage Document: Created 3 chunks
üìÑ Payslip: Created 1 chunks
üî® Building vector indices...
Processed: False
Stats: {'error': "name 'embed_llm' is not defined"}
FAISS index exists: False


## üìä RAG Performance Metrics

To evaluate retrieval quality and answer reliability, we measure:

- **Recall@K** ‚Äì whether relevant chunks are retrieved
- **Precision@K** ‚Äì how many retrieved chunks are actually relevant
- **MRR (Mean Reciprocal Rank)** ‚Äì how early the correct chunk appears
- **End-to-End Accuracy** ‚Äì correctness of final LLM answers
- **Average Latency** ‚Äì real-world responsiveness

Metrics are computed on a small labeled test set representative of the document types in the pipeline.


In [None]:
# RAG PERFORMANCE METRICS CELL
import time
import numpy as np

k = 6

# Small evaluation set
test_set = [
    {
        "question": "What is this document about?",
        "gold_keywords": ["agreement", "summary", "contract", "report"]
    },
    {
        "question": "What are the key financial or monetary amounts mentioned?",
        "gold_keywords": ["$", "amount", "payment", "total"]
    },
    {
        "question": "Who is the document intended for?",
        "gold_keywords": ["client", "customer", "borrower", "recipient"]
    }
]

# Helper functions
def contains_keyword(text, keywords):
    text = text.lower()
    return any(k.lower() in text for k in keywords)

def run_retrieval(question, k=4):
    if doc_store.retriever.index is None:
        raise RuntimeError("FAISS index not initialized. Build index before retrieval.")

    return doc_store.retriever.retrieve(
        question,
        k=k,
        auto_route=True
    )

# 3. Recall@K
def compute_recall_at_k(test_set, k=k):
    hits = 0
    for ex in test_set:
        retrieved = run_retrieval(ex["question"], k)
        found = any(
            contains_keyword(chunk.text, ex["gold_keywords"])
            for chunk, _ in retrieved
        )
        if found:
            hits += 1
    return hits / len(test_set)

# MRR (Mean Reciprocal Rank)
def compute_mrr(test_set, k=k):
    scores = []
    for ex in test_set:
        retrieved = run_retrieval(ex["question"], k)
        rank = 0
        for i, (chunk, _) in enumerate(retrieved):
            if contains_keyword(chunk.text, ex["gold_keywords"]):
                rank = i + 1
                break
        scores.append(1.0 / rank if rank > 0 else 0.0)
    return np.mean(scores)

# End-to-End Accuracy
def compute_end_to_end_accuracy(test_set):
    correct = 0
    for ex in test_set:
        result = doc_store.query(ex["question"], auto_route=False)
        if contains_keyword(result["answer"], ex["gold_keywords"]):
            correct += 1
    return correct / len(test_set)

# Precision@K
def compute_precision_at_k(test_set, k=k):
    precisions = []

    for ex in test_set:
        retrieved = run_retrieval(ex["question"], k)

        if len(retrieved) == 0:
            precisions.append(0.0)
            continue

        relevant = sum(
            contains_keyword(chunk.text, ex["gold_keywords"])
            for chunk, _ in retrieved
        )

        precisions.append(relevant / k)

    return np.mean(precisions)


# Latency Measurement
def compute_avg_latency(test_set):
    times = []
    for ex in test_set:
        start = time.time()
        _ = doc_store.query(ex["question"])
        times.append(time.time() - start)
    return np.mean(times)

# Retrieved Chunks + Final Answer (FOR SLIDES / DEMO)
def log_retrieved_chunks_and_answer(question, k=3):
    retrieved = run_retrieval(question, k)

    print("\nüîé Retrieved Chunks (Slide Capture)")
    for i, (chunk, score) in enumerate(retrieved, start=1):
        preview = chunk.text[:250].replace("\n", " ").strip()
        print(f"Chunk {i} | Similarity: {score:.2f}")
        print(preview)
        print("-" * 50)

    print("\nüß† Mistral Final Answer")
    result = doc_store.query(question, auto_route=False)
    print(result["answer"])

# Run Evaluation
recall_k = compute_recall_at_k(test_set, k)
mrr_k = compute_mrr(test_set, k)
e2e_acc = compute_end_to_end_accuracy(test_set)
precision_k = compute_precision_at_k(test_set, k)
avg_latency = compute_avg_latency(test_set)

print("üìä RAG Evaluation Results")
print(f"Recall@{k}:        {recall_k:.2f}")
print(f"Precision@{k}:     {precision_k:.2f}")
print(f"MRR@{k}:           {mrr_k:.2f}")
print(f"End-to-End Acc:    {e2e_acc:.2f}")
print(f"Avg Latency (sec): {avg_latency:.2f}")


RuntimeError: FAISS index not initialized. Build index before retrieval.

In [None]:
# For Query1 on Slides
log_retrieved_chunks_and_answer(
    "What is this document about?",
    k=3
)

decode: cannot decode batches with this context (calling encode() instead)
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print:        load time =     378.42 ms
llama_perf_context_print: prompt eval time =       5.26 ms /     8 tokens (    0.66 ms per token,  1520.05 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =       7.76 ms /     9 tokens
llama_perf_context_print:    graphs reused =          0
decode: cannot decode batches with this context (calling encode() instead)
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print:        load time =     378.42 ms
llama_perf_context_print: prompt eval time =       4.05 ms /     8 tokens (    0.51 ms per token,  1977.26 tokens per second)
llama_perf_context_print:        eval time =  

Query routing error: Extra data: line 3 column 1 (char 38)
üéØ Query routed to: Other (confidence: 0.00)

üîé Retrieved Chunks (Slide Capture)
Chunk 1 | Similarity: 0.04
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan. Fee Details and Summary Applicants: Application No: Date Prepared: Loan Program: Prepared By: THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fe
--------------------------------------------------
Chunk 2 | Similarity: 0.01
frm (09/2015) FEES WORKSHEET John Q. Smith / Mary A. Smith samplesmith 10/05/2015 30 YEAR FIXED -Purchase XYZ Lender $ 380,000 4.250 % 360 / 360 mths 475,000.00 1,121.53 4,520.00 380,000.00 Cash Deposit 5,000.00 needed to close 95,641.53 1,869.37 39.
--------------------------------------------------
Chunk 3 | Similarity: 0.00
00 Lender's Title Insurance Borrower $ 650.00 Title - Courier Fee Settlement Agent Borrower $ 50.00 Electronic Document Delivery FeeSettlement Agent Borrower $ 50.00 Pest

In [None]:
# For Query2 on Slides
log_retrieved_chunks_and_answer(
    "What are the key financial or monetary amounts mentioned?",
    k=3
)

decode: cannot decode batches with this context (calling encode() instead)
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print:        load time =     378.42 ms
llama_perf_context_print: prompt eval time =       4.31 ms /    12 tokens (    0.36 ms per token,  2785.52 tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =       5.79 ms /    13 tokens
llama_perf_context_print:    graphs reused =          0
decode: cannot decode batches with this context (calling encode() instead)
init: embeddings required but some input tokens were not marked as outputs -> overriding
llama_perf_context_print:        load time =     378.42 ms
llama_perf_context_print: prompt eval time =       4.44 ms /    12 tokens (    0.37 ms per token,  2705.75 tokens per second)
llama_perf_context_print:        eval time =  

Query routing error: Extra data: line 2 column 1 (char 39)
üéØ Query routed to: Other (confidence: 0.00)

üîé Retrieved Chunks (Slide Capture)
Chunk 1 | Similarity: 0.11
Your actual rate, payment, and cost could be higher. Get an official Loan Estimate before choosing a loan. Fee Details and Summary Applicants: Application No: Date Prepared: Loan Program: Prepared By: THIS IS NOT A GOOD FAITH ESTIMATE (GFE). This "Fe
--------------------------------------------------
Chunk 2 | Similarity: 0.01
frm (09/2015) FEES WORKSHEET John Q. Smith / Mary A. Smith samplesmith 10/05/2015 30 YEAR FIXED -Purchase XYZ Lender $ 380,000 4.250 % 360 / 360 mths 475,000.00 1,121.53 4,520.00 380,000.00 Cash Deposit 5,000.00 needed to close 95,641.53 1,869.37 39.
--------------------------------------------------
Chunk 3 | Similarity: 0.00
Payslip Pay Date : 2025/07/17 Working Days : 26 Employee Name : James Bond Employee ID : 007 Earnings Amount Deductions Amount Basic Pay 8000 Tax 800 Allowance 500 Overti

## Current Limitations & Next Steps

**Known limitations**
- OCR accuracy degrades on low-resolution or handwritten documents
- Table extraction may lose row alignment
- Retrieval precision can drop on very similar clauses

**Next steps**
- Add a cross-encoder reranker
- Table-aware parsing for financial line items
- Multilingual OCR support


## üñ•Ô∏è Interactive Demo (Optional)

The following Gradio interface allows users to upload PDFs, inspect document structure, and interact with the RAG pipeline in real time.


In [None]:
demo = create_interface()
demo.launch(share=True, debug=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://a7b523fcce2a0a2ed0.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


üìñ Starting PDF extraction and analysis...
‚úÖ Extracted 4 pages
üß† Analyzing document structure...
  Page 0: New document detected - Resume
  Page 1: New document detected - Mortgage Document
  Page 2: New document detected - Payslip
‚úÖ Identified 3 logical documents
   - Resume: Pages 0-0
   - Mortgage Document: Pages 1-1
   - Payslip: Pages 2-3
üìÑ Resume: Created 1 chunks


init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding


üìÑ Mortgage Document: Created 3 chunks
üìÑ Payslip: Created 1 chunks
üî® Building vector indices...


init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding


‚úÖ Indexed 5 chunks across 3 document types


init: embeddings required but some input tokens were not marked as outputs -> overriding


Query routing error: Extra data: line 3 column 1 (char 40)
üéØ Query routed to: Other (confidence: 0.00)


init: embeddings required but some input tokens were not marked as outputs -> overriding


Query routing error: Extra data: line 3 column 1 (char 39)
üéØ Query routed to: Other (confidence: 0.00)


init: embeddings required but some input tokens were not marked as outputs -> overriding


üéØ Query routed to: Bank Statement (confidence: 0.95)
