# 🚀 Multi-Document RAG System with Advanced Retrieval

## Project Overview
This notebook implements a **production-ready Retrieval-Augmented Generation (RAG)** system capable of:
- Ingesting **multiple PDF documents** into a unified knowledge base
- Answering questions using **hybrid retrieval** (vector + keyword search)
- Providing **cited, verifiable answers** with source attribution
- **Comparing information** across multiple documents

## Architecture Summary
```
User Query → Query Classification → Query Expansion (Multi-Query)
     ↓
HyDE Generation → Hybrid Retrieval (Vector + BM25)
     ↓
RRF Fusion → Cross-Encoder Re-ranking → LLM Generation
     ↓
Answer Verification → Final Response with Citations
```

## Key Technologies
| Component | Technology |
|-----------|------------|
| LLM | Llama 3.3 70B (via Groq) |
| Embeddings | BAAI/bge-large-en-v1.5 |
| Re-ranker | BAAI/bge-reranker-v2-m3 |
| Vector DB | ChromaDB |
| Keyword Search | BM25 |
| UI | Gradio |

## Requirements
- **Groq API Key** (free at console.groq.com)
- **Python 3.10+**
- **GPU recommended** but not required

In [1]:
# ==========================================
# CELL 1: DEPENDENCY Installation
# ==========================================
import os

print("🔥 Cleaning up the environment")

# 1. Uninstall EVERYTHING to ensure no "ghost" versions remain
!pip uninstall -y -q numpy pandas scipy scikit-learn langchain langchain-community langchain-core langchain-groq


print("📦 Installing the Dependencies")

# CORE MATH LIBRARIES
!pip install -q numpy==1.26.4
!pip install -q pandas==2.2.2
!pip install -q scipy==1.13.1

# LANGCHAIN 0.2 ECOSYSTEM (
# We strictly pin these to the 0.2 series to avoid the breaking 0.3 update
!pip install -q langchain-core==0.2.40
!pip install -q langchain-community==0.2.16
!pip install -q langchain==0.2.16
!pip install -q langchain-groq==0.1.9
!pip install -q langchain-text-splitters==0.2.4

# VECTOR DATABASE & EMBEDDINGS
!pip install -q chromadb==0.5.5
!pip install -q sentence-transformers==3.0.1
!pip install -q pypdf==4.3.1
!pip install -q rank-bm25==0.2.2


print("CRITICAL: Go to 'Runtime' > 'Restart session' NOW.")
print("After restarting, run Cell 2.")

!pip install gradio

🔥 Cleaning up the environment
[0m📦 Installing the Dependencies
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.0/18.0 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
dask-cudf-cu12 25.10.0 requires pandas<2.4.0dev0,>=2.0, which is not installed.
access 1.1.10.post3 requires pandas>=2.1.0, which is not installed.
access 1.1.10.post3 requires scipy>=1.14.1, which is not installed.
pandas-gbq 0.30.0 requires pandas>=1.1.4, which is not installed.
geemap 0.35.3 requires pandas, which is not installed.
yellowbrick 1.5 requires scikit-learn>=1.0.0, which is not installed.
yellowbrick 1.5 requires scipy>=1.0.0, which is not installed.
tensorflow-decision-forests 1.12.0 requir

This cell imports all required libraries and sets up the compute device (GPU if available, else CPU).


In [1]:
import os
import sys
import json
import torch
import numpy as np
from typing import List, Dict, Tuple, Optional
from collections import defaultdict
from dataclasses import dataclass
import hashlib
import gradio as gr
from datetime import datetime

# Core Imports
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.retrievers import BM25Retriever
from langchain_groq import ChatGroq
from langchain.schema import Document

# Advanced Models
from sentence_transformers import SentenceTransformer, CrossEncoder

# Setup Device
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"System ready. Running on: {device.upper()}")

System ready. Running on: CUDA


##  Core Data Structures

### QueryProfile Dataclass

**Purpose**: Encapsulates the result of query classification to guide retrieval strategy.

| Field | Type | Description | Example Values |
|-------|------|-------------|----------------|
| `query_type` | str | Category of question | `"factoid"`, `"summary"`, `"comparison"`, `"extraction"`, `"reasoning"` |
| `intent` | str | Same as query_type (for extensibility) | Same as above |
| `needs_multi_docs` | bool | Does query span multiple documents? | `True` for comparison queries |
| `requires_comparison` | bool | Is this a compare/contrast question? | `True` if "compare", "difference" in query |
| `answer_style` | str | How to format the answer | `"direct"`, `"bullets"`, `"steps"` |
| `k` | int | Number of chunks to retrieve | 5-12 (auto-tuned based on query type) |

### Query Type → Retrieval Strategy Mapping:
```
factoid    → k=6,  style=direct  (simple fact lookup)
summary    → k=10, style=bullets (overview questions)
comparison → k=12, style=bullets (cross-document comparison)
extraction → k=8,  style=direct  (extract specific info)
reasoning  → k=10, style=steps   (explain how/why)

In [2]:
@dataclass
class QueryProfile:
    query_type: str
    intent: str
    needs_multi_docs: bool
    requires_comparison: bool
    answer_style: str
    k: int


### QueryCache Class

**Purpose**: LRU-style cache to avoid redundant LLM calls for repeated queries.

#### How It Works:
1. **Key Generation**: MD5 hash of query string
2. **Storage**: Dictionary mapping hash → response
3. **Eviction**: FIFO (First-In-First-Out) when `max_size` exceeded

#### Methods:
| Method | Input | Output | Description |
|--------|-------|--------|-------------|
| `get(query)` | Query string | Response or `None` | Check if query is cached |
| `set(query, response)` | Query + Response | None | Store result, evict oldest if full |


In [3]:
class QueryCache:
    """Simple cache for repeated queries"""
    def __init__(self, max_size=100):
        self.cache = {}
        self.max_size = max_size

    def get(self, query: str) -> Optional[str]:
        key = hashlib.md5(query.encode()).hexdigest()
        return self.cache.get(key)

    def set(self, query: str, response: str):
        key = hashlib.md5(query.encode()).hexdigest()
        if len(self.cache) >= self.max_size:
            self.cache.pop(next(iter(self.cache)))
        self.cache[key] = response


### SemanticChunker Class

**Purpose**: Split documents into semantically coherent chunks (vs. arbitrary character-based splits).

#### Why Semantic Chunking?
| Traditional Chunking | Semantic Chunking |
|---------------------|-------------------|
| Splits at fixed character count | Splits at topic boundaries |
| May cut mid-sentence/concept | Preserves complete ideas |
| Lower retrieval relevance | Higher retrieval relevance |

#### Algorithm:
```
1. Split text into sentences (by ". ")
2. Encode each sentence with SentenceTransformer
3. For each consecutive sentence pair:
   - Compute cosine similarity
   - If similarity > threshold AND size < max:
     → Add to current chunk
   - Else:
     → Save chunk, start new one
4. Return list of semantic chunks
```

#### Parameters:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `model_name` | `all-MiniLM-L6-v2` | Sentence embedding model |
| `max_chunk_size` | 1000 | Maximum characters per chunk |
| `similarity_threshold` | 0.5 | Cosine similarity threshold for grouping |

In [4]:
class SemanticChunker:
    """Advanced semantic chunking using sentence embeddings"""
    def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name, device=device)

    def chunk_document(self, text: str, max_chunk_size=1000, similarity_threshold=0.5):
        """Split text into semantically coherent chunks"""
        sentences = text.replace('\n', ' ').split('. ')
        sentences = [s.strip() + '.' for s in sentences if s.strip()]

        if not sentences:
            return [text]

        embeddings = self.model.encode(sentences)
        chunks = []
        current_chunk = [sentences[0]]
        current_size = len(sentences[0])

        for i in range(1, len(sentences)):
            similarity = np.dot(embeddings[i-1], embeddings[i]) / (
                np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
            )
            sentence_len = len(sentences[i])

            if similarity > similarity_threshold and current_size + sentence_len < max_chunk_size:
                current_chunk.append(sentences[i])
                current_size += sentence_len
            else:
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentences[i]]
                current_size = sentence_len

        if current_chunk:
            chunks.append(' '.join(current_chunk))

        return chunks


### ReciprocalRankFusion (RRF) Class

**Purpose**: Combine multiple ranked retrieval lists into a single optimal ranking.

#### The Problem RRF Solves:
When using multiple retrievers (vector search, keyword search, etc.), each returns a ranked list. How do we combine them?

#### RRF Formula:
```
score(doc) = Σ 1 / (k + rank_i + 1)
```
Where:
- `k` = 60 (smoothing constant, standard value)
- `rank_i` = position of document in retrieval list i (0-indexed)


In [5]:
class ReciprocalRankFusion:
    """RRF for combining multiple retrieval results"""
    @staticmethod
    def fuse(retrieval_results: List[List[Document]], k=60) -> List[Document]:
        doc_scores = defaultdict(float)
        doc_map = {}

        for docs in retrieval_results:
            for rank, doc in enumerate(docs):
                doc_id = doc.metadata.get('chunk_id') or f"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}"
                doc_scores[doc_id] += 1 / (k + rank + 1)
                doc_map[doc_id] = doc

        sorted_docs = sorted(doc_scores.items(), key=lambda x: x[1], reverse=True)
        return [doc_map[doc_id] for doc_id, _ in sorted_docs]


## EnhancedRAG - Complete RAG Engine

This is the **core class** that orchestrates the entire RAG pipeline. All document ingestion, retrieval, and generation flows through this class.

---

### Class Architecture

```
EnhancedRAGv3
├── Storage Layer
│   ├── vector_db (ChromaDB)      # Semantic search index
│   ├── bm25_retriever            # Keyword search index
│   ├── documents (List)          # All document chunks
│   └── pdf_metadata (Dict)       # PDF tracking {name: {path, pages, chunks, pdf_id}}
│
├── Model Layer (Lazy-loaded for memory efficiency)
│   ├── embedding_model           # BAAI/bge-large-en-v1.5 (~1.2GB)
│   ├── cross_encoder             # BAAI/bge-reranker-v2-m3 (~560MB)
│   ├── semantic_chunker          # all-MiniLM-L6-v2 (~90MB)
│   └── query_model               # all-MiniLM-L6-v2 (~90MB)
│
├── LLM Layer
│   └── llm (ChatGroq)            # Llama 3.3 70B via Groq API
│
└── Utility Layer
    ├── cache (QueryCache)        # Response caching (max 100 queries)
    └── api_key                   # Groq API key
```

---

### Method Reference

| Method | Purpose | Key Details |
|--------|---------|-------------|
| `__init__(api_key)` | Initialize system | Sets up LLM, all other models lazy-loaded |
| `load_models()` | Load ML models | BGE embeddings → CrossEncoder → Chunker → Query model |
| `ingest_pdf(path)` | Process PDF | Extract → Chunk → Index in ChromaDB + BM25 |
| `chat(query)` | Answer questions | Full pipeline: classify → expand → retrieve → rerank → generate |
| `summarize_document()` | Summarize all docs | Map-reduce: batch summaries → final synthesis |

---

### 1. Initialization & Model Loading

**`__init__(api_key)`** - Sets up the system with Groq API key. Models are NOT loaded yet (lazy loading for faster startup).

**`load_models()`** - Loads all ML models with progress tracking:

| Progress | Model | Size | Purpose |
|----------|-------|------|---------|
| 10% → 40% | BAAI/bge-large-en-v1.5 | ~1.2GB | Document & query embeddings (1024-dim, normalized) |
| 40% → 60% | BAAI/bge-reranker-v2-m3 | ~560MB | Cross-encoder re-ranking |
| 60% → 80% | all-MiniLM-L6-v2 | ~90MB | Semantic chunking |
| 80% → 100% | all-MiniLM-L6-v2 | ~90MB | Query processing |

---

### 2. Document Ingestion Pipeline

**`ingest_pdf(pdf_path, use_semantic_chunking=True)`**

```
PDF File
    ↓
┌─────────────────────┐
│ 1. PyPDFLoader      │  Extract text from each page
└─────────────────────┘
    ↓
┌─────────────────────┐
│ 2. Duplicate Check  │  Skip if pdf_name already in pdf_metadata
└─────────────────────┘
    ↓
┌─────────────────────┐
│ 3. Chunking         │  SemanticChunker (default) or RecursiveTextSplitter
└─────────────────────┘
    ↓
┌─────────────────────┐
│ 4. Add Metadata     │  {page, source, pdf_name, pdf_id, chunk_id}
└─────────────────────┘
    ↓
┌─────────────────────┐
│ 5. Rebuild Indexes  │  ChromaDB (vector) + BM25 (keyword) with ALL docs
└─────────────────────┘
```

**Chunk Metadata Schema:**
```python
{
    "page": 0,                    # 0-indexed page number
    "source": "/path/to/doc.pdf", # Full file path
    "pdf_name": "doc.pdf",        # Filename only
    "pdf_id": "a1b2c3d4",         # 8-char MD5 hash (unique per PDF)
    "chunk_id": "a1b2c3d4-42"     # Unique chunk identifier
}
```

---

### 3. Query Classification

**`_classify_query(query) → QueryProfile`**

Determines optimal retrieval strategy using LLM + heuristic fallback:

| Query Type | Trigger Keywords | k | Answer Style |
|------------|------------------|---|--------------|
| `factoid` | "what is", "who is", "define" | 6 | direct |
| `summary` | "summarize", "overview", "key points" | 10 | bullets |
| `comparison` | "compare", "difference", "vs", "between" | 12 | bullets |
| `extraction` | (default) | 8 | direct |
| `reasoning` | "explain", "how does", "why" | 10 | steps |

**Returns:** `QueryProfile(query_type, intent, needs_multi_docs, requires_comparison, answer_style, k)`

---

### 4. Query Enhancement Techniques

**`_generate_hyde_document(query) → str`** - HyDE (Hypothetical Document Embeddings)

```
Query: "What is attention?"
         ↓ LLM generates
HyDE Doc: "The attention mechanism is a neural network component
          that allows models to focus on relevant parts..."
         ↓
Used for retrieval (matches real docs better than short query!)
```

**`_expand_query(query) → List[str]`** - Multi-Query Expansion

```
Original: "What are the benefits of transformers?"
         ↓ LLM generates 3 variants
[
  "What are the benefits of transformers?",       # Original
  "What advantages do transformer models offer?", # Variant 1
  "Why are transformers better than RNNs?",       # Variant 2
  "What makes transformer architecture effective?" # Variant 3
]
         ↓
All used for retrieval → RRF fuses results
```

---

### 5. Hybrid Retrieval Pipeline

**`_retrieve_with_rrf(query, k, fetch_factor=2) → List[Document]`**

```
Query
  │
  ├─────────────────────────────────────┐
  ↓                                     ↓
┌────────────────────┐      ┌────────────────────┐
│ Vector Search (MMR)│      │ BM25 Search        │
│                    │      │                    │
│ • Semantic match   │      │ • Exact keywords   │
│ • lambda=0.6       │      │ • Term frequency   │
│   (relevance+      │      │                    │
│    diversity)      │      │                    │
└────────────────────┘      └────────────────────┘
  │                                     │
  └─────────────────┬───────────────────┘
                    ↓
           ┌────────────────┐
           │ RRF Fusion     │  score = Σ 1/(60 + rank + 1)
           └────────────────┘
                    ↓
            Fused ranked list
```

**Why Hybrid?**
- Vector: Understands synonyms, semantic similarity
- BM25: Exact term matching, handles rare words
- Combined: Best of both worlds

---

### 6. Re-ranking & PDF Diversity

**`_rerank_documents(query, documents, top_k) → List[(Document, score)]`**

Uses **CrossEncoder** for neural re-ranking:
- Bi-encoder (initial): Fast but less accurate (query/doc encoded separately)
- Cross-encoder (re-rank): Slower but accurate (query+doc processed together)

**Comparison Query Boost:** For comparison queries, documents containing keywords like "compared to", "in contrast", "whereas" get +10% score boost per keyword.

---

**`_ensure_pdf_diversity(query, documents, target_docs=2) → List[Document]`**

For multi-document queries, ensures chunks from ALL loaded PDFs:

```
Problem: Query about "both papers" returns only Paper A chunks
         ↓
Solution: Detect missing PDFs → filtered vector search → add their chunks
         ↓
Result: [chunk_A1, chunk_A2, chunk_A3, chunk_B1, chunk_B2]
```

---

### 7. Main Chat Pipeline

**`chat(query, use_hyde=True, use_multi_query=True) → (answer, citations, metadata)`**

```
┌─────────────────────────────────────────────────────────────────┐
│ 1. CACHE CHECK         │ Return immediately if query cached     │
├─────────────────────────────────────────────────────────────────┤
│ 2. CLASSIFY QUERY      │ → QueryProfile (type, k, style)        │
├─────────────────────────────────────────────────────────────────┤
│ 3. EXPAND QUERY        │ Generate 3 alternative phrasings       │
├─────────────────────────────────────────────────────────────────┤
│ 4. GENERATE HyDE       │ Create hypothetical answer document    │
├─────────────────────────────────────────────────────────────────┤
│ 5. RETRIEVE            │ For EACH query variant:                │
│                        │   • Vector search (MMR)                │
│                        │   • BM25 search                        │
│                        │   • RRF fusion                         │
├─────────────────────────────────────────────────────────────────┤
│ 6. GLOBAL RRF          │ Fuse results from all query variants   │
├─────────────────────────────────────────────────────────────────┤
│ 7. PDF DIVERSITY       │ Ensure chunks from all loaded PDFs     │
├─────────────────────────────────────────────────────────────────┤
│ 8. RERANK              │ CrossEncoder neural scoring → top k    │
├─────────────────────────────────────────────────────────────────┤
│ 9. BUILD CONTEXT       │ Format: "[Source 1]: chunk content..." │
├─────────────────────────────────────────────────────────────────┤
│ 10. LLM GENERATE       │ Answer with inline [Source X] citations│
│                        │ (Different prompts for comparison)     │
├─────────────────────────────────────────────────────────────────┤
│ 11. VERIFY (complex)   │ Self-check: direct? structured? If not │
│                        │ → regenerate improved answer           │
├─────────────────────────────────────────────────────────────────┤
│ 12. CACHE & RETURN     │ Store result, return (answer, cites,   │
│                        │ metadata)                              │
└─────────────────────────────────────────────────────────────────┘
```

---

### 8. Document Summarization

**`summarize_document(max_chunks=None) → (summary, metadata)`**

Uses **Map-Reduce** pattern:

```
MAP PHASE:
  Chunks [1-10]  → LLM → 3-5 bullet summary
  Chunks [11-20] → LLM → 3-5 bullet summary
  ...
  Chunks [n-m]   → LLM → 3-5 bullet summary

REDUCE PHASE:
  All batch summaries → LLM → Final structured summary:
    • Overview (2-3 sentences)
    • Main Topics (bullets)
    • Important Details (3-5 points)
    • Conclusion
```

---

### Key Parameters Reference

| Parameter | Location | Default | Description |
|-----------|----------|---------|-------------|
| `k` | QueryProfile | 5-12 | Chunks to retrieve (auto-tuned by query type) |
| `fetch_factor` | _retrieve_with_rrf | 2 | Multiplier for initial retrieval pool |
| `lambda_mult` | MMR search | 0.6 | Diversity vs relevance (0=diverse, 1=relevant) |
| `similarity_threshold` | SemanticChunker | 0.5 | Cosine sim for chunk boundaries |
| `max_chunk_size` | SemanticChunker | 1000 | Max characters per chunk |
| `chunk_size` | TextSplitter | 800 | Fallback chunker size |
| `chunk_overlap` | TextSplitter | 150 | Character overlap between chunks |
| `max_size` | QueryCache | 100 | Maximum cached queries |


In [6]:
class EnhancedRAGv3:
    def __init__(self, api_key: str):
        self.vector_db = None
        self.bm25_retriever = None
        self.documents = []
        self.pdf_metadata = {}  # Track multiple PDFs
        self.cache = QueryCache()
        self.api_key = api_key
        self.is_initialized = False

        # Initialize LLM
        self.llm = ChatGroq(
            temperature=0,
            model_name="llama-3.3-70b-versatile",
            groq_api_key=api_key
        )

        # Models (loaded on demand)
        self.embedding_model = None
        self.cross_encoder = None
        self.semantic_chunker = None
        self.query_model = None

    def load_models(self, progress=gr.Progress()):
        """Load all models with progress tracking"""
        if self.is_initialized:
            return "Models already loaded."

        progress(0.1, desc="Loading BGE embeddings...")
        self.embedding_model = HuggingFaceEmbeddings(
            model_name="BAAI/bge-large-en-v1.5",
            model_kwargs={'device': device, 'trust_remote_code': True},
            encode_kwargs={'normalize_embeddings': True}
        )

        progress(0.4, desc="Loading Re-ranker...")
        self.cross_encoder = CrossEncoder('BAAI/bge-reranker-v2-m3', device=device)

        progress(0.6, desc="Loading Semantic Chunker...")
        self.semantic_chunker = SemanticChunker()

        progress(0.8, desc="Loading Query Model...")
        self.query_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2', device=device)

        progress(1.0, desc="Complete")
        self.is_initialized = True
        return "All models loaded successfully."

    def _classify_query(self, query: str) -> QueryProfile:
        classification_prompt = f"""You route user questions for a RAG system.
Return ONLY compact JSON with keys:
- query_type: one of [factoid, summary, comparison, extraction, reasoning]
- needs_multi_docs: true/false (set true when the query likely spans multiple documents or asks for differences)
- requires_comparison: true/false
- answer_style: one of [direct, bullets, steps]
- k: integer between 5 and 12 indicating how many chunks to retrieve

Question: {query}

JSON:"""

        def heuristic_profile() -> QueryProfile:
            ql = query.lower()
            requires_comparison = any(word in ql for word in ['compare', 'difference', 'versus', 'vs', 'between', 'across'])
            needs_multi = requires_comparison or any(word in ql for word in ['both', 'each document', 'all documents', 'across'])
            if any(pattern in ql for pattern in ['what is', 'who is', 'when ', 'define', 'list ']):
                qt = 'factoid'
                k_val = 6
                style = 'direct'
            elif requires_comparison:
                qt = 'comparison'
                k_val = 12
                style = 'bullets'
            elif any(word in ql for word in ['summarize', 'overview', 'key points', 'conclusion']):
                qt = 'summary'
                k_val = 10
                style = 'bullets'
            elif any(word in ql for word in ['explain', 'how does', 'process', 'steps', 'methodology']):
                qt = 'reasoning'
                k_val = 10
                style = 'steps'
            else:
                qt = 'extraction'
                k_val = 8
                style = 'direct'
            return QueryProfile(
                query_type=qt,
                intent=qt,
                needs_multi_docs=needs_multi,
                requires_comparison=requires_comparison,
                answer_style=style,
                k=k_val
            )

        try:
            response = self.llm.invoke(classification_prompt)
            data = json.loads(response.content)
            qt = str(data.get('query_type', 'extraction')).lower()
            needs_multi = bool(data.get('needs_multi_docs', False))
            requires_comparison = bool(data.get('requires_comparison', False))
            style = str(data.get('answer_style', 'direct')).lower()
            k_val = int(data.get('k', 8))
            k_val = max(5, min(k_val, 12))
            if qt not in ['factoid', 'summary', 'comparison', 'extraction', 'reasoning']:
                qt = 'extraction'
            if style not in ['direct', 'bullets', 'steps']:
                style = 'direct'
            return QueryProfile(
                query_type=qt,
                intent=qt,
                needs_multi_docs=needs_multi or requires_comparison,
                requires_comparison=requires_comparison or qt == 'comparison',
                answer_style=style,
                k=k_val
            )
        except Exception:
            return heuristic_profile()

    def _generate_hyde_document(self, query: str) -> str:
        hyde_prompt = f"""Generate a detailed, factual paragraph that would answer this question:

Question: {query}

Write a comprehensive answer (2-3 sentences) as if from an expert document:"""
        try:
            response = self.llm.invoke(hyde_prompt)
            return response.content
        except:
            return query

    def _expand_query(self, query: str) -> List[str]:
        expansion_prompt = f"""Generate 3 different versions of this question to retrieve relevant documents:

Original Question: {query}

Generate 3 alternative phrasings (one per line):"""
        try:
            response = self.llm.invoke(expansion_prompt)
            queries = response.content.strip().split('\n')
            queries = [q.strip().lstrip('1234567890.-) ') for q in queries if q.strip()]
            return [query] + queries[:3]
        except:
            return [query]

    def _adaptive_retrieve(self, query: str, query_type: str) -> int:
        k_map = {'factoid': 5, 'medium': 8, 'complex': 12}
        return k_map.get(query_type, 8)

    def ingest_pdf(self, pdf_path: str, use_semantic_chunking=True, progress=gr.Progress()):
        """Ingest PDF with progress tracking - supports multiple PDFs"""
        progress(0.1, desc=f"Loading PDF: {os.path.basename(pdf_path)}...")

        # Check if already loaded
        pdf_name = os.path.basename(pdf_path)
        if pdf_name in self.pdf_metadata:
            return f"Notice: Document '{pdf_name}' is already loaded."

        loader = PyPDFLoader(pdf_path)
        docs = loader.load()

        progress(0.3, desc=f"Loaded {len(docs)} pages. Chunking...")

        # Add unique PDF identifier to metadata
        pdf_id = hashlib.md5(pdf_path.encode()).hexdigest()[:8]
        chunk_counter = len(self.documents)

        if use_semantic_chunking:
            splits = []
            for i, doc in enumerate(docs):
                progress(0.3 + (0.3 * i / len(docs)), desc=f"Semantic chunking page {i+1}/{len(docs)}...")
                semantic_chunks = self.semantic_chunker.chunk_document(doc.page_content)
                for chunk in semantic_chunks:
                    chunk_counter += 1
                    splits.append(Document(
                        page_content=chunk,
                        metadata={
                            'page': doc.metadata.get('page', 0),
                            'source': pdf_path,
                            'pdf_name': pdf_name,
                            'pdf_id': pdf_id,
                            'chunk_id': f"{pdf_id}-{chunk_counter}"
                        }
                    ))
        else:
            text_splitter = RecursiveCharacterTextSplitter(
                chunk_size=800,
                chunk_overlap=150,
                separators=["\n\n", "\n", ". ", " ", ""],
                length_function=len
            )
            splits = text_splitter.split_documents(docs)
            # Add PDF metadata
            for split in splits:
                chunk_counter += 1
                split.metadata['pdf_name'] = pdf_name
                split.metadata['pdf_id'] = pdf_id
                split.metadata['chunk_id'] = f"{pdf_id}-{chunk_counter}"

        # Add to existing documents
        self.documents.extend(splits)

        # Track PDF metadata
        total_pages = max([doc.metadata.get('page', 0) for doc in docs]) + 1
        self.pdf_metadata[pdf_name] = {
            'path': pdf_path,
            'pages': total_pages,
            'chunks': len(splits),
            'pdf_id': pdf_id,
            'added': datetime.now().strftime("%Y-%m-%d %H:%M")
        }

        progress(0.7, desc=f"Rebuilding Vector Index ({len(self.documents)} total chunks)...")

        # Rebuild vector DB with all documents
        self.vector_db = Chroma.from_documents(
            documents=self.documents,
            embedding=self.embedding_model,
            collection_name="rag_gradio_v3"
        )

        progress(0.9, desc="Rebuilding Keyword Index...")
        self.bm25_retriever = BM25Retriever.from_documents(self.documents)

        progress(1.0, desc="Complete")

        return f"""**Document Added Successfully**

**File:** {pdf_name}
**Pages:** {total_pages}
**Chunks:** {len(splits)}

**Total Collection:**
- {len(self.pdf_metadata)} document(s)
- {len(self.documents)} total chunks

Ready to answer questions."""

    def get_loaded_pdfs(self) -> str:
        """Return formatted list of loaded PDFs"""
        if not self.pdf_metadata:
            return "No documents loaded yet."

        output = "## Loaded Documents\n\n"
        for idx, (name, info) in enumerate(self.pdf_metadata.items(), 1):
            output += f"**{idx}. {name}**\n"
            output += f"   - Pages: {info['pages']} | Chunks: {info['chunks']}\n"
            output += f"   - Added: {info['added']}\n\n"

        output += f"**Total:** {len(self.pdf_metadata)} document(s), {len(self.documents)} chunks"
        return output

    def clear_all_documents(self):
        """Clear all loaded documents"""
        self.documents = []
        self.pdf_metadata = {}
        self.vector_db = None
        self.bm25_retriever = None
        self.cache = QueryCache()  # Clear cache too
        return "All documents cleared."

    def _retrieve_with_rrf(self, query: str, k: int = 5, fetch_factor: int = 2) -> List[Document]:
        fetch_k = max(k * fetch_factor, k)
        vector_docs = self.vector_db.as_retriever(
            search_type="mmr",
            search_kwargs={"k": fetch_k, "fetch_k": fetch_k * 2, "lambda_mult": 0.6}
        ).invoke(query)
        self.bm25_retriever.k = fetch_k
        keyword_docs = self.bm25_retriever.invoke(query)
        fused_docs = ReciprocalRankFusion.fuse([vector_docs, keyword_docs])
        return fused_docs[:fetch_k]

    def _rerank_documents(self, query: str, documents: List[Document], top_k: int = 5, force_comparison: bool = False) -> List[Tuple[Document, float]]:
        if not documents:
            return []

        # For comparison queries, boost documents that likely contain comparative info
        is_comparison = force_comparison or any(word in query.lower() for word in ['compare', 'difference', 'differ', 'versus', 'vs'])

        pairs = [[query, doc.page_content] for doc in documents]
        scores = self.cross_encoder.predict(pairs)

        # Boost scores for docs that contain comparison keywords
        if is_comparison:
            comparison_keywords = ['compared to', 'in contrast', 'difference', 'whereas', 'unlike', 'while', 'however']
            for i, doc in enumerate(documents):
                content_lower = doc.page_content.lower()
                keyword_count = sum(1 for kw in comparison_keywords if kw in content_lower)
                if keyword_count > 0:
                    scores[i] *= (1 + 0.1 * keyword_count)  # Boost by 10% per keyword

        scored_docs = list(zip(documents, scores))
        scored_docs.sort(key=lambda x: x[1], reverse=True)
        return scored_docs[:top_k]

    def _dedupe_documents(self, documents: List[Document]) -> List[Document]:
        deduped = []
        seen = set()
        for doc in documents:
            key = doc.metadata.get('chunk_id') or f"{doc.metadata.get('pdf_id', 'unknown')}::{hashlib.md5(doc.page_content.encode()).hexdigest()}"
            if key in seen:
                continue
            seen.add(key)
            deduped.append(doc)
        return deduped

    def _ensure_pdf_diversity(self, query: str, documents: List[Document], target_docs: int = 2, per_pdf: int = 3) -> List[Document]:
        if not documents or not self.pdf_metadata:
            return documents

        seen_ids = set(doc.metadata.get('pdf_id') for doc in documents if doc.metadata.get('pdf_id'))
        if len(seen_ids) >= target_docs:
            return documents

        missing_ids = [info['pdf_id'] for info in self.pdf_metadata.values() if info['pdf_id'] not in seen_ids]
        extra_docs = []
        for pdf_id in missing_ids[:max(0, target_docs - len(seen_ids))]:
            filtered_docs = self.vector_db.as_retriever(
                search_type="mmr",
                search_kwargs={
                    "k": per_pdf,
                    "fetch_k": per_pdf * 2,
                    "lambda_mult": 0.6,
                    "filter": {"pdf_id": pdf_id}
                }
            ).invoke(query)
            extra_docs.extend(filtered_docs)

        combined = documents + extra_docs
        return self._dedupe_documents(combined)

    def _create_citation_card(self, idx: int, doc: Document, score: float) -> str:
        """Create a formatted citation card"""
        page = doc.metadata.get('page', 'Unknown')
        pdf_name = doc.metadata.get('pdf_name', 'Unknown Document')

        # Get snippet (first 200 chars)
        snippet = doc.page_content[:200] + "..." if len(doc.page_content) > 200 else doc.page_content

        # Relevance label based on score
        if score > 0.7:
            relevance = "High"
        elif score > 0.5:
            relevance = "Medium"
        else:
            relevance = "Low"

        card = f"""
<details style="margin-bottom: 8px; padding: 8px; border: 1px solid #ddd; border-radius: 4px;">
<summary style="cursor: pointer;"><b>[{idx}]</b> {pdf_name} — Page {page} | Relevance: {relevance} ({score:.2f})</summary>
<blockquote style="margin-top: 8px; padding: 8px; background: #f9f9f9; border-left: 3px solid #ccc;">{snippet}</blockquote>
</details>
"""
        return card

    def chat(self, query: str, use_hyde: bool = True, use_multi_query: bool = True, progress=gr.Progress()):
        """Enhanced chat with better answers and citations"""
        if not self.vector_db:
            return "Please upload at least one document first.", "", ""

        # Check cache
        cached_response = self.cache.get(query)
        if cached_response:
            return f"*Retrieved from cache*\n\n{cached_response}", "", "Cached result"

        progress(0.1, desc="Classifying query...")
        profile = self._classify_query(query)
        k = profile.k

        base_queries = [query]

        if use_multi_query:
            progress(0.22, desc="Expanding query variants...")
            expanded_queries = self._expand_query(query)
            base_queries.extend(expanded_queries[:2])

        if use_hyde:
            progress(0.32, desc="Generating HyDE document...")
            hyde_doc = self._generate_hyde_document(query)
            base_queries.append(hyde_doc)

        progress(0.45, desc="Retrieving candidates (MMR + BM25)...")
        retrieval_results = []
        for bq in base_queries:
            retrieval_results.append(self._retrieve_with_rrf(bq, k=k, fetch_factor=2))

        fused_docs = ReciprocalRankFusion.fuse(retrieval_results)
        fused_docs = self._dedupe_documents(fused_docs)[:max(k * 3, k)]

        if profile.needs_multi_docs and len(self.pdf_metadata) > 1:
            fused_docs = self._ensure_pdf_diversity(
                query,
                fused_docs,
                target_docs=min(3, len(self.pdf_metadata)),
                per_pdf=max(2, k // 3)
            )

        progress(0.7, desc="Re-ranking with CrossEncoder...")
        reranked_docs = self._rerank_documents(query, fused_docs, top_k=max(5, k), force_comparison=profile.requires_comparison)

        progress(0.8, desc="Building context...")

        # Build context with inline citations
        context_parts = []
        citation_cards = []

        for idx, (doc, score) in enumerate(reranked_docs, 1):
            page = doc.metadata.get('page', 'Unknown')
            pdf_name = doc.metadata.get('pdf_name', 'Unknown')

            # Add to context
            context_parts.append(f"[Source {idx}]: {doc.page_content}\n")

            # Create citation card
            citation_cards.append(self._create_citation_card(idx, doc, score))

        context_str = "\n".join(context_parts)

        # Enhanced prompt for better answers
        is_comparison = profile.requires_comparison
        style_hint = ""
        if profile.answer_style == 'bullets':
            style_hint = "Use concise bullet points."
        elif profile.answer_style == 'steps':
            style_hint = "Use numbered steps when explaining processes."

        style_instruction = style_hint or "Keep structure aligned to the question type."

        if is_comparison:
            prompt = f"""You are an expert AI assistant analyzing academic/technical documents. Answer this COMPARISON question with precision and structure.

## COMPARISON QUESTION TYPE

## CRITICAL INSTRUCTIONS:
1. **Start with a direct comparison statement** - Don't give background first
2. **Use a structured format:**
   - Brief 1-2 sentence overview of what's being compared
   - Bullet points listing specific differences
   - Each bullet should be concrete and factual
3. **Be specific with numbers, names, and technical details** from the sources
4. **Cite sources** [Source X] after each factual claim
5. **If sources lack comparison info**, explicitly state: "The provided sources do not contain direct comparison information on [aspect]. Based on what's available: [answer what you can]"

## CONTEXT FROM DOCUMENTS:
{context_str}

## COMPARISON QUESTION:
{query}

## STRUCTURED COMPARISON ANSWER:
"""
        else:
            prompt = f"""You are an expert AI assistant analyzing academic/technical documents. Your goal is to provide accurate, well-structured, and comprehensive answers.

## QUERY TYPE: {profile.query_type.upper()}

## INSTRUCTIONS:
1. **Answer the question directly in the first sentence** - Don't start with background
2. **Use inline citations** [Source X] immediately after each claim or fact
3. **Structure your answer clearly:**
   - For factoid queries: Direct answer (2-3 sentences) with supporting details
   - For complex queries: Organized explanation with bullet points or numbered lists
   - For "explain" queries: Start with simple definition, then elaborate
4. **Be comprehensive but concise** - NO repetition or filler words
5. **Use specific facts**: numbers, names, technical terms from sources
6. **If information is insufficient**, state: "The sources provided do not fully address [aspect]. Based on available information: [what you can answer]"
7. {style_instruction}

## CONTEXT FROM DOCUMENTS:
{context_str}

## QUESTION:
{query}

## YOUR ANSWER:
"""

        progress(0.9, desc="Generating enhanced answer...")
        try:
            response = self.llm.invoke(prompt)
            answer = response.content

            # Add verification step for complex queries and comparisons
            if profile.query_type in ['summary', 'comparison', 'reasoning'] or is_comparison:
                verify_prompt = f"""Review this answer for a {profile.query_type} query. Check if it:

Question: {query}

Answer: {answer}

**Evaluation Criteria:**
1. **Directness**: Does it answer the question in the first sentence?
2. **Structure**: Is it well-organized with bullet points for complex info?
3. **Specificity**: Does it use concrete facts/numbers from sources?
4. **Completeness**: Does it address all parts of the question?
5. **No fluff**: Is it concise without repetition?

If the answer has issues, provide an IMPROVED VERSION following this format:
- Start with direct answer
- Use bullet points for lists/comparisons
- Include specific facts with citations
- Be concise

If it's already good, respond with only: "VERIFIED"

Your response:"""

                verify_response = self.llm.invoke(verify_prompt)
                if "VERIFIED" not in verify_response.content.upper():
                    # Extract improved answer (remove any preamble)
                    improved = verify_response.content
                    if "IMPROVED VERSION" in improved or "Here" in improved[:50]:
                        # Find where actual answer starts
                        lines = improved.split('\n')
                        answer_lines = []
                        started = False
                        for line in lines:
                            if started or (line.strip() and not line.strip().startswith(('**', 'If', 'Your', 'The answer'))):
                                started = True
                                answer_lines.append(line)
                        if answer_lines:
                            answer = '\n'.join(answer_lines)
                    else:
                        answer = improved

            self.cache.set(query, answer)

            # Format citations
            citations_html = "\n".join(citation_cards)

            # Metadata
            metadata = f"""**Query Type:** {profile.query_type.title()} | **Multi-Doc:** {"Yes" if profile.needs_multi_docs else "No"} | **Sources Used:** {len(reranked_docs)} | **Documents Searched:** {len(self.pdf_metadata)}"""

            progress(1.0, desc="Complete")
            return answer, citations_html, metadata

        except Exception as e:
            return f"Error: {str(e)}", "", ""

    def summarize_document(self, max_chunks: int = None, progress=gr.Progress()):
        """Generate document summary"""
        if not self.documents:
            return "No document loaded.", ""

        chunks_to_process = self.documents[:max_chunks] if max_chunks else self.documents
        total_chunks = len(chunks_to_process)

        progress(0.1, desc=f"Processing {total_chunks} chunks...")

        chunk_summaries = []
        batch_size = 10

        for i in range(0, total_chunks, batch_size):
            batch = chunks_to_process[i:i+batch_size]
            batch_text = "\n\n---\n\n".join([doc.page_content for doc in batch])

            progress(0.1 + (0.6 * i / total_chunks), desc=f"Summarizing chunks {i+1}-{min(i+batch_size, total_chunks)}...")

            map_prompt = f"""Summarize the key points from this document section in 3-5 bullet points:

{batch_text}

Key Points:"""

            try:
                response = self.llm.invoke(map_prompt)
                chunk_summaries.append(response.content)
            except Exception as e:
                continue

        progress(0.8, desc="Synthesizing final summary...")

        combined_summaries = "\n\n".join(chunk_summaries)

        reduce_prompt = f"""You are summarizing documents. Below are summaries of different sections.

Create a comprehensive, well-structured summary that includes:

1. **Overview**: What are these documents about? (2-3 sentences)
2. **Main Topics**: Key themes and subjects covered (bullet points)
3. **Important Details**: Critical information, findings, or arguments (3-5 points)
4. **Conclusion**: Overall takeaway or significance

Section Summaries:
{combined_summaries}

## COMPREHENSIVE SUMMARY:"""

        try:
            final_response = self.llm.invoke(reduce_prompt)
            summary = final_response.content

            # Build metadata
            metadata = f"""## Summary Statistics

**Documents Analyzed:** {len(self.pdf_metadata)}
**Total Chunks:** {total_chunks}
**Total Pages:** {sum(info['pages'] for info in self.pdf_metadata.values())}

### Documents Included:
"""
            for name, info in self.pdf_metadata.items():
                metadata += f"- **{name}** ({info['pages']} pages)\n"

            progress(1.0, desc="Complete")
            return summary, metadata

        except Exception as e:
            return f"Error: {str(e)}", ""


## Gradio Web Interface

The `create_interface()` function creates a 4-tab web UI:

| Tab | Purpose | Key Actions |
|-----|---------|-------------|
| **Setup** | Initialize system | Enter API key → Load models → Upload PDFs |
| **Chat** | Q&A interface | Ask questions with HyDE/Multi-Query options |
| **Summarize** | Document summary | Generate map-reduce summary of all docs |
| **Help** | Documentation | Usage guide and example questions |


In [7]:
def create_interface():
    """Create enhanced Gradio interface"""

    # Global RAG instance
    rag_system = None

    def initialize_system(api_key):
        nonlocal rag_system
        if not api_key:
            return "Please enter your Groq API key.", ""
        try:
            rag_system = EnhancedRAGv3(api_key)
            status = rag_system.load_models()
            return status, ""
        except Exception as e:
            return f"Error: {str(e)}", ""

    def upload_and_process(file, use_semantic):
        if rag_system is None or not rag_system.is_initialized:
            return "Please initialize the system first.", ""
        try:
            status = rag_system.ingest_pdf(file.name, use_semantic_chunking=use_semantic)
            loaded_pdfs = rag_system.get_loaded_pdfs()
            return status, loaded_pdfs
        except Exception as e:
            return f"Error: {str(e)}", ""

    def ask_question(query, use_hyde, use_multi_query):
        if rag_system is None:
            return "Please initialize the system first.", "", ""
        if not query.strip():
            return "Please enter a question.", "", ""
        try:
            answer, citations, metadata = rag_system.chat(query, use_hyde=use_hyde, use_multi_query=use_multi_query)
            return answer, citations, metadata
        except Exception as e:
            return f"Error: {str(e)}", "", ""

    def summarize_doc():
        if rag_system is None:
            return "Please initialize the system first.", ""
        try:
            summary, metadata = rag_system.summarize_document()
            return summary, metadata
        except Exception as e:
            return f"Error: {str(e)}", ""

    def clear_docs():
        if rag_system is None:
            return "No system initialized", ""
        status = rag_system.clear_all_documents()
        return status, ""

    def get_pdf_list():
        if rag_system is None:
            return "No system initialized"
        return rag_system.get_loaded_pdfs()

    # Create Gradio Blocks interface
    with gr.Blocks(
        title="Multi-PDF RAG System",
        theme=gr.themes.Base(),
        css="""
            .gradio-container { max-width: 1200px; margin: auto; }
            h1 { font-weight: 600; }
            .prose { font-size: 14px; }
        """
    ) as app:
        gr.Markdown("""
        # Multi-Document RAG System
        **Advanced Document Q&A with Multiple PDF Support** — Powered by Llama 3.3 70B

        Multi-document support | Enhanced citations | Verification system
        """)

        with gr.Tab("Setup"):
            gr.Markdown("### Step 1: Initialize System")
            with gr.Row():
                api_key_input = gr.Textbox(
                    label="Groq API Key",
                    type="password",
                    placeholder="Enter your Groq API key",
                    scale=3
                )
                init_btn = gr.Button("Initialize", variant="primary", scale=1)
            init_status = gr.Textbox(label="Status", interactive=False)

            gr.Markdown("### Step 2: Upload Documents")
            gr.Markdown("*Multiple PDFs supported — each document will be added to the knowledge base.*")

            with gr.Row():
                with gr.Column(scale=2):
                    file_input = gr.File(label="Select PDF", file_types=[".pdf"])
                    semantic_check = gr.Checkbox(label="Use Semantic Chunking (Recommended)", value=True)
                    with gr.Row():
                        upload_btn = gr.Button("Add Document", variant="primary", scale=2)
                        clear_btn = gr.Button("Clear All", variant="stop", scale=1)
                    upload_status = gr.Markdown()

                with gr.Column(scale=1):
                    gr.Markdown("#### Document Library")
                    loaded_pdfs_display = gr.Markdown("No documents loaded yet.")
                    refresh_btn = gr.Button("Refresh", size="sm")

            init_btn.click(initialize_system, inputs=[api_key_input], outputs=[init_status, loaded_pdfs_display])
            upload_btn.click(upload_and_process, inputs=[file_input, semantic_check], outputs=[upload_status, loaded_pdfs_display])
            clear_btn.click(clear_docs, outputs=[upload_status, loaded_pdfs_display])
            refresh_btn.click(get_pdf_list, outputs=[loaded_pdfs_display])

        with gr.Tab("Chat"):
            gr.Markdown("### Ask Questions About Your Documents")

            with gr.Row():
                with gr.Column(scale=3):
                    query_input = gr.Textbox(
                        label="Your Question",
                        placeholder="What are the key conclusions of these papers?",
                        lines=3
                    )
                    with gr.Row():
                        hyde_check = gr.Checkbox(label="HyDE (Better Retrieval)", value=True)
                        multi_query_check = gr.Checkbox(label="Multi-Query (More Comprehensive)", value=True)
                    ask_btn = gr.Button("Submit", variant="primary", size="lg")

                with gr.Column(scale=1):
                    gr.Markdown("""
                    #### Tips
                    - Be specific in your questions
                    - HyDE improves retrieval quality
                    - Multi-Query finds more context
                    - Questions search across all loaded documents
                    """)

            metadata_output = gr.Markdown(label="Query Info")
            answer_output = gr.Markdown(label="Answer")

            gr.Markdown("#### Sources & Citations")
            sources_output = gr.HTML(label="Sources")

            ask_btn.click(
                ask_question,
                inputs=[query_input, hyde_check, multi_query_check],
                outputs=[answer_output, sources_output, metadata_output]
            )

            gr.Examples(
                examples=[
                    "What are the main findings?",
                    "Explain the methodology in detail",
                    "What are the key conclusions?",
                    "Compare the approaches discussed",
                    "What are the limitations mentioned?",
                    "Summarize the most important contributions"
                ],
                inputs=query_input
            )

        with gr.Tab("Summarize"):
            gr.Markdown("### Generate Comprehensive Summary")
            gr.Markdown("Analyzes all loaded documents and creates a unified summary.")

            summarize_btn = gr.Button("Generate Summary", variant="primary", size="lg")
            summary_metadata = gr.Markdown()
            summary_output = gr.Markdown()

            summarize_btn.click(summarize_doc, outputs=[summary_output, summary_metadata])

        with gr.Tab("Help"):
            gr.Markdown("""
            ## How to Use This System

            ### Quick Start
            1. **Setup Tab**: Enter Groq API key and click "Initialize"
            2. **Setup Tab**: Upload PDF(s) and click "Add Document" (repeat for multiple files)
            3. **Chat Tab**: Ask questions about your documents
            4. **Summarize Tab**: Get unified summary of all documents

            ---

            ## Example PDFs for Testing

            Download these PDFs to test the system:

            ### Academic Research Papers
            1. **Attention is All You Need** (Transformer Paper)
               - URL: `https://arxiv.org/pdf/1706.03762.pdf`
               - Great for: Testing technical Q&A

            2. **BERT: Pre-training of Deep Bidirectional Transformers**
               - URL: `https://arxiv.org/pdf/1810.04805.pdf`
               - Great for: Multi-paper comparison

            3. **GPT-3 Paper** (Language Models are Few-Shot Learners)
               - URL: `https://arxiv.org/pdf/2005.14165.pdf`
               - Great for: Complex methodology questions

            ### Business & Finance
            4. **Tesla Q3 2024 Earnings Report**
               - URL: `https://digitalassets.tesla.com/tesla-contents/image/upload/IR/TSLA-Q3-2024-Update`
               - Great for: Financial analysis questions

            5. **World Bank Annual Report**
               - URL: `https://thedocs.worldbank.org/en/doc/9a8210d538854d29883cf3a19e66a3e2-0350012021/original/WBG-Annual-Report-2021-EN.pdf`
               - Great for: Economic data queries

            ### Technical Documentation
            6. **Python Documentation (any topic)**
               - URL: Search "python [topic] pdf" on official Python docs
               - Great for: Technical Q&A

            ### Medical/Scientific
            7. **WHO COVID-19 Reports**
               - URL: `https://www.who.int/publications` (download any PDF)
               - Great for: Health information extraction

            ---

            ## Real-Life Use Case Questions

            ### Research Paper Analysis (Attention Paper)
            ```
            1. "What is the main innovation proposed in this paper?"
            2. "Explain the self-attention mechanism in detail"
            3. "How does the Transformer compare to RNN-based models?"
            4. "What are the computational complexity advantages?"
            5. "What datasets were used for evaluation?"
            6. "What are the key results on machine translation tasks?"
            7. "Describe the encoder-decoder architecture"
            8. "What are the limitations mentioned by the authors?"
            ```

            ### Business/Finance Analysis (Earnings Reports)
            ```
            1. "What was the total revenue this quarter?"
            2. "How did the company perform compared to last quarter?"
            3. "What are the main growth drivers mentioned?"
            4. "Summarize the key financial metrics"
            5. "What guidance did management provide?"
            6. "What are the main risks discussed?"
            7. "How much cash does the company have?"
            8. "What investments are being made in R&D?"
            ```

            ### Multi-Paper Comparison (Upload BERT + GPT-3)
            ```
            1. "Compare the pre-training objectives of these models"
            2. "What are the key differences in architecture?"
            3. "Which model performs better on which tasks?"
            4. "How do the training datasets differ?"
            5. "What are the main innovations in each paper?"
            6. "Compare the computational requirements"
            ```

            ### Policy/Report Analysis (World Bank Report)
            ```
            1. "What are the main economic trends discussed?"
            2. "Which regions showed the strongest growth?"
            3. "What interventions are recommended?"
            4. "Summarize the poverty reduction initiatives"
            5. "What are the key challenges identified?"
            6. "What metrics are used to measure success?"
            ```

            ### Medical Literature (WHO Reports)
            ```
            1. "What are the recommended treatment protocols?"
            2. "What evidence supports these guidelines?"
            3. "What are the risk factors mentioned?"
            4. "Summarize the epidemiological data"
            5. "What prevention measures are suggested?"
            ```

            ### Contract/Legal Document Analysis
            ```
            1. "What are the key obligations of each party?"
            2. "What are the termination conditions?"
            3. "Summarize the payment terms"
            4. "What warranties are provided?"
            5. "What are the liability limitations?"
            ```

            ---

            ## Complete Example Workflow

            **Scenario:** Analyzing the "Attention is All You Need" paper

            **Step 1:** Upload the PDF
            - Download: `https://arxiv.org/pdf/1706.03762.pdf`
            - Upload in Setup tab

            **Step 2:** Start with broad question:
            ```
            "What is this paper about and what problem does it solve?"
            ```

            **Step 3:** Dive into specifics:
            ```
            "Explain how multi-head attention works"
            "What are the advantages over recurrent models?"
            "How does positional encoding work?"
            ```

            **Step 4:** Compare (if you upload GPT-3 paper too):
            ```
            "How does the Transformer architecture used in GPT-3 differ from the original?"
            ```

            **Step 5:** Get comprehensive summary:
            - Go to Summarize tab
            - Click "Generate Summary"

            ---

            ### Key Features

            #### Multi-Document Support
            - Upload multiple PDFs to build a knowledge base
            - Questions search across all loaded documents
            - Each source is clearly attributed with document name and page

            #### Enhanced Citations
            - Expandable citation cards with snippets
            - Relevance scores (High, Medium, Low)
            - Shows exact page numbers and document names

            #### Better Answer Quality
            - Advanced query classification (factoid/medium/complex)
            - Verification system for complex queries
            - Structured, comprehensive responses
            - Context-aware answer length

            #### Advanced Retrieval
            - **HyDE**: Generates hypothetical documents for better retrieval
            - **Multi-Query**: Expands questions for comprehensive coverage
            - **RRF Fusion**: Combines vector + keyword search
            - **Cross-Encoder Re-ranking**: Improves relevance scoring

            ### Best Practices

            **For Best Results:**
            - Use semantic chunking (default)
            - Keep HyDE and Multi-Query enabled for important questions
            - Ask specific, well-formed questions
            - Review citations to verify answers

            **Example Questions:**
            - "What is the transformer architecture?" (Factoid)
            - "Explain how self-attention works in detail" (Complex)
            - "Compare BERT and GPT approaches" (Complex)
            - "What are the key innovations in this paper?" (Medium)

            ### Performance
            - **Simple queries**: ~5-8 seconds
            - **Complex queries**: ~10-15 seconds (includes verification)
            - **Full summary**: ~30-90 seconds (depends on document count)

            ### Technical Details
            - **Embeddings**: BAAI/bge-large-en-v1.5 (SOTA)
            - **Re-ranker**: BAAI/bge-reranker-v2-m3
            - **LLM**: Llama 3.3 70B Versatile via Groq
            - **Retrieval**: Hybrid (Vector + BM25) with RRF fusion

            ---

            **Need Help?** Make sure to initialize the system and upload at least one PDF before asking questions!
            """)

    return app


### Launch Configuration:
```python
app.launch(share=True, debug=True, server_port=7860)
```
- `share=True` → Creates public `gradio.live` URL
- Access locally at `http://localhost:7860`


In [None]:
# Create and launch the interface
app = create_interface()

# Launch
app.launch(
    share=True,
    debug=True,
    server_name="0.0.0.0",
    server_port=7860
)

  with gr.Blocks(
  with gr.Blocks(


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://f02d585ac558d5f5c6.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


  self.embedding_model = HuggingFaceEmbeddings(


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/779 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/795 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientStartEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event ClientCreateCollectionEvent: capture() takes 1 positional argument but 3 were given
ERROR:chromadb.telemetry.product.posthog:Failed to send telemetry event CollectionQueryEvent: capture() takes 1 positional argument but 3 were given
