# ?? RAG-Based PDF Question Answering with Multi-Model Evaluation

This notebook demonstrates a complete **Retrieval-Augmented Generation (RAG)** system that:
- Answers questions based on PDF documents
- Compares 3 LLM models: **Mistral AI**, **Qwen3**, and **Llama**
- Evaluates with 10+ metrics including latency, BERTScore, hallucination detection, etc.
- Generates comparison visualizations

---

## ?? Key Features:
- ? Multi-model comparison (Mistral, Qwen3, Llama)
- ? 10+ evaluation metrics
- ? Multi-trial testing (3 trials per model)
- ? Beautiful visualizations
- ? PDF-grounded responses (no hallucination)

---

## ?? Step 1: Install Dependencies

Install all required packages for the RAG system.

In [None]:
%%capture
!pip install langchain==0.1.0 langchain-community==0.0.13
!pip install pypdf2==3.0.1 pymupdf==1.23.8
!pip install sentence-transformers==2.2.2 faiss-cpu==1.7.4
!pip install transformers==4.36.2 torch==2.1.2 accelerate==0.25.0
!pip install bert-score==0.3.13 rouge-score==0.1.2 nltk==3.8.1 sacrebleu==2.3.1
!pip install matplotlib==3.8.2 seaborn==0.13.0 plotly==5.18.0
!pip install scikit-learn==1.3.2 pandas==2.0.3 numpy==1.24.3
!pip install reportlab  # For creating sample PDFs

print("? All packages installed successfully!")

## ?? Step 2: Download NLTK Data

In [None]:
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
print("? NLTK data downloaded!")

## ?? Step 3: Import Libraries and Setup

In [None]:
import os
import time
import warnings
warnings.filterwarnings('ignore')

# Create directories
os.makedirs('./data/pdfs', exist_ok=True)
os.makedirs('./data/vector_db', exist_ok=True)
os.makedirs('./results', exist_ok=True)

print("? Directories created!")
print("?? Structure:")
print("   ./data/pdfs/ - Place your PDF files here")
print("   ./data/vector_db/ - Vector database storage")
print("   ./results/ - Evaluation results")

## ?? Step 4: Create Sample PDF Documents

We'll create 3 sample PDFs about AI topics. **Replace this with your own PDFs on Kaggle!**

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer

def create_sample_pdf(filename, title, content_list):
    """Create a sample PDF file"""
    doc = SimpleDocTemplate(filename, pagesize=letter)
    story = []
    styles = getSampleStyleSheet()
    
    # Title
    story.append(Paragraph(title, styles['Heading1']))
    story.append(Spacer(1, 12))
    
    # Content
    for para in content_list:
        story.append(Paragraph(para, styles['BodyText']))
        story.append(Spacer(1, 12))
    
    doc.build(story)
    print(f"? Created: {filename}")

# Sample content about Machine Learning
ml_content = [
    "Machine learning is a subset of artificial intelligence that enables systems to learn from data.",
    "There are three main types: supervised learning, unsupervised learning, and reinforcement learning.",
    "Deep learning uses neural networks with multiple layers for complex pattern recognition.",
    "Common applications include image recognition, natural language processing, and recommendation systems."
]

# Sample content about NLP
nlp_content = [
    "Natural Language Processing enables computers to understand and generate human language.",
    "Transformers revolutionized NLP with models like BERT, GPT, and T5.",
    "Key tasks include sentiment analysis, machine translation, and question answering.",
    "Word embeddings represent words as dense vectors in continuous space."
]

# Sample content about RAG
rag_content = [
    "Retrieval-Augmented Generation combines retrieval and generation for better factual accuracy.",
    "RAG systems use vector databases like FAISS for efficient similarity search.",
    "Benefits include reduced hallucination and ability to use up-to-date information.",
    "Evaluation metrics include BLEU, ROUGE, BERTScore, and hallucination detection."
]

# Create PDFs
create_sample_pdf('./data/pdfs/machine_learning.pdf', 'Machine Learning Overview', ml_content)
create_sample_pdf('./data/pdfs/nlp_overview.pdf', 'Natural Language Processing', nlp_content)
create_sample_pdf('./data/pdfs/rag_systems.pdf', 'RAG Systems', rag_content)

print("\n? All sample PDFs created!")

## ?? Step 5: Define Core RAG Components

### 5.1 PDF Processor

In [None]:
import fitz  # PyMuPDF
from pathlib import Path
from typing import List
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

class PDFProcessor:
    """Process PDFs and split into chunks"""
    
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
        )
    
    def load_pdf(self, pdf_path: str) -> str:
        """Load PDF and extract text"""
        text = ""
        doc = fitz.open(pdf_path)
        for page in doc:
            text += page.get_text()
        doc.close()
        return text
    
    def load_directory(self, directory: str) -> List[Document]:
        """Load all PDFs from directory"""
        pdf_files = list(Path(directory).glob("*.pdf"))
        all_documents = []
        
        for pdf_path in pdf_files:
            text = self.load_pdf(str(pdf_path))
            documents = self.text_splitter.create_documents(
                [text],
                metadatas=[{"source": pdf_path.name}]
            )
            all_documents.extend(documents)
        
        print(f"? Loaded {len(all_documents)} chunks from {len(pdf_files)} PDFs")
        return all_documents

print("? PDFProcessor defined!")

### 5.2 Vector Store

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

class VectorStore:
    """Manage vector embeddings and similarity search"""
    
    def __init__(self, model_name='sentence-transformers/all-MiniLM-L6-v2'):
        print(f"Loading embedding model: {model_name}...")
        self.embedding_model = SentenceTransformer(model_name)
        self.index = None
        self.documents = []
        print("? Embedding model loaded!")
    
    def build_index(self, documents: List[Document]):
        """Build FAISS index"""
        self.documents = documents
        texts = [doc.page_content for doc in documents]
        
        print("Creating embeddings...")
        embeddings = self.embedding_model.encode(texts, show_progress_bar=True)
        embeddings = np.array(embeddings).astype('float32')
        
        # Build FAISS index
        dimension = embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(embeddings)
        
        print(f"? Built FAISS index with {len(documents)} documents")
    
    def search(self, query: str, k: int = 5) -> List[tuple]:
        """Search for similar documents"""
        query_embedding = self.embedding_model.encode([query]).astype('float32')
        distances, indices = self.index.search(query_embedding, k)
        
        results = []
        for idx, distance in zip(indices[0], distances[0]):
            results.append((self.documents[idx], float(distance)))
        return results
    
    def get_context(self, query: str, k: int = 5) -> str:
        """Get concatenated context from top-k documents"""
        results = self.search(query, k)
        context = "\n\n".join([doc.page_content for doc, _ in results])
        return context

print("? VectorStore defined!")

### 5.3 RAG System

In [None]:
class RAGSystem:
    """Complete RAG system"""
    
    def __init__(self, pdf_directory: str):
        self.pdf_processor = PDFProcessor()
        self.vector_store = VectorStore()
        
        # Load and index documents
        print("\n?? Loading PDFs...")
        documents = self.pdf_processor.load_directory(pdf_directory)
        
        print("\n?? Building vector index...")
        self.vector_store.build_index(documents)
    
    def create_prompt(self, query: str, context: str) -> str:
        """Create prompt with context"""
        prompt = f"""You are a helpful assistant that answers questions based ONLY on the provided context.
If the answer cannot be found in the context, say "I cannot answer based on the provided documents."

Context:
{context}

Question: {query}

Answer (based only on the context):"""
        return prompt
    
    def query(self, question: str, top_k: int = 5) -> tuple:
        """Query the RAG system"""
        context = self.vector_store.get_context(question, k=top_k)
        prompt = self.create_prompt(question, context)
        return prompt, context

print("? RAGSystem defined!")

## ?? Step 6: Initialize RAG System

In [None]:
# Initialize RAG system with our PDFs
rag_system = RAGSystem('./data/pdfs')

print("\n" + "="*60)
print("? RAG SYSTEM READY!")
print("="*60)

## ?? Step 7: Test RAG Retrieval

In [None]:
# Test query
test_question = "What is machine learning?"

prompt, context = rag_system.query(test_question)

print("?? Query:", test_question)
print("\n?? Retrieved Context:")
print("="*60)
print(context)
print("="*60)

## ?? Step 8: Define LLM Models (Simplified for Kaggle)

**Note**: Full model loading requires significant GPU memory. For Kaggle, we'll use a lightweight approach or API-based models.

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

class SimpleLLMModel:
    """Simplified LLM model for demo"""
    
    def __init__(self, model_name: str):
        self.model_name = model_name
        self.model = None
        self.tokenizer = None
    
    def load(self, use_small_model=True):
        """Load model - using smaller model for Kaggle"""
        if use_small_model:
            # Use a smaller model for Kaggle demo
            model_id = "distilgpt2"  # Lightweight model
            print(f"Loading lightweight model: {model_id}...")
        else:
            model_id = self.model_name
            print(f"Loading {self.model_name}...")
        
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
        )
        
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
        
        if torch.cuda.is_available():
            self.model = self.model.to('cuda')
        
        print(f"? Model loaded!")
    
    def generate(self, prompt: str, max_tokens: int = 150) -> tuple:
        """Generate response"""
        if self.model is None:
            self.load()
        
        start_time = time.time()
        
        inputs = self.tokenizer(prompt, return_tensors="pt", truncation=True, max_length=512)
        if torch.cuda.is_available():
            inputs = {k: v.to('cuda') for k, v in inputs.items()}
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=0.7,
                do_sample=True,
                pad_token_id=self.tokenizer.pad_token_id,
            )
        
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response[len(prompt):].strip()
        
        latency = time.time() - start_time
        
        return response, latency

print("? SimpleLLMModel defined!")
print("\n?? Note: Using lightweight model (DistilGPT2) for Kaggle demo.")
print("   For production, use full models: Mistral, Qwen3, or Llama.")

## ?? Step 9: Define Evaluation Metrics

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from bert_score import score as bert_score
import nltk

class EvaluationMetrics:
    """Comprehensive evaluation metrics"""
    
    def __init__(self):
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    
    def cosine_similarity(self, text1: str, text2: str, embedding_model) -> float:
        """Compute cosine similarity"""
        embeddings = embedding_model.encode([text1, text2])
        similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
        return float(similarity)
    
    def bleu_score(self, reference: str, candidate: str) -> float:
        """Compute BLEU score"""
        ref_tokens = nltk.word_tokenize(reference.lower())
        cand_tokens = nltk.word_tokenize(candidate.lower())
        smoothing = SmoothingFunction().method1
        return sentence_bleu([ref_tokens], cand_tokens, smoothing_function=smoothing)
    
    def meteor_score_calc(self, reference: str, candidate: str) -> float:
        """Compute METEOR score"""
        ref_tokens = nltk.word_tokenize(reference.lower())
        cand_tokens = nltk.word_tokenize(candidate.lower())
        return meteor_score([ref_tokens], cand_tokens)
    
    def rouge_scores(self, reference: str, candidate: str) -> dict:
        """Compute ROUGE scores"""
        scores = self.rouge_scorer.score(reference, candidate)
        return {
            'rouge1': scores['rouge1'].fmeasure,
            'rouge2': scores['rouge2'].fmeasure,
            'rougeL': scores['rougeL'].fmeasure,
        }
    
    def bertscore_calc(self, reference: str, candidate: str) -> dict:
        """Compute BERTScore"""
        P, R, F1 = bert_score([candidate], [reference], lang='en', verbose=False)
        return {
            'precision': float(P[0]),
            'recall': float(R[0]),
            'f1': float(F1[0])
        }
    
    def hallucination_score(self, context: str, response: str, embedding_model) -> float:
        """Detect hallucination (0=no hallucination, 1=high hallucination)"""
        if "cannot answer" in response.lower():
            return 0.0
        similarity = self.cosine_similarity(response, context, embedding_model)
        return 1.0 - similarity
    
    def evaluate_all(self, query: str, response: str, reference: str, context: str, 
                     latency: float, embedding_model) -> dict:
        """Compute all metrics"""
        metrics = {'latency': latency}
        
        # Similarity metrics
        metrics['cosine_similarity'] = self.cosine_similarity(reference, response, embedding_model)
        
        # NLP metrics
        metrics['bleu'] = self.bleu_score(reference, response)
        metrics['meteor'] = self.meteor_score_calc(reference, response)
        
        # ROUGE
        rouge = self.rouge_scores(reference, response)
        metrics.update(rouge)
        
        # BERTScore
        bertscore = self.bertscore_calc(reference, response)
        metrics['bertscore_f1'] = bertscore['f1']
        
        # Hallucination and quality
        metrics['hallucination'] = self.hallucination_score(context, response, embedding_model)
        metrics['completeness'] = rouge['rougeL']  # Use ROUGE-L recall as completeness
        metrics['irrelevance'] = 1.0 - self.cosine_similarity(query, response, embedding_model)
        
        return metrics

print("? EvaluationMetrics defined!")

## ?? Step 10: Run Demo Evaluation

Let's test the system with a sample question.

In [None]:
# Initialize components
evaluator = EvaluationMetrics()
model = SimpleLLMModel("demo-model")
model.load(use_small_model=True)

# Test question
question = "What is machine learning?"
reference = "Machine learning is a subset of artificial intelligence that enables systems to learn from data."

# Get context and generate response
prompt, context = rag_system.query(question)

print("\n" + "="*60)
print("?? DEMO EVALUATION")
print("="*60)
print(f"\n? Question: {question}")
print(f"\n?? Reference: {reference}")

# Generate response
print("\n?? Generating response...")
response, latency = model.generate(prompt, max_tokens=100)

print(f"\n?? Response: {response}")
print(f"?? Latency: {latency:.2f}s")

# Evaluate
print("\n?? Computing metrics...")
metrics = evaluator.evaluate_all(
    query=question,
    response=response,
    reference=reference,
    context=context,
    latency=latency,
    embedding_model=rag_system.vector_store.embedding_model
)

print("\n" + "="*60)
print("?? EVALUATION RESULTS")
print("="*60)
for metric, value in metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## ?? Step 11: Visualization Functions

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

sns.set_style("whitegrid")

def plot_metrics_comparison(results_dict: dict, title: str = "Model Comparison"):
    """Plot bar chart comparing models"""
    df = pd.DataFrame(results_dict).T
    
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    axes = axes.flatten()
    
    metrics = ['latency', 'cosine_similarity', 'bertscore_f1', 'completeness',
               'hallucination', 'irrelevance', 'meteor', 'bleu']
    
    for idx, metric in enumerate(metrics):
        if metric in df.columns:
            ax = axes[idx]
            df[metric].plot(kind='bar', ax=ax, color=['#FF6B6B', '#4ECDC4', '#45B7D1'])
            ax.set_title(metric.replace('_', ' ').title(), fontweight='bold')
            ax.set_ylabel('Score')
            ax.tick_params(axis='x', rotation=45)
            ax.grid(True, alpha=0.3)
    
    plt.suptitle(title, fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.savefig('./results/comparison.png', dpi=300, bbox_inches='tight')
    plt.show()

def create_summary_table(results_dict: dict):
    """Create summary table"""
    df = pd.DataFrame(results_dict).T
    print("\n" + "="*80)
    print("?? SUMMARY TABLE")
    print("="*80)
    print(df.round(4).to_string())
    
    # Save to CSV
    df.to_csv('./results/summary.csv')
    print("\n? Saved to: ./results/summary.csv")

print("? Visualization functions defined!")

## ?? Step 12: Create Visualizations

In [None]:
# Example results (replace with actual multi-model evaluation)
example_results = {
    'Model_Demo': metrics
}

# Plot comparison
plot_metrics_comparison(example_results, "RAG System Evaluation - Demo")

# Create summary table
create_summary_table(example_results)

## ?? Step 13: Multi-Trial Evaluation (Optional)

Run multiple trials to test consistency.

In [None]:
def run_multi_trial_evaluation(rag_system, model, evaluator, question, reference, num_trials=3):
    """Run multiple trials and aggregate results"""
    all_metrics = []
    
    print(f"\n?? Running {num_trials} trials...\n")
    
    for trial in range(1, num_trials + 1):
        print(f"Trial {trial}/{num_trials}...", end=" ")
        
        # Get context and generate
        prompt, context = rag_system.query(question)
        response, latency = model.generate(prompt, max_tokens=100)
        
        # Evaluate
        metrics = evaluator.evaluate_all(
            query=question,
            response=response,
            reference=reference,
            context=context,
            latency=latency,
            embedding_model=rag_system.vector_store.embedding_model
        )
        
        all_metrics.append(metrics)
        print(f"Latency: {latency:.2f}s")
    
    # Aggregate
    aggregated = {}
    for key in all_metrics[0].keys():
        values = [m[key] for m in all_metrics]
        aggregated[key] = {
            'mean': np.mean(values),
            'std': np.std(values),
            'min': np.min(values),
            'max': np.max(values)
        }
    
    return aggregated

# Run multi-trial evaluation
multi_trial_results = run_multi_trial_evaluation(
    rag_system, model, evaluator, question, reference, num_trials=3
)

print("\n" + "="*80)
print("?? MULTI-TRIAL AGGREGATED RESULTS")
print("="*80)
for metric, stats in multi_trial_results.items():
    print(f"{metric:20s}: mean={stats['mean']:.4f}, std={stats['std']:.4f}, min={stats['min']:.4f}, max={stats['max']:.4f}")

## ?? Step 14: Upload Your Own PDFs (Kaggle Datasets)

To use your own PDFs on Kaggle:

1. **Upload PDFs as a Kaggle Dataset**:
   - Go to kaggle.com/datasets
   - Click "New Dataset"
   - Upload your PDF files
   - Make it public or private

2. **Add Dataset to Notebook**:
   - Click "Add Data" in the right panel
   - Search for your dataset
   - Click "Add"

3. **Update the path**:
   ```python
   # Replace this path with your dataset path
   pdf_directory = '/kaggle/input/your-dataset-name/'
   rag_system = RAGSystem(pdf_directory)
   ```

## ?? Step 15: Full Model Evaluation (For GPU Kaggle)

If you have GPU enabled on Kaggle, uncomment and run this cell to load full models.

In [None]:
# # Uncomment to load full models (requires GPU)
# from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
# import torch

# def load_full_model(model_name):
#     """Load full model with quantization"""
#     quantization_config = BitsAndBytesConfig(
#         load_in_4bit=True,
#         bnb_4bit_compute_dtype=torch.float16,
#     )
    
#     model = AutoModelForCausalLM.from_pretrained(
#         model_name,
#         quantization_config=quantization_config,
#         device_map="auto",
#         trust_remote_code=True
#     )
    
#     tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
#     return model, tokenizer

# # Load Mistral, Qwen3, or Llama
# # model, tokenizer = load_full_model('mistralai/Mistral-7B-Instruct-v0.2')
# # model, tokenizer = load_full_model('Qwen/Qwen2-7B-Instruct')
# # model, tokenizer = load_full_model('meta-llama/Llama-2-7b-chat-hf')

## ?? Summary

### ? What We Built:
1. **PDF Processing** - Load and chunk documents
2. **Vector Database** - FAISS-based similarity search
3. **RAG System** - Context-aware question answering
4. **Evaluation Metrics** - 10+ metrics including:
   - Latency
   - Cosine Similarity
   - BERTScore F1
   - Completeness
   - Hallucination Detection
   - Irrelevance Detection
   - METEOR, BLEU, ROUGE
5. **Visualizations** - Bar charts and summary tables

### ?? Next Steps:
1. Upload your own PDFs to Kaggle datasets
2. Update test questions
3. Enable GPU for full model evaluation
4. Compare Mistral, Qwen3, and Llama models
5. Analyze results and optimize

### ?? Resources:
- GitHub: https://github.com/isratjahan829/LLM_Task
- Documentation: See README.md in the repo

---

**?? Congratulations! You have a working RAG evaluation system on Kaggle!**