# ?? Complete RAG System with Multi-Model Evaluation

## Pure Python Implementation - No Shell Commands

### Features:
- ? PDF-based Question Answering
- ? 3 LLM Models: Mistral, Qwen3, Llama
- ? 10+ Evaluation Metrics
- ? Multi-Trial Testing
- ? Comparison Visualizations

## ?? Step 1: Install Required Packages

In [None]:
import sys
import subprocess

# Install packages
packages = [
    'langchain==0.1.0',
    'langchain-community==0.0.13',
    'pypdf2==3.0.1',
    'pymupdf==1.23.8',
    'sentence-transformers==2.2.2',
    'faiss-cpu==1.7.4',
    'transformers==4.36.2',
    'torch==2.1.2',
    'accelerate==0.25.0',
    'bert-score==0.3.13',
    'rouge-score==0.1.2',
    'nltk==3.8.1',
    'matplotlib==3.8.2',
    'seaborn==0.13.0',
    'scikit-learn==1.3.2',
    'pandas==2.0.3',
    'numpy==1.24.3',
    'reportlab'
]

for package in packages:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q', package])

print("? All packages installed!")

In [None]:
# Download NLTK data
import nltk
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('omw-1.4', quiet=True)
print("? NLTK data downloaded!")

## ?? Step 2: Setup Directories and Imports

In [None]:
import os
import time
import warnings
from pathlib import Path
from typing import List, Dict, Optional, Tuple
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

warnings.filterwarnings('ignore')

# Create directories
Path('./data/pdfs').mkdir(parents=True, exist_ok=True)
Path('./data/vector_db').mkdir(parents=True, exist_ok=True)
Path('./results').mkdir(parents=True, exist_ok=True)

print("? Directories created!")
print("?? ./data/pdfs/ - PDF files")
print("?? ./data/vector_db/ - Vector database")
print("?? ./results/ - Results and charts")

## ?? Step 3: Create Sample PDFs

In [None]:
from reportlab.lib.pagesizes import letter
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer

def create_pdf(filename: str, title: str, paragraphs: List[str]):
    """Create a PDF document"""
    doc = SimpleDocTemplate(filename, pagesize=letter)
    story = []
    styles = getSampleStyleSheet()
    
    # Title
    title_style = ParagraphStyle(
        'Title',
        parent=styles['Heading1'],
        fontSize=24,
        spaceAfter=30
    )
    story.append(Paragraph(title, title_style))
    story.append(Spacer(1, 12))
    
    # Content
    for para in paragraphs:
        story.append(Paragraph(para, styles['BodyText']))
        story.append(Spacer(1, 12))
    
    doc.build(story)
    return filename

# Create sample PDFs
pdfs = [
    {
        'filename': './data/pdfs/machine_learning.pdf',
        'title': 'Machine Learning Overview',
        'content': [
            "Machine learning is a subset of artificial intelligence that enables computer systems to learn and improve from experience without being explicitly programmed. It focuses on developing algorithms that can access data and use it to learn for themselves.",
            "There are three main types of machine learning: supervised learning, where models are trained on labeled data; unsupervised learning, where models find patterns in unlabeled data; and reinforcement learning, where agents learn to make decisions by interacting with an environment.",
            "Deep learning is a specialized subset of machine learning that uses artificial neural networks with multiple layers. These networks can automatically learn hierarchical representations of data, making them particularly effective for complex tasks like image recognition and natural language processing.",
            "Common applications of machine learning include recommendation systems, fraud detection, image and speech recognition, predictive analytics, and autonomous vehicles. The field continues to evolve rapidly with new algorithms and techniques being developed regularly."
        ]
    },
    {
        'filename': './data/pdfs/natural_language_processing.pdf',
        'title': 'Natural Language Processing',
        'content': [
            "Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language in a meaningful and useful way. It combines computational linguistics with machine learning and deep learning models.",
            "Key NLP tasks include sentiment analysis, machine translation, named entity recognition, question answering, text summarization, and language generation. These tasks require understanding both the syntax and semantics of language.",
            "Transformer architecture, introduced in 2017, revolutionized NLP by using self-attention mechanisms. Models like BERT, GPT, and T5 have achieved remarkable results across various NLP benchmarks and real-world applications.",
            "Modern NLP systems use word embeddings and contextualized representations to capture semantic meaning. Pre-trained language models can be fine-tuned for specific tasks with relatively small amounts of task-specific data."
        ]
    },
    {
        'filename': './data/pdfs/rag_systems.pdf',
        'title': 'Retrieval-Augmented Generation Systems',
        'content': [
            "Retrieval-Augmented Generation (RAG) is an advanced architecture that combines information retrieval with text generation. It retrieves relevant documents from a knowledge base and uses them to generate more accurate and factual responses.",
            "RAG systems consist of three main components: a retriever that finds relevant documents, a vector database for efficient similarity search, and a generator that produces responses based on retrieved context. This architecture reduces hallucination and improves factual accuracy.",
            "Vector databases like FAISS, Pinecone, or Weaviate enable fast similarity search over large document collections. Documents are converted into dense vector representations using embedding models like sentence-transformers.",
            "Evaluation of RAG systems requires multiple metrics including semantic similarity measures like BERTScore, factuality metrics like hallucination detection, and traditional NLP metrics like BLEU and ROUGE. Multi-dimensional evaluation ensures comprehensive assessment of system performance."
        ]
    }
]

for pdf_data in pdfs:
    create_pdf(pdf_data['filename'], pdf_data['title'], pdf_data['content'])
    print(f"? Created: {pdf_data['filename']}")

print("\n? All PDFs created!")

## ?? Step 4: PDF Processor Class

In [None]:
import fitz  # PyMuPDF
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document

class PDFProcessor:
    """Process PDFs and chunk documents"""
    
    def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap,
            length_function=len,
        )
    
    def extract_text_from_pdf(self, pdf_path: str) -> str:
        """Extract text from PDF using PyMuPDF"""
        text = ""
        doc = fitz.open(pdf_path)
        for page in doc:
            text += page.get_text()
        doc.close()
        return text
    
    def load_pdf(self, pdf_path: str) -> List[Document]:
        """Load single PDF and create chunks"""
        text = self.extract_text_from_pdf(pdf_path)
        filename = os.path.basename(pdf_path)
        
        documents = self.text_splitter.create_documents(
            [text],
            metadatas=[{"source": filename, "path": pdf_path}]
        )
        return documents
    
    def load_directory(self, directory: str) -> List[Document]:
        """Load all PDFs from directory"""
        pdf_files = list(Path(directory).glob("*.pdf"))
        all_documents = []
        
        print(f"Loading {len(pdf_files)} PDF files...")
        for pdf_path in pdf_files:
            documents = self.load_pdf(str(pdf_path))
            all_documents.extend(documents)
            print(f"  ? {pdf_path.name}: {len(documents)} chunks")
        
        print(f"\n? Total: {len(all_documents)} chunks from {len(pdf_files)} PDFs")
        return all_documents

print("? PDFProcessor class defined!")

## ??? Step 5: Vector Store Class

In [None]:
from sentence_transformers import SentenceTransformer
import faiss

class VectorStore:
    """Vector database using FAISS"""
    
    def __init__(self, model_name: str = 'sentence-transformers/all-MiniLM-L6-v2'):
        print(f"Loading embedding model: {model_name}")
        self.embedding_model = SentenceTransformer(model_name)
        self.index = None
        self.documents = []
        self.embeddings = None
        print("? Embedding model loaded!")
    
    def create_embeddings(self, documents: List[Document]) -> np.ndarray:
        """Create embeddings for documents"""
        texts = [doc.page_content for doc in documents]
        print(f"Creating embeddings for {len(texts)} documents...")
        embeddings = self.embedding_model.encode(texts, show_progress_bar=True)
        return np.array(embeddings).astype('float32')
    
    def build_index(self, documents: List[Document]):
        """Build FAISS index"""
        self.documents = documents
        self.embeddings = self.create_embeddings(documents)
        
        # Create FAISS index
        dimension = self.embeddings.shape[1]
        self.index = faiss.IndexFlatL2(dimension)
        self.index.add(self.embeddings)
        
        print(f"? FAISS index built with {len(documents)} documents")
    
    def search(self, query: str, k: int = 5) -> List[Tuple[Document, float]]:
        """Search for similar documents"""
        query_embedding = self.embedding_model.encode([query]).astype('float32')
        distances, indices = self.index.search(query_embedding, k)
        
        results = []
        for idx, distance in zip(indices[0], distances[0]):
            if idx < len(self.documents):
                results.append((self.documents[idx], float(distance)))
        return results
    
    def get_context(self, query: str, k: int = 5) -> str:
        """Get concatenated context"""
        results = self.search(query, k)
        context = "\n\n".join([doc.page_content for doc, _ in results])
        return context
    
    def compute_similarity(self, text1: str, text2: str) -> float:
        """Compute cosine similarity"""
        embeddings = self.embedding_model.encode([text1, text2])
        norm1 = np.linalg.norm(embeddings[0])
        norm2 = np.linalg.norm(embeddings[1])
        
        if norm1 == 0 or norm2 == 0:
            return 0.0
        
        similarity = np.dot(embeddings[0], embeddings[1]) / (norm1 * norm2)
        return float(similarity)

print("? VectorStore class defined!")

## ?? Step 6: RAG System Class

In [None]:
class RAGSystem:
    """Complete RAG System"""
    
    def __init__(self, pdf_directory: str):
        print("\n" + "="*60)
        print("INITIALIZING RAG SYSTEM")
        print("="*60)
        
        # Initialize components
        self.pdf_processor = PDFProcessor(chunk_size=1000, chunk_overlap=200)
        self.vector_store = VectorStore()
        
        # Load and index documents
        print("\n?? Loading PDFs...")
        documents = self.pdf_processor.load_directory(pdf_directory)
        
        print("\n?? Building vector index...")
        self.vector_store.build_index(documents)
        
        print("\n" + "="*60)
        print("? RAG SYSTEM READY!")
        print("="*60)
    
    def create_prompt(self, query: str, context: str) -> str:
        """Create prompt with context"""
        prompt = f"""You are a helpful assistant that answers questions based ONLY on the provided context from PDF documents.
If the answer cannot be found in the context, say "I cannot answer this question based on the provided documents."

Context from PDFs:
{context}

Question: {query}

Answer (based only on the context above):"""
        return prompt
    
    def query(self, question: str, top_k: int = 5) -> Tuple[str, str]:
        """Query the system"""
        context = self.vector_store.get_context(question, k=top_k)
        prompt = self.create_prompt(question, context)
        return prompt, context
    
    def get_retrieved_docs(self, question: str, top_k: int = 5) -> List[Tuple[Document, float]]:
        """Get retrieved documents with scores"""
        return self.vector_store.search(question, k=top_k)

print("? RAGSystem class defined!")

## ?? Step 7: Initialize RAG System

In [None]:
# Initialize RAG system
rag_system = RAGSystem('./data/pdfs')

## ?? Step 8: Test Retrieval

In [None]:
# Test query
test_question = "What is machine learning and what are its main types?"

print("?? Test Query:", test_question)
print("\n" + "="*60)

# Get prompt and context
prompt, context = rag_system.query(test_question)

print("?? Retrieved Context:")
print("="*60)
print(context)
print("="*60)
print(f"\n? Context length: {len(context)} characters")

## ?? Step 9: LLM Model Class

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

class LLMModel:
    """LLM Model with quantization support"""
    
    def __init__(self, model_name: str, display_name: str = None):
        self.model_name = model_name
        self.display_name = display_name or model_name
        self.model = None
        self.tokenizer = None
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
    
    def load(self, use_quantization: bool = True):
        """Load model with optional quantization"""
        print(f"\nLoading {self.display_name}...")
        print(f"Device: {self.device}")
        
        try:
            if use_quantization and torch.cuda.is_available():
                # 4-bit quantization for GPU
                quantization_config = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_compute_dtype=torch.float16,
                    bnb_4bit_use_double_quant=True,
                    bnb_4bit_quant_type="nf4"
                )
                
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    quantization_config=quantization_config,
                    device_map="auto",
                    trust_remote_code=True
                )
            else:
                # Standard loading
                self.model = AutoModelForCausalLM.from_pretrained(
                    self.model_name,
                    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
                    device_map="auto" if torch.cuda.is_available() else None,
                    trust_remote_code=True,
                    low_cpu_mem_usage=True
                )
                
                if self.device == 'cuda':
                    self.model = self.model.to(self.device)
            
            self.tokenizer = AutoTokenizer.from_pretrained(
                self.model_name,
                trust_remote_code=True
            )
            
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
            
            print(f"? {self.display_name} loaded successfully!")
            return True
            
        except Exception as e:
            print(f"? Error loading {self.display_name}: {str(e)}")
            return False
    
    def generate(self, prompt: str, max_tokens: int = 512, temperature: float = 0.7) -> Tuple[str, float]:
        """Generate response and measure latency"""
        if self.model is None:
            raise ValueError("Model not loaded. Call load() first.")
        
        start_time = time.time()
        
        # Tokenize
        inputs = self.tokenizer(
            prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048
        )
        
        if self.device == 'cuda':
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
        
        # Generate
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_tokens,
                temperature=temperature,
                do_sample=True,
                top_p=0.9,
                pad_token_id=self.tokenizer.pad_token_id,
                eos_token_id=self.tokenizer.eos_token_id,
            )
        
        # Decode
        response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        response = response[len(prompt):].strip()
        
        latency = time.time() - start_time
        return response, latency
    
    def unload(self):
        """Unload model to free memory"""
        if self.model is not None:
            del self.model
            del self.tokenizer
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
            print(f"? {self.display_name} unloaded")

print("? LLMModel class defined!")

## ?? Step 10: Evaluation Metrics Class

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.translate.meteor_score import meteor_score
from rouge_score import rouge_scorer
from bert_score import score as bert_score

class EvaluationMetrics:
    """Comprehensive evaluation metrics"""
    
    def __init__(self):
        self.rouge_scorer = rouge_scorer.RougeScorer(
            ['rouge1', 'rouge2', 'rougeL'],
            use_stemmer=True
        )
    
    def cosine_similarity(self, text1: str, text2: str, embedding_model) -> float:
        """Compute cosine similarity"""
        embeddings = embedding_model.encode([text1, text2])
        similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
        return float(similarity)
    
    def bleu_score(self, reference: str, candidate: str) -> float:
        """Compute BLEU score"""
        import nltk
        ref_tokens = nltk.word_tokenize(reference.lower())
        cand_tokens = nltk.word_tokenize(candidate.lower())
        smoothing = SmoothingFunction().method1
        return float(sentence_bleu([ref_tokens], cand_tokens, smoothing_function=smoothing))
    
    def meteor_score(self, reference: str, candidate: str) -> float:
        """Compute METEOR score"""
        import nltk
        ref_tokens = nltk.word_tokenize(reference.lower())
        cand_tokens = nltk.word_tokenize(candidate.lower())
        return float(meteor_score([ref_tokens], cand_tokens))
    
    def rouge_scores(self, reference: str, candidate: str) -> Dict[str, float]:
        """Compute ROUGE scores"""
        scores = self.rouge_scorer.score(reference, candidate)
        return {
            'rouge1': float(scores['rouge1'].fmeasure),
            'rouge2': float(scores['rouge2'].fmeasure),
            'rougeL': float(scores['rougeL'].fmeasure),
        }
    
    def bertscore(self, reference: str, candidate: str) -> Dict[str, float]:
        """Compute BERTScore"""
        P, R, F1 = bert_score([candidate], [reference], lang='en', verbose=False)
        return {
            'precision': float(P[0]),
            'recall': float(R[0]),
            'f1': float(F1[0])
        }
    
    def completeness(self, reference: str, candidate: str) -> float:
        """Measure completeness (coverage of reference)"""
        scores = self.rouge_scorer.score(reference, candidate)
        return float(scores['rougeL'].recall)
    
    def hallucination_score(self, context: str, response: str, embedding_model) -> float:
        """Detect hallucination (0=no hallucination, 1=high hallucination)"""
        if "cannot answer" in response.lower() or "don't know" in response.lower():
            return 0.0
        similarity = self.cosine_similarity(response, context, embedding_model)
        return float(1.0 - similarity)
    
    def irrelevance_score(self, query: str, response: str, embedding_model) -> float:
        """Measure irrelevance (0=relevant, 1=irrelevant)"""
        similarity = self.cosine_similarity(query, response, embedding_model)
        return float(1.0 - similarity)
    
    def evaluate_all(self, query: str, response: str, reference: str, 
                     context: str, latency: float, embedding_model) -> Dict[str, float]:
        """Compute all metrics"""
        metrics = {'latency': latency}
        
        # Similarity
        metrics['cosine_similarity'] = self.cosine_similarity(reference, response, embedding_model)
        
        # NLP metrics
        metrics['bleu'] = self.bleu_score(reference, response)
        metrics['meteor'] = self.meteor_score(reference, response)
        
        # ROUGE
        rouge = self.rouge_scores(reference, response)
        metrics.update(rouge)
        
        # BERTScore
        bertscore = self.bertscore(reference, response)
        metrics['bertscore_f1'] = bertscore['f1']
        metrics['bertscore_precision'] = bertscore['precision']
        metrics['bertscore_recall'] = bertscore['recall']
        
        # Quality metrics
        metrics['completeness'] = self.completeness(reference, response)
        metrics['hallucination'] = self.hallucination_score(context, response, embedding_model)
        metrics['irrelevance'] = self.irrelevance_score(query, response, embedding_model)
        
        return metrics

print("? EvaluationMetrics class defined!")

## ?? Step 11: Multi-Trial Evaluator

In [None]:
class TrialEvaluator:
    """Run multiple trials and aggregate results"""
    
    def __init__(self, num_trials: int = 3):
        self.num_trials = num_trials
        self.evaluator = EvaluationMetrics()
    
    def run_trials(self, model: LLMModel, rag_system: RAGSystem, 
                   question: str, reference: str) -> Dict:
        """Run multiple trials for a model"""
        print(f"\nRunning {self.num_trials} trials for {model.display_name}...")
        
        all_trials = []
        
        for trial in range(1, self.num_trials + 1):
            print(f"  Trial {trial}/{self.num_trials}...", end=" ")
            
            # Get context and generate
            prompt, context = rag_system.query(question)
            response, latency = model.generate(prompt)
            
            # Evaluate
            metrics = self.evaluator.evaluate_all(
                query=question,
                response=response,
                reference=reference,
                context=context,
                latency=latency,
                embedding_model=rag_system.vector_store.embedding_model
            )
            
            all_trials.append({
                'trial': trial,
                'response': response,
                'metrics': metrics
            })
            
            print(f"Latency: {latency:.2f}s")
        
        # Aggregate
        aggregated = self.aggregate_metrics(all_trials)
        
        return {
            'model': model.display_name,
            'trials': all_trials,
            'aggregated': aggregated
        }
    
    def aggregate_metrics(self, trials: List[Dict]) -> Dict:
        """Aggregate metrics across trials"""
        all_metrics = [t['metrics'] for t in trials]
        aggregated = {}
        
        for key in all_metrics[0].keys():
            values = [m[key] for m in all_metrics]
            aggregated[key] = {
                'mean': float(np.mean(values)),
                'std': float(np.std(values)),
                'min': float(np.min(values)),
                'max': float(np.max(values))
            }
        
        return aggregated

print("? TrialEvaluator class defined!")

## ?? Step 12: Visualization Functions

In [None]:
def plot_metrics_comparison(results: List[Dict], save_path: str = './results'):
    """Create comprehensive comparison charts"""
    
    # Prepare data
    metrics_to_plot = [
        'latency', 'cosine_similarity', 'bertscore_f1', 'completeness',
        'hallucination', 'irrelevance', 'meteor', 'bleu'
    ]
    
    metric_labels = {
        'latency': 'Latency (s)',
        'cosine_similarity': 'Cosine Similarity',
        'bertscore_f1': 'BERTScore F1',
        'completeness': 'Completeness',
        'hallucination': 'Hallucination',
        'irrelevance': 'Irrelevance',
        'meteor': 'METEOR',
        'bleu': 'BLEU'
    }
    
    # Create figure
    fig, axes = plt.subplots(2, 4, figsize=(20, 10))
    axes = axes.flatten()
    
    colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#FFA07A', '#98D8C8']
    
    for idx, metric in enumerate(metrics_to_plot):
        ax = axes[idx]
        
        model_names = []
        values = []
        stds = []
        
        for result in results:
            model_names.append(result['model'])
            agg = result['aggregated'][metric]
            values.append(agg['mean'])
            stds.append(agg['std'])
        
        bars = ax.bar(model_names, values, yerr=stds, 
                     color=colors[:len(model_names)], alpha=0.8, capsize=5)
        
        # Add value labels
        for bar in bars:
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height,
                   f'{height:.3f}',
                   ha='center', va='bottom', fontweight='bold')
        
        ax.set_title(metric_labels[metric], fontsize=12, fontweight='bold')
        ax.set_ylabel('Score', fontsize=10)
        ax.tick_params(axis='x', rotation=45)
        ax.grid(True, alpha=0.3, axis='y')
    
    plt.suptitle('Multi-Model RAG System Performance Comparison', 
                fontsize=16, fontweight='bold')
    plt.tight_layout()
    
    save_file = os.path.join(save_path, 'comprehensive_comparison.png')
    plt.savefig(save_file, dpi=300, bbox_inches='tight')
    print(f"\n? Chart saved: {save_file}")
    plt.show()

def create_summary_table(results: List[Dict], save_path: str = './results'):
    """Create and save summary table"""
    
    data = []
    for result in results:
        agg = result['aggregated']
        row = {
            'Model': result['model'],
            'Latency (s)': f"{agg['latency']['mean']:.2f} ? {agg['latency']['std']:.2f}",
            'Cosine Sim': f"{agg['cosine_similarity']['mean']:.3f}",
            'BERTScore F1': f"{agg['bertscore_f1']['mean']:.3f}",
            'Completeness': f"{agg['completeness']['mean']:.3f}",
            'Hallucination': f"{agg['hallucination']['mean']:.3f}",
            'Irrelevance': f"{agg['irrelevance']['mean']:.3f}",
            'METEOR': f"{agg['meteor']['mean']:.3f}",
            'BLEU': f"{agg['bleu']['mean']:.3f}"
        }
        data.append(row)
    
    df = pd.DataFrame(data)
    
    print("\n" + "="*100)
    print("SUMMARY TABLE")
    print("="*100)
    print(df.to_string(index=False))
    print("="*100)
    
    # Save CSV
    csv_file = os.path.join(save_path, 'summary_table.csv')
    df.to_csv(csv_file, index=False)
    print(f"\n? Table saved: {csv_file}")
    
    return df

print("? Visualization functions defined!")

## ?? Step 13: Define Models to Test

In [None]:
# Define models (you can modify these)
MODEL_CONFIGS = [
    {
        'name': 'mistralai/Mistral-7B-Instruct-v0.2',
        'display_name': 'Mistral-7B'
    },
    {
        'name': 'Qwen/Qwen2-7B-Instruct',
        'display_name': 'Qwen3-7B'
    },
    {
        'name': 'meta-llama/Llama-2-7b-chat-hf',
        'display_name': 'Llama-2-7B'
    }
]

# Test questions
TEST_CASES = [
    {
        'question': 'What is machine learning and what are its main types?',
        'reference': 'Machine learning is a subset of artificial intelligence that enables systems to learn from data. The three main types are supervised learning (trained on labeled data), unsupervised learning (finds patterns in unlabeled data), and reinforcement learning (learns through interaction with environment).'
    },
    {
        'question': 'What is Retrieval-Augmented Generation and what are its benefits?',
        'reference': 'Retrieval-Augmented Generation (RAG) is an architecture that combines information retrieval with text generation to produce more accurate and factual responses. Key benefits include reduced hallucination, improved factual accuracy, and ability to use up-to-date information from documents.'
    }
]

print("? Model configurations and test cases defined!")
print(f"\nModels to test: {len(MODEL_CONFIGS)}")
print(f"Test questions: {len(TEST_CASES)}")

## ?? Step 14: Run Complete Evaluation

**Note**: This will take significant time and resources. For demo, you can:
- Test with 1 model first
- Reduce number of trials
- Use smaller/lighter models

In [None]:
def run_complete_evaluation(rag_system, model_configs, test_cases, num_trials=3):
    """Run complete evaluation pipeline"""
    
    print("\n" + "="*80)
    print("STARTING COMPLETE EVALUATION")
    print("="*80)
    print(f"Models: {len(model_configs)}")
    print(f"Test Cases: {len(test_cases)}")
    print(f"Trials per model: {num_trials}")
    print("="*80)
    
    trial_evaluator = TrialEvaluator(num_trials=num_trials)
    all_results = []
    
    # For each test case
    for case_idx, test_case in enumerate(test_cases, 1):
        print(f"\n{'='*80}")
        print(f"TEST CASE {case_idx}/{len(test_cases)}")
        print(f"{'='*80}")
        print(f"Question: {test_case['question']}")
        print(f"Reference: {test_case['reference']}")
        
        # For each model
        for model_config in model_configs:
            print(f"\n{'-'*80}")
            print(f"Testing: {model_config['display_name']}")
            print(f"{'-'*80}")
            
            # Initialize model
            model = LLMModel(
                model_name=model_config['name'],
                display_name=model_config['display_name']
            )
            
            # Load model
            if model.load(use_quantization=True):
                # Run trials
                result = trial_evaluator.run_trials(
                    model=model,
                    rag_system=rag_system,
                    question=test_case['question'],
                    reference=test_case['reference']
                )
                
                all_results.append(result)
                
                # Print summary
                agg = result['aggregated']
                print(f"\n?? Summary for {model_config['display_name']}:")
                print(f"  Latency: {agg['latency']['mean']:.2f}s (?{agg['latency']['std']:.2f})")
                print(f"  Cosine Sim: {agg['cosine_similarity']['mean']:.3f}")
                print(f"  BERTScore F1: {agg['bertscore_f1']['mean']:.3f}")
                print(f"  Hallucination: {agg['hallucination']['mean']:.3f}")
                
                # Unload model
                model.unload()
            else:
                print(f"?? Skipping {model_config['display_name']} due to load error")
    
    return all_results

# Run evaluation (you can modify parameters)
print("Starting evaluation...")
print("?? This will take significant time with multiple models!")
print("?? Tip: Start with 1 model for testing")

# Uncomment to run full evaluation:
# results = run_complete_evaluation(
#     rag_system=rag_system,
#     model_configs=MODEL_CONFIGS,
#     test_cases=TEST_CASES[:1],  # Use first test case only
#     num_trials=3
# )

print("\n? Evaluation function ready!")
print("\n?? Uncomment the code above to run evaluation")

## ?? Step 15: Generate Visualizations

After running evaluation, generate charts and tables.

In [None]:
# Uncomment after running evaluation:
# if 'results' in locals() and results:
#     print("\n" + "="*80)
#     print("GENERATING VISUALIZATIONS")
#     print("="*80)
#     
#     # Create charts
#     plot_metrics_comparison(results)
#     
#     # Create summary table
#     summary_df = create_summary_table(results)
#     
#     print("\n" + "="*80)
#     print("? EVALUATION COMPLETE!")
#     print("="*80)
#     print(f"Results saved in: ./results/")
# else:
#     print("?? Run evaluation first to generate visualizations")

print("? Visualization code ready!")
print("\n?? Uncomment after running evaluation to generate charts")

## ?? Quick Demo (Single Model)

Test with a single lightweight model for quick results.

In [None]:
def quick_demo(rag_system, test_question, reference_answer):
    """Quick demo with lightweight model"""
    
    print("\n" + "="*80)
    print("QUICK DEMO")
    print("="*80)
    
    # Use a lightweight model for demo
    demo_model = LLMModel('gpt2', 'GPT2-Demo')
    
    if demo_model.load(use_quantization=False):
        # Get context
        prompt, context = rag_system.query(test_question)
        
        print(f"\n?? Question: {test_question}")
        print(f"\n?? Retrieved Context ({len(context)} chars)")
        
        # Generate
        print("\n?? Generating response...")
        response, latency = demo_model.generate(prompt, max_tokens=100)
        
        print(f"\n?? Response: {response}")
        print(f"\n?? Latency: {latency:.2f}s")
        
        # Evaluate
        evaluator = EvaluationMetrics()
        metrics = evaluator.evaluate_all(
            query=test_question,
            response=response,
            reference=reference_answer,
            context=context,
            latency=latency,
            embedding_model=rag_system.vector_store.embedding_model
        )
        
        print("\n?? Metrics:")
        for metric, value in list(metrics.items())[:8]:
            print(f"  {metric:20s}: {value:.4f}")
        
        demo_model.unload()
        
        print("\n" + "="*80)
        print("? Demo complete!")
        print("="*80)

# Run quick demo
quick_demo(
    rag_system,
    "What is machine learning?",
    "Machine learning is a subset of AI that enables systems to learn from data."
)

## ?? Summary

### ? What We Built:

1. **PDF Processing** - Load and chunk documents
2. **Vector Database** - FAISS-based similarity search
3. **RAG System** - Context-aware question answering
4. **LLM Integration** - Support for Mistral, Qwen3, Llama
5. **Comprehensive Metrics**:
   - Latency
   - Cosine Similarity
   - BERTScore F1
   - Completeness
   - Hallucination Detection
   - Irrelevance Detection
   - METEOR, BLEU, ROUGE
6. **Multi-Trial Testing** - Statistical aggregation
7. **Visualizations** - Bar charts and tables

### ?? How to Use:

1. **Run all cells above**
2. **Uncomment evaluation code** in Step 14
3. **Choose models** to test (start with 1 for demo)
4. **Run evaluation** (may take 10-30 min per model)
5. **Generate visualizations** in Step 15
6. **Download results** from `./results/` folder

### ?? Tips:

- **Memory**: Use GPU for faster inference
- **Testing**: Start with 1 model and 1 trial
- **PDFs**: Replace sample PDFs with your own
- **Models**: Adjust MODEL_CONFIGS for different models
- **Trials**: Increase num_trials for better statistics

### ?? Resources:

- **GitHub**: https://github.com/isratjahan829/LLM_Task
- **Documentation**: See README.md

---

**?? Your RAG evaluation system is ready to use!**