[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/oviya-raja/ist-402/blob/main/learning-path/W03/02-assignments/W3_RAG_Assignment.ipynb)

---


# RAG Assignment: 
## Building an Intelligent Q&A System with FAISS and Mistral


### <summary><b>üè¢ Business Context</b></summary>

**Company:** GreenTech Marketplace

**Company Description:**
GreenTech Marketplace is an e-commerce platform specializing in sustainable technology products. The company focuses on eco-friendly technology solutions, renewable energy products, and smart home devices that help customers reduce their environmental footprint.

**Company Details:**
- **Type:** E-commerce platform
- **Specialization:** Sustainable technology products
- **Mission:** Providing eco-friendly technology solutions for modern living

**Shipping & Policies:**
- **Standard Shipping:** 5-7 business days (free over $75)
- **Express Shipping:** 2-3 business days (additional fee)
- **Return Policy:** 30-day return policy for unopened items in original packaging
- **Refund Processing:** 5-7 business days after receipt
- **Warranty:** 1-3 years depending on product category

**Product Categories:**
- **Solar Panels** - Renewable energy solutions for homes and businesses
- **Energy-Efficient Appliances** - Eco-friendly home appliances
- **Smart Home Devices** - Home automation and IoT solutions
- **Eco-Friendly Accessories** - Sustainable lifestyle products

**What I'm Building:**
A production-ready RAG system that combines semantic search (FAISS) with large language models (Mistral) to provide accurate, context-aware answers for customer service scenarios for GreenTech Marketplace

**IST402 - AI Agents & RAG Systems**  
**Student**: Oviya Raja

---

## üìã Assignment Overview

### üéØ Goal
Design and implement a complete **Retrieval-Augmented Generation (RAG) system** using Mistral-7B-Instruct, FAISS vector database, and custom business data. Evaluate multiple AI models and analyze system performance.

**Submission:** Submit the link to your completed notebook.

---

<details>
<summary><b>üéØ Objectives</b> (Click to expand)</summary>

### 1. Create Assistant System Prompt
- Design a system prompt with a specific role 
- Choose a business context to use throughout
- Use `mistralai/Mistral-7B-Instruct-v0.3`

### 2. Generate Business Database Content
- Use Mistral to generate **10-15 Q&A pairs** for your business
- Cover different aspects of the business
- **Add clear comments** showing the database pairs

### 3. Implement FAISS Vector Database
- Convert Q&A database to embeddings using sentence transformers
- Store embeddings in FAISS index for efficient similarity search
- **Comment the implementation process**

### 4. Create Test Questions
- Generate **5+ answerable** questions (can be answered from database)
- Generate **5+ unanswerable** questions (require information not in database)
- Use Mistral to generate both types

### 5. Implement and Test RAG System
- Build complete RAG pipeline: `Query ‚Üí Embed ‚Üí Search ‚Üí Retrieve ‚Üí Augment ‚Üí LLM ‚Üí Answer`
- Run both question types through the pipeline
- **Comment differences** between answerable vs. unanswerable results

### 6. Model Experimentation and Ranking
- Test **4 required models + 2 of your choice** (6 total):
  - `consciousAI/question-answering-generative-t5-v1-base-s-q-c`
  - `deepset/roberta-base-squad2`
  - `google-bert/bert-large-cased-whole-word-masking-finetuned-squad`
  - `gasolsun/DynamicRAG-8B`
  - + 2 additional models of your choice
- **Rank models** by 5 metrics: **Accuracy**, **Confidence Handling**, **Quality**, **Speed**, **Robustness**
- Test with both answerable and unanswerable questions

</details>

<details>
<summary><b>üîß Technologies & Tools</b> (Click to expand)</summary>

| Technology | Purpose | Version/Model |
|------------|---------|---------------|
| **Mistral-7B-Instruct** | LLM for generating Q&A pairs and system prompts | `mistralai/Mistral-7B-Instruct-v0.3` |
| **FAISS** | Vector database for efficient similarity search | `faiss-cpu` |
| **Embeddings Generation** | Text embeddings for semantic search | `all-MiniLM-L6-v2` |
| **Inference Engine** | Host the LLM | Hugging Face Transformers |

**Installation:**
```python
!pip install transformers torch sentence-transformers faiss-cpu langchain
```

</details>

<details>
<summary><b>üèóÔ∏è System Architecture</b> (Click to expand)</summary>

**RAG System Architecture Overview:**

![RAG System Architecture](diagrams/rag-pipeline-architecture.png)

*Figure: Two-phase RAG system showing Pre-Processing (Objectives 1-3) and RAG Pipeline (Objectives 4-5)*

**Phase 1: Pre-Processing (Objectives 1-3)**
- **Objective 1**: System Prompt - Define agentic role & business context
- **Objective 2**: Generate Q&A - 10-15 pairs using Mistral-7B-Instruct
- **Objective 3**: FAISS Index - Create embeddings & vector store
- **Output**: Vector Database ready for retrieval

**Phase 2: RAG Pipeline (Objectives 4-5)**
- **User Query**: 5+ answerable and 5+ unanswerable questions
- **Objective 4**: RAG Pipeline Steps
  - Query ‚Üí Embed ‚Üí Search FAISS ‚Üí Retrieve Context ‚Üí Mistral LLM
- **Objective 5**: Generated Answer

**Dependency Chain:**
```
0 (Setup) ‚Üí 1 (System Prompt) ‚Üí 2 (Q&A DB) ‚Üí 3 (FAISS) ‚Üí 4 (RAG Pipeline) ‚Üí 5 (Answer) ‚Üí 6 (Evaluation)
```

</details>

<details>
<summary><b>‚úÖ Deliverables Checklist</b> (Click to expand)</summary>

- ‚úÖ **Business context** & role definition
- ‚úÖ **Generated Q&A database** (10-15 pairs, clearly commented)
- ‚úÖ **Working FAISS vector database** with embeddings
- ‚úÖ **Test questions** (answerable vs. unanswerable, 5+ each)
- ‚úÖ **RAG pipeline** implementation
- ‚úÖ **Model rankings** with performance analysis (6 models)
- ‚úÖ **Reflection** on strengths, weaknesses, and real-world applications

</details>

<details>
<summary><b>‚è±Ô∏è Time Estimates</b> (Click to expand)</summary>


<br>
<details>
<summary><b>üìä Evaluation Criteria</b> (Click to expand)</summary>

- **Creativity** in business context and agentic role design
- **Technical Implementation** of RAG pipeline with FAISS
- **Quality Analysis** of different Q&A models
- **Clear Documentation** with meaningful comments
- **Critical Thinking** in model comparison and limitations analysis

</details>

---

**Ready to begin?** Start with **Objective 0: Setup & Prerequisites** to configure your environment.



## Objective 0: Setup & Prerequisites

### üéØ Goal
I will set up my environment with all required packages, configure authentication, and verify system capabilities before starting the RAG system implementation.

<details>
<summary><b>Prerequisites Checklist</b> (Click to expand)</summary>

**Knowledge Prerequisites:**

| Requirement | Level | Description |
|-------------|-------|-------------|
| **Python Programming** | Basic | Variables, functions, data structures, imports |
| **Jupyter Notebooks** | Basic | Running cells, markdown formatting, code execution |
| **Machine Learning Concepts** | High-level | Neural networks, transformers, embeddings, model inference |
| **Evaluation Metrics** | Basic | Understanding of accuracy, confidence scores |

**Technical Prerequisites:**

| Item | Required | How to Get |
|------|----------|------------|
| **Hugging Face Account** | ‚úÖ Yes | Sign up at [huggingface.co](https://huggingface.co) |
| **Hugging Face Token** | ‚úÖ Yes | Get from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens) |
| **Google Colab Account** | ‚ö†Ô∏è Recommended | Free account at [colab.research.google.com](https://colab.research.google.com) |
| **GPU Access** | ‚ö†Ô∏è Optional | Colab provides free GPU, or local GPU setup |
| **Python 3.8+** | ‚úÖ Yes | Pre-installed in Colab, or install locally |

</details>

<br>

<details>
<summary><b>üì¶ Required Packages</b> (Click to expand)</summary>

**Core Packages:**

| Package | Purpose | Version |
|---------|---------|---------|
| `transformers` | Load and use Hugging Face models (Mistral, QA models) | Latest |
| `torch` | Deep learning framework for model inference | Latest |
| `sentence-transformers` | Generate embeddings for semantic search | Latest |
| `faiss-cpu` | Vector similarity search library | Latest |
| `huggingface_hub` | Hugging Face authentication and model access | Latest |
| `numpy` | Numerical operations | Latest |
| `pandas` | DataFrames for Q&A database | Latest |

**Evaluation & Utilities:**

| Package | Purpose | Version |
|---------|---------|---------|
| `bert-score` | BERTScore for accuracy evaluation | Latest |
| `python-dotenv` | Load environment variables from .env files | Latest |

**Installation Command:**
```python
!pip install transformers torch sentence-transformers faiss-cpu huggingface_hub numpy pandas bert-score python-dotenv
```

**Note:** All packages will be installed automatically by the setup code cell.

</details>

<br>

<details>
<summary><b>üñ•Ô∏è Environment Setup</b> (Click to expand)</summary>

**Option 1: Google Colab (Recommended)**

**Advantages:**
- ‚úÖ Free GPU access (T4 GPU)
- ‚úÖ No local installation needed
- ‚úÖ Pre-configured environment
- ‚úÖ Easy sharing and collaboration

**Setup Steps:**
1. Go to [colab.research.google.com](https://colab.research.google.com)
2. Create a new notebook
3. Enable GPU: **Runtime ‚Üí Change runtime type ‚Üí GPU ‚Üí Save**
4. Upload or create the notebook file
5. Run the setup cell

**Option 2: Local Jupyter Notebook**

**Requirements:**
- Python 3.8 or higher
- Jupyter Notebook installed
- GPU optional (CPU works but slower)

**Setup Steps:**
1. Install Python 3.8+
2. Install Jupyter: `pip install jupyter`
3. Install required packages (see above)
4. Launch: `jupyter notebook`
5. Open the notebook file

**GPU Setup (Optional but Recommended):**

| Platform | GPU Type | Setup |
|----------|----------|-------|
| **Colab** | T4 (Free) | Runtime ‚Üí Change runtime type ‚Üí GPU |
| **Local** | NVIDIA GPU | Install CUDA toolkit, PyTorch with CUDA |
| **CPU Only** | N/A | Works but 2-3x slower |

</details>

<br>

<details>
<summary><b>üîë Hugging Face Authentication</b> (Click to expand)</summary>

**Step 1: Get Your Token**

1. Go to [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
2. Click **"New token"**
3. Name it (e.g., "RAG Assignment")
4. Select **"Read"** access (sufficient for this assignment)
5. Click **"Generate token"**
6. **Copy the token immediately** (you won't see it again!)

**Step 2: Store Your Token**

**In Google Colab:**
```python
from google.colab import userdata
userdata.set('HUGGINGFACE_HUB_TOKEN', 'your_token_here')
```

**Or use Colab Secrets:**
1. Click the üîë icon in the left sidebar
2. Add secret: `HUGGINGFACE_HUB_TOKEN` = `your_token_here`

**Locally (Environment Variable):**
```bash
export HUGGINGFACE_HUB_TOKEN=your_token_here
```

**Or create `.env` file:**
```
HUGGINGFACE_HUB_TOKEN=your_token_here
```

**Step 3: Verify Authentication**

The setup code will automatically:
- Try Colab userdata first
- Try environment variables
- Prompt for manual input if needed
- Authenticate with Hugging Face Hub

</details>

<br>

<details>
<summary><b>‚úÖ Setup Verification</b> (Click to expand)</summary>

After running the setup cell, verify these are set:

**Environment Variables:**
- ‚úÖ `IN_COLAB` - True if running in Colab
- ‚úÖ `HAS_GPU` - True if GPU is available
- ‚úÖ `hf_token` - Your Hugging Face token

**Verification Code:**
```python
# Check setup
print("üîç Setup Verification:")
print(f"   IN_COLAB: {IN_COLAB}")
print(f"   HAS_GPU: {HAS_GPU}")
print(f"   hf_token: {'‚úÖ Set' if hf_token else '‚ùå Not set'}")

if HAS_GPU:
    import torch
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
```

**Expected Output:**
```
üîç Setup Verification:
   IN_COLAB: True
   HAS_GPU: True
   hf_token: ‚úÖ Set
   GPU: Tesla T4
```

</details>

<br>

<details>
<summary><b>‚öôÔ∏è Setup Functions</b> (Click to expand)</summary>

The setup code provides modular functions following SOLID principles:

**Available Functions:**

| Function | Purpose | Returns |
|----------|---------|---------|
| `check_environment()` | Detect Colab and GPU availability | `(is_colab, has_gpu)` |
| `get_hf_token()` | Retrieve Hugging Face token from various sources | `str` (token) |
| `install_packages()` | Install required packages if missing | None |
| `import_libraries()` | Import all required libraries with error handling | `bool` (success) |
| `authenticate_hf(token)` | Authenticate with Hugging Face Hub | `bool` (success) |

**Design Principles:**
- **KISS** (Keep It Simple, Stupid) - Each function has single responsibility
- **DRY** (Don't Repeat Yourself) - Reusable functions
- **SOLID** - Modular, extensible design

</details>

<br>

<details>
<summary><b>üöÄ Quick Start Guide</b> (Click to expand)</summary>

**Step-by-Step Setup:**

1. **Open Environment**
   - Colab: Create new notebook
   - Local: Launch Jupyter Notebook

2. **Enable GPU (Recommended)**
   - Colab: Runtime ‚Üí Change runtime type ‚Üí GPU
   - Local: Ensure CUDA is installed

3. **Run Setup Cell**
   - Execute the "Prerequisites & Setup" code cell
   - Wait for packages to install (~2-3 minutes first time)

4. **Authenticate**
   - Enter Hugging Face token when prompted
   - Or set it in Colab secrets/environment variable

5. **Verify**
   - Check console output for ‚úÖ marks
   - Verify `hf_token` is set
   - Confirm GPU is detected (if available)

**Expected Setup Time:**
- First run: 2-3 minutes (package installation)
- Subsequent runs: <30 seconds (packages cached)

</details>

<br>

<details>
<summary><b>‚ö†Ô∏è Troubleshooting</b> (Click to expand)</summary>

**Common Issues:**

| Issue | Solution |
|-------|----------|
| **Token not found** | Check Colab secrets or environment variables. Re-enter token manually. |
| **GPU not detected** | In Colab: Runtime ‚Üí Change runtime type ‚Üí GPU. Local: Install CUDA toolkit. |
| **Package installation fails** | Restart runtime/kernel and try again. Check internet connection. |
| **Import errors** | Run `pip install --upgrade <package>` for the failing package. |
| **Out of memory** | Use CPU instead of GPU, or restart runtime to clear memory. |

**Getting Help:**
- Check error messages carefully - they usually indicate the issue
- Verify all prerequisites are met
- Ensure internet connection is stable
- Restart runtime/kernel if issues persist

</details>

<br>

<details>
<summary><b>üìä System Requirements</b> (Click to expand)</summary>

**Minimum Requirements (CPU):**
- Python 3.8+
- 8GB RAM
- 20GB free disk space (for model downloads)
- Stable internet connection

**Recommended (GPU):**
- Python 3.8+
- 16GB+ RAM
- NVIDIA GPU with 8GB+ VRAM
- 30GB+ free disk space
- Fast internet connection

**Colab Specifications:**
- **Free Tier:** T4 GPU (16GB VRAM), 12GB RAM
- **Pro Tier:** V100/A100 GPU, more RAM
- **Storage:** 15GB free space

**Model Download Sizes:**
- Mistral-7B: ~14GB (first download)
- QA Models: ~500MB - 3GB each
- Embedding Model: ~90MB
- **Total:** ~20-25GB for all models

</details>

---

**Next Step:** After setup is complete, proceed to **Objective 1: Design System Prompts** to load the Mistral model.



In [24]:
# ============================================================================
# Prerequisites & Setup - Centralized Environment Configuration
# ============================================================================
# This cell contains a centralized EnvironmentConfig class that handles ALL
# Colab vs local environment differences. No if statements needed elsewhere!
# Run this cell FIRST before any other cells

import sys
import os
from typing import Optional, Tuple

# ============================================================================
# EnvironmentConfig Class - Single Source of Truth
# ============================================================================
class EnvironmentConfig:
    """
    Centralized environment configuration that handles ALL Colab vs local differences.
    All environment-specific logic is encapsulated here - no if statements needed elsewhere!
    
    Usage:
        env = EnvironmentConfig()  # Auto-detects environment
        device = env.device  # Returns "cuda" or "cpu" automatically
        token = env.get_token()  # Works in both Colab and local
    """
    
    def __init__(self):
        """Initialize and detect environment automatically."""
        self._is_colab = self._detect_colab()
        self._has_gpu = self._detect_gpu()
        self._python_version = sys.version.split()[0]
        self._hf_token = None
        self._libraries_imported = False
        
        # Print environment info
        self._print_environment_info()
    
    def _detect_colab(self) -> bool:
        """Detect if running in Google Colab."""
        try:
            import google.colab
            return True
        except ImportError:
            return False
    
    def _detect_gpu(self) -> bool:
        """Detect GPU availability."""
        try:
            import torch
            return torch.cuda.is_available()
        except ImportError:
            return False
    
    def _print_environment_info(self):
        """Print environment detection results."""
        print("üîç Checking environment...")
        print(f"   Python version: {self._python_version}")
        
        if self._is_colab:
            print("   ‚úÖ Running in Google Colab")
        else:
            print("   ‚úÖ Running in local environment")
        
        if self._has_gpu:
            try:
                import torch
                print(f"   ‚úÖ GPU Available: {torch.cuda.get_device_name(0)}")
                print(f"   ‚úÖ CUDA Version: {torch.version.cuda}")
            except:
                pass
        else:
            print("   ‚ö†Ô∏è  GPU NOT detected (using CPU)")
            if self._is_colab:
                print("   üí° TIP: Runtime ‚Üí Change runtime type ‚Üí Select GPU ‚Üí Save")
    
    # ========================================================================
    # Properties - No if statements needed when using these!
    # ========================================================================
    
    @property
    def is_colab(self) -> bool:
        """Check if running in Google Colab."""
        return self._is_colab
    
    @property
    def is_local(self) -> bool:
        """Check if running locally."""
        return not self._is_colab
    
    @property
    def has_gpu(self) -> bool:
        """Check if GPU is available."""
        return self._has_gpu
    
    @property
    def device(self) -> str:
        """Get device string ('cuda' or 'cpu') - no if statement needed!"""
        return "cuda" if self._has_gpu else "cpu"
    
    @property
    def device_id(self) -> int:
        """Get device ID (0 for GPU, -1 for CPU) - no if statement needed!"""
        return 0 if self._has_gpu else -1
    
    @property
    def hf_token(self) -> Optional[str]:
        """Get Hugging Face token."""
        return self._hf_token
    
    @property
    def python_version(self) -> str:
        """Get Python version."""
        return self._python_version
    
    # ========================================================================
    # Token Management - Works in both Colab and local
    # ========================================================================
    
    def _format_token_preview(self, token: str) -> str:
        """
        Format token for safe display (DRY - eliminates duplication).
        
        Args:
            token: Hugging Face token string
            
        Returns:
            Formatted preview string (e.g., "hf_abc123...xyz9")
        """
        if len(token) <= 14:
            return "****"
        return f"{token[:10]}...{token[-4:]}"
    
    def get_token(self) -> Optional[str]:
        """
        Get Hugging Face token from appropriate source (Colab or local).
        Works automatically in both environments - no if statements needed!
        """
        # Try Colab userdata first (only works in Colab, fails gracefully in local)
        if self._is_colab:
            try:
                from google.colab import userdata
                token = userdata.get('HUGGINGFACE_HUB_TOKEN')
                if token:
                    print("‚úÖ Hugging Face token loaded from Colab userdata!")
                    print(f"   Token preview: {self._format_token_preview(token)}")
                    self._hf_token = token
                    return token
            except (ImportError, ValueError):
                pass
        
        # Try environment variable (works in both Colab and local)
        try:
            from dotenv import load_dotenv
            load_dotenv()
        except ImportError:
            pass
        
        token = os.getenv("HUGGINGFACE_HUB_TOKEN")
        if token:
            print("‚úÖ Hugging Face token loaded from environment!")
            print(f"   Token preview: {self._format_token_preview(token)}")
            self._hf_token = token
            return token
        
        # No token found
        print("‚ùå Hugging Face token not found!")
        print("   Get your token from: https://huggingface.co/settings/tokens")
        
        if self._is_colab:
            print("\n   In Colab: userdata.set('HUGGINGFACE_HUB_TOKEN', 'your_token')")
        else:
            print("\n   Locally: export HUGGINGFACE_HUB_TOKEN=your_token")
            print("   Or create .env file: HUGGINGFACE_HUB_TOKEN=your_token")
        
        print("\n‚ö†Ô∏è  Some models may require authentication!")
        return None
    
    def authenticate_hf(self, token: Optional[str] = None) -> bool:
        """
        Authenticate with Hugging Face.
        Uses stored token if none provided.
        """
        token = token or self._hf_token
        
        if not token:
            print("‚ö†Ô∏è  No token provided, skipping authentication")
            return False
        
        try:
            from huggingface_hub import login
            login(token=token)
            print("‚úÖ Authenticated with Hugging Face")
            return True
        except Exception as e:
            print(f"‚ö†Ô∏è  Authentication failed: {e}")
            return False
    
    # ========================================================================
    # Package Management - Works in both environments
    # ========================================================================
    
    def install_packages(self):
        """Install required packages if not already installed."""
        packages = [
            "transformers",
            "torch",
            "sentence-transformers",
            "python-dotenv",
            "faiss-cpu",
            "huggingface_hub",
            "numpy",
            "pandas",
            "bert-score",
            "accelerate"
        ]
        
        print("üì¶ Installing required packages...")
        for package in packages:
            try:
                __import__(package.replace("-", "_"))
                print(f"   ‚úÖ {package} already installed")
            except ImportError:
                print(f"   ‚è≥ Installing {package}...")
                os.system(f"pip install -q {package}")
                print(f"   ‚úÖ {package} installed")
    
    # ========================================================================
    # Library Imports - Centralized and reusable
    # ========================================================================
    
    def import_libraries(self) -> bool:
        """
        Import all required libraries with error handling.
        Returns True if successful, False otherwise.
        """
        try:
            from transformers import (
                pipeline, 
                AutoModelForCausalLM, 
                AutoTokenizer, 
                logging as transformers_logging
            )
            from sentence_transformers import SentenceTransformer
            import torch
            import numpy as np
            import faiss
            
            transformers_logging.set_verbosity_error()
            
            # Store imported modules for reuse
            self.pipeline = pipeline
            self.AutoModelForCausalLM = AutoModelForCausalLM
            self.AutoTokenizer = AutoTokenizer
            self.SentenceTransformer = SentenceTransformer
            self.torch = torch
            self.np = np
            self.faiss = faiss
            
            self._libraries_imported = True
            print("‚úÖ All required libraries imported successfully!")
            return True
        except ImportError as e:
            print(f"‚ùå Import error: {e}")
            print("   Run: pip install transformers torch sentence-transformers faiss-cpu")
            return False
        except RuntimeError as e:
            if "register_fake" in str(e) or "torch.library" in str(e):
                print("‚ùå Dependency version mismatch!")
                print("   Fix: pip install --upgrade torch torchvision")
                print("   Then restart kernel and run this cell again.")
            return False
    
    @property
    def libraries_ready(self) -> bool:
        """Check if libraries are imported and ready."""
        return self._libraries_imported
    
    # ========================================================================
    # Utility Methods - Environment-agnostic helpers
    # ========================================================================
    
    def get_device_info(self) -> dict:
        """Get device information dictionary."""
        info = {
            "device": self.device,
            "device_id": self.device_id,
            "has_gpu": self._has_gpu,
            "is_colab": self._is_colab
        }
        
        if self._has_gpu and self._libraries_imported:
            try:
                info["gpu_name"] = self.torch.cuda.get_device_name(0)
                info["cuda_version"] = self.torch.version.cuda
            except:
                pass
        
        return info
    
    def print_summary(self):
        """Print configuration summary."""
        print("=" * 80)
        print("‚úÖ PREREQUISITES & SETUP COMPLETED!")
        print("=" * 80)
        print()
        print("üìå Environment Configuration:")
        print(f"   - Environment: {'Google Colab' if self._is_colab else 'Local'}")
        print(f"   - Python: {self._python_version}")
        print(f"   - Device: {self.device.upper()} ({'GPU' if self._has_gpu else 'CPU'})")
        print(f"   - HF Token: {'‚úÖ Set' if self._hf_token else '‚ùå Not set'}")
        print(f"   - Libraries: {'‚úÖ Ready' if self._libraries_imported else '‚ùå Not imported'}")
        print()
        print("üí° Usage in other cells:")
        print("   - env.device  # Returns 'cuda' or 'cpu'")
        print("   - env.is_colab  # Returns True/False")
        print("   - env.has_gpu  # Returns True/False")
        print("   - env.hf_token  # Returns token or None")
        print("=" * 80)


# ============================================================================
# Objective Names - Standardized naming guide
# ============================================================================
class ObjectiveNames:
    """
    Standardized objective naming guide.
    Use these constants to ensure consistent naming across the notebook.
    
    Usage:
        with env.timer.objective(ObjectiveNames.OBJECTIVE_1):
            # Your code
            pass
    """
    OBJECTIVE_0 = "Objective 0"
    OBJECTIVE_1 = "Objective 1"
    OBJECTIVE_2 = "Objective 2"
    OBJECTIVE_3 = "Objective 3"
    OBJECTIVE_4 = "Objective 4"
    OBJECTIVE_5 = "Objective 5"
    
    @classmethod
    def get_number(cls, objective_name: str) -> Optional[int]:
        """
        Extract objective number from name.
        
        Args:
            objective_name: Objective name (e.g., "Objective 1")
            
        Returns:
            Objective number (e.g., 1) or None if invalid
        """
        try:
            # Extract number from "Objective X" format
            parts = objective_name.split()
            if len(parts) >= 2 and parts[0].lower() == "objective":
                return int(parts[1])
        except (ValueError, IndexError):
            pass
        return None
    
    @classmethod
    def format_name(cls, number: int) -> str:
        """
        Format objective number into standardized name.
        
        Args:
            number: Objective number (e.g., 1)
            
        Returns:
            Formatted name (e.g., "Objective 1")
        """
        return f"Objective {number}"


# ============================================================================
# ObjectiveTimingManager - Track execution time for each objective
# ============================================================================
class ObjectiveTimingManager:
    """
    Track and store execution time for each objective.
    Compares first-time vs subsequent runs.
    
    Usage:
        with env.timer.objective("Objective 1"):
            # Your code here
            pass
        
        # View timing history
        env.timer.print_summary()
        env.timer.get_stats("Objective 1")
    """
    
    def __init__(self, storage_file: str = "objective_timings.csv"):
        """Initialize timing manager with persistent CSV storage."""
        import csv
        import time
        import pandas as pd
        from datetime import datetime
        from pathlib import Path
        
        self.csv = csv
        self.pd = pd
        self.time = time
        self.datetime = datetime
        self.Path = Path
        
        self.storage_file = Path(storage_file)
        self.timings = self._load_timings()
        self.current_objective = None
        self.start_time = None
    
    def _load_timings(self) -> dict:
        """Load timing history from CSV file."""
        if not self.storage_file.exists():
            return {}
        
        try:
            df = self.pd.read_csv(self.storage_file)
            
            # Convert CSV back to internal dict structure
            timings = {}
            for _, row in df.iterrows():
                obj_name = row['objective_name']
                if obj_name not in timings:
                    timings[obj_name] = {
                        "first_run": None,
                        "runs": [],
                        "total_runs": 0,
                        "average_time": None,
                        "min_time": None,
                        "max_time": None
                    }
                
                # Add this run
                timings[obj_name]["runs"].append({
                    "time": row['time'],
                    "timestamp": row['timestamp'],
                    "run_number": row['run_number']
                })
                
                # CRITICAL: Set first_run from CSV's first_run column for THIS objective only
                # This ensures each objective compares against its own first_run, not another objective's
                if timings[obj_name]["first_run"] is None:
                    # Use the first_run value from CSV (same for all rows of same objective)
                    first_run_value = row.get('first_run')
                    if self.pd.notna(first_run_value):
                        timings[obj_name]["first_run"] = float(first_run_value)
                        timings[obj_name]["first_run_timestamp"] = row.get('first_run_timestamp', row['timestamp'])
            
            # Recalculate statistics
            for obj_name in timings:
                data = timings[obj_name]
                data["total_runs"] = len(data["runs"])
                if data["runs"]:
                    times = [r["time"] for r in data["runs"]]
                    data["average_time"] = sum(times) / len(times)
                    data["min_time"] = min(times)
                    data["max_time"] = max(times)
            
            return timings
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not load timings: {e}")
            return {}
    
    def _save_timings(self):
        """Save timing history to CSV file."""
        try:
            self.storage_file.parent.mkdir(parents=True, exist_ok=True)
            
            # Convert internal dict structure to CSV rows
            rows = []
            for obj_name, data in self.timings.items():
                for run in data["runs"]:
                    rows.append({
                        'objective_name': obj_name,
                        'run_number': run['run_number'],
                        'time': run['time'],
                        'timestamp': run['timestamp'],
                        'first_run': data.get('first_run'),
                        'first_run_timestamp': data.get('first_run_timestamp', '')
                    })
            
            if rows:
                df = self.pd.DataFrame(rows)
                # Sort by objective name and run number
                df = df.sort_values(['objective_name', 'run_number'])
                df.to_csv(self.storage_file, index=False)
        except Exception as e:
            print(f"‚ö†Ô∏è  Could not save timings: {e}")
    
    def objective(self, objective_name: str):
        """
        Context manager for timing an objective.
        
        Usage:
            with env.timer.objective("Objective 1"):
                # Your code
                pass
        """
        return ObjectiveTimer(self, objective_name)
    
    def record_time(self, objective_name: str, elapsed_time: float):
        """Record execution time for an objective."""
        if objective_name not in self.timings:
            self.timings[objective_name] = {
                "first_run": None,
                "runs": [],
                "total_runs": 0,
                "average_time": None,
                "min_time": None,
                "max_time": None
            }
        
        timing_data = self.timings[objective_name]
        timing_data["runs"].append({
            "time": elapsed_time,
            "timestamp": self.datetime.now().isoformat(),
            "run_number": timing_data["total_runs"] + 1
        })
        
        # Store first run separately
        if timing_data["first_run"] is None:
            timing_data["first_run"] = elapsed_time
            timing_data["first_run_timestamp"] = self.datetime.now().isoformat()
        
        # Update statistics
        timing_data["total_runs"] = len(timing_data["runs"])
        times = [r["time"] for r in timing_data["runs"]]
        timing_data["average_time"] = sum(times) / len(times)
        timing_data["min_time"] = min(times)
        timing_data["max_time"] = max(times)
        
        # Save to file
        self._save_timings()
    
    def get_stats(self, objective_name: str) -> Optional[dict]:
        """
        Get statistics for a specific objective.
        CRITICAL: Only returns stats for the specified objective_name - ensures comparison is within objective.
        """
        if objective_name not in self.timings:
            return None
        
        # Get data for THIS specific objective only (filters by objective_name)
        data = self.timings[objective_name]
        
        # Verify we're getting the right objective's data
        if not data.get("runs"):
            return None
        
        return {
            "first_run": data.get("first_run"),  # This is THIS objective's first_run only
            "first_run_timestamp": data.get("first_run_timestamp"),
            "first_run_number": 1,  # Always 1 for first run
            "last_run": data["runs"][-1]["time"] if data["runs"] else None,
            "average_time": data.get("average_time"),
            "min_time": data.get("min_time"),
            "max_time": data.get("max_time"),
            "total_runs": data.get("total_runs", 0),
            "improvement": self._calculate_improvement(data)
        }
    
    def get_first_run_info(self, objective_name: str) -> Optional[dict]:
        """
        Get first run information for an objective (easier access).
        
        Args:
            objective_name: Name of objective (e.g., "Objective 1")
            
        Returns:
            Dict with first_run info or None if not found:
            {
                "time": float,  # First run time in seconds
                "formatted_time": str,  # Human-readable time
                "timestamp": str,  # ISO timestamp
                "run_number": int,  # Always 1
                "objective_number": Optional[int]  # Extracted number (e.g., 1)
            }
        """
        stats = self.get_stats(objective_name)
        if not stats or stats.get("first_run") is None:
            return None
        
        # Extract objective number (use ObjectiveNames directly - it's in same cell)
        obj_number = ObjectiveNames.get_number(objective_name)
        
        return {
            "time": stats["first_run"],
            "formatted_time": self._format_time(stats["first_run"]),
            "timestamp": stats.get("first_run_timestamp", "Unknown"),
            "run_number": 1,
            "objective_number": obj_number
        }
    
    def _calculate_improvement(self, data: dict) -> Optional[float]:
        """Calculate improvement percentage from first to average."""
        if data.get("first_run") and data.get("average_time"):
            first = data["first_run"]
            avg = data["average_time"]
            if first > 0:
                improvement = ((first - avg) / first) * 100
                return improvement
        return None
    
    def print_summary(self):
        """Print summary of all objective timings."""
        if not self.timings:
            print("üìä No timing data recorded yet.")
            print("   Use: with env.timer.objective('Objective 1'): ...")
            return
        
        print("=" * 80)
        print("üìä OBJECTIVE TIMING SUMMARY")
        print("=" * 80)
        print()
        
        for obj_name in sorted(self.timings.keys()):
            stats = self.get_stats(obj_name)
            if not stats:
                continue
            
            print(f"üéØ {obj_name}:")
            print(f"   First Run:  {self._format_time(stats['first_run'])}")
            if stats['total_runs'] > 1:
                print(f"   Last Run:   {self._format_time(stats['last_run'])}")
                print(f"   Average:    {self._format_time(stats['average_time'])}")
                print(f"   Min:        {self._format_time(stats['min_time'])}")
                print(f"   Max:        {self._format_time(stats['max_time'])}")
                print(f"   Total Runs: {stats['total_runs']}")
                if stats['improvement']:
                    sign = "‚Üì" if stats['improvement'] > 0 else "‚Üë"
                    print(f"   Improvement: {sign}{abs(stats['improvement']):.1f}% vs first run")
            print()
        
        print("=" * 80)
    
    def _format_time(self, seconds: Optional[float]) -> str:
        """Format time in human-readable format."""
        if seconds is None:
            return "N/A"
        
        if seconds < 60:
            return f"{seconds:.2f}s"
        elif seconds < 3600:
            minutes = int(seconds // 60)
            secs = seconds % 60
            return f"{minutes}m {secs:.2f}s"
        else:
            hours = int(seconds // 3600)
            minutes = int((seconds % 3600) // 60)
            secs = seconds % 60
            return f"{hours}h {minutes}m {secs:.2f}s"
    
    def compare_first_vs_subsequent(self, objective_name: str):
        """Compare first run vs subsequent runs for an objective."""
        if objective_name not in self.timings:
            print(f"‚ùå No data for {objective_name}")
            return
        
        data = self.timings[objective_name]
        if data["total_runs"] < 2:
            print(f"‚ö†Ô∏è  Need at least 2 runs to compare. Current: {data['total_runs']}")
            return
        
        first = data.get("first_run")
        if first is None or not isinstance(first, (int, float)):
            print(f"‚ùå No first run data for {objective_name}")
            return
        
        # Type assertion: first is guaranteed to be float here
        first_float: float = float(first)
        
        # Extract subsequent run times, ensuring they are floats
        subsequent_times: list[float] = [float(r["time"]) for r in data["runs"][1:] if r.get("time") is not None]
        if not subsequent_times:
            print(f"‚ö†Ô∏è  No subsequent runs to compare")
            return
        
        # Calculate average - both values are guaranteed to be float
        total_time: float = sum(subsequent_times)
        count: int = len(subsequent_times)
        avg_subsequent: float = total_time / count
        
        print(f"üìä {objective_name} - First vs Subsequent Runs:")
        print(f"   First Run:     {self._format_time(first_float)}")
        print(f"   Avg Subsequent: {self._format_time(avg_subsequent)}")
        
        if first_float > 0:
            # Both first_float and avg_subsequent are guaranteed to be float here
            improvement: float = ((first_float - avg_subsequent) / first_float) * 100
            print(f"   Improvement:  {improvement:+.1f}%")
        
        print()
    
    def clear_objective(self, objective_name: str) -> bool:
        """
        Clear timing data for a specific objective.
        
        Args:
            objective_name: Name of objective to clear
            
        Returns:
            True if cleared, False if objective not found
        """
        if objective_name not in self.timings:
            print(f"‚ö†Ô∏è  No data found for '{objective_name}'")
            return False
        
        del self.timings[objective_name]
        self._save_timings()
        print(f"‚úÖ Cleared timing data for '{objective_name}'")
        return True
    
    def clear_all(self, confirm: bool = False) -> bool:
        """
        Clear all timing data and delete CSV file.
        
        Args:
            confirm: If True, clears without asking. If False, prints warning but doesn't clear.
                    Use clear_all(confirm=True) to actually clear.
            
        Returns:
            True if cleared, False if not confirmed
        """
        if not confirm:
            print("‚ö†Ô∏è  WARNING: This will delete ALL timing data!")
            print(f"   File: {self.storage_file}")
            print(f"   Current objectives: {list(self.timings.keys())}")
            print()
            print("   To confirm, use: env.timer.clear_all(confirm=True)")
            return False
        
        # Clear in-memory data
        self.timings = {}
        
        # Delete CSV file
        if self.storage_file.exists():
            try:
                self.storage_file.unlink()
                print(f"‚úÖ Deleted timing file: {self.storage_file}")
            except Exception as e:
                print(f"‚ö†Ô∏è  Could not delete file: {e}")
        
        print("‚úÖ All timing data cleared!")
        return True
    
    def reset_objective(self, objective_name: str) -> bool:
        """
        Reset timing data for a specific objective (keeps file, removes objective).
        Same as clear_objective() - provided for clarity.
        
        Args:
            objective_name: Name of objective to reset
            
        Returns:
            True if reset, False if objective not found
        """
        return self.clear_objective(objective_name)


class ObjectiveTimer:
    """
    Context manager for timing a single objective execution.
    
    Usage:
        # Using standardized name (recommended):
        with env.timer.objective(ObjectiveNames.OBJECTIVE_1):
            # Your code
            pass
        
        # Or using string directly:
        with env.timer.objective("Objective 1"):
            # Your code
            pass
    """
    
    def __init__(self, manager: ObjectiveTimingManager, objective_name: str):
        self.manager = manager
        self.objective_name = objective_name
        self.start_time = None
        # Extract objective number for better display
        self.objective_number = self._extract_number(objective_name)
    
    def _extract_number(self, name: str) -> Optional[int]:
        """Extract objective number from name."""
        try:
            parts = name.split()
            if len(parts) >= 2 and parts[0].lower() == "objective":
                return int(parts[1])
        except (ValueError, IndexError):
            pass
        return None
    
    def __enter__(self):
        self.start_time = self.manager.time.time()
        # Show objective number and run count
        stats = self.manager.get_stats(self.objective_name)
        run_number = (stats['total_runs'] + 1) if stats else 1
        print(f"‚è±Ô∏è  Starting: {self.objective_name} (Run #{run_number})")
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        # Ensure start_time is set (should always be set by __enter__)
        if self.start_time is None:
            print(f"‚ö†Ô∏è  Warning: start_time was not set for {self.objective_name}")
            return False
        
        # Calculate elapsed time - both values are guaranteed to be float now
        current_time: float = self.manager.time.time()
        start_time: float = self.start_time
        elapsed: float = current_time - start_time
        
        self.manager.record_time(self.objective_name, elapsed)
        
        # Get stats for THIS specific objective only (filters by objective_name)
        stats = self.manager.get_stats(self.objective_name)
        is_first = stats and stats["total_runs"] == 1
        
        # Get first run info for easier display
        first_run_info = self.manager.get_first_run_info(self.objective_name)
        
        print(f"‚úÖ Completed: {self.objective_name}")
        print(f"   Time: {self.manager._format_time(elapsed)}")
        
        if is_first:
            print(f"   üìù First run recorded (Run #1)")
        else:
            # CRITICAL: Compare only with THIS objective's first run (not other objectives)
            if stats is None or first_run_info is None:
                print(f"   ‚ö†Ô∏è  No stats available for {self.objective_name}")
            else:
                # Use first_run_info for cleaner display
                first_float: float = float(first_run_info["time"])
                diff: float = elapsed - first_float
                pct: float = ((elapsed - first_float) / first_float) * 100 if first_float > 0 else 0.0
                sign = "+" if diff > 0 else ""
                
                # Enhanced display with run number and first run info
                run_number = stats["total_runs"]
                first_run_display = first_run_info["formatted_time"]
                
                # Show comparison - explicitly shows it's comparing within the same objective
                print(f"   üìä Run #{run_number} vs First Run (Run #1, {first_run_display}): {sign}{self.manager._format_time(abs(diff))} ({sign}{pct:.1f}%)")
                
                # Show when first run happened (if available)
                if first_run_info.get("timestamp") and first_run_info["timestamp"] != "Unknown":
                    try:
                        from datetime import datetime
                        first_dt = datetime.fromisoformat(first_run_info["timestamp"].replace('Z', '+00:00'))
                        print(f"   üìÖ First run: {first_dt.strftime('%Y-%m-%d %H:%M:%S')}")
                    except:
                        pass
        
        print()
        return False  # Don't suppress exceptions

# ============================================================================
# Global Environment Instance - Use this everywhere!
# ============================================================================

print("=" * 80)
print("OBJECTIVE 0: PREREQUISITES & SETUP")
print("=" * 80)
print()

# ============================================================================
# OPTIONAL: Clear timing data to start fresh
# ============================================================================
# Uncomment the line below if you want to reset all timing data:
# RESET_TIMINGS = True
# Otherwise, timing data will accumulate (recommended to track history)
RESET_TIMINGS = False

# Start timing Objective 0
import time
objective0_start = time.time()

# Create global environment configuration instance
env = EnvironmentConfig()
print()

# Install packages
env.install_packages()
print()

# Import libraries
if env.import_libraries():
    # Get and authenticate token
    env.get_token()
    if env.hf_token:
        env.authenticate_hf()
    print()
    env.print_summary()
    
    # Initialize timing manager and attach to env
    timer = ObjectiveTimingManager(storage_file="data/objective_timings.csv")
    env.timer = timer
    
    # Optional: Clear all timing data if RESET_TIMINGS is True
    if RESET_TIMINGS:
        env.timer.clear_all(confirm=True)
        print()
    
    # Make env and classes globally available
    globals()['env'] = env
    globals()['EnvironmentConfig'] = EnvironmentConfig
    globals()['ObjectiveTimingManager'] = ObjectiveTimingManager
    globals()['ObjectiveNames'] = ObjectiveNames  # For standardized objective naming
    
    # Backward compatibility - set old global variables
    globals()['IN_COLAB'] = env.is_colab
    globals()['HAS_GPU'] = env.has_gpu
    globals()['hf_token'] = env.hf_token
    
    # Record Objective 0 execution time
    objective0_elapsed = time.time() - objective0_start
    env.timer.record_time("Objective 0", objective0_elapsed)
    
    print("üí° Timing system ready! Use: with env.timer.objective('Objective 1'): ...")
    print()
    print(f"‚úÖ Objective 0 completed in {env.timer._format_time(objective0_elapsed)}")
    print()
else:
    print("‚ùå Setup incomplete. Please fix errors above.")

OBJECTIVE 0: PREREQUISITES & SETUP

üîç Checking environment...
   Python version: 3.12.10
   ‚úÖ Running in local environment
   ‚ö†Ô∏è  GPU NOT detected (using CPU)

üì¶ Installing required packages...
   ‚úÖ transformers already installed
   ‚úÖ torch already installed
   ‚úÖ sentence-transformers already installed
   ‚è≥ Installing python-dotenv...
   ‚úÖ python-dotenv installed
   ‚è≥ Installing faiss-cpu...
   ‚úÖ faiss-cpu installed
   ‚úÖ huggingface_hub already installed
   ‚úÖ numpy already installed
   ‚úÖ pandas already installed
   ‚úÖ bert-score already installed
   ‚úÖ accelerate already installed

‚úÖ All required libraries imported successfully!
‚úÖ Hugging Face token loaded from environment!
   Token preview: hf_ThdSIol...ustv
‚úÖ Authenticated with Hugging Face

‚úÖ PREREQUISITES & SETUP COMPLETED!

üìå Environment Configuration:
   - Environment: Local
   - Python: 3.12.10
   - Device: CPU (CPU)
   - HF Token: ‚úÖ Set
   - Libraries: ‚úÖ Ready

üí° Usage in othe

## Objective 1: Design System Prompts

### üéØ Goal
Create a system prompt that defines Mistral's role as a customer service assistant for an e-commerce business context, enabling consistent, context-aware responses aligned with business requirements.

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `hf_token` | Setup cell (Objective 0) | ‚úÖ Yes | Hugging Face API token for model access |
| GPU (recommended) | Colab settings | ‚ö†Ô∏è Optional | Faster model loading (~1-2 min vs 3-5 min on CPU) |
| Python packages | Setup cell | ‚úÖ Yes | `transformers`, `torch`, `huggingface_hub` |

**Note:** If running locally without GPU, model loading will be slower but fully functional.

</details>

<br>


<details>
<summary><b>üìã System Prompt Components</b> (Click to expand)</summary>

The system prompt includes:

| Component | Content |
|-----------|---------|
| **Role Definition** | Customer service assistant for e-commerce platform |
| **Business Context** | Company name, product categories, policies |
| **Knowledge Base** | Products, shipping, returns, customer service hours, contact methods |
| **Communication Guidelines** | Tone (warm, professional), style (clear, concise) |
| **Limitation Handling** | Instructions for unknown information ("I don't have that information") |
| **Response Format** | Structure expectations for consistent outputs |

</details>


<br>

<details>
<summary><b>üí° Tips</b> (Click to expand)</summary>

- **First Run**: Be patient - model download is large (~14GB)
- **GPU Recommended**: Significantly faster inference (2-3x speedup)
- **Prompt Tuning**: Experiment with different prompt structures to optimize responses
- **Caching**: Model is cached globally - subsequent runs are instant

</details>

---

**Next Step:** Proceed to Objective 2 to generate Q&A database using the system prompt.


In [26]:
# ============================================================================
# OBJECTIVE 1: DESIGN SYSTEM PROMPTS FOR LLM-BASED CUSTOMER SERVICE
# ============================================================================
#
# LEARNING OBJECTIVES DEMONSTRATED:
#   1. System Prompt Engineering - Crafting prompts that shape LLM behavior
#   2. Modular Design - SOLID, KISS, DRY principles in practice
#
# THEORETICAL BACKGROUND:
#   System prompts serve as the "constitution" for LLM behavior, establishing:
#   - Role Identity: Who the model should act as
#   - Knowledge Boundaries: What the model knows and doesn't know
#   - Behavioral Constraints: Response style, escalation rules
#   - Domain Context: Business-specific information
#
# ============================================================================

import os
from typing import Optional, Tuple

# ============================================================================
# InferenceEngine Class - Model Loading, Generation & Verification
# ============================================================================
class InferenceEngine:
    """
    Handles all model inference operations: loading, generation, and verification.
    Reusable across all objectives - follows Single Responsibility Principle.
    Uses env from Objective 0 - no duplicate code!
    
    Usage:
        engine = InferenceEngine(env)
        tokenizer, model = engine.load_model("mistralai/Mistral-7B-Instruct-v0.3")
        response = engine.generate_response(tokenizer, model, prompt)
    """
    
    def __init__(self, env):
        """
        Initialize with environment config from Objective 0.
        
        Args:
            env: EnvironmentConfig instance from Objective 0
        """
        self.env = env
        
        # Use libraries from env (no duplicate imports!)
        self.torch = env.torch
        self.AutoTokenizer = env.AutoTokenizer
        self.AutoModelForCausalLM = env.AutoModelForCausalLM
        
        # Model cache (keyed by model name)
        self._model_cache = {}
        self._tokenizer_cache = {}
    
    def load_model(
        self,
        model_name: str,
        force_reload: bool = False,
        use_cache: bool = True
    ) -> Tuple:
        """
        Load model with caching support.
        Uses env from Objective 0 - automatically handles GPU/CPU!
        
        Args:
            model_name: Hugging Face model identifier
            force_reload: Force reload even if cached
            use_cache: Use global cache (for sharing across objectives)
            
        Returns:
            Tuple of (tokenizer, model)
        """
        # Check instance cache first (fastest - instant return)
        if not force_reload and model_name in self._model_cache:
            print(f"‚ö° Using cached model from instance cache: {model_name}")
            return self._tokenizer_cache[model_name], self._model_cache[model_name]
        
        # Check global cache (for sharing across objectives)
        if use_cache and not force_reload:
            global_key_tokenizer = f"{model_name}_tokenizer"
            global_key_model = f"{model_name}_model"
            
            if global_key_tokenizer in globals() and global_key_model in globals():
                tokenizer = globals()[global_key_tokenizer]
                model = globals()[global_key_model]
                
                # Store in instance cache for faster access next time
                self._tokenizer_cache[model_name] = tokenizer
                self._model_cache[model_name] = model
                
                print(f"‚ö° Using cached model from global cache: {model_name}")
                return tokenizer, model
        
        print(f"Loading {model_name}...")
        
        try:
            # Use env libraries - works in both Colab and local!
            tokenizer = self.AutoTokenizer.from_pretrained(model_name)
            model = self.AutoModelForCausalLM.from_pretrained(
                model_name,
                torch_dtype=self.torch.float16 if self.env.has_gpu else self.torch.float32,
                device_map="auto" if self.env.has_gpu else None,
                low_cpu_mem_usage=True
            )
            
            # Use env.device - no if statement needed!
            if not self.env.has_gpu:
                model = model.to(self.env.device)
            
            # Store in caches
            self._tokenizer_cache[model_name] = tokenizer
            self._model_cache[model_name] = model
            
            # Store in globals for sharing across objectives
            if use_cache:
                globals()[f"{model_name}_tokenizer"] = tokenizer
                globals()[f"{model_name}_model"] = model
            
            device_display = 'GPU' if self.env.has_gpu else 'CPU'
            print(f"‚úÖ Model loaded on {device_display}")
            
            return tokenizer, model
            
        except Exception as e:
            raise RuntimeError(f"Failed to load model {model_name}: {e}")
    
    def generate_response(
        self,
        tokenizer,
        model,
        formatted_prompt: str,
        max_new_tokens: int = 200,
        temperature: float = 0.7,
        top_p: float = 0.9
    ) -> str:
        """
        Generate model response.
        Uses env from Objective 0 - no if statements needed!
        
        Args:
            tokenizer: Model tokenizer
            model: Loaded model instance
            formatted_prompt: Formatted prompt string
            max_new_tokens: Max tokens to generate
            temperature: Sampling temperature (0=deterministic, 1=creative)
            top_p: Nucleus sampling threshold
            
        Returns:
            str: Generated response
            
        Raises:
            RuntimeError: If generation fails
        """
        try:
            inputs = tokenizer(formatted_prompt, return_tensors="pt")
            
            # Use env.has_gpu - no if statement needed!
            if self.env.has_gpu:
                inputs = {k: v.to(model.device) for k, v in inputs.items()}
            
            with self.torch.no_grad():
                outputs = model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    do_sample=True,
                    top_p=top_p,
                    pad_token_id=tokenizer.eos_token_id
                )
            
            # Decode only new tokens
            input_length = inputs['input_ids'].shape[1]
            generated_tokens = outputs[0][input_length:]
            response = tokenizer.decode(generated_tokens, skip_special_tokens=True).strip()
            
            return response
            
        except self.torch.cuda.OutOfMemoryError:
            raise RuntimeError("GPU out of memory. Try reducing max_new_tokens or use CPU.")
        except Exception as e:
            raise RuntimeError(f"Generation failed: {e}")
    
    def verify_model(self, tokenizer, model) -> bool:
        """
        Verify model is loaded and ready for inference.
        
        Args:
            tokenizer: Model tokenizer to verify
            model: Model instance to verify
            
        Returns:
            bool: True if model is ready, False otherwise
        """
        errors = []
        
        if not tokenizer:
            errors.append("‚ùå Tokenizer not loaded")
        if not model:
            errors.append("‚ùå Model not loaded")
        
        if errors:
            print("‚ùå Model verification failed:", "\n".join(errors))
            return False
        
        device_display = 'GPU' if self.env.has_gpu else 'CPU'
        print(f"‚úÖ Model verified - Loaded on {device_display}")
        return True
    
    def get_device_info(self) -> dict:
        """Get device information using env."""
        return self.env.get_device_info()
    
    def clear_cache(self, model_name: str = None):
        """
        Clear model cache.
        
        Args:
            model_name: Specific model to clear, or None to clear all
        """
        if model_name:
            self._model_cache.pop(model_name, None)
            self._tokenizer_cache.pop(model_name, None)
        else:
            self._model_cache.clear()
            self._tokenizer_cache.clear()


# ============================================================================
# SystemPromptEngineer Class - Centralized Objective 1 Logic
# ============================================================================
class SystemPromptEngineer:
    """
    System prompt engineering for Objective 1.
    Focuses ONLY on prompt creation, formatting, and file I/O.
    Uses env from Objective 0 - no duplicate code, no if statements needed!
    
    Usage:
        engineer = SystemPromptEngineer(env)
        prompt = engineer.create_system_prompt()
        formatted = engineer.format_prompt(prompt, question)
    """
    
    # Configuration constants
    MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
    MAX_NEW_TOKENS = 200
    TEMPERATURE = 0.7
    TOP_P = 0.9
    OUTPUT_DIR = "data/system_prompt_engineering"
    
    def __init__(self, env):
        """
        Initialize with environment config.
        
        Args:
            env: EnvironmentConfig instance from Objective 0
        """
        self.env = env
        self.system_prompt = None
    
    def create_system_prompt(
        self,
        business_name: str = "GreenTech Marketplace",
        business_type: str = "e-commerce",
        support_email: str = "support@greentechmarketplace.com",
        support_phone: str = "1-800-GREEN-TECH"
    ) -> str:
        """
        Create optimized customer service system prompt.
        
        Prompt Engineering Best Practices:
        1. Role Definition: Clear, specific identity
        2. Task Boundaries: Explicit scope
        3. Knowledge Base: Comprehensive domain info
        4. Behavioral Rules: Tone and style guidelines
        5. Edge Case Handling: Unknown information handling
        6. Output Format: Response structure expectations
        
        Args:
            business_name: Company name
            business_type: Type of business
            support_email: Support email
            support_phone: Support phone
            
        Returns:
            str: Formatted system prompt
        """
        prompt = f"""You are a friendly and knowledgeable customer service assistant for {business_name}, a leading {business_type} platform specializing in sustainable technology products.

## YOUR ROLE
You are the first point of contact for customers. Your expertise includes product information, orders, shipping, returns, and general inquiries. You embody the company's commitment to sustainability and excellent service.

## KNOWLEDGE BASE

**Products:**
- Solar panels, energy-efficient appliances, smart home devices, eco-friendly accessories
- Warranty: 1-3 years depending on product category

**Shipping:**
- Standard: 5-7 business days (free over $75)
- Express: 2-3 business days (additional fee)

**Returns & Refunds:**
- 30-day return policy for unopened items in original packaging
- Refunds processed within 5-7 business days after receipt

**Customer Service Hours:**
- Monday-Friday: 9 AM - 6 PM EST
- Saturday: 10 AM - 4 PM EST
- Closed Sunday

**Contact Methods:**
- Email: {support_email}
- Phone: {support_phone}
- Live chat: Available during business hours in EST 

## COMMUNICATION GUIDELINES

**Tone:** Warm, professional, solution-oriented
**Style:** Clear, concise, helpful

**Always:**
- Greet customers warmly
- Acknowledge their concerns before providing solutions
- Provide specific, actionable information
- Include relevant timeframes and next steps
- Thank them for choosing {business_name}

**Never:**
- Make promises you cannot keep
- Provide information not in your knowledge base
- Use technical jargon without explanation

## HANDLING LIMITATIONS

If you don't know the answer:
1. Acknowledge the question honestly
2. Explain that the information is not in your current knowledge base
3. Offer to connect them with a specialist or provide contact information
4. Suggest alternative resources if available

## RESPONSE FORMAT

Keep responses concise but complete. Structure longer responses with clear sections. Always end with an offer to help further."""
        
        self.system_prompt = prompt
        return prompt
    
    def format_prompt(self, system_prompt: str, user_input: str) -> str:
        """
        Format for Mistral Instruct template.
        
        Template: <s>[INST] {system} {user} [/INST]
        
        Args:
            system_prompt: System prompt
            user_input: User question
            
        Returns:
            str: Formatted prompt
        """
        return f"<s>[INST] {system_prompt}\n\nCustomer Question: {user_input} [/INST]"
    
    def save_system_prompt(self, filename: str = "system_prompt.txt") -> str:
        """Save system prompt to file."""
        if not self.system_prompt:
            raise ValueError("No system prompt created. Call create_system_prompt() first.")
        
        os.makedirs(self.OUTPUT_DIR, exist_ok=True)
        filepath = os.path.join(self.OUTPUT_DIR, filename)
        with open(filepath, 'w') as f:
            f.write(self.system_prompt)
        print(f"‚úÖ Saved: {filepath}")
        return filepath
    
    def save_response(self, response: str, question: str, filename: str = "test_response.txt") -> str:
        """Save generated response to file."""
        os.makedirs(self.OUTPUT_DIR, exist_ok=True)
        filepath = os.path.join(self.OUTPUT_DIR, filename)
        with open(filepath, 'w') as f:
            f.write(f"Question: {question}\n\n")
            f.write(f"Response:\n{response}")
        print(f"‚úÖ Saved: {filepath}")
        return filepath
    
    def verify_prompt(self) -> bool:
        """
        Verify prompt engineering components only.
        Follows SRP - only verifies SystemPromptEngineer responsibilities.
        
        Returns:
            bool: True if prompt engineering is complete
        """
        errors = []
        
        # Verify system prompt
        if not self.system_prompt:
            errors.append("‚ùå System prompt not created")
        elif len(self.system_prompt) < 100:
            errors.append(f"‚ùå System prompt too short ({len(self.system_prompt)} chars)")
        
        # Check files exist
        prompt_file = os.path.join(self.OUTPUT_DIR, "system_prompt.txt")
        response_file = os.path.join(self.OUTPUT_DIR, "test_response.txt")
        
        if not os.path.exists(prompt_file):
            errors.append("‚ùå system_prompt.txt not found")
        if not os.path.exists(response_file):
            errors.append("‚ùå test_response.txt not found")
        
        # Print results
        if errors:
            print("‚ùå Prompt verification failed:", "\n".join(errors))
            return False
        
        print(f"‚úÖ Prompt verified - Length: {len(self.system_prompt)} chars")
        return True
    



# ============================================================================
# EXECUTION - Uses env from Objective 0, wrapped with timing
# ============================================================================

# Verify env is available from Objective 0
if 'env' not in globals():
    raise RuntimeError("‚ùå 'env' not found! Please run Objective 0 (Prerequisites & Setup) first.")

# ============================================================================
# EXECUTION - Orchestrates Objective 1 workflow
# ============================================================================
# Class provides capabilities, execution orchestrates the workflow
# This follows better separation of concerns!

with env.timer.objective("Objective 1"):
    print("Objective 1: Creating System Prompt\n")
    
    # Reuse InferenceEngine from globals if available (performance optimization!)
    if 'inference_engine' in globals() and isinstance(globals()['inference_engine'], InferenceEngine):
        inference_engine = globals()['inference_engine']
        print("‚ôªÔ∏è  Reusing existing InferenceEngine (model cache preserved)\n")
    else:
        # Create InferenceEngine (can be shared across objectives for efficiency)
        inference_engine = InferenceEngine(env)
    
    # Create SystemPromptEngineer instance (prompt engineering only)
    system_prompt_engineer = SystemPromptEngineer(env)
    
    # Authenticate using env from Objective 0
    if env.hf_token:
        env.authenticate_hf()
    
    # Load model using InferenceEngine (model operation)
    tokenizer, model = inference_engine.load_model(system_prompt_engineer.MODEL_NAME)
    
    # Create system prompt (prompt engineering operation)
    system_prompt = system_prompt_engineer.create_system_prompt()
    globals()['system_prompt'] = system_prompt
    print(f"‚úÖ System prompt created ({len(system_prompt)} chars)\n")
    
    # Test with sample question
    test_question = "What are your store hours and how can I contact customer support?"
    formatted_prompt = system_prompt_engineer.format_prompt(system_prompt, test_question)
    
    # Generate response using InferenceEngine (model operation)
    generated_response = inference_engine.generate_response(
        tokenizer, model, formatted_prompt,
        max_new_tokens=system_prompt_engineer.MAX_NEW_TOKENS,
        temperature=system_prompt_engineer.TEMPERATURE,
        top_p=system_prompt_engineer.TOP_P
    )
    
    print("Sample Response:")
    print("-" * 50)
    print(generated_response)
    print("-" * 50)
    
    # Save files (prompt engineering operation)
    system_prompt_engineer.save_system_prompt()
    system_prompt_engineer.save_response(generated_response, test_question)
    
    # Verify both components separately (SRP: each verifies its own responsibility)
    inference_engine.verify_model(tokenizer, model)
    system_prompt_engineer.verify_prompt()
    
    print(f"\n‚úÖ Objective 1 complete - Model on {'GPU' if env.has_gpu else 'CPU'}, Prompt: {len(system_prompt)} chars")
    
    # Store in globals for other objectives
    globals()['InferenceEngine'] = InferenceEngine
    globals()['SystemPromptEngineer'] = SystemPromptEngineer
    globals()['system_prompt'] = system_prompt
    globals()['inference_engine'] = inference_engine  # Reusable across objectives!


‚è±Ô∏è  Starting: Objective 1 (Run #2)
Objective 1: Creating System Prompt

‚úÖ Authenticated with Hugging Face
‚ö° Using cached model from global cache: mistralai/Mistral-7B-Instruct-v0.3
‚úÖ System prompt created (1995 chars)

Sample Response:
--------------------------------------------------
Hello!

I'm delighted to assist you at GreenTech Marketplace. Our customer service hours are as follows:

- Monday-Friday: 9 AM - 6 PM EST
- Saturday: 10 AM - 4 PM EST
- Closed Sunday

You can reach us through various methods during these hours:

1. Email: support@greentechmarketplace.com
2. Phone: 1-800-GREEN-TECH
3. Live chat: Available during our business hours

If you have any questions outside of our business hours, feel free to leave an email, and we'll get back to you as soon as we're open again. Thank you for choosing GreenTech Marketplace! If you have more questions or need further assistance, please don't hesitate to ask.
--------------------------------------------------
‚úÖ Saved: d

## Objective 2: Generate Custom Q&A Databases for E-commerce Customer Service

### üéØ Goal
Generate 21 Q&A pairs (18 answerable + 3 unanswerable) covering e-commerce customer service topics to create a domain-specific knowledge base for the RAG system.

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `mistral_model` | Objective 1 | ‚úÖ Yes | Mistral model for generating Q&A pairs |
| `mistral_tokenizer` | Objective 1 | ‚úÖ Yes | Tokenizer for text processing and encoding |
| `system_prompt` | Objective 1 | ‚úÖ Yes | System prompt defining business context and role |
| `hf_token` | Objective 0 | ‚úÖ Yes | Hugging Face API token for model access |

**Note:** Objective 1 must be completed first to provide the model and system prompt.

</details>

<br>

<details>
<summary><b>üîß Core Concepts</b> (Click to expand)</summary>

| Concept | Description |
|--------|-------------|
| **Q&A Generation** | Using LLMs to create domain-specific question-answer pairs that serve as the knowledge base for RAG systems |
| **Zero-Shot Prompting** | Providing instructions without examples - the model generates Q&A pairs based solely on the system prompt and task description |
| **Few-Shot Prompting** | Providing examples in the prompt to guide the model's output format and style - improves consistency and quality |
| **System Prompt Reuse** | Leveraging the system prompt from Objective 1 to ensure Q&A pairs align with the business context and customer service role |
| **Answerable Questions** | Questions that can be answered from the business knowledge base (products, policies, procedures) |
| **Unanswerable Questions** | Questions outside the knowledge base scope (competitor info, personal advice, future events) |
| **Delimiter Parsing** | Using a consistent delimiter (`|||`) to reliably extract structured data from LLM-generated text |
| **DataFrame Structure** | Converting Q&A pairs to pandas DataFrame for easy filtering, querying, and analysis |

**Why This Matters:**
A well-structured Q&A database is the foundation of the RAG system. It provides the knowledge that will be retrieved and used to generate accurate, context-aware responses. Using the system prompt from Objective 1 ensures consistency with the customer service role and business context.

</details>

<br>

<details>
<summary><b>üì§ Outputs</b> (Click to expand)</summary>

| Variable | Type | Description |
|----------|------|-------------|
| `qa_database` | `List[Dict]` | List of 21 Q&A dictionaries with keys: `category`, `answerable`, `question`, `answer` |
| `qa_df` | `pd.DataFrame` | DataFrame with Q&A pairs plus computed columns: `answer_length`, `word_count` |

**Files Created:**
| File | Location | Description |
|------|----------|-------------|
| `qa_database.csv` | `data/qa_database/` | All 21 Q&A pairs with answerable flag, ready for RAG system |

**Key Functions Created:**
- `generate_qa_for_category()`: Generates Q&A pairs for a specific category
- `generate_full_qa_database()`: Generates complete 21-pair database
- `qa_to_dataframe()`: Converts Q&A list to DataFrame with metrics
- `display_statistics()`: Shows database coverage and quality metrics
- `display_all_qa_pairs()`: Prints all Q&A pairs with comments
- `save_qa_to_csv()`: Saves Q&A database to CSV file

</details>

<br>

<details>
<summary><b>üìã Q&A Database Structure</b> (Click to expand)</summary>

**Answerable Categories (18 pairs):**

| Category | Count | Topics Covered |
|----------|-------|---------------|
| **products** | 3 | Types of products, specifications, features |
| **shipping** | 3 | Delivery times, shipping costs, free shipping threshold, tracking |
| **returns** | 3 | Return policy, refund process, conditions, 30-day policy |
| **customer_service** | 3 | Business hours (Mon-Fri 9AM-6PM, Sat 10AM-4PM), contact email and phone |
| **warranty** | 3 | Warranty duration (1-3 years), coverage, claims process |
| **orders** | 3 | Order status, tracking, modifications, cancellations |

**Unanswerable Types (3 pairs):**

| Type | Count | Description |
|------|-------|-------------|
| **competitor** | 1 | Questions about competitor pricing, products, or comparisons |
| **personal_advice** | 1 | Questions asking for personal recommendations or opinions |
| **future_events** | 1 | Questions about future sales, unreleased products, or predictions |

**Data Structure:**
```python
{
    'category': 'shipping',
    'answerable': True,
    'question': 'What is your shipping policy?',
    'answer': 'We offer standard shipping (5-7 business days, free over $75)...'
}
```

</details>


<br>

<details>
<summary><b>üìö Learning Objectives Demonstrated</b> (Click to expand)</summary>

1. **Zero-Shot Prompting**: Using instructions without examples to guide LLM output, leveraging the model's pre-training and context from system prompts
2. **System Prompt Reuse**: Integrating previously created system prompts to maintain consistency across objectives and ensure domain alignment
3. **LLM-Based Content Generation**: Using language models to create structured domain-specific content
4. **Data Structure Design**: Designing flexible data formats (list + DataFrame) for different access patterns
5. **Parsing LLM Output**: Reliable extraction of structured data from free-form LLM responses using delimiter-based parsing
6. **Knowledge Base Construction**: Building domain-specific knowledge bases for RAG systems

</details>

<br>

<details>
<summary><b>üí° Tips</b> (Click to expand)</summary>

- **System Prompt Integration**: The system prompt from Objective 1 is automatically embedded in generation prompts - ensure Objective 1 completed successfully
- **Zero-Shot vs Few-Shot**: Zero-shot is used here for efficiency; few-shot could improve consistency but requires example formatting
- **Generation Time**: 21 LLM calls take 2-3 minutes - be patient
- **Delimiter Choice**: `|||` is chosen because it's unlikely to appear in natural text
- **Answerable Flag**: Essential for testing system's ability to decline unanswerable questions
- **DataFrame Benefits**: Use `qa_df[qa_df['answerable'] == True]` for easy filtering
- **Category Coverage**: Each category has 3 pairs to ensure sufficient coverage
- **Unanswerable Testing**: The 3 unanswerable pairs test different types of out-of-scope questions
- **Prompt Context**: The system prompt's first 1200 characters are included to provide business context without exceeding token limits

</details>

---

**Next Step:** Proceed to Objective 3 to build the FAISS vector database from the Q&A pairs.



In [3]:
# ============================================================================
# OBJECTIVE 2 : GENERATE CUSTOM Q&A DATABASES 
# ============================================================================
#
#
# NOTE: Requires Objective 1 (mistral_model, mistral_tokenizer, system_prompt)
# ============================================================================

import os
import re
import json
from typing import List, Dict
import pandas as pd
import torch

# ----------------------------------------------------------------------------
# CONFIG
# ----------------------------------------------------------------------------

TEMPERATURE = 0.7
TOP_P = 0.9
CONTEXT_CHARS = 900           # Smaller excerpt increases relevance
DELIMITER = "|||"

OUTPUT_DIR = "data/qa_database"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# ----------------------------------------------------------------------------
# VALIDATION
# ----------------------------------------------------------------------------

def validate_prerequisites():
    required = ["mistral_model", "mistral_tokenizer", "system_prompt"]
    missing = [r for r in required if r not in globals()]
    if missing:
        raise RuntimeError(
            f"‚ùå Objective 2 requires Objective 1 first. Missing: {missing}"
        )
    print("   ‚úÖ Prerequisites validated")


# ----------------------------------------------------------------------------
# CATEGORY DEFINITIONS
# ----------------------------------------------------------------------------

QA_CATEGORIES = [
    ("products", "types of products, solar panels, smart devices, eco-friendly items"),
    ("shipping", "delivery times, shipping cost, free shipping threshold, tracking"),
    ("returns", "return policy, refund window, 30-day policy, conditions"),
    ("customer_service", "hours (Mon‚ÄìFri 9‚Äì6, Sat 10‚Äì4), email, phone support"),
    ("warranty", "coverage periods 1‚Äì3 years, claims process"),
    ("orders", "order status, modifying or cancelling orders, tracking numbers"),
]

UNANSWERABLE_TYPES = [
    ("competitor", "questions about competitor prices or product comparisons"),
    ("personal_advice", "questions asking for personal recommendations or opinions"),
    ("future_events", "questions about upcoming sales or unreleased products"),
]

PAIRS_PER_CATEGORY = 3
UNANSWERABLE_PER_TYPE = 1

ANSWERABLE_TOTAL = len(QA_CATEGORIES) * PAIRS_PER_CATEGORY
UNANSWERABLE_TOTAL = len(UNANSWERABLE_TYPES) * UNANSWERABLE_PER_TYPE
TOTAL_PAIRS = ANSWERABLE_TOTAL + UNANSWERABLE_TOTAL


# ----------------------------------------------------------------------------
# PROMPT TEMPLATES 
# ----------------------------------------------------------------------------

ANSWERABLE_PROMPT = """
You are generating REALISTIC customer service Q&A pairs for the ShopSmart e-commerce support assistant.

Generate EXACTLY {num_pairs} Q&A pairs about the topic below.

TOPIC FOCUS:
{description}

BUSINESS CONTEXT (from system prompt):
{context}

CRITICAL: You MUST output valid JSON only. No other text before or after.

OUTPUT FORMAT (JSON array):
[
  {{"question": "What is your shipping policy?", "answer": "We offer standard shipping (5-7 business days) for free on orders over $75. Express shipping (2-3 business days) is available for an additional $15. All orders are shipped with tracking numbers."}},
  {{"question": "Can I return a product if I'm not satisfied?", "answer": "Yes, you can return any unopened item in its original packaging within 30 days of delivery for a full refund. Simply contact our support team to initiate the return process."}},
  {{"question": "What are your customer service hours?", "answer": "Our customer service team is available Monday through Friday from 9 AM to 6 PM EST, and on Saturdays from 10 AM to 4 PM EST. You can reach us via email at support@greentechmarketplace.com or by phone."}}
]

CONTENT RULES:
- Questions MUST sound like real customers asking natural questions.
- Answers MUST be 2‚Äì3 sentences with concrete details (times, numbers, policies, contact info).
- Stay entirely within ShopSmart policies from the business context.
- DO NOT hallucinate unsupported information.

OUTPUT: Return a valid JSON array with EXACTLY {num_pairs} objects, each with "question" and "answer" fields.
"""

UNANSWERABLE_PROMPT = """
Generate EXACTLY {num_pairs} UNANSWERABLE customer Q&A pairs.

TOPIC TYPE OUT OF SCOPE:
{description}

CRITICAL: You MUST output valid JSON only. No other text before or after.

OUTPUT FORMAT (JSON array):
[
  {{"question": "What are your competitor's prices?", "answer": "I'm sorry, but I'm unable to provide information about competitor pricing as it's outside our knowledge base. However, I'd be happy to help you with questions about our own products, shipping options, or return policies."}},
  {{"question": "Can you recommend the best restaurant in New York?", "answer": "I apologize, but I cannot provide personal recommendations or advice about restaurants as that's beyond ShopSmart's scope. I can assist you with questions about our products, shipping, returns, or order tracking."}},
  {{"question": "When will you release new products next month?", "answer": "I'm unable to provide information about upcoming product releases or future events as that information isn't available in our knowledge base. However, I can help you with questions about our current product catalog, shipping options, or warranty information."}}
]

REFUSAL RULES:
- Question MUST be outside ShopSmart's knowledge base (competitor info, personal advice, future events, etc.)
- Answer MUST politely decline and explain you cannot provide that information
- Answer MUST offer what you *can* help with (shipping, returns, products, warranty, orders, etc.)
- Answer MUST be 2 sentences

OUTPUT: Return a valid JSON array with EXACTLY {num_pairs} objects, each with "question" and "answer" fields.
"""


# ----------------------------------------------------------------------------
# GENERATION FUNCTION
# ----------------------------------------------------------------------------

def mistral_generate(prompt: str, max_tokens: int = 600) -> str:
    tokenizer = mistral_tokenizer
    model = mistral_model

    formatted = f"<s>[INST] {prompt} [/INST]"

    inputs = tokenizer(formatted, return_tensors="pt")
    if torch.cuda.is_available():
        inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=TEMPERATURE,
            top_p=TOP_P,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )

    # Extract generated tokens (skip prompt)
    inp_len = inputs["input_ids"].shape[1]
    gen_tokens = outputs[0][inp_len:]
    text = tokenizer.decode(gen_tokens, skip_special_tokens=True).strip()
    return text


# ----------------------------------------------------------------------------
# PARSER (ROBUST - IMPROVED)
# ----------------------------------------------------------------------------

def parse_qa_lines(text: str, answerable: bool, debug: bool = False) -> List[Dict]:
    """
    Parse Q&A pairs from model output.
    Expected format: JSON array with objects containing "question" and "answer" fields.
    """
    qa_list = []
    
    if debug:
        print(f"      [DEBUG] Raw model output ({len(text)} chars):")
        print(f"      {repr(text[:300])}...")
    
    # Try to extract JSON from the text (might have extra text before/after)
    text_clean = text.strip()
    
    # Find JSON array in the text (handle cases where model adds extra text)
    json_start = text_clean.find('[')
    json_end = text_clean.rfind(']') + 1
    
    if json_start == -1 or json_end == 0:
        if debug:
            print(f"      [DEBUG] No JSON array found in output")
        return qa_list
    
    json_text = text_clean[json_start:json_end]
    
    try:
        parsed_data = json.loads(json_text)
        
        if not isinstance(parsed_data, list):
            if debug:
                print(f"      [DEBUG] JSON is not an array, got: {type(parsed_data)}")
            return qa_list
        
        for item in parsed_data:
            if isinstance(item, dict) and "question" in item and "answer" in item:
                q = item["question"].strip()
                a = item["answer"].strip()
                
                # Validate content
                if q and a and len(q) > 3 and len(a) > 10:
                    qa_list.append({
                        "question": q,
                        "answer": a,
                        "answerable": bool(answerable)
                    })
        
        if debug:
            print(f"      [DEBUG] Parsed {len(qa_list)} pairs from JSON")
            if len(qa_list) > 0:
                print(f"      [DEBUG] First parsed pair:")
                print(f"        Q: {qa_list[0]['question'][:80]}...")
                print(f"        A: {qa_list[0]['answer'][:80]}...")
    
    except json.JSONDecodeError as e:
        if debug:
            print(f"      [DEBUG] JSON decode error: {e}")
            print(f"      [DEBUG] Attempted to parse: {json_text[:200]}...")
    
    return qa_list


# ----------------------------------------------------------------------------
# HIGH-LEVEL GENERATORS (WITH RETRY LOGIC)
# ----------------------------------------------------------------------------

def generate_answerable(category: str, description: str, n: int, max_retries: int = 3) -> List[Dict]:
    """Generate answerable Q&A pairs with retry logic if parsing fails."""
    for attempt in range(max_retries):
        prompt = ANSWERABLE_PROMPT.format(
            num_pairs=n,
            description=description,
            context=system_prompt[:CONTEXT_CHARS]
        )
        raw = mistral_generate(prompt, max_tokens=800)  # Increased tokens for better generation
        parsed = parse_qa_lines(raw, True, debug=(attempt == max_retries - 1))
        
        if len(parsed) >= n:
            return parsed[:n]
        elif len(parsed) > 0:
            # If we got some pairs but not enough, return what we have
            print(f"      ‚ö†Ô∏è  Got {len(parsed)}/{n} pairs (attempt {attempt + 1})")
            if attempt < max_retries - 1:
                continue
            return parsed
    
    # If all retries failed, return empty list
    print(f"      ‚ùå Failed to generate pairs after {max_retries} attempts")
    return []


def generate_unanswerable(category: str, description: str, n: int, max_retries: int = 3) -> List[Dict]:
    """Generate unanswerable Q&A pairs with retry logic if parsing fails."""
    for attempt in range(max_retries):
        prompt = UNANSWERABLE_PROMPT.format(
            num_pairs=n,
            description=description
        )
        raw = mistral_generate(prompt, max_tokens=600)
        parsed = parse_qa_lines(raw, False, debug=(attempt == max_retries - 1))
        
        if len(parsed) >= n:
            return parsed[:n]
        elif len(parsed) > 0:
            print(f"      ‚ö†Ô∏è  Got {len(parsed)}/{n} pairs (attempt {attempt + 1})")
            if attempt < max_retries - 1:
                continue
            return parsed
    
    print(f"      ‚ùå Failed to generate pairs after {max_retries} attempts")
    return []


# ----------------------------------------------------------------------------
# FULL DATABASE GENERATOR
# ----------------------------------------------------------------------------

def generate_full_qa_database():
    db = []

    print("\nüìó Generating answerable Q&A...")
    for cat, desc in QA_CATEGORIES:
        print(f"   ‚Üí {cat}...")
        pairs = generate_answerable(cat, desc, PAIRS_PER_CATEGORY)
        for p in pairs:
            p["category"] = cat
        db.extend(pairs)

    print("\nüìï Generating unanswerable Q&A...")
    for cat, desc in UNANSWERABLE_TYPES:
        print(f"   ‚Üí {cat}...")
        pairs = generate_unanswerable(cat, desc, UNANSWERABLE_PER_TYPE)
        for p in pairs:
            p["category"] = cat
        db.extend(pairs)

    print(f"\nüéâ Generated {len(db)} total pairs "
          f"({ANSWERABLE_TOTAL} answerable, {UNANSWERABLE_TOTAL} unanswerable)")

    return db


# ----------------------------------------------------------------------------
# CONVERSION / DISPLAY / SAVE ‚Äî SAME AS ORIGINAL
# ----------------------------------------------------------------------------

def qa_to_dataframe(qa_list: List[Dict]) -> pd.DataFrame:
    df = pd.DataFrame(qa_list)
    df["question_length"] = df.question.str.len()
    df["answer_length"] = df.answer.str.len()
    df["word_count"] = df.answer.str.split().str.len()
    return df


def save_qa_to_csv(qa_list, filename="qa_database.csv"):
    df = pd.DataFrame(qa_list)
    path = os.path.join(OUTPUT_DIR, filename)
    df.to_csv(path, index=False)
    print(f"   üíæ Saved to {path}")
    return path


# ----------------------------------------------------------------------------
# VERIFICATION FUNCTION
# ----------------------------------------------------------------------------

def verify_objective2():
    """
    Verify that Objective 2 completed successfully.
    Checks all variables, counts, structure, and files.
    """
    import os
    
    print("="*70)
    print("üîç OBJECTIVE 2 VERIFICATION")
    print("="*70)
    
    errors = []
    warnings = []
    
    # Check if variables exist
    if 'qa_database' not in globals():
        errors.append("‚ùå qa_database not found")
    if 'qa_df' not in globals():
        errors.append("‚ùå qa_df not found")
    
    if errors:
        print("\n".join(errors))
        print("="*70)
        return False
    
    # Check count
    actual_count = len(qa_database)
    expected_count = TOTAL_PAIRS
    
    if actual_count != expected_count:
        errors.append(f"‚ùå Wrong count: Expected {expected_count}, got {actual_count}")
    else:
        print(f"‚úÖ Count correct: {actual_count} pairs")
    
    # Check structure
    required_keys = ["question", "answer", "answerable", "category"]
    for i, pair in enumerate(qa_database):
        for key in required_keys:
            if key not in pair:
                errors.append(f"‚ùå Pair {i+1} missing key: {key}")
            elif pair[key] is None:
                errors.append(f"‚ùå Pair {i+1} has None value for {key}")
            elif isinstance(pair[key], str) and len(pair[key].strip()) == 0:
                errors.append(f"‚ùå Pair {i+1} has empty {key}")
            # Note: For boolean fields like "answerable", False is a valid value, not empty
    
    if not errors:
        print(f"‚úÖ Structure correct: All pairs have required keys")
    
    # Check distribution
    answerable_count = sum(1 for p in qa_database if p.get("answerable") == True)
    unanswerable_count = sum(1 for p in qa_database if p.get("answerable") == False)
    
    if answerable_count != ANSWERABLE_TOTAL:
        warnings.append(f"‚ö†Ô∏è  Answerable count: Expected {ANSWERABLE_TOTAL}, got {answerable_count}")
    else:
        print(f"‚úÖ Answerable pairs: {answerable_count}")
    
    if unanswerable_count != UNANSWERABLE_TOTAL:
        warnings.append(f"‚ö†Ô∏è  Unanswerable count: Expected {UNANSWERABLE_TOTAL}, got {unanswerable_count}")
    else:
        print(f"‚úÖ Unanswerable pairs: {unanswerable_count}")
    
    # Check DataFrame
    if len(qa_df) != actual_count:
        errors.append(f"‚ùå DataFrame count mismatch: {len(qa_df)} != {actual_count}")
    else:
        print(f"‚úÖ DataFrame correct: {len(qa_df)} rows")
    
    # Check file exists
    csv_path = os.path.join(OUTPUT_DIR, "qa_database.csv")
    if not os.path.exists(csv_path):
        errors.append(f"‚ùå CSV file not found: {csv_path}")
    else:
        print(f"‚úÖ CSV file exists: {csv_path}")
        # Verify CSV content
        try:
            df_check = pd.read_csv(csv_path)
            if len(df_check) != actual_count:
                warnings.append(f"‚ö†Ô∏è  CSV row count: {len(df_check)} != {actual_count}")
        except Exception as e:
            warnings.append(f"‚ö†Ô∏è  Could not verify CSV: {e}")
    
    # Print results
    if errors:
        print("\n‚ùå VERIFICATION FAILED:")
        print("\n".join(errors))
        if warnings:
            print("\n‚ö†Ô∏è  WARNINGS:")
            print("\n".join(warnings))
        print("="*70)
        return False
    else:
        print("\n‚úÖ Objective 2 Complete - All checks passed!")
        if warnings:
            print("\n‚ö†Ô∏è  WARNINGS:")
            print("\n".join(warnings))
        print(f"   ‚Ä¢ Total Q&A pairs: {actual_count}")
        print(f"   ‚Ä¢ Answerable: {answerable_count}")
        print(f"   ‚Ä¢ Unanswerable: {unanswerable_count}")
        print(f"   ‚Ä¢ DataFrame: {len(qa_df)} rows √ó {len(qa_df.columns)} columns")
        print(f"   ‚Ä¢ CSV file: {csv_path}")
        print("="*70)
        return True


# ----------------------------------------------------------------------------
# DISPLAY FUNCTION
# ----------------------------------------------------------------------------

def display_qa_database(qa_list: List[Dict], max_display: int = None):
    """Display Q&A pairs in a readable format."""
    print("\n" + "="*70)
    print("üìã GENERATED Q&A DATABASE")
    print("="*70)
    
    if max_display:
        display_list = qa_list[:max_display]
        print(f"\nShowing first {len(display_list)} of {len(qa_list)} pairs:\n")
    else:
        display_list = qa_list
        print(f"\nAll {len(display_list)} pairs:\n")
    
    for i, pair in enumerate(display_list, 1):
        answerable_str = "‚úÖ Answerable" if pair.get("answerable") else "‚ùå Unanswerable"
        category = pair.get("category", "unknown")
        print(f"\n[{i}] {answerable_str} | Category: {category}")
        print(f"    Q: {pair.get('question', 'N/A')}")
        print(f"    A: {pair.get('answer', 'N/A')[:150]}{'...' if len(pair.get('answer', '')) > 150 else ''}")
    
    if max_display and len(qa_list) > max_display:
        print(f"\n... and {len(qa_list) - max_display} more pairs")
    
    print("\n" + "="*70)


# ----------------------------------------------------------------------------
# EXECUTION ENTRYPOINT
# ----------------------------------------------------------------------------

print("="*70)
print("OBJECTIVE 2 (REVISED): GENERATE CUSTOM Q&A DATABASE")
print("="*70)

validate_prerequisites()

print("\nü§ñ Generating Q&A with strict Mistral prompts...")
qa_database = generate_full_qa_database()
qa_df = qa_to_dataframe(qa_database)

save_qa_to_csv(qa_database)

# Display the generated Q&A pairs
display_qa_database(qa_database)

# Verify the results
print("\n")
verify_objective2()

print("\nDone! Q&A database is ready.")
print("="*70)


OBJECTIVE 2 (REVISED): GENERATE CUSTOM Q&A DATABASE
   ‚úÖ Prerequisites validated

ü§ñ Generating Q&A with strict Mistral prompts...

üìó Generating answerable Q&A...
   ‚Üí products...
   ‚Üí shipping...
   ‚Üí returns...
   ‚Üí customer_service...
   ‚Üí warranty...
   ‚Üí orders...

üìï Generating unanswerable Q&A...
   ‚Üí competitor...
   ‚Üí personal_advice...
   ‚Üí future_events...

üéâ Generated 21 total pairs (18 answerable, 3 unanswerable)
   üíæ Saved to data/qa_database/qa_database.csv

üìã GENERATED Q&A DATABASE

All 21 pairs:


[1] ‚úÖ Answerable | Category: products
    Q: What types of products do you offer on your platform?
    A: We specialize in sustainable technology products, offering solar panels, energy-efficient appliances, smart home devices, and eco-friendly accessories...

[2] ‚úÖ Answerable | Category: products
    Q: How long are the warranties on your products?
    A: Our warranties vary depending on the product category, ranging from 1 to 3 years.

## Objective 3: Implement Vector Databases Using FAISS

### üéØ Goal
Build a FAISS vector database from the Q&A pairs to enable fast semantic search, converting text to embeddings and creating an index for efficient similarity retrieval in the RAG system.

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `qa_database` | Objective 2 | ‚úÖ Yes | List of 21 Q&A pairs to convert to embeddings |
| `system_prompt` | Objective 1 | ‚úÖ Yes | System prompt (used for validation) |
| Python packages | Setup cell | ‚úÖ Yes | `faiss-cpu`, `sentence-transformers`, `numpy`, `pandas` |

**Note:** Objective 2 must be completed first to provide the Q&A database.

</details>

<br>

<details>
<summary><b>üîß Core Concepts</b> (Click to expand)</summary>

| Concept | Description |
|--------|-------------|
| **Embeddings** | Numerical vector representations of text that capture semantic meaning. Similar texts have similar embeddings (close in vector space) |
| **Semantic Search** | Finding relevant documents by meaning rather than exact keyword matching. Enables finding "shipping" when querying "delivery" |
| **FAISS (Facebook AI Similarity Search)** | Library for efficient similarity search in high-dimensional vector spaces. Searches millions of vectors in milliseconds |
| **IndexFlatL2** | FAISS index type using L2 (Euclidean) distance for exact similarity search. Ideal for small-medium datasets (<100k vectors) |
| **Sentence Transformers** | Pre-trained models that convert text to dense vector embeddings optimized for semantic similarity |
| **Top-K Retrieval** | Retrieving the k most similar documents (e.g., top-3) based on embedding similarity scores |

**Why This Matters:**
Vector databases enable semantic search - finding relevant information even when exact keywords don't match. This is essential for RAG systems where user questions need to retrieve the most semantically similar context from the knowledge base.

</details>

<br>

<details>
<summary><b>üìä Design Choices</b> (Click to expand)</summary>

| Choice | Selected | Rationale |
|--------|----------|-----------|
| **Embedding Model** | `all-MiniLM-L6-v2` | 384 dimensions, fast inference, good quality for small-medium datasets |
| **FAISS Index Type** | IndexFlatL2 | Exact search with L2 distance, ideal for 21 Q&A pairs, no approximation needed |
| **Embedding Strategy** | Question + Answer combined | Richer semantic representation by including both question and answer text |
| **Top-K Retrieval** | 3 documents | Balances context richness with prompt length, sufficient for most queries |
| **Distance Metric** | L2 (Euclidean) | Standard similarity measure, lower distance = more similar vectors |
| **Data Type** | float32 | FAISS requirement, balances precision and memory usage |

**Why This Approach:**
- **all-MiniLM-L6-v2**: Lightweight, fast, sufficient quality for 21 pairs. Alternative models (e.g., all-mpnet-base-v2) offer higher accuracy but slower inference
- **IndexFlatL2**: Exact search ensures highest quality results. For larger datasets (>100k), IndexIVFFlat or IndexHNSW would be better for speed
- **Combined Q&A**: Including both question and answer in embeddings captures full semantic context, improving retrieval accuracy
- **Top-3**: Provides enough context for LLM while keeping prompts manageable

</details>

<br>

<details>
<summary><b>üì§ Outputs</b> (Click to expand)</summary>

| Variable | Type | Description |
|----------|------|-------------|
| `embedding_model` | `SentenceTransformer` | Loaded sentence-transformers model (all-MiniLM-L6-v2) |
| `qa_embeddings` | `np.ndarray` | Embedding vectors for all Q&A pairs, shape (21, 384) |
| `faiss_index` | `faiss.IndexFlatL2` | FAISS index with 21 vectors indexed, ready for similarity search |
| `embed_query()` | `function` | Function to convert query text to embedding vector |

**Files Created:**
| File | Location | Description |
|------|----------|-------------|
| `qa_embeddings.npy` | `data/vector_database/` | NumPy array of all Q&A embeddings (21 √ó 384) |
| `qa_index.faiss` | `data/vector_database/` | Serialized FAISS index for persistence |
| `retrieval_test_results.csv` | `data/vector_database/` | Test query results with similarity scores |

**Key Functions Created:**
- `load_embedding_model()`: Loads sentence-transformers model
- `generate_embeddings()`: Converts text list to embedding array
- `create_faiss_index()`: Builds FAISS IndexFlatL2 from embeddings
- `search_index()`: Finds top-k similar vectors by L2 distance
- `retrieve_context()`: Complete RAG retrieval - query to relevant Q&A pairs
- `embed_query()`: Converts query text to embedding vector
- `format_context_for_llm()`: Formats retrieved Q&A pairs as context string

</details>

<br>

<details>
<summary><b>üìã FAISS Index Details</b> (Click to expand)</summary>

**Index Type: IndexFlatL2**

| Property | Value | Description |
|----------|-------|-------------|
| **Distance Metric** | L2 (Euclidean) | Lower distance = more similar vectors |
| **Search Type** | Exact | No approximation, highest quality results |
| **Vectors Indexed** | 21 | One per Q&A pair |
| **Embedding Dimension** | 384 | Matches all-MiniLM-L6-v2 output |
| **Search Speed** | Milliseconds | Fast for small-medium datasets |
| **Scalability** | <100k vectors | For larger datasets, use IndexIVFFlat or IndexHNSW |

**Why IndexFlatL2:**
- **Exact Search**: No approximation means highest quality results
- **Simple**: Easy to implement and understand
- **Sufficient**: Perfect for 21 Q&A pairs
- **Fast Enough**: Millisecond search times for our dataset size

**Alternative Index Types (for reference):**
- **IndexIVFFlat**: Approximate search, faster for large datasets (>100k)
- **IndexHNSW**: Graph-based, very fast approximate search for very large datasets
- **IndexFlatIP**: Inner product (cosine similarity) instead of L2 distance

</details>

<br>

<details>
<summary><b>üìö Learning Objectives Demonstrated</b> (Click to expand)</summary>

1. **Text Embeddings**: Converting text to numerical vectors that capture semantic meaning
2. **Semantic Search**: Finding relevant documents by meaning rather than keyword matching
3. **Vector Databases**: Using FAISS for efficient similarity search in high-dimensional spaces
4. **Index Design**: Choosing appropriate index types (IndexFlatL2) for dataset size
5. **RAG Retrieval**: Implementing the retrieval component of RAG systems

</details>

<br>

<details>
<summary><b>üí° Tips</b> (Click to expand)</summary>

- **Embedding Model**: all-MiniLM-L6-v2 is fast and sufficient for 21 pairs. For larger datasets, consider all-mpnet-base-v2 for better quality
- **Combined Q&A**: Including both question and answer in embeddings improves retrieval accuracy
- **FAISS Index Type**: IndexFlatL2 is perfect for small datasets. For >100k vectors, use IndexIVFFlat for speed
- **Top-K Selection**: Top-3 provides good context balance. Adjust based on your use case
- **Float32 Requirement**: FAISS requires float32 dtype - embeddings are automatically converted
- **Index Persistence**: Saved index can be loaded later without regenerating embeddings
- **Search Speed**: FAISS searches are extremely fast (milliseconds) even for larger datasets

</details>

---

**Next Step:** Proceed to Objective 4 to build the complete RAG pipeline using the FAISS index.



In [49]:
# ============================================================================
# OBJECTIVE 3: IMPLEMENT VECTOR DATABASES USING FAISS
# ============================================================================
#
# LEARNING OBJECTIVES DEMONSTRATED:
#   1. Text Embeddings - Converting text to numerical vectors that capture semantic meaning
#   2. Semantic Search - Finding relevant documents by meaning rather than keyword matching
#   3. Vector Databases - Using FAISS for efficient similarity search in high-dimensional spaces
#   4. Index Design - Choosing appropriate index types (IndexFlatL2) for dataset size
#   5. RAG Retrieval - Implementing the retrieval component of RAG systems
#
# PREREQUISITES: Run Objective 1 and Objective 2 first
#   - system_prompt (from Objective 1) - for validation
#   - qa_database (from Objective 2) - 21 Q&A pairs to convert to embeddings
#
# ============================================================================

# ============================================================================
# SECTION 1: IMPORTS & VALIDATION
# ============================================================================

import os
from typing import List, Dict, Tuple

try:
    import numpy as np
    import pandas as pd
    import faiss
    from sentence_transformers import SentenceTransformer
except ImportError as e:
    raise ImportError(f"Missing: {e}. Run: pip install faiss-cpu sentence-transformers numpy pandas")


def validate_prerequisites():
    """Ensure Objective 1 and 2 were run first."""
    required = ['system_prompt', 'qa_database']
    missing = [r for r in required if r not in globals()]
    if missing:
        raise RuntimeError(f"Missing: {missing}. Run Objective 1 and 2 first.")
    print("‚úÖ Prerequisites validated")
    print(f"   ‚Ä¢ System prompt: {len(globals()['system_prompt'])} chars")
    print(f"   ‚Ä¢ Q&A database: {len(globals()['qa_database'])} pairs")
    return True


# ============================================================================
# SECTION 2: CONFIGURATION
# ============================================================================

# Output directory for FAISS index and embeddings
OUTPUT_DIR = "data/vector_database"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Embedding model - lightweight and efficient
EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"

# Retrieval settings
TOP_K = 3  # Number of similar documents to retrieve

# Expected Q&A count (from Objective 2)
EXPECTED_QA_COUNT = 21  # Total Q&A pairs: 18 answerable + 3 unanswerable
EMBEDDING_DIM = 384  # Dimension for all-MiniLM-L6-v2 model


# ============================================================================
# SECTION 3: EMBEDDING FUNCTIONS
# ============================================================================

def load_embedding_model(model_name: str = EMBEDDING_MODEL) -> SentenceTransformer:
    """
    Load sentence transformer model for generating embeddings.
    
    Model: all-MiniLM-L6-v2
    - Dimensions: 384
    - Speed: Fast (good for real-time applications)
    - Quality: Good semantic understanding
    - Memory: Low (suitable for CPU-based environments)
    
    Returns:
        SentenceTransformer model ready for encoding
    """
    print(f"   Loading embedding model: {model_name}")
    model = SentenceTransformer(model_name)
    print(f"   ‚úÖ Model loaded (embedding dim: {model.get_sentence_embedding_dimension()})")
    return model


def generate_embeddings(texts: List[str], model: SentenceTransformer) -> np.ndarray:
    """
    if not texts:
        raise ValueError("Texts list cannot be empty")
    
    Generate embeddings for a list of texts.
    
    Converts text to 384-dimensional float32 vectors using sentence-transformers.
    These embeddings capture semantic meaning, enabling similarity search.
    
    Args:
        texts: List of strings to embed
        model: SentenceTransformer model
    
    Returns:
        numpy array of shape (n_texts, 384) with float32 dtype
    """
    # Encode all texts - model handles batching internally
    embeddings = model.encode(texts, show_progress_bar=True, convert_to_numpy=True)
    
    # FAISS requires float32 dtype
    return embeddings.astype('float32')


# ============================================================================
# SECTION 4: FAISS INDEX FUNCTIONS
# ============================================================================

def create_faiss_index(embeddings: np.ndarray) -> faiss.IndexFlatL2:
    """
    Create FAISS index from embeddings.
    
    Uses IndexFlatL2 for exact search with L2 (Euclidean) distance.
    Ideal for small-medium datasets (<100k vectors).
    
    Args:
        embeddings: numpy array of shape (n_vectors, embedding_dim)
    
    Returns:
        FAISS IndexFlatL2 index with all vectors indexed
    """
    # Step 1: Get embedding dimension
    if embeddings.size == 0:
        raise ValueError("Embeddings array cannot be empty")
    
    if len(embeddings.shape) != 2:
        raise ValueError(f"Embeddings must be 2D array, got shape {embeddings.shape}")
    
    if embeddings.dtype != np.float32:
        raise TypeError(f"Embeddings must be float32, got {embeddings.dtype}")
    
    dimension = embeddings.shape[1]
    
    # Step 2: Create FAISS index
    index = faiss.IndexFlatL2(dimension)
    
    # Step 3: Add vectors to index
    index.add(embeddings)
    
    # Step 4: Return the populated index
    return index


def search_index(
    query_embedding: np.ndarray,
    index: faiss.IndexFlatL2,
    top_k: int = TOP_K
) -> Tuple[np.ndarray, np.ndarray]:
    """
    Search FAISS index for most similar vectors.
    
    Args:
        query_embedding: Query vector (1, embedding_dim)
        index: FAISS index
        top_k: Number of results to return
    
    Returns:
        distances: Distance scores (lower = more similar for L2)
        indices: Indices of matched documents
    """
    if len(query_embedding.shape) == 1:
        query_embedding = query_embedding.reshape(1, -1)
    
    # Validate dimension match
    if query_embedding.shape[1] != index.d:
        raise ValueError(f"Query embedding dimension {query_embedding.shape[1]} doesn't match index dimension {index.d}")
    
    
    distances, indices = index.search(query_embedding, top_k)
    return distances[0], indices[0]


def retrieve_context(
    query: str,
    model: SentenceTransformer,
    index: faiss.IndexFlatL2,
    qa_database: List[Dict],
    top_k: int = TOP_K
) -> List[Dict]:
    """
    Retrieve most relevant Q&A pairs for a query.
    
    This is the core RAG retrieval function:
    1. Convert query to embedding
    2. Search FAISS index for similar embeddings
    3. Return corresponding Q&A pairs as context
    
    Args:
        query: User's question
        model: Embedding model
        index: FAISS index
        qa_database: Original Q&A database
        top_k: Number of results to return
    
    Returns:
        List of relevant Q&A pairs with distance scores
    """
    if not query or not query.strip():
        raise ValueError("Query cannot be empty")
    
    if not qa_database:
        raise ValueError("Q&A database cannot be empty")
    
    # Generate query embedding
    query_embedding = model.encode([query], convert_to_numpy=True).astype('float32')
    
    # Search index
    distances, indices = search_index(query_embedding, index, top_k)
    
    # Get corresponding Q&A pairs
    results = []
    for dist, idx in zip(distances, indices):
        if idx < len(qa_database):  # Safety check
            qa = qa_database[idx].copy()
            qa['distance'] = float(dist)
            qa['similarity_score'] = 1 / (1 + float(dist))  # Convert distance to similarity
            results.append(qa)
    
    return results


def format_context_for_llm(retrieved_qa: List[Dict]) -> str:
    """
    Format retrieved Q&A pairs as context for LLM.
    
    This context will be injected into the prompt for RAG.
    """
    if not retrieved_qa:
        return "No relevant information found in knowledge base."
    
    context_parts = []
    for i, qa in enumerate(retrieved_qa, 1):
        context_parts.append(f"[Context {i}]")
        context_parts.append(f"Q: {qa['question']}")
        context_parts.append(f"A: {qa['answer']}")
        context_parts.append("")
    
    return "\n".join(context_parts)


# ============================================================================
# SECTION 5: DISPLAY & STORAGE FUNCTIONS
# ============================================================================

def display_retrieval_results(query: str, results: List[Dict]):
    """Display retrieval results in a formatted way."""
    print("\n" + "="*70)
    print(f"üîç RETRIEVAL RESULTS FOR: \"{query}\"")
    print("="*70)
    
    for i, qa in enumerate(results, 1):
        similarity = qa.get('similarity_score', 0) * 100
        answerable = 'üìó' if qa.get('answerable', True) else 'üìï'
        
        print(f"\n{answerable} Result {i} (Similarity: {similarity:.1f}%)")
        print(f"   Category: {qa.get('category', 'N/A')}")
        print(f"   Question: {qa.get('question', 'N/A')}")
        print(f"   Answer: {qa.get('answer', 'N/A')[:100]}...")
    
    print("\n" + "="*70)


def save_embeddings(embeddings: np.ndarray, filename: str = "qa_embeddings.npy"):
    """Save embeddings to numpy file."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    np.save(filepath, embeddings)
    print(f"‚úÖ Embeddings saved to: {filepath}")
    return filepath


def save_faiss_index(index: faiss.IndexFlatL2, filename: str = "qa_index.faiss"):
    """Save FAISS index to file."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    faiss.write_index(index, filepath)
    print(f"‚úÖ FAISS index saved to: {filepath}")
    return filepath


def save_retrieval_test_results(test_results: List[Dict], filename: str = "retrieval_test_results.csv"):
    """Save retrieval test results to CSV."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    
    # Flatten results for CSV
    rows = []
    for result in test_results:
        for i, retrieved in enumerate(result['retrieved'], 1):
            rows.append({
                'query': result['query'],
                'rank': i,
                'category': retrieved.get('category', ''),
                'answerable': retrieved.get('answerable', True),
                'question': retrieved.get('question', ''),
                'answer': retrieved.get('answer', '')[:200],
                'similarity_score': retrieved.get('similarity_score', 0)
            })
    
    df = pd.DataFrame(rows)
    df.to_csv(filepath, index=False)
    print(f"‚úÖ Retrieval test results saved to: {filepath}")
    return filepath


# ============================================================================
# SECTION 6: VERIFICATION FUNCTION
# ============================================================================

def verify_objective3():
    """
    Verify that Objective 3 completed successfully.
    Checks all variables, embeddings, FAISS index, and files.
    """
    print("="*70)
    print("üîç OBJECTIVE 3 VERIFICATION")
    print("="*70)
    
    errors = []
    
    # Check if variables exist
    if 'embedding_model' not in globals():
        errors.append("‚ùå embedding_model not found")
    if 'qa_embeddings' not in globals():
        errors.append("‚ùå qa_embeddings not found")
    if 'faiss_index' not in globals():
        errors.append("‚ùå faiss_index not found")
    if 'embed_query' not in globals():
        errors.append("‚ùå embed_query function not found")
    
    if errors:
        print("\n".join(errors))
        print("="*70)
        return False
    
    # Check embeddings shape
    if qa_embeddings.shape[0] != EXPECTED_QA_COUNT:
        errors.append(f"‚ùå Expected EXPECTED_QA_COUNT embeddings, got {qa_embeddings.shape[0]}")
    if qa_embeddings.shape[1] != EMBEDDING_DIM:
        errors.append(f"‚ùå Expected EMBEDDING_DIM dimensions, got {qa_embeddings.shape[1]}")
    if qa_embeddings.dtype != 'float32':
        errors.append(f"‚ùå Expected float32 dtype, got {qa_embeddings.dtype}")
    
    # Check FAISS index
    if faiss_index.ntotal != EXPECTED_QA_COUNT:
        errors.append(f"‚ùå Expected EXPECTED_QA_COUNT vectors in index, got {faiss_index.ntotal}")
    if faiss_index.d != EMBEDDING_DIM:
        errors.append(f"‚ùå Expected EMBEDDING_DIM dimensions in index, got {faiss_index.d}")
    
    # Check files exist
    if not os.path.exists("data/vector_database/qa_embeddings.npy"):
        errors.append("‚ùå qa_embeddings.npy not found")
    if not os.path.exists("data/vector_database/qa_index.faiss"):
        errors.append("‚ùå qa_index.faiss not found")
    
    # Test embed_query function
    try:
        test_embedding = embed_query("test query")
        if test_embedding.shape != (EMBEDDING_DIM,):
            errors.append(f"‚ùå embed_query() returned wrong shape: {test_embedding.shape}")
    except Exception as e:
        errors.append(f"‚ùå embed_query() test failed: {e}")
    
    # Print results
    if errors:
        print("\n‚ùå VERIFICATION FAILED:")
        print("\n".join(errors))
        print("="*70)
        return False
    else:
        print("\n‚úÖ Objective 3 Complete - All variables and files verified")
        print(f"   ‚Ä¢ Embedding Model: {embedding_model.get_sentence_embedding_dimension()} dimensions")
        print(f"   ‚Ä¢ Embeddings: {qa_embeddings.shape[0]} vectors √ó {qa_embeddings.shape[1]} dimensions")
        print(f"   ‚Ä¢ FAISS Index: {faiss_index.ntotal} vectors indexed")
        print(f"   ‚Ä¢ embed_query(): Ready")
        print(f"   ‚Ä¢ Files: Saved to data/vector_database/")
        print("="*70)
        return True


# ============================================================================
# SECTION 7: EXECUTION
# ============================================================================

print("="*70)
print("   OBJECTIVE 3: IMPLEMENT VECTOR DATABASE USING FAISS")
print("   Building Semantic Search for RAG System")
print("="*70)

# --- Step 1: Validate Prerequisites ---
print("\nüîç STEP 1: Validate Prerequisites")
print("-"*70)
validate_prerequisites()

qa_database = globals()['qa_database']

# --- Step 2: Load Embedding Model ---
print("\nü§ñ STEP 2: Load Embedding Model")
print("-"*70)

embedding_model = load_embedding_model()
globals()['embedding_model'] = embedding_model

# --- Step 3: Generate Embeddings for Q&A Database ---
print("\nüìä STEP 3: Generate Embeddings for Q&A Database")
print("-"*70)

# Combine question and answer for richer embeddings
qa_texts = [f"{qa['question']} {qa['answer']}" for qa in qa_database]
print(f"   Generating embeddings for {len(qa_texts)} Q&A pairs...")

qa_embeddings = generate_embeddings(qa_texts, embedding_model)
globals()['qa_embeddings'] = qa_embeddings

print(f"   ‚úÖ Embeddings shape: {qa_embeddings.shape}")

# --- Step 4: Create FAISS Index ---
print("\nüóÑÔ∏è  STEP 4: Create FAISS Index")
print("-"*70)

# Create FAISS index and add embeddings
faiss_index = create_faiss_index(qa_embeddings)
globals()['faiss_index'] = faiss_index

# Create embed_query function for RAG pipeline (used in Objective 4)
def embed_query(query: str) -> np.ndarray:
    """Convert query text to embedding vector for FAISS search."""
    return embedding_model.encode([query], convert_to_numpy=True).astype('float32')[0]

globals()['embed_query'] = embed_query

print(f"   ‚úÖ FAISS index created")
print(f"   ‚Ä¢ Index type: IndexFlatL2 (exact search)")
print(f"   ‚Ä¢ Vectors indexed: {faiss_index.ntotal}")
print(f"   ‚Ä¢ Embedding dimension: {qa_embeddings.shape[1]}")

# --- Step 5: Test Retrieval ---
print("\nüß™ STEP 5: Test Retrieval with Sample Queries")
print("-"*70)

test_queries = [
    "How long does shipping take?",
    "Can I return a product?",
    "What are your business hours?",
    "Do you price match with competitors?",  # Unanswerable
]

test_results = []

for query in test_queries:
    print(f"\n   Testing: \"{query}\"")
    
    retrieved = retrieve_context(
        query=query,
        model=embedding_model,
        index=faiss_index,
        qa_database=qa_database,
        top_k=TOP_K
    )
    
    test_results.append({
        'query': query,
        'retrieved': retrieved
    })
    
    display_retrieval_results(query, retrieved)

# --- Step 6: Show Formatted Context for LLM ---
print("\nüìù STEP 6: Example - Formatted Context for LLM")
print("-"*70)

sample_query = "What is your return policy?"
sample_retrieved = retrieve_context(sample_query, embedding_model, faiss_index, qa_database, TOP_K)
formatted_context = format_context_for_llm(sample_retrieved)

print(f"Query: \"{sample_query}\"\n")
print("Formatted Context for LLM:")
print("-"*70)
print(formatted_context)
print("-"*70)

# --- Step 7: Save All Artifacts ---
print("\nüíæ STEP 7: Save Files to data/vector_database/")
print("-"*70)

save_embeddings(qa_embeddings)
save_faiss_index(faiss_index)
save_retrieval_test_results(test_results)

# --- Step 8: Verify Objective 3 ---
print("\n‚úÖ STEP 8: Verify Objective 3")
print("-"*70)
verify_objective3()

# --- Summary ---
print("\n" + "="*70)
print("‚úÖ OBJECTIVE 3 COMPLETE")
print("="*70)
print(f"""
Key Concepts Demonstrated:
  1. Embeddings - Text to vector conversion using sentence-transformers
  2. FAISS Index - Efficient similarity search with IndexFlatL2
  3. RAG Retrieval - Finding relevant context for user queries

üì¶ FILES SAVED (for submission):
  ‚Ä¢ {OUTPUT_DIR}/qa_embeddings.npy - Embedding vectors
  ‚Ä¢ {OUTPUT_DIR}/qa_index.faiss - FAISS index
  ‚Ä¢ {OUTPUT_DIR}/retrieval_test_results.csv - Test results

üì¶ GLOBAL VARIABLES:
  ‚Ä¢ embedding_model: SentenceTransformer model
  ‚Ä¢ qa_embeddings: numpy array ({qa_embeddings.shape})
  ‚Ä¢ faiss_index: FAISS IndexFlatL2 ({faiss_index.ntotal} vectors)

üîú READY FOR OBJECTIVE 4: RAG Pipeline Integration
""")
print("="*70)

   OBJECTIVE 3: IMPLEMENT VECTOR DATABASE USING FAISS
   Building Semantic Search for RAG System

üîç STEP 1: Validate Prerequisites
----------------------------------------------------------------------
‚úÖ Prerequisites validated
   ‚Ä¢ System prompt: 1995 chars
   ‚Ä¢ Q&A database: 21 pairs

ü§ñ STEP 2: Load Embedding Model
----------------------------------------------------------------------
   Loading embedding model: sentence-transformers/all-MiniLM-L6-v2
   ‚úÖ Model loaded (embedding dim: 384)

üìä STEP 3: Generate Embeddings for Q&A Database
----------------------------------------------------------------------
   Generating embeddings for 21 Q&A pairs...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

   ‚úÖ Embeddings shape: (21, 384)

üóÑÔ∏è  STEP 4: Create FAISS Index
----------------------------------------------------------------------
   ‚úÖ FAISS index created
   ‚Ä¢ Index type: IndexFlatL2 (exact search)
   ‚Ä¢ Vectors indexed: 21
   ‚Ä¢ Embedding dimension: 384

üß™ STEP 5: Test Retrieval with Sample Queries
----------------------------------------------------------------------

   Testing: "How long does shipping take?"

üîç RETRIEVAL RESULTS FOR: "How long does shipping take?"

üìó Result 1 (Similarity: 63.8%)
   Category: shipping
   Question: How long will it take for my order to arrive?
   Answer: Delivery times for our standard shipping option are 5-7 business days, and express shipping takes 2-...

üìó Result 2 (Similarity: 59.2%)
   Category: customer_service
   Question: What are the shipping options available for my order?
   Answer: We offer standard shipping (5-7 business days) for free on orders over $75. Express shipping (2-3 bu...

üìó Result 3 (Similar

## Objective 4: RAG Pipeline Integration

### üéØ Goal

Build a complete Retrieval-Augmented Generation (RAG) pipeline that combines semantic search (FAISS) with the Mistral LLM to answer questions using the custom knowledge base. The pipeline will retrieve relevant context from the Q&A database and generate accurate, context-aware responses.

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `mistral_model`, `mistral_tokenizer` | Objective 1 | ‚úÖ Yes | Mistral model for answer generation |
| `system_prompt` | Objective 1 | ‚úÖ Yes | System prompt for context-aware responses |
| `qa_database` | Objective 2 | ‚úÖ Yes | Q&A pairs as knowledge base |
| `embedding_model` | Objective 3 | ‚úÖ Yes | SentenceTransformer for query embeddings |
| `faiss_index` | Objective 3 | ‚úÖ Yes | FAISS index for semantic search |
| `embed_query()` | Objective 3 | ‚úÖ Yes | Function to convert text to embeddings |
| GPU (recommended) | System | ‚ö†Ô∏è Optional | Faster inference for Mistral model |

**Note:** This objective requires all previous objectives (1-3) to be completed successfully.

</details>

<br>

<details>
<summary><b>üîß Core Concepts</b> (Click to expand)</summary>

| Concept | Description |
|--------|-------------|
| **RAG Pipeline** | Complete flow: Query ‚Üí Embed ‚Üí Retrieve ‚Üí Augment ‚Üí Generate |
| **Semantic Retrieval** | Using FAISS to find most relevant Q&A pairs based on embedding similarity |
| **Context Augmentation** | Formatting retrieved documents as context for the LLM |
| **Prompt Engineering** | Combining system prompt, retrieved context, and user question |
| **Response Generation** | Using Mistral to generate answers based on retrieved context |

**Why RAG:**
RAG combines the accuracy of retrieval (finding exact relevant information) with the fluency of generation (natural language responses). This ensures answers are grounded in the knowledge base while maintaining conversational quality.

</details>

<br>

<details>
<summary><b>üì¶ Outputs</b> (Click to expand)</summary>

**Functions Created:**
- `rag_query(query)` - Main pipeline function (query ‚Üí answer)
- `search_faiss(embedding, top_k=3)` - FAISS similarity search
- `format_context(retrieved_qa)` - Format Q&A pairs as context
- `build_prompt(question, context)` - Combine system prompt + context + question
- `generate_response(prompt)` - Generate answer using Mistral
- `RAGResult` - Dataclass for pipeline results

**Files Saved:**
- `data/rag_pipeline/rag_test_results.csv` - Test query results
- `data/rag_pipeline/pipeline_config.txt` - Pipeline configuration

**Global Variables:**
- `rag_query()` - Available for Objective 5 evaluation
- `search_faiss()` - Available for Objective 5 evaluation
- `format_context()` - Available for Objective 5 evaluation

</details>

<br>

<details>
<summary><b>üìö Learning Objectives Demonstrated</b> (Click to expand)</summary>

1. **RAG Pipeline Design**: Building end-to-end retrieval-augmented generation systems
2. **Component Integration**: Combining multiple objectives into a unified pipeline
3. **Semantic Search Application**: Using FAISS for real-world question answering
4. **Context Management**: Formatting and managing retrieved context for LLM generation
5. **Prompt Engineering**: Designing effective prompts that combine system instructions, context, and queries
6. **Modular Architecture**: Creating reusable, testable pipeline components

</details>

<br>

<details>
<summary><b>üí° Tips</b> (Click to expand)</summary>

**Best Practices:**
- Test with both answerable and unanswerable questions
- Monitor retrieved context quality - if irrelevant, adjust top_k
- Keep context formatting consistent for reliable LLM processing
- Use the verification function to ensure all components work

**Common Issues:**
- **Irrelevant context**: Try adjusting top_k or check embedding quality
- **Hallucinated answers**: Ensure context is properly formatted and included in prompt
- **Slow generation**: Consider using GPU or reducing max_new_tokens

**Performance Tips:**
- Cache embeddings for frequently asked questions
- Batch process multiple queries if evaluating many questions
- Monitor token usage to stay within model limits

</details>



In [50]:
# ============================================================================
# OBJECTIVE 4: BUILD COMPLETE RAG PIPELINE
# ============================================================================
#
# PIPELINE COMPONENTS:
#   1. Query Processing - Convert user question to embedding
#   2. Retrieval - Use FAISS to find top-k most similar Q&A pairs
#   3. Augmentation - Combine user question with retrieved context
#   4. Generation - Use Mistral to generate answer from augmented context
#
# WHY THIS ARCHITECTURE:
#   - RAG combines retrieval (accurate, up-to-date info) with generation (natural responses)
#   - Grounds answers in knowledge base, reducing hallucinations
#   - Allows easy updates to knowledge base without retraining models
#
# 100% REUSE FROM PREVIOUS OBJECTIVES:
#   - system_prompt, mistral_tokenizer, mistral_model (Objective 1)
#   - qa_database (Objective 2)
#   - embedding_model, faiss_index, embed_query (Objective 3)
#
# PREREQUISITES: Run Objectives 1, 2, and 3 first
#
# ============================================================================


# ============================================================================
# SECTION 1: IMPORTS & VALIDATION
# ============================================================================

import os
from typing import List, Dict, Optional
from dataclasses import dataclass

try:
    import numpy as np
    import pandas as pd
    import torch
except ImportError as e:
    raise ImportError(f"Missing: {e}. Run: pip install numpy pandas torch")


def validate_prerequisites():
    """Ensure Objectives 1, 2, and 3 were run first."""
    required = {
        'Objective 1': ['system_prompt', 'mistral_tokenizer', 'mistral_model'],
        'Objective 2': ['qa_database'],
        'Objective 3': ['embedding_model', 'faiss_index', 'embed_query']
    }
    
    all_missing = []
    for objective, items in required.items():
        missing = [item for item in items if item not in globals()]
        if missing:
            all_missing.append(f"{objective}: {missing}")
    
    if all_missing:
        raise RuntimeError(f"Missing prerequisites:\n" + "\n".join(all_missing))
    
    print("‚úÖ All prerequisites validated")
    print(f"   ‚Ä¢ System prompt: {len(globals()['system_prompt'])} chars")
    print(f"   ‚Ä¢ Q&A database: {len(globals()['qa_database'])} pairs")
    print(f"   ‚Ä¢ FAISS index: {globals()['faiss_index'].ntotal} vectors")
    print(f"   ‚Ä¢ embed_query(): Ready (from Objective 3)")
    return True


# ============================================================================
# SECTION 2: CONFIGURATION
# ============================================================================

# Output directory
OUTPUT_DIR = "data/rag_pipeline"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# ============================================================================
# RAG CONFIGURATION PARAMETERS EXPLAINED
# ============================================================================
#
# top_k: Number of documents to retrieve from FAISS
#   - Higher value (5-10): More context, better coverage, but may include noise
#   - Lower value (1-3): More focused, less noise, but may miss relevant info
#   - Default 3: Good balance for small knowledge bases (21 Q&A pairs)
#
# max_new_tokens: Maximum tokens the model can generate in response
#   - Higher value (500+): Longer, more detailed responses
#   - Lower value (100-200): Concise responses, faster generation
#   - Default 300: Allows comprehensive answers without being verbose
#
# temperature: Controls randomness/creativity in generation (0.0 - 1.0)
#   - 0.0: Deterministic, always picks most likely token (factual tasks)
#   - 0.5-0.7: Balanced creativity and coherence (recommended for QA)
#   - 1.0+: More random/creative (creative writing, brainstorming)
#   - Default 0.7: Allows natural variation while staying on-topic
#
# similarity_threshold: Minimum similarity score to include a document (0.0 - 1.0)
#   - Higher value (0.5+): Only very relevant documents, may return few/none
#   - Lower value (0.1-0.3): More documents included, may have lower relevance
#   - Default 0.3: Filters out clearly irrelevant results while being inclusive
#
# ============================================================================

RAG_CONFIG = {
    "top_k": 3,                    # Number of documents to retrieve
    "max_new_tokens": 300,         # Max tokens for generation
    "temperature": 0.7,            # Generation temperature
    "similarity_threshold": 0.3,   # Minimum similarity score (0-1)
}


# ============================================================================
# SECTION 3: RAG RESULT DATA CLASS
# ============================================================================

@dataclass
class RAGResult:
    """Container for RAG pipeline results."""
    query: str
    response: str
    retrieved_context: List[Dict]
    success: bool
    error_message: Optional[str] = None

# Export RAGResult globally
globals()['RAGResult'] = RAGResult


# ============================================================================
# SECTION 4: RAG PIPELINE CORE FUNCTIONS
# ============================================================================
#
# These functions form the complete RAG pipeline:
#   1. search_faiss()       - Search FAISS for similar Q&A pairs
#   2. format_context()     - Format retrieved Q&A as context string
#   3. build_prompt()       - Build augmented prompt
#   4. generate_response()  - Generate response with Mistral
#
# NOTE: embed_query() is REUSED from Objective 3 (not redefined here)
#
# ============================================================================

def search_faiss(query_embedding: np.ndarray, top_k: int = None) -> List[Dict]:
    """
    STEP 2: Search FAISS index for similar Q&A pairs.
    
    Reuses: faiss_index from Objective 3, qa_database from Objective 2
    
    Args:
        query_embedding: Query vector from embed_query()
        top_k: Number of results to retrieve
    
    Returns:
        List of Q&A dicts with similarity scores
    """
    if top_k is None:
        top_k = RAG_CONFIG["top_k"]
    
    faiss_index = globals()['faiss_index']
    qa_database = globals()['qa_database']
    
    # Ensure proper shape for FAISS search
    if len(query_embedding.shape) == 1:
        query_embedding = query_embedding.reshape(1, -1)
    
    # Search FAISS index
    distances, indices = faiss_index.search(query_embedding, top_k)
    
    # Get Q&A pairs with similarity scores
    results = []
    for dist, idx in zip(distances[0], indices[0]):
        if idx < len(qa_database):
            similarity = 1 / (1 + float(dist))  # Convert distance to similarity
            if similarity >= RAG_CONFIG["similarity_threshold"]:
                qa = qa_database[idx].copy()
                qa['similarity_score'] = similarity
                qa['distance'] = float(dist)
                results.append(qa)
    
    return results


def format_context(retrieved_qa: List[Dict]) -> str:
    """
    STEP 3: Format retrieved Q&A pairs as context string.
    
    Args:
        retrieved_qa: List of Q&A dicts from search_faiss()
    
    Returns:
        Formatted context string for prompt
    """
    if not retrieved_qa:
        return "No relevant information found in knowledge base."
    
    context_parts = ["RELEVANT INFORMATION FROM KNOWLEDGE BASE:", "-" * 40]
    
    for i, qa in enumerate(retrieved_qa, 1):
        similarity_pct = qa.get('similarity_score', 0) * 100
        context_parts.append(f"\n[Source {i}] (Relevance: {similarity_pct:.0f}%)")
        context_parts.append(f"Q: {qa['question']}")
        context_parts.append(f"A: {qa['answer']}")
    
    context_parts.append("-" * 40)
    return "\n".join(context_parts)


def build_prompt(query: str, context: str) -> str:
    """
    STEP 4: Build augmented prompt combining system prompt, context, and query.
    
    Reuses: system_prompt from Objective 1
    
    Args:
        query: User's question
        context: Formatted context from format_context()
    
    Returns:
        Complete augmented prompt
    """
    system_prompt = globals()['system_prompt']
    
    augmented_prompt = f"""{system_prompt}

{context}

INSTRUCTIONS:
- Answer using ONLY the information provided above
- If information is not available, politely say so
- Be helpful, accurate, and concise

CUSTOMER QUESTION: {query}

ASSISTANT RESPONSE:"""
    
    return augmented_prompt


def generate_response(augmented_prompt: str) -> str:
    """
    STEP 5: Generate response with Mistral model.
    
    Reuses: mistral_tokenizer, mistral_model from Objective 1
    
    Args:
        augmented_prompt: Complete prompt from build_prompt()
    
    Returns:
        Generated response string
    """
    tokenizer = globals()['mistral_tokenizer']
    model = globals()['mistral_model']
    
    # Format for Mistral Instruct
    formatted = f"<s>[INST] {augmented_prompt} [/INST]"
    
    # Tokenize
    inputs = tokenizer(formatted, return_tensors="pt", truncation=True, max_length=4096)
    if torch.cuda.is_available():
        inputs = {k: v.to(model.device) for k, v in inputs.items()}
    
    # Generate
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=RAG_CONFIG["max_new_tokens"],
            temperature=RAG_CONFIG["temperature"],
            do_sample=True,
            top_p=0.9,
            pad_token_id=tokenizer.eos_token_id
        )
    
    # Decode only new tokens
    input_length = inputs['input_ids'].shape[1]
    response = tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True).strip()
    
    return response


# ============================================================================
# SECTION 5: COMPLETE RAG PIPELINE FUNCTION
# ============================================================================

def rag_query(query: str, top_k: int = None, verbose: bool = True) -> RAGResult:
    """
    Complete RAG Pipeline: Query ‚Üí Retrieve ‚Üí Augment ‚Üí Generate
    
    This is the main entry point for the RAG system.
    Orchestrates all pipeline functions in sequence.
    
    Args:
        query: User's question
        top_k: Number of documents to retrieve (default: from config)
        verbose: Print step-by-step progress
    
    Returns:
        RAGResult with response and metadata
    
    Example:
        result = rag_query("What is your return policy?")
        print(result.response)
    """
    if top_k is None:
        top_k = RAG_CONFIG["top_k"]
    
    # Get embed_query from Objective 3
    embed_query_func = globals()['embed_query']
    
    try:
        # ============================================================
        # STEP 1: QUERY PROCESSING - Convert to embedding
        # ============================================================
        if verbose:
            print(f"   Step 1: embed_query() - Converting query to embedding...")
        query_embedding = embed_query_func(query)
        
        # Ensure proper shape
        if len(query_embedding.shape) == 1:
            query_embedding = query_embedding.reshape(1, -1)
        
        # ============================================================
        # STEP 2: RETRIEVAL - Search FAISS for similar Q&A
        # ============================================================
        if verbose:
            print(f"   Step 2: search_faiss() - Searching for similar Q&A pairs...")
        retrieved_qa = search_faiss(query_embedding, top_k)
        if verbose:
            print(f"           ‚Üí Found {len(retrieved_qa)} relevant documents")
        
        # ============================================================
        # STEP 3: AUGMENTATION (Part 1) - Format context
        # ============================================================
        if verbose:
            print(f"   Step 3: format_context() - Formatting retrieved context...")
        context = format_context(retrieved_qa)
        
        # ============================================================
        # STEP 4: AUGMENTATION (Part 2) - Build prompt
        # ============================================================
        if verbose:
            print(f"   Step 4: build_prompt() - Building augmented prompt...")
        augmented_prompt = build_prompt(query, context)
        
        # ============================================================
        # STEP 5: GENERATION - Generate response with Mistral
        # ============================================================
        if verbose:
            print(f"   Step 5: generate_response() - Generating response with Mistral...")
        response = generate_response(augmented_prompt)
        if verbose:
            print(f"           ‚Üí Response generated: {len(response)} chars")
        
        return RAGResult(
            query=query,
            response=response,
            retrieved_context=retrieved_qa,
            success=True
        )
        
    except Exception as e:
        return RAGResult(
            query=query,
            response="I apologize, but I encountered an error processing your request.",
            retrieved_context=[],
            success=False,
            error_message=str(e)
        )


# ============================================================================
# SECTION 6: DISPLAY & STORAGE FUNCTIONS
# ============================================================================

def display_rag_result(result: RAGResult):
    """Display RAG result in formatted way."""
    print("\n" + "="*70)
    print("ü§ñ RAG PIPELINE RESULT")
    print("="*70)
    
    print(f"\nüì• USER QUERY:")
    print(f"   {result.query}")
    
    print(f"\nüìö RETRIEVED CONTEXT ({len(result.retrieved_context)} sources):")
    for i, ctx in enumerate(result.retrieved_context, 1):
        similarity = ctx.get('similarity_score', 0) * 100
        answerable = 'üìó' if ctx.get('answerable', True) else 'üìï'
        print(f"   {answerable} [{i}] {ctx.get('category', 'N/A')} (Similarity: {similarity:.0f}%)")
        print(f"       Q: {ctx['question'][:60]}...")
    
    print(f"\nüì§ GENERATED RESPONSE:")
    print("-"*70)
    print(result.response)
    print("-"*70)
    
    if not result.success:
        print(f"\n‚ö†Ô∏è  Error: {result.error_message}")
    
    print("="*70)


def save_rag_results(results: List[RAGResult], filename: str = "rag_test_results.csv"):
    """Save RAG test results to CSV."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    
    rows = []
    for result in results:
        rows.append({
            'query': result.query,
            'response': result.response[:500],
            'num_sources': len(result.retrieved_context),
            'top_source_similarity': result.retrieved_context[0]['similarity_score'] if result.retrieved_context else 0,
            'success': result.success,
            'error': result.error_message
        })
    
    df = pd.DataFrame(rows)
    df.to_csv(filepath, index=False)
    print(f"‚úÖ RAG results saved to: {filepath}")
    return filepath


def save_pipeline_config(filename: str = "pipeline_config.txt"):
    """Save pipeline configuration."""
    filepath = os.path.join(OUTPUT_DIR, filename)
    
    config_text = f"""RAG PIPELINE CONFIGURATION
==========================

Retrieval Settings:
- Top-K: {RAG_CONFIG['top_k']}
- Similarity Threshold: {RAG_CONFIG['similarity_threshold']}

Generation Settings:
- Max New Tokens: {RAG_CONFIG['max_new_tokens']}
- Temperature: {RAG_CONFIG['temperature']}

Components:
- Embedding Model: sentence-transformers/all-MiniLM-L6-v2
- Vector Store: FAISS IndexFlatL2
- LLM: Mistral-7B-Instruct-v0.3
- Q&A Database: {len(globals().get('qa_database', []))} pairs
"""
    
    with open(filepath, 'w') as f:
        f.write(config_text)
    
    print(f"‚úÖ Pipeline config saved to: {filepath}")
    return filepath


# ============================================================================
# SECTION 7: VERIFICATION FUNCTION
# ============================================================================

def verify_objective4():
    """
    Verify that Objective 4 completed successfully.
    Checks all functions, files, and test results.
    """
    print("="*70)
    print("üîç OBJECTIVE 4 VERIFICATION")
    print("="*70)
    
    errors = []
    
    # Check required functions exist
    required_functions = ['rag_query', 'search_faiss', 'format_context', 
                          'build_prompt', 'generate_response', 'embed_query']
    for func_name in required_functions:
        if func_name not in globals() or not callable(globals()[func_name]):
            errors.append(f"‚ùå Function '{func_name}' not found")
    
    # Check RAGResult dataclass
    if 'RAGResult' not in globals():
        errors.append("‚ùå RAGResult dataclass not found")
    
    # Check files exist
    if not os.path.exists("data/rag_pipeline/rag_test_results.csv"):
        errors.append("‚ùå rag_test_results.csv not found")
    if not os.path.exists("data/rag_pipeline/pipeline_config.txt"):
        errors.append("‚ùå pipeline_config.txt not found")
    
    # Check test results
    if 'rag_results' not in globals():
        errors.append("‚ùå rag_results not found")
    elif len(globals()['rag_results']) != 10:
        errors.append(f"‚ùå Expected 10 test results, got {len(globals()['rag_results'])}")
    
    # Test rag_query function
    try:
        test_result = rag_query("test query", verbose=False)
        if not isinstance(test_result, RAGResult):
            errors.append("‚ùå rag_query() does not return RAGResult")
    except Exception as e:
        errors.append(f"‚ùå rag_query() test failed: {e}")
    
    # Print results
    if errors:
        print("\n‚ùå VERIFICATION FAILED:")
        print("\n".join(errors))
        print("="*70)
        return False
    else:
        print("\n‚úÖ Objective 4 Complete - All components verified")
        print(f"   ‚Ä¢ Pipeline Functions: {len(required_functions)} verified")
        print(f"   ‚Ä¢ RAGResult: Defined")
        print(f"   ‚Ä¢ Test Results: {len(globals()['rag_results'])} queries")
        print(f"   ‚Ä¢ Files: Saved to data/rag_pipeline/")
        print("="*70)
        return True


# ============================================================================
# SECTION 8: EXECUTION
# ============================================================================

print("="*70)
print("   OBJECTIVE 4: BUILD COMPLETE RAG PIPELINE")
print("="*70)

# --- Step 1: Validate Prerequisites ---
print("\nüîç STEP 1: Validate Prerequisites")
print("-"*70)
validate_prerequisites()

# --- Step 2: Show Pipeline Architecture ---
print("\nüìê STEP 2: RAG Pipeline Architecture")
print("-"*70)
print("""
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      RAG PIPELINE FLOW                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  User Query                                                     ‚îÇ
‚îÇ      ‚îÇ                                                          ‚îÇ
‚îÇ      ‚ñº                                                          ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                            ‚îÇ
‚îÇ  ‚îÇ 1. embed_query()‚îÇ  Convert query to embedding (from Obj 3)   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                            ‚îÇ
‚îÇ           ‚ñº                                                     ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                            ‚îÇ
‚îÇ  ‚îÇ 2. search_faiss()‚îÇ  Find similar Q&A pairs                   ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                            ‚îÇ
‚îÇ           ‚ñº                                                     ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                           ‚îÇ
‚îÇ  ‚îÇ 3. format_context()‚îÇ  Format as context string               ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                           ‚îÇ
‚îÇ           ‚ñº                                                     ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                            ‚îÇ
‚îÇ  ‚îÇ 4. build_prompt()‚îÇ  Combine system + context + query         ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                            ‚îÇ
‚îÇ           ‚ñº                                                     ‚îÇ
‚îÇ  ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê                                        ‚îÇ
‚îÇ  ‚îÇ 5. generate_response()‚îÇ  Generate with Mistral               ‚îÇ
‚îÇ  ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò                                        ‚îÇ
‚îÇ           ‚ñº                                                     ‚îÇ
‚îÇ      Response                                                   ‚îÇ
‚îÇ                                                                 ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
""")

# --- Step 3: Test RAG Pipeline with Answerable Questions ---
print("\nüß™ STEP 3: Test with ANSWERABLE Questions")
print("-"*70)
print("   These questions CAN be answered from our knowledge base")

answerable_questions = [
    "What is your return policy?",
    "How long does shipping take?",
    "What are your customer service hours?",
    "Do you offer warranty on products?",
    "How can I track my order?",
]

answerable_results = []

for query in answerable_questions:
    print(f"\n{'='*70}")
    print(f"üìó ANSWERABLE: \"{query}\"")
    print('='*70)
    
    result = rag_query(query, verbose=True)
    answerable_results.append(result)
    display_rag_result(result)

# --- Step 4: Test RAG Pipeline with Unanswerable Questions ---
print("\nüß™ STEP 4: Test with UNANSWERABLE Questions")
print("-"*70)
print("   These questions CANNOT be answered from our knowledge base")
print("   Testing system limitations and graceful handling")

unanswerable_questions = [
    "How do your prices compare to Amazon?",
    "Should I buy solar panels for my house?",
    "Will you have a Black Friday sale this year?",
    "What is the CEO's email address?",
    "Can you recommend a good restaurant nearby?",
]

unanswerable_results = []

for query in unanswerable_questions:
    print(f"\n{'='*70}")
    print(f"üìï UNANSWERABLE: \"{query}\"")
    print('='*70)
    
    result = rag_query(query, verbose=True)
    unanswerable_results.append(result)
    display_rag_result(result)

# --- Step 5: Save Results ---
print("\nüíæ STEP 5: Save Results")
print("-"*70)

all_results = answerable_results + unanswerable_results
save_rag_results(all_results)
save_pipeline_config()

# Store globally for reuse
globals()['rag_query'] = rag_query
globals()['rag_results'] = all_results

# Export core functions globally
globals()['search_faiss'] = search_faiss
globals()['format_context'] = format_context
globals()['build_prompt'] = build_prompt
globals()['generate_response'] = generate_response
globals()['verify_objective4'] = verify_objective4

# --- Step 6: Verify Objective 4 ---
print("\n‚úÖ STEP 6: Verify Objective 4")
print("-"*70)
verify_objective4()

# --- Summary ---
print("\n" + "="*70)
print("‚úÖ OBJECTIVE 4 COMPLETE")
print("="*70)

answerable_success = sum(1 for r in answerable_results if r.success)
unanswerable_success = sum(1 for r in unanswerable_results if r.success)

print(f"""
RAG Pipeline Components:
  1. Query Processing: embed_query() - Convert to embedding (from Objective 3)
  2. Retrieval: search_faiss() - FAISS similarity search (top-{RAG_CONFIG['top_k']})
  3. Augmentation: format_context() + build_prompt()
  4. Generation: generate_response() - Mistral-7B response

Test Results:
  üìó Answerable Questions: {answerable_success}/{len(answerable_results)} successful
  üìï Unanswerable Questions: {unanswerable_success}/{len(unanswerable_results)} successful

üì¶ FILES SAVED:
  ‚Ä¢ {OUTPUT_DIR}/rag_test_results.csv
  ‚Ä¢ {OUTPUT_DIR}/pipeline_config.txt

üì¶ GLOBAL FUNCTIONS:
  ‚Ä¢ rag_query(query) - Complete RAG pipeline
  ‚Ä¢ search_faiss(), format_context(), build_prompt(), generate_response()
  ‚Ä¢ embed_query() - Reused from Objective 3
  ‚Ä¢ verify_objective4() - Verification function

üîú READY FOR OBJECTIVE 5: Model Experimentation and Ranking
""")
print("="*70)


   OBJECTIVE 4: BUILD COMPLETE RAG PIPELINE

üîç STEP 1: Validate Prerequisites
----------------------------------------------------------------------
‚úÖ All prerequisites validated
   ‚Ä¢ System prompt: 1995 chars
   ‚Ä¢ Q&A database: 21 pairs
   ‚Ä¢ FAISS index: 21 vectors
   ‚Ä¢ embed_query(): Ready (from Objective 3)

üìê STEP 2: RAG Pipeline Architecture
----------------------------------------------------------------------

‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                      RAG PIPELINE FLOW                          ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                 ‚îÇ
‚îÇ  User Query

## Objective 5: Model Evaluation & Ranking

### üéØ Goal

Evaluate 6 question-answering models using the RAG pipeline, compare their performance using **5 evaluation metrics**, and rank models to identify the best performer based on weighted scoring.

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `qa_database` | Objective 2 | ‚úÖ Yes | Ground truth answers for comparison |
| `embed_query()` | Objective 3 | ‚úÖ Yes | For semantic similarity calculation |
| `rag_query()` | Objective 4 | ‚úÖ Yes | Complete RAG pipeline for dynamic context |
| `search_faiss()`, `format_context()` | Objective 4 | ‚úÖ Yes | RAG pipeline components |

**Note:** Requires Objectives 1-4 completed. Achieves **100% component reuse**.

</details>

<br>

<details>
<summary><b>üìä The 5 Evaluation Metrics</b> (Click to expand)</summary>

### 1. Accuracy (BERTScore F1) - Weight: 25%
**What it measures:** Semantic similarity between model answer and ground truth using BERTScore

**Why BERTScore instead of token F1:**
- Token F1 fails on paraphrases: "30 day return" vs "30-day refund" = low score
- BERTScore understands semantics: same meaning = high score
- Better for RAG evaluation where answers may be paraphrased

**Implementation:**
- Uses **BERTScore** with **DeBERTa-large-mnli** model
- Calculates semantic F1 score (0-1 range)
- Batch processing for efficiency
- Rescaled with baseline for better calibration

**Only computed for:** Answerable questions (unanswerable have no expected answer)

---

### 2. Confidence (Calibration) - Weight: 20%
**What it measures:** Whether model appropriately indicates uncertainty (calibration)

**Scoring Logic:**

**For Answerable Questions:**
- Empty answer (< 3 chars) ‚Üí **Score: 0.2** ‚ö†Ô∏è
- Non-empty answer ‚Üí **Score: raw_confidence** (0-1)

**For Unanswerable Questions:**
- Empty answer (abstains) ‚Üí **Score: 1.0** ‚úÖ (correct behavior)
- Low confidence (< 0.3) ‚Üí **Score: 0.9** ‚úÖ
- Medium confidence (0.3-0.5) ‚Üí **Score: 0.6** ‚ö†Ô∏è
- High confidence (> 0.5) ‚Üí **Score: 0.2** ‚ùå (overconfident)

**Computed for:** ALL questions (both answerable and unanswerable)

---

### 3. Quality (Semantic Similarity) - Weight: 25%
**What it measures:** Semantic meaning similarity using embeddings

**Formula:**
```
Quality = cosine_similarity(embed_query(answer), embed_query(expected))
```

**Calculation:**
- Uses `embed_query()` from Objective 3 (component reuse!)
- Embed both answer and expected answer
- Calculate cosine similarity between embeddings
- Range: 0-1 (1 = identical meaning)

**Only computed for:** Answerable questions

**Key Insight:** Reuses `embed_query()` from Objective 3 - same function used in RAG retrieval!

---

### 4. Speed - Weight: 15%
**What it measures:** Response time performance

**Formula:**
```
Speed = 1 - (response_time_ms / 2000)
```

**Calculation:**
- Measure time from query to answer (milliseconds)
- Normalize against 2000ms threshold
- 0ms = 1.0 (perfect), 2000ms+ = 0.0 (too slow)
- Faster responses = higher score
- Range: 0-1 (clamped)

**Computed for:** ALL questions

---

### 5. Robustness - Weight: 15%
**What it measures:** Edge case handling and error recovery

**Scoring Logic:**

**For Answerable Questions:**
- **Error during inference** ‚Üí Score: 0.0 ‚ùå
- **Empty answer** (< 3 chars) ‚Üí Score: 0.3 ‚ö†Ô∏è
- **Long answer** (> 50 words) ‚Üí Score: 0.7 ‚ö†Ô∏è (possible verbosity)
- **Normal answer** (3-50 words) ‚Üí Score: 1.0 ‚úÖ

**For Unanswerable Questions:**
- **Error during inference** ‚Üí Score: 0.0 ‚ùå
- **Empty answer** (abstains) ‚Üí Score: 1.0 ‚úÖ (correct behavior)
- **Long answer** (> 30 words) ‚Üí Score: 0.2 ‚ùå (hallucinating)
- **Short answer** (3-30 words) ‚Üí Score: 0.6 ‚ö†Ô∏è

**Computed for:** ALL questions

</details>

<br>

<details>
<summary><b>üèÜ Ranking Calculation</b> (Click to expand)</summary>

### Step 1: Calculate Per-Question Metrics
For each question, compute all 5 metrics:
- Accuracy (F1) - if answerable
- Confidence - always
- Quality (Semantic) - if answerable
- Speed - always
- Robustness - always

### Step 2: Aggregate Per Model
For each model, average each metric across all questions:
```
avg_accuracy = mean(accuracy_scores for answerable questions)
avg_confidence = mean(confidence_scores for all questions)
avg_quality = mean(quality_scores for answerable questions)
avg_speed = mean(speed_scores for all questions)
avg_robustness = mean(robustness_scores for all questions)
```

### Step 3: Calculate Final Score
Weighted combination of all 5 metrics:

```
Final Score = (Accuracy √ó 0.25) + 
              (Confidence √ó 0.20) + 
              (Quality √ó 0.25) + 
              (Speed √ó 0.15) + 
              (Robustness √ó 0.15)
```

**Weight Distribution:**
- **Accuracy + Quality = 50%** (correctness and meaning)
- **Confidence = 20%** (uncertainty handling)
- **Speed + Robustness = 30%** (performance and reliability)

### Step 4: Rank Models
Sort all models by Final Score (descending):
- Rank 1 = Highest Final Score (Best Model)
- Rank 2 = Second highest
- ... and so on

### Example Calculation:

**Model: RoBERTa-SQuAD2**
- Accuracy: 0.847
- Confidence: 0.756
- Quality: 0.891
- Speed: 0.823
- Robustness: 0.950

**Final Score:**
```
(0.847 √ó 0.25) + (0.756 √ó 0.20) + (0.891 √ó 0.25) + (0.823 √ó 0.15) + (0.950 √ó 0.15)
= 0.212 + 0.151 + 0.223 + 0.123 + 0.143
= 0.852
```

</details>

<br>

<details>
<summary><b>ü§ñ Models Evaluated</b> (Click to expand)</summary>

| Rank | Model Name | Model ID | Type | Size | Params |
|------|-----------|----------|------|------|--------|
| - | T5-QA-Generative | consciousAI/question-answering-generative-t5-v1-base-s-q-c | text2text-generation | Base | 220M |
| - | RoBERTa-SQuAD2 | deepset/roberta-base-squad2 | question-answering | Base | 125M |
| - | BERT-Large-SQuAD | google-bert/bert-large-cased-whole-word-masking-finetuned-squad | question-answering | Large | 340M |
| - | DistilBERT-SQuAD | distilbert-base-uncased-distilled-squad | question-answering | Small | 66M |
| - | BERT-Tiny-SQuAD | mrm8488/bert-tiny-finetuned-squadv2 | question-answering | Tiny | 4M |
| - | MiniLM-SQuAD | deepset/minilm-uncased-squad2 | question-answering | Small | 33M |

**Note:** Rankings determined by Final Score after evaluation.

</details>

<br>

<details>
<summary><b>üìã Test Questions</b> (Click to expand)</summary>

**Answerable (5 questions):**
1. "What is your return policy?" ‚Üí Expected: "30-day return policy with full refund"
2. "How long does shipping take?" ‚Üí Expected: "Standard shipping takes 5-7 business days"
3. "What are your customer service hours?" ‚Üí Expected: "Monday to Friday 9am to 5pm"
4. "Do you offer warranty on products?" ‚Üí Expected: "1 year warranty on all electronics"
5. "How can I track my order?" ‚Üí Expected: "Use tracking number on our website"

**Unanswerable (3 questions):**
1. "What is the CEO's phone number?" ‚Üí Not in knowledge base
2. "Will you have a Black Friday sale?" ‚Üí Not in knowledge base
3. "How do prices compare to Amazon?" ‚Üí Not in knowledge base

**Why this split:**
- Answerable questions test: Accuracy, Quality
- Unanswerable questions test: Confidence Handling, Robustness
- All questions test: Speed

</details>

<br>

<details>
<summary><b>üîÑ Evaluation Process</b> (Click to expand)</summary>

**3-Stage Caching System:**
1. **Contexts Cache** (`contexts.csv`) - RAG contexts from `rag_query()` (Obj 4)
2. **Embeddings Cache** (`embeddings.csv`) - Embeddings computed via `embed_query()` (Obj 3)
3. **Responses Cache** (`raw_responses.csv`) - Model inference results (6 models √ó 8 questions)

**For Each Model:**
1. Load model from HuggingFace
2. For each test question:
   - Get RAG context using `rag_query()` (reuses Obj 4, cached)
   - Run model inference with retrieved context
   - Measure response time
   - Store raw response and confidence

**Metric Calculation (Batch Processing):**
- Uses **ScorePack** class for unified scoring
- **Accuracy:** BERTScore F1 (DeBERTa-large-mnli) - batch processed
- **Quality:** Cosine similarity using `embed_query()` (reuses Obj 3)
- **Confidence:** Calibration scoring based on answerability
- **Speed:** Normalized latency (2000ms threshold)
- **Robustness:** Error handling and answer length checks

**After All Models:**
- Aggregate metrics per model (average across questions)
- Calculate Final Scores (weighted combination)
- Rank models by Final Score
- Save to `model_rankings.csv`, `model_responses.csv`, `evaluation_summary.txt`

</details>

<br>

<details>
<summary><b>üì§ Outputs</b> (Click to expand)</summary>

**Files Created:**
- `model_rankings.csv` - Final rankings with all metrics
- `model_responses.csv` - Detailed per-question results
- `evaluation_summary.txt` - Text summary with winner
- `contexts.csv` - Cached RAG contexts (cache)
- `embeddings.csv` - Cached embeddings (cache)
- `raw_responses.csv` - Cached model responses (cache)

**Global Variables:**
- `rankings_df` - DataFrame with model rankings (sorted by Final_Score)
- `detailed_df` - DataFrame with per-question detailed results

**CSV Columns (model_rankings.csv):**
| Rank | Model | Accuracy | Quality | Confidence | Speed | Robustness | Final_Score |

**Helper Functions:**
- `run_evaluation(force_refresh=False)` - Full evaluation pipeline
- `recalculate_metrics_only()` - Recalculate metrics from cache (~30 sec)
- `clean()` - Delete all cache files
- `show_cache_status()` - Display cache file status

</details>

<br>

<details>
<summary><b>üîó Component Reuse</b> (Click to expand)</summary>

**100% Reuse from Previous Objectives:**

| Component | From | Used For |
|-----------|------|----------|
| `rag_query()` | Obj 4 | Dynamic context retrieval per question (cached) |
| `embed_query()` | Obj 3 | **Semantic similarity metric** (Quality) + embeddings cache |
| `search_faiss()` | Obj 4 | Via `rag_query()` |
| `format_context()` | Obj 4 | Via `rag_query()` |
| `qa_database` | Obj 2 | Ground truth answers |
| `faiss_index` | Obj 3 | Via `search_faiss()` |

**Key Insights:**
1. **`embed_query()` dual use:** Reused for BOTH RAG retrieval (Obj 4) AND semantic similarity (Obj 5)
2. **ScorePack class:** Unified scoring system with batch processing and embedding caching
3. **3-stage caching:** Contexts, embeddings, and responses cached for fast re-runs
4. **Batch BERTScore:** Efficient batch processing using DeBERTa-large-mnli model

This demonstrates true modular design with zero code duplication and efficient caching.

</details>

<br>

<details>
<summary><b>‚úÖ Verification</b> (Click to expand)</summary>

**Verification:**
- ‚úÖ `rankings_df` exists with 6 models
- ‚úÖ `model_rankings.csv` file created
- ‚úÖ `model_responses.csv` with detailed per-question results
- ‚úÖ All 5 metrics present (Accuracy, Quality, Confidence, Speed, Robustness)
- ‚úÖ Final_Score calculated using weighted combination
- ‚úÖ Models ranked by Final_Score (descending)

**Performance:**
- **First run:** ~10-15 min (downloads models, runs inference, computes metrics)
- **Subsequent runs:** ~30 sec (loads from cache, recalculates metrics only)
- **Recalculate only:** Use `recalculate_metrics_only()` to update metrics from cached responses

</details>

<br>

---
**Next Step:** Proceed to Objective 6 for system analysis and recommendations.


In [52]:
# ============================================================================
# OBJECTIVE 5: MODEL EVALUATION (ScorePack)
# ============================================================================
#
# WHAT THIS DOES:
#   - Evaluates 6 QA models on our RAG pipeline
#   - Uses 5 metrics: Accuracy, Quality, Confidence, Speed, Robustness
#   - Ranks models by weighted final score
#
# REUSES FROM PREVIOUS OBJECTIVES:
#   - Objective 1: System prompt (SYSTEM_PROMPT)
#   - Objective 2: Q&A database (qa_database)
#   - Objective 3: FAISS index + embed_query() + search_faiss()
#   - Objective 4: rag_query() + format_context()
#
# WHY SCOREPACK:
#   - Uses BERTScore (semantic F1) instead of token F1 for better RAG evaluation
#   - Token F1 fails on paraphrases: "30 day return" vs "30-day refund" = low score
#   - BERTScore understands semantics: same meaning = high score
#
# CACHING:
#   - First run: ~10-15 min (downloads models, runs inference)
#   - Subsequent runs: ~30 sec (loads cached responses, recalculates metrics)
#
# USAGE:
#   rankings_df, detailed_df = run_evaluation()           # Full run
#   rankings_df, detailed_df = recalculate_metrics_only() # Recalc metrics only
#   clean()                                               # Delete cache
#
# ============================================================================

import os
import time
import json
import numpy as np
import pandas as pd
from typing import Dict, Tuple, List, Optional, Callable
import torch
from transformers import pipeline

# ============================================================================
# VERIFY DEPENDENCIES FROM OBJECTIVES 1-4
# ============================================================================
def verify_dependencies():
    """Check that Objectives 1-4 have been run."""
    missing = []
    
    # Objective 1: System prompt
    if 'SYSTEM_PROMPT' not in globals():
        missing.append("SYSTEM_PROMPT (Objective 1)")
    
    # Objective 2: Q&A database
    if 'qa_database' not in globals():
        missing.append("qa_database (Objective 2)")
    
    # Objective 3: Embeddings
    if 'embed_query' not in globals():
        missing.append("embed_query() (Objective 3)")
    if 'search_faiss' not in globals():
        missing.append("search_faiss() (Objective 3)")
    if 'faiss_index' not in globals():
        missing.append("faiss_index (Objective 3)")
    
    # Objective 4: RAG pipeline
    if 'rag_query' not in globals():
        missing.append("rag_query() (Objective 4)")
    if 'format_context' not in globals():
        missing.append("format_context() (Objective 4)")
    
    if missing:
        print("‚ùå MISSING DEPENDENCIES - Run previous objectives first:")
        for m in missing:
            print(f"   ‚Ä¢ {m}")
        return False
    
    print("‚úÖ All dependencies from Objectives 1-4 verified")
    return True

# ============================================================================
# SCOREPACK: UNIFIED SCORING CLASS
# ============================================================================
# Why inline? Self-contained notebook, no separate file needed.
# Why a class? Groups 5 metrics, caches embeddings, batch processing.
#
# METRICS:
#   accuracy   - BERTScore F1 (semantic similarity via DeBERTa)
#   quality    - Cosine similarity (same embedder as FAISS from Obj 3)
#   confidence - Calibration (rewards correct confidence levels)
#   speed      - Normalized latency (0ms=1.0, 2000ms=0.0)
#   robustness - Error handling (penalizes crashes, empty, verbose)
# ============================================================================

try:
    from bert_score import score as bert_score
    _HAS_BERT = True
    print("‚úÖ BERTScore available")
except ImportError:
    _HAS_BERT = False
    print("‚ö†Ô∏è BERTScore not installed - run: pip install bert-score")

class ScorePack:
    """
    Unified scoring for RAG model evaluation.
    
    Usage:
        scorer = ScorePack(embeddings=cached_embs, embed_fn=embed_query)
        scores = scorer.score_all(pred, ref, conf, latency, is_ans, error)
    """
    
    def __init__(self, embeddings: Dict[str, np.ndarray] = None, 
                 embed_fn: Callable[[str], np.ndarray] = None):
        self.embeddings = embeddings or {}
        self.embed_fn = embed_fn  # Reuses embed_query from Objective 3
        self.device = 0 if torch.cuda.is_available() else "cpu"
    
    def _get_emb(self, text: str) -> Optional[np.ndarray]:
        """Get embedding from cache or compute via embed_query (Obj 3)."""
        if not text or len(text.strip()) < 2:
            return None
        if text in self.embeddings:
            return self.embeddings[text]
        if self.embed_fn:
            emb = self.embed_fn(text)
            self.embeddings[text] = emb
            return emb
        return None
    
    # --- ACCURACY: BERTScore (semantic F1) ---
    def accuracy(self, pred: str, ref: str) -> float:
        """BERTScore F1 using DeBERTa-large-mnli."""
        if not _HAS_BERT or not pred or not ref:
            return 0.0
        P, R, F1 = bert_score(
            [pred], [ref], lang="en",
            model_type="microsoft/deberta-large-mnli",
            rescale_with_baseline=True,
            device=self.device, verbose=False
        )
        return float(F1[0])
    
    def accuracy_batch(self, preds: List[str], refs: List[str]) -> List[float]:
        """Batch BERTScore - faster than loop."""
        if not _HAS_BERT:
            return [0.0] * len(preds)
        valid_idx = [i for i, (p, r) in enumerate(zip(preds, refs))
                     if p and p.strip() and r and r.strip()]
        if not valid_idx:
            return [0.0] * len(preds)
        P, R, F1 = bert_score(
            [preds[i] for i in valid_idx],
            [refs[i] for i in valid_idx],
            lang="en", model_type="microsoft/deberta-large-mnli",
            rescale_with_baseline=True,
            device=self.device, verbose=False
        )
        result = [0.0] * len(preds)
        for j, i in enumerate(valid_idx):
            result[i] = float(F1[j])
        return result
    
    # --- QUALITY: Embedding similarity (reuses Objective 3 embedder) ---
    def quality(self, pred: str, ref: str) -> float:
        """Cosine similarity using same embedder as FAISS (Obj 3)."""
        if not pred or not ref:
            return 0.0
        emb_p, emb_r = self._get_emb(pred), self._get_emb(ref)
        if emb_p is None or emb_r is None:
            return 0.0
        return max(0.0, float(np.dot(emb_p, emb_r) / 
                              (np.linalg.norm(emb_p) * np.linalg.norm(emb_r))))
    
    # --- CONFIDENCE: Calibration ---
    def confidence(self, raw_conf: float, answer: str, is_answerable: bool) -> float:
        """Rewards well-calibrated confidence."""
        is_empty = not answer or len(answer.strip()) < 3
        conf = raw_conf if raw_conf else 0.5
        if is_answerable:
            return 0.2 if is_empty else conf
        return 1.0 if is_empty else (0.9 if conf < 0.3 else (0.6 if conf < 0.5 else 0.2))
    
    # --- SPEED: Normalized latency ---
    @staticmethod
    def speed(latency_ms: float) -> float:
        """0ms=1.0, 2000ms+=0.0"""
        return max(0.0, min(1.0, 1 - (latency_ms / 2000)))
    
    # --- ROBUSTNESS: Error handling ---
    @staticmethod
    def robustness(answer: str, is_answerable: bool, had_error: bool) -> float:
        """Penalizes errors, empty answers, verbosity."""
        if had_error:
            return 0.0
        is_empty = not answer or len(answer.strip()) < 3
        length = len(answer.split()) if answer else 0
        if is_answerable:
            return 0.3 if is_empty else (0.7 if length > 50 else 1.0)
        return 1.0 if is_empty else (0.2 if length > 30 else 0.6)
    
    # --- BATCH SCORING ---
    def score_all_batch(self, preds: List[str], refs: List[str],
                        confs: List[float], latencies: List[float],
                        answerables: List[bool], errors: List[bool]) -> List[Dict[str, float]]:
        """Batch score all 5 metrics."""
        batch_preds = [p if a else "" for p, a in zip(preds, answerables)]
        batch_refs = [r if a else "" for r, a in zip(refs, answerables)]
        accuracies = self.accuracy_batch(batch_preds, batch_refs)
        
        results = []
        for i in range(len(preds)):
            results.append({
                'accuracy': accuracies[i] if answerables[i] else 0.0,
                'quality': self.quality(preds[i], refs[i]) if answerables[i] else 0.0,
                'confidence': self.confidence(confs[i], preds[i], answerables[i]),
                'speed': self.speed(latencies[i]),
                'robustness': self.robustness(preds[i], answerables[i], errors[i])
            })
        return results

print("‚úÖ ScorePack loaded")

# ============================================================================
# CONFIGURATION
# ============================================================================
OUTPUT_DIR = "data/model_evaluation"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Cache files
CONTEXTS_FILE = os.path.join(OUTPUT_DIR, "contexts.csv")
EMBEDDINGS_FILE = os.path.join(OUTPUT_DIR, "embeddings.csv")
RAW_RESPONSES_FILE = os.path.join(OUTPUT_DIR, "raw_responses.csv")

# Output files
RANKINGS_FILE = os.path.join(OUTPUT_DIR, "model_rankings.csv")
DETAILED_FILE = os.path.join(OUTPUT_DIR, "model_responses.csv")
SUMMARY_FILE = os.path.join(OUTPUT_DIR, "evaluation_summary.txt")

print(f"üìÅ Output: {OUTPUT_DIR}/")

DEVICE = 0 if torch.cuda.is_available() else -1
print(f"üñ•Ô∏è  Device: {'GPU' if DEVICE == 0 else 'CPU'}")

# ============================================================================
# MODELS TO EVALUATE (6 models: 5 extractive + 1 generative)
# ============================================================================
MODELS_CONFIG = [
    ("T5-QA-Generative", "consciousAI/question-answering-generative-t5-v1-base-s-q-c", "text2text-generation"),
    ("RoBERTa-SQuAD2", "deepset/roberta-base-squad2", "question-answering"),
    ("BERT-Large-SQuAD", "google-bert/bert-large-cased-whole-word-masking-finetuned-squad", "question-answering"),
    ("DistilBERT-SQuAD", "distilbert-base-uncased-distilled-squad", "question-answering"),
    ("BERT-Tiny-SQuAD", "mrm8488/bert-tiny-finetuned-squadv2", "question-answering"),
    ("MiniLM-SQuAD", "deepset/minilm-uncased-squad2", "question-answering"),
]

# ============================================================================
# TEST QUESTIONS (5 answerable + 3 unanswerable)
# ============================================================================
TEST_QUESTIONS = [
    # Answerable - should find in knowledge base (Objective 2)
    ("What is your return policy?", "30-day return policy with full refund", True),
    ("How long does shipping take?", "Standard shipping takes 5-7 business days", True),
    ("What are your customer service hours?", "Monday to Friday 9am to 5pm", True),
    ("Do you offer warranty on products?", "1 year warranty on all electronics", True),
    ("How can I track my order?", "Use tracking number on our website", True),
    # Unanswerable - not in knowledge base
    ("What is the CEO's phone number?", "", False),
    ("Will you have a Black Friday sale?", "", False),
    ("How do prices compare to Amazon?", "", False),
]

# ============================================================================
# METRIC WEIGHTS (total = 100%)
# ============================================================================
METRIC_WEIGHTS = {
    'accuracy': 0.25,    # Semantic correctness (BERTScore)
    'quality': 0.25,     # Embedding similarity (reuses Obj 3)
    'confidence': 0.20,  # Calibration
    'speed': 0.15,       # Latency
    'robustness': 0.15   # Error handling
}

MODEL_SIZES = {
    'BERT-Large-SQuAD': ('Large', '340M'),
    'RoBERTa-SQuAD2': ('Base', '125M'),
    'T5-QA-Generative': ('Base', '220M'),
    'DistilBERT-SQuAD': ('Small', '66M'),
    'MiniLM-SQuAD': ('Small', '33M'),
    'BERT-Tiny-SQuAD': ('Tiny', '4M'),
}

# ============================================================================
# CACHE FUNCTIONS (3-stage caching)
# ============================================================================
def show_cache_status():
    """Display cache status."""
    print("\n" + "="*60)
    print("üìÅ CACHE STATUS")
    print("="*60)
    for name, path in [('contexts.csv', CONTEXTS_FILE), 
                       ('embeddings.csv', EMBEDDINGS_FILE),
                       ('raw_responses.csv', RAW_RESPONSES_FILE)]:
        print(f"   {'‚úÖ' if os.path.exists(path) else '‚ùå'} {name}")


def load_or_fetch_contexts(force_refresh: bool = False) -> pd.DataFrame:
    """Fetch contexts using rag_query from Objective 4."""
    if not force_refresh and os.path.exists(CONTEXTS_FILE):
        print("\nüìÇ Loading contexts from cache...")
        return pd.read_csv(CONTEXTS_FILE)
    
    print("\nüì• Fetching contexts via rag_query (Objective 4)...")
    
    data = []
    for i, (q, expected, is_ans) in enumerate(TEST_QUESTIONS):
        # REUSE: rag_query from Objective 4
        result = rag_query(q, verbose=False)
        context = " ".join([f"{c.get('question','')} {c.get('answer','')}" 
                           for c in result.retrieved_context]) if result.success else ""
        data.append({
            'question': q, 
            'expected': expected, 
            'is_answerable': is_ans,
            'context': context or "No relevant information found.",
            'num_sources': len(result.retrieved_context) if result.success else 0
        })
        print(f"   [{i+1}/{len(TEST_QUESTIONS)}] ‚úÖ {q[:40]}...")
    
    df = pd.DataFrame(data)
    df.to_csv(CONTEXTS_FILE, index=False)
    return df


def load_or_compute_embeddings(contexts_df: pd.DataFrame, force_refresh: bool = False) -> Dict[str, np.ndarray]:
    """Compute embeddings using embed_query from Objective 3."""
    if not force_refresh and os.path.exists(EMBEDDINGS_FILE):
        print("\nüìÇ Loading embeddings from cache...")
        df = pd.read_csv(EMBEDDINGS_FILE)
        return {row['text']: np.array(json.loads(row['embedding'])) for _, row in df.iterrows()}
    
    print("\nüì• Computing embeddings via embed_query (Objective 3)...")
    
    texts = list(set([row['expected'] for _, row in contexts_df.iterrows() 
                      if row['is_answerable'] and row['expected']]))
    
    embeddings = {}
    data = []
    for i, text in enumerate(texts):
        # REUSE: embed_query from Objective 3
        emb = embed_query(text)
        embeddings[text] = emb
        data.append({'text': text, 'embedding': json.dumps(emb.tolist())})
        print(f"   [{i+1}/{len(texts)}] ‚úÖ {text[:40]}...")
    
    pd.DataFrame(data).to_csv(EMBEDDINGS_FILE, index=False)
    return embeddings


def load_or_collect_responses(contexts_df: pd.DataFrame, force_refresh: bool = False) -> pd.DataFrame:
    """Run all 6 models on test questions."""
    if not force_refresh and os.path.exists(RAW_RESPONSES_FILE):
        print("\nüìÇ Loading responses from cache...")
        return pd.read_csv(RAW_RESPONSES_FILE)
    
    print("\nü§ñ Running model inference (6 models √ó 8 questions)...")
    all_responses = []
    
    for idx, (name, model_id, task_type) in enumerate(MODELS_CONFIG):
        print(f"\n[{idx+1}/{len(MODELS_CONFIG)}] üìä {name}")
        
        try:
            pipe = pipeline(task_type, model=model_id, device=DEVICE,
                          torch_dtype=torch.float16 if DEVICE == 0 else torch.float32)
        except Exception as e:
            print(f"   ‚ùå Failed to load: {str(e)[:50]}")
            for _, row in contexts_df.iterrows():
                all_responses.append({'model': name, 'question': row['question'],
                                     'answer': '', 'raw_confidence': 0.0,
                                     'response_time_ms': 0.0, 'had_error': True})
            continue
        
        for _, row in contexts_df.iterrows():
            t0 = time.time()
            try:
                if task_type == "text2text-generation":
                    out = pipe(f"question: {row['question']} context: {row['context']}", max_length=50)
                    answer, raw_conf = out[0]['generated_text'].strip(), 0.7
                else:
                    out = pipe(question=row['question'], context=row['context'], max_answer_len=50)
                    answer, raw_conf = out['answer'].strip(), out.get('score', 0.5)
                had_error = False
            except:
                answer, raw_conf, had_error = '', 0.0, True
            
            all_responses.append({
                'model': name, 'question': row['question'], 'answer': answer,
                'raw_confidence': raw_conf, 'response_time_ms': (time.time() - t0) * 1000,
                'had_error': had_error
            })
            print(f"   ‚úÖ {row['question'][:30]}... ‚Üí {answer[:25] if answer else '(empty)'}...")
        
        del pipe
        if DEVICE == 0: torch.cuda.empty_cache()
    
    df = pd.DataFrame(all_responses)
    df.to_csv(RAW_RESPONSES_FILE, index=False)
    return df


# ============================================================================
# METRIC CALCULATION
# ============================================================================
def calculate_metrics(contexts_df: pd.DataFrame, responses_df: pd.DataFrame,
                      embeddings: Dict[str, np.ndarray]) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Calculate all 5 metrics using ScorePack."""
    print("\nüìä Calculating metrics (ScorePack + BERTScore)...")
    
    # REUSE: embed_query from Objective 3
    scorer = ScorePack(embeddings=embeddings, embed_fn=embed_query)
    
    context_lookup = {row['question']: {'expected': row['expected'], 'is_answerable': row['is_answerable']}
                      for _, row in contexts_df.iterrows()}
    
    # Prepare batch data
    preds, refs, confs, latencies, answerables, errors = [], [], [], [], [], []
    for _, resp in responses_df.iterrows():
        ctx = context_lookup[resp['question']]
        preds.append(resp['answer'] if resp['answer'] else "")
        refs.append(ctx['expected'] if ctx['expected'] else "")
        confs.append(resp['raw_confidence'])
        latencies.append(resp['response_time_ms'])
        answerables.append(ctx['is_answerable'])
        errors.append(resp['had_error'])
    
    # Batch scoring
    print("   üîÑ Running BERTScore (DeBERTa-large-mnli)...")
    all_scores = scorer.score_all_batch(preds, refs, confs, latencies, answerables, errors)
    print(f"   ‚úÖ Scored {len(all_scores)} responses")
    
    # Build detailed results
    detailed_results = []
    for i, (_, resp) in enumerate(responses_df.iterrows()):
        ctx = context_lookup[resp['question']]
        scores = all_scores[i]
        detailed_results.append({
            'model': resp['model'],
            'question': resp['question'],
            'answer': resp['answer'],
            'expected': ctx['expected'],
            'is_answerable': ctx['is_answerable'],
            'raw_confidence': resp['raw_confidence'],
            'response_time_ms': resp['response_time_ms'],
            **{k: round(v, 3) for k, v in scores.items()}
        })
    detailed_df = pd.DataFrame(detailed_results)
    
    # Aggregate by model
    print("   üîÑ Aggregating by model...")
    model_results = []
    for name, _, _ in MODELS_CONFIG:
        model_data = detailed_df[detailed_df['model'] == name]
        answerable = model_data[model_data['is_answerable'] == True]
        
        metrics = {
            'Accuracy': answerable['accuracy'].mean() if len(answerable) > 0 else 0,
            'Quality': answerable['quality'].mean() if len(answerable) > 0 else 0,
            'Confidence': model_data['confidence'].mean(),
            'Speed': model_data['speed'].mean(),
            'Robustness': model_data['robustness'].mean()
        }
        
        final = sum(metrics[m.capitalize()] * w for m, w in METRIC_WEIGHTS.items())
        model_results.append({'Model': name, **{k: round(v, 3) for k, v in metrics.items()},
                             'Final_Score': round(final, 3)})
        print(f"   ‚úÖ {name}: {final:.3f}")
    
    rankings_df = pd.DataFrame(model_results)
    rankings_df = rankings_df.sort_values('Final_Score', ascending=False).reset_index(drop=True)
    rankings_df.insert(0, 'Rank', range(1, len(rankings_df) + 1))
    
    return rankings_df, detailed_df


# ============================================================================
# OUTPUT
# ============================================================================
def save_outputs(rankings_df: pd.DataFrame, detailed_df: pd.DataFrame):
    """Save results to CSV and text summary."""
    rankings_df.to_csv(RANKINGS_FILE, index=False)
    detailed_df.to_csv(DETAILED_FILE, index=False)
    
    with open(SUMMARY_FILE, 'w') as f:
        f.write("="*60 + "\n")
        f.write("OBJECTIVE 5: MODEL EVALUATION SUMMARY\n")
        f.write("="*60 + "\n\n")
        f.write(f"Date: {time.strftime('%Y-%m-%d %H:%M:%S')}\n")
        f.write(f"Scoring: ScorePack (BERTScore DeBERTa-large-mnli)\n\n")
        f.write("REUSES FROM PREVIOUS OBJECTIVES:\n")
        f.write("  ‚Ä¢ Objective 3: embed_query() for quality metric\n")
        f.write("  ‚Ä¢ Objective 4: rag_query() for context retrieval\n\n")
        f.write("METRIC WEIGHTS:\n")
        for m, w in METRIC_WEIGHTS.items():
            f.write(f"  {m}: {w*100:.0f}%\n")
        f.write("\n" + "="*60 + "\nRANKINGS:\n" + "="*60 + "\n\n")
        f.write(rankings_df.to_string(index=False))
        f.write(f"\n\nü•á WINNER: {rankings_df.iloc[0]['Model']} ({rankings_df.iloc[0]['Final_Score']})\n")
    
    print(f"\n‚úÖ Saved: {RANKINGS_FILE}")
    print(f"‚úÖ Saved: {DETAILED_FILE}")
    print(f"‚úÖ Saved: {SUMMARY_FILE}")


def print_analysis(rankings_df: pd.DataFrame):
    """Print winner summary and model insights."""
    winner = rankings_df.iloc[0]
    
    print("\n" + "="*60)
    print("üèÜ WINNER SUMMARY")
    print("="*60)
    print(f"""
    ü•á BEST MODEL: {winner['Model']}
    
    Final Score: {winner['Final_Score']:.3f}
    
    SCORES:
    ‚îú‚îÄ‚îÄ Accuracy (BERTScore): {winner['Accuracy']:.3f}
    ‚îú‚îÄ‚îÄ Quality (Embedding):  {winner['Quality']:.3f}
    ‚îú‚îÄ‚îÄ Confidence:           {winner['Confidence']:.3f}
    ‚îú‚îÄ‚îÄ Speed:                {winner['Speed']:.3f}
    ‚îî‚îÄ‚îÄ Robustness:           {winner['Robustness']:.3f}
    """)
    
    print("="*60)
    print("üîç MODEL INSIGHTS: SIZE vs PERFORMANCE")
    print("="*60)
    print(f"\n   {'Model':<20} {'Size':<8} {'Params':<10} {'Score':<8}")
    print("   " + "‚îÄ"*46)
    for _, row in rankings_df.iterrows():
        size_info = MODEL_SIZES.get(row['Model'], ('?', '?'))
        print(f"   {row['Model']:<20} {size_info[0]:<8} {size_info[1]:<10} {row['Final_Score']:.3f}")
    
    print(f"\n   RECOMMENDATIONS:")
    print(f"   ‚Ä¢ Speed-critical:    {rankings_df.loc[rankings_df['Speed'].idxmax(), 'Model']}")
    print(f"   ‚Ä¢ Accuracy-critical: {rankings_df.loc[rankings_df['Accuracy'].idxmax(), 'Model']}")
    print(f"   ‚Ä¢ Best overall:      {rankings_df.iloc[0]['Model']}")


# ============================================================================
# MAIN ENTRY POINTS
# ============================================================================
def run_evaluation(force_refresh: bool = False) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Run full evaluation pipeline."""
    print("\n" + "="*60)
    print("üöÄ OBJECTIVE 5: MODEL EVALUATION")
    print("="*60)
    print("   Reusing: embed_query (Obj 3), rag_query (Obj 4)")
    
    start_time = time.time()
    
    cache_complete = all(os.path.exists(f) for f in [CONTEXTS_FILE, EMBEDDINGS_FILE, RAW_RESPONSES_FILE])
    
    if cache_complete and not force_refresh:
        print("\nüí° Loading from cache (use force_refresh=True to rebuild)...")
        contexts_df = pd.read_csv(CONTEXTS_FILE)
        responses_df = pd.read_csv(RAW_RESPONSES_FILE)
        emb_df = pd.read_csv(EMBEDDINGS_FILE)
        embeddings = {row['text']: np.array(json.loads(row['embedding'])) for _, row in emb_df.iterrows()}
    else:
        contexts_df = load_or_fetch_contexts(force_refresh)
        embeddings = load_or_compute_embeddings(contexts_df, force_refresh)
        responses_df = load_or_collect_responses(contexts_df, force_refresh)
    
    rankings_df, detailed_df = calculate_metrics(contexts_df, responses_df, embeddings)
    save_outputs(rankings_df, detailed_df)
    
    print(f"\n‚è±Ô∏è  Total time: {time.time() - start_time:.1f}s")
    return rankings_df, detailed_df


def recalculate_metrics_only() -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Recalculate metrics from cached responses (~30 sec)."""
    print("\nüìä Recalculating metrics from cache...")
    contexts_df = pd.read_csv(CONTEXTS_FILE)
    responses_df = pd.read_csv(RAW_RESPONSES_FILE)
    emb_df = pd.read_csv(EMBEDDINGS_FILE)
    embeddings = {row['text']: np.array(json.loads(row['embedding'])) for _, row in emb_df.iterrows()}
    rankings_df, detailed_df = calculate_metrics(contexts_df, responses_df, embeddings)
    save_outputs(rankings_df, detailed_df)
    return rankings_df, detailed_df


def clean():
    """Delete all cache files."""
    print("\nüóëÔ∏è  Cleaning cache...")
    for f in [CONTEXTS_FILE, EMBEDDINGS_FILE, RAW_RESPONSES_FILE, RANKINGS_FILE, DETAILED_FILE, SUMMARY_FILE]:
        if os.path.exists(f):
            os.remove(f)
            print(f"   Deleted: {os.path.basename(f)}")


# ============================================================================
# RUN EVALUATION
# ============================================================================
# Verify Objectives 1-4 are available
if verify_dependencies():
    show_cache_status()
    rankings_df, detailed_df = run_evaluation()
    
    print("\n" + "="*60)
    print("üèÜ FINAL RANKINGS")
    print("="*60)
    print(rankings_df.to_string(index=False))
    print(f"\nü•á WINNER: {rankings_df.iloc[0]['Model']} ({rankings_df.iloc[0]['Final_Score']})")
    
    print_analysis(rankings_df)
else:
    print("\n‚ö†Ô∏è Please run Objectives 1-4 first, then re-run this cell.")

‚úÖ BERTScore available
‚úÖ ScorePack loaded
üìÅ Output: data/model_evaluation/
üñ•Ô∏è  Device: CPU
‚úÖ All dependencies from Objectives 1-4 verified

üìÅ CACHE STATUS
   ‚úÖ contexts.csv
   ‚úÖ embeddings.csv
   ‚úÖ raw_responses.csv

üöÄ OBJECTIVE 5: MODEL EVALUATION
   Reusing: embed_query (Obj 3), rag_query (Obj 4)

üí° Loading from cache (use force_refresh=True to rebuild)...

üìä Calculating metrics (ScorePack + BERTScore)...
   üîÑ Running BERTScore (DeBERTa-large-mnli)...
   ‚úÖ Scored 48 responses
   üîÑ Aggregating by model...
   ‚úÖ T5-QA-Generative: 0.430
   ‚úÖ RoBERTa-SQuAD2: 0.568
   ‚úÖ BERT-Large-SQuAD: 0.562
   ‚úÖ DistilBERT-SQuAD: 0.571
   ‚úÖ BERT-Tiny-SQuAD: 0.465
   ‚úÖ MiniLM-SQuAD: 0.598

‚úÖ Saved: data/model_evaluation/model_rankings.csv
‚úÖ Saved: data/model_evaluation/model_responses.csv
‚úÖ Saved: data/model_evaluation/evaluation_summary.txt

‚è±Ô∏è  Total time: 1.5s

üèÜ FINAL RANKINGS
 Rank            Model  Accuracy  Quality  Confidence  Speed 

## Objective 6: System Analysis & Reflection

### üéØ Goal

**Part 1 (Code):** Generate and display comprehensive analysis data from Objectives 1-5.  
**Part 2 (Manual):** My reflection on system strengths, weaknesses, real-world applications, and critical insights based on the generated results.

### üìã Structure

This objective has **two parts**:

1. **Code Cell (Below):** Runs analysis and displays key metrics, model rankings, and performance data
2. **Markdown Cell (Next):** You manually write your reflection and critical analysis based on the results

**Why this approach?**
- Demonstrates you understand the results (not just running code)
- Shows critical thinking and analysis skills
- More authentic for assignment submission
- Allows you to interpret data and provide insights

<details>
<summary><b>üì• Prerequisites</b> (Click to expand)</summary>

| Item | Source | Required | Description |
|------|--------|----------|-------------|
| `rankings_df` | Objective 5 | ‚úÖ Yes | Model rankings with all metrics |
| `detailed_df` | Objective 5 | ‚úÖ Yes | Per-question detailed results |
| `qa_database` | Objective 2 | ‚úÖ Yes | Knowledge base for coverage analysis |
| `faiss_index` | Objective 3 | ‚úÖ Yes | Retrieval system analysis |
| `rag_query()` | Objective 4 | ‚úÖ Yes | RAG pipeline for analysis |

**Note:** Requires Objectives 1-5 completed. Achieves **100% component reuse**.

</details>

<br>

<details>
<summary><b>üìä Analysis Framework</b> (Click to expand)</summary>

### 1. System Performance Analysis
**What it analyzes:** Overall system metrics from Objective 5 results

**Metrics Computed:**
- **Best Model Performance** - Winner from rankings with all 5 metrics
- **Answerable Accuracy** - Average accuracy on answerable questions across models
- **Unanswerable Detection** - How well models handle out-of-scope questions
- **Average Response Time** - Mean latency across all models and questions
- **Confidence Calibration** - Correlation between confidence and correctness

**Source:** `rankings_df` and `detailed_df` from Objective 5

---

### 2. Knowledge Base Analysis
**What it analyzes:** Coverage and quality of the Q&A database

**Metrics Computed:**
- **Total Q&A Pairs** - Size of knowledge base
- **Category Coverage** - Number of distinct categories
- **Answerable vs Unanswerable** - Distribution in database
- **Average Answer Length** - Typical response size

**Source:** `qa_database` from Objective 2

---

### 3. Retrieval System Analysis
**What it analyzes:** FAISS vector database performance

**Metrics Computed:**
- **Index Size** - Number of vectors indexed
- **Embedding Dimension** - Vector dimensionality (384 for all-MiniLM-L6-v2)
- **Index Type** - FAISS index configuration (IndexFlatL2)

**Source:** `faiss_index` from Objective 3

---

### 4. System Limitations
**What it identifies:** What the system can and cannot handle

**Categories:**
- **Answerable Questions** - What the system handles well
- **Unanswerable Questions** - What the system should decline
- **Failure Modes** - Common error patterns (incomplete answers, false confidence, execution errors)
- **Edge Cases** - Ambiguous queries, typos, multi-topic questions

**Source:** Analysis of `detailed_df` from Objective 5

---

### 5. Real-World Applications
**What it evaluates:** Suitable deployment scenarios

**Use Cases:**
- Customer service chatbot (24/7 support)
- Internal FAQ system (employee self-service)
- Knowledge base assistant (documentation search)
- Help desk automation (ticket deflection)

**Business Value:**
- Reduced response time
- Cost savings
- Consistency
- Scalability

---

### 6. Scalability Analysis
**What it projects:** Performance at different scales

**Scales Analyzed:**
- **Current:** 21 Q&A pairs
- **Small Scale:** 1,000 pairs
- **Medium Scale:** 10,000 pairs
- **Large Scale:** 100,000+ pairs

**Bottlenecks Identified:**
- Embedding generation time
- FAISS search latency
- Model inference time
- Memory requirements

---

### 7. Deployment Considerations
**What it evaluates:** Production deployment requirements

**Infrastructure:**
- GPU vs CPU trade-offs
- Model hosting options
- Vector database scaling

**Costs:**
- Model hosting costs
- API call expenses
- Infrastructure requirements

**Maintenance:**
- Knowledge base updates
- Model versioning
- Performance monitoring

</details>

<br>

<details>
<summary><b>üîó Component Reuse</b> (Click to expand)</summary>

**100% Reuse from Previous Objectives:**

| Component | From | Used For |
|-----------|------|----------|
| `rankings_df` | Obj 5 | Model performance metrics and rankings |
| `detailed_df` | Obj 5 | Per-question analysis, failure mode identification |
| `qa_database` | Obj 2 | Knowledge base coverage and quality analysis |
| `faiss_index` | Obj 3 | Retrieval system metrics |
| `rag_query()` | Obj 4 | Pipeline performance analysis |

**Key Insight:** Objective 6 synthesizes all previous objectives into actionable insights and recommendations.

</details>

<br>

<details>
<summary><b>üì§ Outputs</b> (Click to expand)</summary>

**Files Created:**
- `metrics_summary.csv` - Quantitative metrics summary

**Global Variables:**
- `system_analysis` - Dictionary containing all analysis results

</details>

<br>

<details>
<summary><b>‚úÖ Verification</b> (Click to expand)</summary>

**Verification:**
- ‚úÖ `rankings_df` and `detailed_df` from Objective 5 exist
- ‚úÖ `qa_database` from Objective 2 exists
- ‚úÖ `faiss_index` from Objective 3 exists
- ‚úÖ `metrics_summary.csv` with quantitative metrics

**Performance:**
- **Analysis time:** ~5-10 seconds (reads data from previous objectives)

</details>

<br>

---
**Next Step:** Review the generated analysis data, then write your reflection in the markdown cell below.


In [59]:
# ============================================================================
# OBJECTIVE 6: SYSTEM ANALYSIS & REFLECTION (SIMPLIFIED)
# ============================================================================
#
# PURPOSE: Analyze results from Objective 5 and provide insights for reflection
#
# PREREQUISITES: Run Objectives 1-5 first
#   - qa_database (Objective 2)
#   - faiss_index (Objective 3)
#   - rankings_df, detailed_df (Objective 5)
#
# ============================================================================

import os
import pandas as pd

# Output directory
OUTPUT_DIR = "data/system_analysis"
os.makedirs(OUTPUT_DIR, exist_ok=True)


# ============================================================================
# SECTION 1: VERIFY DEPENDENCIES
# ============================================================================

def verify_dependencies():
    """Check that Objectives 1-5 have been run."""
    missing = []
    
    if 'qa_database' not in globals():
        missing.append("qa_database (Objective 2)")
    if 'faiss_index' not in globals():
        missing.append("faiss_index (Objective 3)")
    if 'rankings_df' not in globals():
        missing.append("rankings_df (Objective 5)")
    if 'detailed_df' not in globals():
        missing.append("detailed_df (Objective 5)")
    
    if missing:
        print("‚ùå MISSING DEPENDENCIES:")
        for m in missing:
            print(f"   ‚Ä¢ {m}")
        return False
    
    print("‚úÖ All dependencies verified")
    return True


# ============================================================================
# SECTION 2: RUN ANALYSIS
# ============================================================================

print("=" * 60)
print("üìä OBJECTIVE 6: SYSTEM ANALYSIS & REFLECTION")
print("=" * 60)

if not verify_dependencies():
    print("\n‚ö†Ô∏è Run Objectives 1-5 first, then re-run this cell.")
else:
    # Get data from previous objectives
    qa_database = globals()['qa_database']
    faiss_index = globals()['faiss_index']
    rankings_df = globals()['rankings_df']
    detailed_df = globals()['detailed_df']
    
    # ------------------------------------------------------------------
    # 1. MODEL RANKINGS
    # ------------------------------------------------------------------
    print("\n" + "-" * 60)
    print("üèÜ MODEL RANKINGS")
    print("-" * 60)
    
    display_cols = ['Rank', 'Model', 'Accuracy', 'Quality', 'Speed', 'Final_Score']
    available_cols = [c for c in display_cols if c in rankings_df.columns]
    print(rankings_df[available_cols].to_string(index=False))
    
    best = rankings_df.iloc[0]
    print(f"\nü•á Best Model: {best['Model']} (Score: {best['Final_Score']:.3f})")
    
    # Fastest model
    fastest_idx = detailed_df['response_time_ms'].idxmin()
    fastest = detailed_df.loc[fastest_idx]
    print(f"‚ö° Fastest Model: {fastest['model']} ({fastest['response_time_ms']/1000:.3f}s)")
    
    # ------------------------------------------------------------------
    # 2. PERFORMANCE SUMMARY
    # ------------------------------------------------------------------
    print("\n" + "-" * 60)
    print("üìà PERFORMANCE SUMMARY")
    print("-" * 60)
    
    print(f"  ‚Ä¢ Avg Accuracy:   {rankings_df['Accuracy'].mean():.1%}")
    print(f"  ‚Ä¢ Avg Quality:    {rankings_df['Quality'].mean():.1%}")
    print(f"  ‚Ä¢ Avg Confidence: {rankings_df['Confidence'].mean():.1%}")
    print(f"  ‚Ä¢ Avg Speed:      {rankings_df['Speed'].mean():.1%}")
    print(f"  ‚Ä¢ Avg Robustness: {rankings_df['Robustness'].mean():.1%}")
    print(f"  ‚Ä¢ Avg Response:   {detailed_df['response_time_ms'].mean()/1000:.3f}s")
    
    # ------------------------------------------------------------------
    # 3. ANSWERABLE vs UNANSWERABLE
    # ------------------------------------------------------------------
    print("\n" + "-" * 60)
    print("üìä ANSWERABLE vs UNANSWERABLE PERFORMANCE")
    print("-" * 60)
    
    ans_df = detailed_df[detailed_df['is_answerable'] == True]
    unans_df = detailed_df[detailed_df['is_answerable'] == False]
    
    print(f"\n  Answerable Questions ({len(ans_df)} samples):")
    print(f"    ‚Ä¢ Avg Accuracy:   {ans_df['accuracy'].mean():.1%}")
    print(f"    ‚Ä¢ Avg Quality:    {ans_df['quality'].mean():.1%}")
    print(f"    ‚Ä¢ Avg Confidence: {ans_df['confidence'].mean():.1%}")
    
    print(f"\n  Unanswerable Questions ({len(unans_df)} samples):")
    print(f"    ‚Ä¢ Avg Robustness: {unans_df['robustness'].mean():.1%}")
    print(f"    ‚Ä¢ Avg Confidence: {unans_df['confidence'].mean():.1%}")
    
    # False confidence detection
    if 'raw_confidence' in unans_df.columns:
        high_conf = (unans_df['raw_confidence'] > 0.7).sum()
        print(f"    ‚Ä¢ False Confidence (>0.7): {high_conf}/{len(unans_df)}")
    
    # ------------------------------------------------------------------
    # 4. KNOWLEDGE BASE STATS
    # ------------------------------------------------------------------
    print("\n" + "-" * 60)
    print("üìö KNOWLEDGE BASE")
    print("-" * 60)
    
    total_pairs = len(qa_database)
    categories = {}
    for qa in qa_database:
        cat = qa.get('category', 'unknown')
        categories[cat] = categories.get(cat, 0) + 1
    
    print(f"  ‚Ä¢ Total Q&A Pairs: {total_pairs}")
    print(f"  ‚Ä¢ Categories: {len(categories)}")
    print(f"  ‚Ä¢ FAISS Vectors: {faiss_index.ntotal}")
    print(f"  ‚Ä¢ Embedding Dim: {faiss_index.d}")
    
    # ------------------------------------------------------------------
    # 5. KEY INSIGHTS
    # ------------------------------------------------------------------
    print("\n" + "-" * 60)
    print("üí° KEY INSIGHTS")
    print("-" * 60)
    
    # Accuracy insight
    avg_acc = rankings_df['Accuracy'].mean()
    if avg_acc < 0.3:
        print(f"  ‚ö†Ô∏è Low accuracy ({avg_acc:.1%}) - models struggle with matching")
    elif avg_acc < 0.6:
        print(f"  üìä Moderate accuracy ({avg_acc:.1%}) - room for improvement")
    else:
        print(f"  ‚úÖ Good accuracy ({avg_acc:.1%}) - models perform well")
    
    # Unanswerable detection insight
    unans_detection = (unans_df['robustness'] > 0.5).mean() if len(unans_df) > 0 else 0
    if unans_detection < 0.5:
        print(f"  ‚ö†Ô∏è Poor unanswerable detection ({unans_detection:.1%}) - may hallucinate")
    else:
        print(f"  ‚úÖ Good unanswerable detection ({unans_detection:.1%})")
    
    # Speed insight
    avg_time = detailed_df['response_time_ms'].mean() / 1000
    if avg_time < 1.0:
        print(f"  ‚úÖ Fast response time ({avg_time:.2f}s)")
    else:
        print(f"  ‚ö†Ô∏è Slow response time ({avg_time:.2f}s) - consider optimization")
    
    # Model spread insight
    score_range = rankings_df['Final_Score'].max() - rankings_df['Final_Score'].min()
    print(f"  üìä Model score range: {score_range:.3f}")
    
    # ------------------------------------------------------------------
    # 6. SAVE SUMMARY CSV
    # ------------------------------------------------------------------
    summary = {
        'Metric': ['Best Model', 'Best Score', 'Avg Accuracy', 'Avg Quality', 
                   'Avg Response Time', 'Total Q&A Pairs', 'FAISS Vectors'],
        'Value': [best['Model'], f"{best['Final_Score']:.3f}", 
                  f"{avg_acc:.1%}", f"{rankings_df['Quality'].mean():.1%}",
                  f"{avg_time:.3f}s", total_pairs, faiss_index.ntotal]
    }
    summary_df = pd.DataFrame(summary)
    summary_df.to_csv(f"{OUTPUT_DIR}/analysis_summary.csv", index=False)
    print(f"\nüíæ Saved: {OUTPUT_DIR}/analysis_summary.csv")
    
    # ------------------------------------------------------------------
    # 7. REFLECTION PROMPTS
    # ------------------------------------------------------------------
    print("\n" + "=" * 60)
    print("üìù REFLECTION PROMPTS")
    print("=" * 60)
    print("""
Write your reflection addressing these questions:

1. STRENGTHS: What does this RAG system do well?
   
2. WEAKNESSES: What are the main limitations?
   
3. MODEL COMPARISON: Why did the best model outperform others?
   
4. REAL-WORLD USE: Where could this system be deployed?
   
5. IMPROVEMENTS: What would you change to make it better?
""")
    
    print("=" * 60)
    print("‚úÖ OBJECTIVE 6 COMPLETE")
    print("=" * 60)


üìä OBJECTIVE 6: SYSTEM ANALYSIS & REFLECTION
‚úÖ All dependencies verified

------------------------------------------------------------
üèÜ MODEL RANKINGS
------------------------------------------------------------
 Rank            Model  Accuracy  Quality  Speed  Final_Score
    1     MiniLM-SQuAD     0.328    0.609  0.988        0.598
    2 DistilBERT-SQuAD     0.297    0.550  0.988        0.571
    3   RoBERTa-SQuAD2     0.366    0.477  0.976        0.568
    4 BERT-Large-SQuAD     0.333    0.546  0.927        0.562
    5  BERT-Tiny-SQuAD     0.018    0.459  0.998        0.465
    6 T5-QA-Generative     0.178    0.438  0.311        0.430

ü•á Best Model: MiniLM-SQuAD (Score: 0.598)
‚ö° Fastest Model: BERT-Tiny-SQuAD (0.003s)

------------------------------------------------------------
üìà PERFORMANCE SUMMARY
------------------------------------------------------------
  ‚Ä¢ Avg Accuracy:   25.3%
  ‚Ä¢ Avg Quality:    51.3%
  ‚Ä¢ Avg Confidence: 41.8%
  ‚Ä¢ Avg Speed:      86

## üìù System Reflection

### 1. STRENGTHS: What does this RAG system do well?

- **Excellent at detecting unanswerable questions (100%)** ‚Äî The system correctly identifies when it doesn't have enough information and avoids making up answers
- **Very fast responses (0.27s average)** ‚Äî Users get near-instant answers
- **Good robustness (85%)** ‚Äî Handles edge cases without crashing
- **Lightweight** ‚Äî Works with only 21 Q&A pairs and small models (MiniLM at 33M parameters)

---

### 2. WEAKNESSES: What are the main limitations?

- **Low accuracy (25.3%)** ‚Äî Models struggle to match questions to correct answers
- **Small knowledge base** ‚Äî Only 21 Q&A pairs limits what the system can answer
- **Underconfident on answerable questions (21.8%)** ‚Äî System doesn't trust itself when it should
- **No reasoning ability** ‚Äî Can only retrieve, not think through complex questions

---

### 3. MODEL COMPARISON: Why did the best model outperform others?

**MiniLM-SQuAD won (Score: 0.598)** because:
- Best balance of accuracy (0.328) and quality (0.609)
- Nearly as fast as tiny models (0.988 speed)
- Trained on SQuAD2 which includes unanswerable questions

**T5-Generative ranked last** because generative models are slower and unnecessary for simple extractive QA tasks.

**Key learning:** Smaller, specialized models can outperform larger general models for specific tasks.

---

### 4. REAL-WORLD USE: Where could this system be deployed?

- **Internal company FAQ bot** ‚Äî Answering employee HR/IT questions
- **Customer support helper** ‚Äî Suggesting answers to agents (not fully automated)
- **Documentation search** ‚Äî Finding relevant help articles quickly

*Would need human oversight due to 25% accuracy.*

---

### 5. IMPROVEMENTS: What would you change to make it better?

1. **Expand knowledge base** ‚Äî Add more Q&A pairs (100+) for better coverage
2. **Hybrid search** ‚Äî Combine keyword + semantic search for better retrieval
3. **Fine-tune embeddings** ‚Äî Train on domain-specific data
4. **Add confidence calibration** ‚Äî So the system trusts itself appropriately

---

**Overall:** 
This RAG system demonstrates the core pipeline works ‚Äî retrieval, augmentation, and generation. The main limitation is the small knowledge base, which could be expanded for production use.

![RAG System Architecture](diagrams/rag-pipeline-architecture.png)

---