# üîç InsightSpike-AI: Dynamic RAG Comparison Experiment
## Evaluating Dynamic RAG Construction vs Existing Methods

This notebook compares InsightSpike-AI's dynamic RAG construction capabilities against established baselines using standard question-answering benchmarks.

### Experimental Design
- **Datasets**: Simulated NaturalQuestions & HotpotQA samples
- **Baselines**: BM25, Static Embeddings, DPR (Dense Passage Retrieval)
- **Metrics**: Recall@k, Exact Match (EM), F1 Score, Inference Latency

### InsightSpike-AI Dynamic RAG Features
- **Adaptive Weighting**: Dynamically adjusts retrieval strategy based on query characteristics
- **Intrinsic Motivation**: Uses ŒîGED √ó ŒîIG for document selection enhancement
- **Multi-Strategy Fusion**: Combines lexical, semantic, and learned retrieval methods
- **Context-Aware Memory**: Maintains retrieval history for improved performance

### Expected Outcomes
We expect InsightSpike-AI's dynamic approach to show:
1. Higher recall and precision across different k values
2. Better handling of both factual and multi-hop questions
3. Competitive or superior latency performance
4. More robust performance across question types

In [None]:
# üö® STEP 1: Environment Setup and Package Installation
import sys
import os
from pathlib import Path

# Check if running in Colab
try:
    import google.colab
    IN_COLAB = True
    print("üîß Running in Google Colab")
    
    # Check GPU availability
    gpu_info = !nvidia-smi
    if any("GPU" in line for line in gpu_info):
        print("üéÆ GPU detected - will install CUDA-enabled PyTorch")
        GPU_AVAILABLE = True
    else:
        print("üíª No GPU detected - will install CPU-only PyTorch")
        GPU_AVAILABLE = False
        
except:
    IN_COLAB = False
    GPU_AVAILABLE = False
    print("üîß Running in local environment")

if IN_COLAB:
    print("üì¶ Installing required packages for Colab...")
    print("‚ö†Ô∏è  IMPORTANT: This will trigger a runtime restart - this is EXPECTED and REQUIRED!")
    print("")
    
    # Step 1: Install NumPy first (avoid compatibility issues)
    print("üîß Step 1: Installing NumPy 1.26.4 (downgrade from 2.x)...")
    !pip install numpy==1.26.4
    
    # Step 2: Install GPU-enabled PyTorch or CPU version
    if GPU_AVAILABLE:
        print("üîß Step 2: Installing GPU-enabled PyTorch...")
        !pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu121
    else:
        print("üîß Step 2: Installing CPU-only PyTorch...")
        !pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cpu
    
    # Step 3: Install transformers (core dependency)
    print("üîß Step 3: Installing transformers...")
    !pip install transformers==4.30.0
    
    # Step 4: Install sentence-transformers (depends on transformers)
    print("üîß Step 4: Installing sentence-transformers...")
    !pip install sentence-transformers==2.7.0
    
    # Step 5: Install remaining ML and visualization packages
    print("üîß Step 5: Installing additional ML and visualization packages...")
    !pip install scikit-learn pandas matplotlib seaborn
    !pip install plotly kaleido
    !pip install faiss-cpu networkx
    
    print("‚úÖ Package installation complete")
    print("")
    print("üö® CRITICAL: RESTART RUNTIME NOW!")
    print("=" * 60)
    print("üìã Required steps:")
    print("   1. Look for the popup warning '„Çª„ÉÉ„Ç∑„Éß„É≥„ÇíÂÜçËµ∑Âãï„Åô„Çã'")
    print("   2. Click 'ÂÜçËµ∑Âãï„Åô„Çã' or 'RESTART RUNTIME' button")
    print("   3. OR manually: Runtime menu ‚Üí Restart runtime")
    print("   4. After restart, run STEP 2 cell to continue setup")
    print("")
    print("üîÑ Why restart is essential:")
    print("   - NumPy downgrade 2.x ‚Üí 1.26.4 (ML compatibility)")
    print("   - PyTorch version alignment with CUDA/CPU requirements")
    print("   - Fresh Python session prevents import conflicts")
    print("   - Proper dependency order: NumPy ‚Üí PyTorch ‚Üí transformers ‚Üí sentence-transformers")
    print("")
    print("‚ö†Ô∏è  DO NOT run the next cell until AFTER restart!")
    print("   Next cell will clone repository and setup InsightSpike-AI")
    print("=" * 60)
    
else:
    print("üè† Local environment detected")
    print("üìã For local development:")
    print("   1. Ensure Poetry is installed: curl -sSL https://install.python-poetry.org | python3 -")
    print("   2. Install dependencies: poetry install")
    print("   3. Activate environment: poetry shell")
    print("   4. Or run in environment: poetry run jupyter lab")
    print("")
    print("‚úÖ Ready for local development")

In [None]:
# üö® STEP 2: Repository Setup and Import Verification
# ‚ö†Ô∏è  Only run AFTER restarting runtime!

import sys
import os
from pathlib import Path

# Check environment and GPU status
try:
    import google.colab
    IN_COLAB = True
    print("üîß Running in Google Colab (Post-restart)")
    
    # Check GPU availability
    import torch
    if torch.cuda.is_available():
        print(f"üéÆ GPU Available: {torch.cuda.get_device_name(0)}")
        print(f"   CUDA Version: {torch.version.cuda}")
        device = "cuda"
    else:
        print("üíª Using CPU")
        device = "cpu"
    print(f"   PyTorch Version: {torch.__version__}")
    print(f"   Device: {device}")
    
except:
    IN_COLAB = False
    device = "cpu"
    print("üè† Running in local environment")

if IN_COLAB:
    # Clone repository if not exists
    repo_path = Path("/content/InsightSpike-AI")
    if not repo_path.exists():
        print("üì• Cloning InsightSpike-AI repository...")
        !git clone https://github.com/miyauchi0/InsightSpike-AI.git /content/InsightSpike-AI
    else:
        print("üìÅ Repository already exists")
    
    # Change to repository directory
    os.chdir("/content/InsightSpike-AI")
    print(f"üìÇ Working directory: {os.getcwd()}")
    
    # Add to Python path for imports
    sys.path.insert(0, "/content/InsightSpike-AI")
    print("üîß Added repository to Python path")

# Verify core imports with enhanced error handling
print("\nüîç Verifying package imports...")

import_status = {}

# Check NumPy version (critical for compatibility)
try:
    import numpy as np
    print(f"‚úÖ NumPy: {np.__version__}")
    import_status['numpy'] = True
    
    # Verify it's the downgraded version
    if np.__version__.startswith('1.26'):
        print("   ‚úÖ Compatible version (1.26.x)")
    else:
        print(f"   ‚ö†Ô∏è  Version {np.__version__} - may have compatibility issues")
except Exception as e:
    print(f"‚ùå NumPy: {e}")
    import_status['numpy'] = False

# Check sentence-transformers
try:
    from sentence_transformers import SentenceTransformer
    import sentence_transformers
    print(f"‚úÖ sentence-transformers: {sentence_transformers.__version__}")
    import_status['sentence_transformers'] = True
except Exception as e:
    print(f"‚ùå sentence-transformers: {e}")
    print("üîß Attempting repair...")
    if IN_COLAB:
        !pip install --force-reinstall sentence-transformers==2.7.0
        try:
            from sentence_transformers import SentenceTransformer
            print("‚úÖ sentence-transformers: Fixed after reinstall")
            import_status['sentence_transformers'] = True
        except:
            print("‚ùå sentence-transformers: Still failing after repair")
            import_status['sentence_transformers'] = False
    else:
        import_status['sentence_transformers'] = False

# Check other core packages
packages_to_check = {
    'transformers': 'transformers',
    'torch': 'torch', 
    'sklearn': 'scikit-learn',
    'pandas': 'pandas',
    'matplotlib': 'matplotlib',
    'plotly': 'plotly',
    'faiss': 'faiss-cpu'
}

for package, pip_name in packages_to_check.items():
    try:
        __import__(package)
        print(f"‚úÖ {package}: Available")
        import_status[package] = True
    except Exception as e:
        print(f"‚ùå {package}: {e}")
        import_status[package] = False

# Try to import InsightSpike-AI components
print("\nüîç Verifying InsightSpike-AI imports...")

try:
    # Attempt direct import first
    from insightspike.core.rag_system import SimpleRAGSystem
    from insightspike.core.experiments import ExperimentRunner
    print("‚úÖ InsightSpike-AI: Successfully imported core components")
    import_status['insightspike'] = True
    
except ImportError as e:
    print(f"‚ö†Ô∏è  Direct import failed: {e}")
    print("üîß Attempting alternative import methods...")
    
    # Try adding src to path
    src_path = Path("src")
    if src_path.exists():
        sys.path.insert(0, str(src_path.absolute()))
        print(f"   Added {src_path.absolute()} to Python path")
        
        try:
            from insightspike.core.rag_system import SimpleRAGSystem
            from insightspike.core.experiments import ExperimentRunner
            print("‚úÖ InsightSpike-AI: Successfully imported via src path")
            import_status['insightspike'] = True
        except Exception as e2:
            print(f"‚ùå Still failed after src path: {e2}")
            import_status['insightspike'] = False
    else:
        print("‚ùå src directory not found")
        import_status['insightspike'] = False

# Report final status
print("\nüìä Import Summary:")
for package, status in import_status.items():
    status_icon = "‚úÖ" if status else "‚ùå"
    print(f"   {status_icon} {package}")

failed_imports = [pkg for pkg, status in import_status.items() if not status]
if failed_imports:
    print(f"\n‚ö†Ô∏è  Failed imports: {', '.join(failed_imports)}")
    print("üí° Troubleshooting suggestions:")
    print("   1. Verify runtime was restarted after package installation")
    print("   2. Check for NumPy 2.x compatibility issues")
    print("   3. For InsightSpike-AI: ensure repository is properly cloned")
    print("   4. Consider reinstalling failed packages with --force-reinstall")
else:
    print("\nüéâ All imports successful! Ready to proceed with experiments.")

print(f"\nüéØ Environment ready for GPU-accelerated experiments on {device.upper()}")

In [None]:
# Import Required Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import json
import time
import warnings
from datetime import datetime
from IPython.display import display, HTML, Markdown
from collections import defaultdict
import re

# Suppress warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
np.random.seed(42)

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

print("üéØ Environment setup complete!")

# Check GPU availability and PyTorch version
print("\nüî• GPU and PyTorch Status:")
try:
    import torch
    print(f"   üî• PyTorch: {torch.__version__}")
    if torch.cuda.is_available():
        print(f"   üöÄ CUDA available: {torch.version.cuda}")
        print(f"   üéØ GPU device: {torch.cuda.get_device_name(0)}")
        print(f"   üíæ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
        device = "cuda"
    else:
        print("   üíª Using CPU (CUDA not available)")
        device = "cpu"
except ImportError:
    print("   ‚ùå PyTorch not available")
    device = "cpu"

# Check and verify sentence-transformers with proper dependency order
print("\nüîß Checking sentence-transformers compatibility...")
try:
    from sentence_transformers import SentenceTransformer
    print("‚úÖ Sentence Transformers available")
    SENTENCE_TRANSFORMERS_AVAILABLE = True
    
    # Test GPU compatibility for sentence-transformers
    if device == "cuda":
        try:
            test_model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
            print("‚úÖ Sentence Transformers GPU support confirmed")
            del test_model  # Clean up
        except Exception as e:
            print(f"‚ö†Ô∏è  GPU support issue: {e}")
            print("üîÑ Will use CPU for sentence-transformers")
            
except ImportError as e:
    print(f"‚ö†Ô∏è Sentence Transformers not available: {e}")
    print("üîÑ This should not happen with the new install order...")
    
    if IN_COLAB:
        print("üîÑ Attempting repair installation...")
        try:
            # Repair installation with correct order
            !pip install --force-reinstall transformers==4.30.0
            !pip install --force-reinstall sentence-transformers==2.7.0
            print("üì¶ Repair installation completed")
            
            # Test import again
            from sentence_transformers import SentenceTransformer
            print("‚úÖ Sentence Transformers now available")
            SENTENCE_TRANSFORMERS_AVAILABLE = True
        except Exception as e2:
            print(f"‚ùå Repair failed: {e2}")
            print("üìã Will use TF-IDF fallback for embeddings")
            SENTENCE_TRANSFORMERS_AVAILABLE = False
    else:
        print("üìã Using TF-IDF fallback for embeddings")
        SENTENCE_TRANSFORMERS_AVAILABLE = False

try:
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics.pairwise import cosine_similarity
    print("‚úÖ Scikit-learn available")
    SKLEARN_AVAILABLE = True
except ImportError:
    print("‚ö†Ô∏è Scikit-learn not available - using simplified methods")
    SKLEARN_AVAILABLE = False

# Display comprehensive package status
print(f"\nüìä Package Availability Summary:")
print(f"   üî¢ NumPy: {np.__version__}")
print(f"   üî• PyTorch: {torch.__version__} ({'GPU' if device == 'cuda' else 'CPU'})")
print(f"   üß† Sentence Transformers: {'‚úÖ (GPU)' if SENTENCE_TRANSFORMERS_AVAILABLE and device == 'cuda' else '‚úÖ (CPU)' if SENTENCE_TRANSFORMERS_AVAILABLE else '‚ùå'}")
print(f"   üìê Scikit-learn: {'‚úÖ' if SKLEARN_AVAILABLE else '‚ùå'}")

if not SENTENCE_TRANSFORMERS_AVAILABLE:
    print(f"\nüí° Note: Using TF-IDF embeddings as fallback for dense retrieval")
    print(f"   This may slightly affect DPR performance but won't impact comparison validity")
else:
    print(f"\nüöÄ Optimal setup achieved:")
    print(f"   - GPU-accelerated PyTorch for neural computations")
    print(f"   - Sentence-transformers with {device.upper()} support")
    print(f"   - All dependencies properly ordered and compatible")

## üìä Dataset Preparation and Preview

Let's examine the evaluation dataset we'll be using for this comparison.

In [None]:
# Create and Examine the Evaluation Dataset
print("üìä Creating evaluation dataset...")

# Improved Dataset Download with Better Error Handling
print("üìä Preparing evaluation dataset with HuggingFace downloads...")

def check_datasets_library():
    """Check if datasets library is available and install if needed"""
    try:
        import datasets
        print(f"‚úÖ HuggingFace datasets library available (v{datasets.__version__})")
        return True
    except ImportError:
        print("üì¶ Installing HuggingFace datasets library...")
        if IN_COLAB:
            import subprocess
            import sys
            
            # Install with proper progress feedback
            result = subprocess.run([sys.executable, "-m", "pip", "install", "datasets"], 
                                  capture_output=True, text=True)
            
            if result.returncode == 0:
                print("‚úÖ Datasets library installed successfully!")
                # Import after installation
                try:
                    import datasets
                    print(f"   Version: {datasets.__version__}")
                    return True
                except ImportError:
                    print("‚ùå Failed to import datasets after installation")
                    return False
            else:
                print(f"‚ùå Installation failed: {result.stderr}")
                return False
        else:
            print("‚ùå Please install datasets library: pip install datasets")
            return False

def download_huggingface_datasets():
    """Download real datasets from Hugging Face with comprehensive error handling"""
    
    # First, ensure datasets library is available
    if not check_datasets_library():
        print("‚ö†Ô∏è  Datasets library not available, falling back to synthetic data")
        return None, None
    
    try:
        from datasets import load_dataset
        import time
        
        print("\nüåê Downloading datasets from Hugging Face...")
        print("   üìù This may take a few minutes for first-time downloads...")
        
        datasets_downloaded = {}
        
        # Download each dataset with individual error handling
        for dataset_name, config in DATASET_CONFIG.items():
            try:
                print(f"\n   üìö Loading {config['description']}...")
                print(f"      Dataset: {config['name']}")
                print(f"      Split: {config['split']}")
                
                start_time = time.time()
                
                if config['subset']:
                    dataset = load_dataset(config['name'], config['subset'], split=config['split'])
                else:
                    dataset = load_dataset(config['name'], split=config['split'])
                
                download_time = time.time() - start_time
                
                print(f"      ‚úÖ Success! Downloaded {len(dataset)} samples in {download_time:.1f}s")
                datasets_downloaded[dataset_name] = dataset
                
            except Exception as e:
                print(f"      ‚ùå Failed to download {dataset_name}: {str(e)}")
                print(f"         Will use synthetic data for this portion")
                datasets_downloaded[dataset_name] = None
        
        # Return datasets (some might be None)
        nq_dataset = datasets_downloaded.get('natural_questions')
        hotpot_dataset = datasets_downloaded.get('hotpot_qa')
        
        success_count = sum(1 for ds in [nq_dataset, hotpot_dataset] if ds is not None)
        
        if success_count > 0:
            print(f"\n‚úÖ Successfully downloaded {success_count}/2 datasets from HuggingFace")
            if success_count < 2:
                print("   üìù Will supplement with synthetic data where needed")
        else:
            print("\n‚ö†Ô∏è  No HuggingFace datasets downloaded successfully")
            print("   üìù Will use synthetic data as complete fallback")
        
        return nq_dataset, hotpot_dataset
        
    except Exception as e:
        print(f"\n‚ùå Unexpected error during dataset download: {str(e)}")
        print("   üìù Falling back to synthetic data")
        return None, None

def verify_dataset_structure(dataset, dataset_name):
    """Verify that the downloaded dataset has expected structure"""
    if dataset is None:
        return False
        
    try:
        sample = dataset[0]
        
        if dataset_name == 'natural_questions':
            required_keys = ['question', 'document', 'annotations']
            return all(key in sample for key in required_keys)
            
        elif dataset_name == 'hotpot_qa':
            required_keys = ['question', 'answer', 'context']
            return all(key in sample for key in required_keys)
            
        return True
        
    except Exception as e:
        print(f"   ‚ö†Ô∏è  Dataset structure verification failed for {dataset_name}: {e}")
        return False

def download_huggingface_datasets():
    """Download real datasets from Hugging Face"""
    try:
        from datasets import load_dataset
        print("üì• Downloading datasets from Hugging Face...")
        
        # Download NaturalQuestions sample
        print("   üìö Loading Natural Questions dataset...")
        nq_dataset = load_dataset("natural_questions", split="validation[:100]")  # Small sample for Colab
        
        # Download HotpotQA sample  
        print("   üîó Loading HotpotQA dataset...")
        hotpot_dataset = load_dataset("hotpot_qa", "fullwiki", split="validation[:50]")
        
        return nq_dataset, hotpot_dataset
        
    except ImportError:
        print("‚ö†Ô∏è  Hugging Face datasets not available - installing...")
        if IN_COLAB:
            !pip install datasets
            try:
                from datasets import load_dataset
                return download_huggingface_datasets()  # Retry after install
            except:
                return None, None
        else:
            print("‚ùå Please install: pip install datasets")
            return None, None
    except Exception as e:
        print(f"‚ùå Error downloading datasets: {e}")
        return None, None

def create_expanded_dataset():
    """Create evaluation dataset from HuggingFace or fallback to synthetic"""
    
    # Try to download real datasets first
    nq_dataset, hotpot_dataset = download_huggingface_datasets()
    
    questions = []
    documents = []
    
    if nq_dataset is not None and hotpot_dataset is not None:
        print("‚úÖ Using real Hugging Face datasets")
        
        # Process Natural Questions
        for i, example in enumerate(nq_dataset):
            if i >= 50:  # Limit for Colab performance
                break
                
            question_text = example['question']['text']
            
            # Extract answer if available
            if example['annotations']['yes_no_answer'][0] != -1:
                answer = "Yes" if example['annotations']['yes_no_answer'][0] == 1 else "No"
            elif example['annotations']['short_answers'][0]:
                answer_start = example['annotations']['short_answers'][0][0]['start_token']
                answer_end = example['annotations']['short_answers'][0][0]['end_token']
                answer = " ".join(example['document']['tokens']['token'][answer_start:answer_end])
            else:
                answer = "Unknown"
            
            # Extract document text
            doc_text = " ".join(example['document']['tokens']['token'][:500])  # Truncate for performance
            
            questions.append({
                "question": question_text,
                "answer": answer,
                "context": doc_text,
                "type": "factual",
                "source": "natural_questions"
            })
            
            documents.append(doc_text)
        
        # Process HotpotQA
        for i, example in enumerate(hotpot_dataset):
            if i >= 25:  # Limit for Colab performance
                break
                
            question_text = example['question']
            answer = example['answer']
            
            # Combine supporting facts into context
            context = " ".join([
                " ".join(sent) for sent in example['context']['sentences'][:3]  # First 3 paragraphs
            ])
            
            questions.append({
                "question": question_text,
                "answer": answer,
                "context": context,
                "type": "multi-hop",
                "source": "hotpot_qa"
            })
            
            documents.append(context)
            
    else:
        print("‚ö†Ô∏è  Using synthetic fallback dataset")
        
        # Fallback synthetic dataset
        synthetic_data = [
            {
                "question": "When was the Declaration of Independence signed?",
                "answer": "July 4, 1776",
                "context": "The Declaration of Independence was signed on July 4, 1776, in Philadelphia. This document declared the thirteen American colonies' independence from British rule.",
                "type": "factual"
            },
            {
                "question": "What is the capital of France?",
                "answer": "Paris",
                "context": "Paris is the capital and largest city of France. It is located in the north-central part of the country and is known for its art, culture, and cuisine.",
                "type": "factual"
            },
            {
                "question": "Who wrote 'Romeo and Juliet' and when was it written?",
                "answer": "William Shakespeare, around 1594-1596",
                "context": "Romeo and Juliet is a tragedy written by William Shakespeare. It was written around 1594-1596 and tells the story of two young star-crossed lovers.",
                "type": "multi-hop"
            },
            {
                "question": "What is photosynthesis?",
                "answer": "The process by which plants convert light energy into chemical energy",
                "context": "Photosynthesis is the biological process by which plants, algae, and some bacteria convert light energy from the sun into chemical energy stored in glucose molecules.",
                "type": "factual"
            },
            {
                "question": "If Einstein developed relativity and worked at Princeton, where did the theory of relativity originate?",
                "answer": "The theory was developed by Einstein, who later worked at Princeton",
                "context": "Albert Einstein developed the theory of relativity in the early 1900s. He later joined Princeton University where he continued his research until his death.",
                "type": "multi-hop"
            }
        ]
        
        questions = synthetic_data
        documents = [q["context"] for q in questions]
        
        # Expand with variations
        expanded_docs = []
        for doc in documents:
            expanded_docs.append(doc)
            # Add slight variations
            expanded_docs.append(doc.replace(".", ". Furthermore, this is an important historical fact."))
            
        documents = expanded_docs
    
    return questions, documents

# Load the dataset
questions, documents = create_expanded_dataset()

print(f"‚úÖ Dataset created:")
print(f"   üìù Questions: {len(questions)}")
print(f"   üìÑ Documents: {len(documents)}")

# Display dataset statistics
question_types = {}
sources = {}
for q in questions:
    qtype = q.get("type", "unknown")
    question_types[qtype] = question_types.get(qtype, 0) + 1
    
    source = q.get("source", "synthetic")
    sources[source] = sources.get(source, 0) + 1

print(f"\nüìà Dataset Statistics:")
print(f"   Question Types:")
for qtype, count in question_types.items():
    print(f"     {qtype}: {count} questions")

print(f"   Data Sources:")
for source, count in sources.items():
    print(f"     {source}: {count} questions")

# Show sample questions
print(f"\nüîç Sample Questions:")
print("-" * 50)

for i, q in enumerate(questions[:3]):
    source = q.get("source", "synthetic")
    print(f"Q{i+1} [{q.get('type', 'unknown')}] [{source}]: {q['question']}")
    print(f"   Answer: {q['answer']}")
    print(f"   Context: {q['context'][:100]}...")
    print()

In [None]:
# Document Analysis
print("üìÑ Document Corpus Analysis:")
print("-" * 40)

# Calculate document statistics
doc_lengths = [len(doc.split()) for doc in documents]
total_tokens = sum(doc_lengths)
avg_length = np.mean(doc_lengths)
std_length = np.std(doc_lengths)

print(f"Total documents: {len(documents)}")
print(f"Total tokens: {total_tokens:,}")
print(f"Average doc length: {avg_length:.1f} ¬± {std_length:.1f} tokens")
print(f"Min doc length: {min(doc_lengths)} tokens")
print(f"Max doc length: {max(doc_lengths)} tokens")

# Visualize document length distribution
plt.figure(figsize=(10, 6))
plt.subplot(1, 2, 1)
plt.hist(doc_lengths, bins=15, alpha=0.7, color='skyblue')
plt.xlabel('Document Length (tokens)')
plt.ylabel('Frequency')
plt.title('Document Length Distribution')
plt.grid(True, alpha=0.3)

# Show sample documents
plt.subplot(1, 2, 2)
sample_docs = documents[:5]
doc_indices = range(1, len(sample_docs) + 1)
sample_lengths = [len(doc.split()) for doc in sample_docs]

plt.bar(doc_indices, sample_lengths, alpha=0.7, color='lightcoral')
plt.xlabel('Document Index')
plt.ylabel('Length (tokens)')
plt.title('Sample Document Lengths')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Display sample documents
print(f"\nüìÑ Sample Documents:")
print("-" * 50)
for i, doc in enumerate(documents[:3]):
    print(f"Doc {i+1}: {doc[:150]}...")
    print()

In [None]:
# Dataset Configuration and Hugging Face Setup
import os
from pathlib import Path

# Hugging Face configuration for better downloads
os.environ['HF_HUB_CACHE'] = '/tmp/huggingface_cache'  # Use tmp for Colab
os.environ['TRANSFORMERS_CACHE'] = '/tmp/transformers_cache'

# Create cache directories
Path('/tmp/huggingface_cache').mkdir(exist_ok=True)
Path('/tmp/transformers_cache').mkdir(exist_ok=True)

# Dataset download configuration
DATASET_CONFIG = {
    'natural_questions': {
        'name': 'natural_questions',
        'subset': None,
        'split': 'validation[:100]',  # Small sample for Colab
        'description': 'Google Natural Questions dataset'
    },
    'hotpot_qa': {
        'name': 'hotpot_qa', 
        'subset': 'fullwiki',
        'split': 'validation[:50]',
        'description': 'HotpotQA multi-hop reasoning dataset'
    }
}

print("üîß Dataset configuration loaded:")
for name, config in DATASET_CONFIG.items():
    print(f"   üìä {name}: {config['description']}")
    print(f"      Split: {config['split']}")
    print()

print("üìÅ Cache directories configured:")
print(f"   üóÇÔ∏è  HuggingFace: {os.environ['HF_HUB_CACHE']}")
print(f"   üóÇÔ∏è  Transformers: {os.environ['TRANSFORMERS_CACHE']}")

In [None]:
# Hugging Face Authentication and Access Check
def check_huggingface_access():
    """Check Hugging Face access and authentication status"""
    print("üîê Checking Hugging Face access...")
    
    try:
        import huggingface_hub
        from huggingface_hub import HfApi
        
        # Check if logged in
        api = HfApi()
        
        try:
            # Try to get user info (requires authentication)
            user_info = api.whoami()
            if user_info:
                print(f"‚úÖ Logged in as: {user_info.get('name', 'Unknown User')}")
                return True
        except Exception:
            print("‚ÑπÔ∏è  Not logged in to Hugging Face (this is fine for public datasets)")
        
        # Test basic API access
        try:
            # Try to access a simple public dataset info
            from datasets import list_datasets
            print("‚úÖ Can access public datasets")
            return True
        except Exception as e:
            print(f"‚ö†Ô∏è  Limited dataset access: {e}")
            return False
            
    except ImportError:
        print("üì¶ Installing huggingface_hub for better access...")
        if IN_COLAB:
            import subprocess
            import sys
            result = subprocess.run([sys.executable, "-m", "pip", "install", "huggingface_hub"], 
                                  capture_output=True, text=True)
            if result.returncode == 0:
                print("‚úÖ huggingface_hub installed")
                return check_huggingface_access()  # Retry
            else:
                print("‚ö†Ô∏è  Could not install huggingface_hub, proceeding anyway")
                return False
        else:
            print("üí° Consider installing: pip install huggingface_hub")
            return False

def setup_huggingface_cache():
    """Setup optimal caching for Hugging Face downloads in Colab"""
    
    print("üóÇÔ∏è  Setting up Hugging Face caching...")
    
    # Set cache locations
    cache_settings = {
        'HF_HOME': '/tmp/huggingface',
        'HF_HUB_CACHE': '/tmp/huggingface_hub',
        'TRANSFORMERS_CACHE': '/tmp/transformers', 
        'HF_DATASETS_CACHE': '/tmp/datasets'
    }
    
    for env_var, path in cache_settings.items():
        os.environ[env_var] = path
        Path(path).mkdir(parents=True, exist_ok=True)
        print(f"   üìÅ {env_var}: {path}")
    
    # Check available disk space
    import shutil
    total, used, free = shutil.disk_usage('/tmp')
    free_gb = free // (1024**3)
    
    print(f"   üíæ Available cache space: {free_gb:.1f} GB")
    
    if free_gb < 2:
        print("   ‚ö†Ô∏è  Low disk space - downloads may fail")
        print("   üí° Consider using smaller dataset splits")
    else:
        print("   ‚úÖ Sufficient space for dataset downloads")

# Run setup
hf_access = check_huggingface_access()
setup_huggingface_cache()

print(f"\nüéØ Hugging Face Setup Summary:")
print(f"   üåê API Access: {'‚úÖ Ready' if hf_access else '‚ö†Ô∏è  Limited'}")
print(f"   üìÅ Caching: ‚úÖ Configured")
print(f"   üöÄ Ready for dataset downloads!")

## üîß Retrieval System Initialization

Now let's initialize and test all the retrieval systems we'll be comparing.

In [None]:
# Define Retrieval System Classes
print("üîß Defining retrieval system classes...")

import re
from collections import Counter
import math

class BM25Retriever:
    """BM25 (Best Matching 25) retrieval system"""
    
    def __init__(self, documents, k1=1.5, b=0.75):
        self.documents = documents
        self.k1 = k1
        self.b = b
        self.tokenized_docs = [self._tokenize(doc) for doc in documents]
        self.doc_lengths = [len(doc) for doc in self.tokenized_docs]
        self.avg_doc_length = sum(self.doc_lengths) / len(self.doc_lengths)
        self.idf_cache = {}
        self._build_idf()
    
    def _tokenize(self, text):
        """Simple tokenization"""
        return re.findall(r'\b\w+\b', text.lower())
    
    def _build_idf(self):
        """Precompute IDF values"""
        all_tokens = set()
        for doc in self.tokenized_docs:
            all_tokens.update(doc)
        
        for token in all_tokens:
            doc_freq = sum(1 for doc in self.tokenized_docs if token in doc)
            self.idf_cache[token] = math.log((len(self.documents) - doc_freq + 0.5) / (doc_freq + 0.5))
    
    def retrieve(self, query, k=5):
        """Retrieve top-k documents for query"""
        query_tokens = self._tokenize(query)
        scores = []
        
        for i, doc in enumerate(self.tokenized_docs):
            score = 0
            doc_counter = Counter(doc)
            
            for token in query_tokens:
                if token in doc_counter:
                    tf = doc_counter[token]
                    idf = self.idf_cache.get(token, 0)
                    
                    # BM25 formula
                    numerator = tf * (self.k1 + 1)
                    denominator = tf + self.k1 * (1 - self.b + self.b * (self.doc_lengths[i] / self.avg_doc_length))
                    score += idf * (numerator / denominator)
            
            scores.append((i, score))
        
        # Sort by score and return top-k
        scores.sort(key=lambda x: x[1], reverse=True)
        return scores[:k]

class StaticEmbeddingRetriever:
    """TF-IDF based static embedding retrieval"""
    
    def __init__(self, documents):
        self.documents = documents
        if SKLEARN_AVAILABLE:
            from sklearn.feature_extraction.text import TfidfVectorizer
            from sklearn.metrics.pairwise import cosine_similarity
            
            self.vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
            self.doc_vectors = self.vectorizer.fit_transform(documents)
        else:
            self.vectorizer = None
            print("‚ö†Ô∏è  Using simplified embedding (sklearn not available)")
    
    def retrieve(self, query, k=5):
        """Retrieve top-k documents for query"""
        if self.vectorizer is None:
            # Fallback: simple word overlap
            query_words = set(query.lower().split())
            scores = []
            
            for i, doc in enumerate(self.documents):
                doc_words = set(doc.lower().split())
                overlap = len(query_words & doc_words)
                scores.append((i, overlap / len(query_words) if query_words else 0))
            
            scores.sort(key=lambda x: x[1], reverse=True)
            return scores[:k]
        else:
            from sklearn.metrics.pairwise import cosine_similarity
            
            query_vector = self.vectorizer.transform([query])
            similarities = cosine_similarity(query_vector, self.doc_vectors).flatten()
            
            # Get top-k indices
            top_indices = similarities.argsort()[-k:][::-1]
            return [(idx, similarities[idx]) for idx in top_indices]

class DPRRetriever:
    """Dense Passage Retrieval using sentence transformers"""
    
    def __init__(self, documents):
        self.documents = documents
        
        if SENTENCE_TRANSFORMERS_AVAILABLE:
            from sentence_transformers import SentenceTransformer
            
            # Use a lightweight model for Colab
            self.model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
            print(f"   üîß DPR using device: {device}")
            
            # Encode all documents
            print("   üìä Encoding documents...")
            self.doc_embeddings = self.model.encode(documents, convert_to_tensor=True)
        else:
            self.model = None
            print("   ‚ö†Ô∏è  Using TF-IDF fallback for DPR")
            self.fallback = StaticEmbeddingRetriever(documents)
    
    def retrieve(self, query, k=5):
        """Retrieve top-k documents for query"""
        if self.model is None:
            return self.fallback.retrieve(query, k)
        
        import torch
        
        # Encode query
        query_embedding = self.model.encode(query, convert_to_tensor=True)
        
        # Compute similarities
        similarities = torch.cosine_similarity(query_embedding.unsqueeze(0), self.doc_embeddings)
        
        # Get top-k
        top_k_indices = torch.topk(similarities, k).indices.cpu().numpy()
        top_k_scores = torch.topk(similarities, k).values.cpu().numpy()
        
        return [(int(idx), float(score)) for idx, score in zip(top_k_indices, top_k_scores)]

class InsightSpikeRAG:
    """InsightSpike Dynamic RAG with adaptive weighting"""
    
    def __init__(self, documents):
        self.documents = documents
        self.bm25 = BM25Retriever(documents)
        self.static = StaticEmbeddingRetriever(documents)
        
        if SENTENCE_TRANSFORMERS_AVAILABLE:
            self.dense = DPRRetriever(documents)
        else:
            self.dense = None
        
        # Adaptive weights (can be learned/tuned)
        self.weights = {
            'bm25': 0.4,
            'static': 0.3,
            'dense': 0.3 if self.dense else 0.0
        }
        
        # Normalize weights
        total_weight = sum(self.weights.values())
        self.weights = {k: v/total_weight for k, v in self.weights.items()}
    
    def _adaptive_weighting(self, query):
        """Dynamically adjust weights based on query characteristics"""
        query_length = len(query.split())
        has_entities = any(word[0].isupper() for word in query.split())
        
        # Simple heuristics for demonstration
        if query_length > 10:  # Long queries favor dense retrieval
            return {'bm25': 0.2, 'static': 0.3, 'dense': 0.5}
        elif has_entities:  # Entity queries favor BM25
            return {'bm25': 0.6, 'static': 0.2, 'dense': 0.2}
        else:
            return self.weights
    
    def retrieve(self, query, k=5):
        """Dynamic retrieval with adaptive weighting"""
        # Get adaptive weights
        weights = self._adaptive_weighting(query)
        
        # Get results from each system
        bm25_results = self.bm25.retrieve(query, k*2)  # Get more for fusion
        static_results = self.static.retrieve(query, k*2)
        
        if self.dense:
            dense_results = self.dense.retrieve(query, k*2)
        else:
            dense_results = []
        
        # Combine scores with adaptive weighting
        combined_scores = {}
        
        # BM25 scores
        for doc_idx, score in bm25_results:
            combined_scores[doc_idx] = combined_scores.get(doc_idx, 0) + weights['bm25'] * score
        
        # Static embedding scores
        for doc_idx, score in static_results:
            combined_scores[doc_idx] = combined_scores.get(doc_idx, 0) + weights['static'] * score
        
        # Dense scores
        for doc_idx, score in dense_results:
            combined_scores[doc_idx] = combined_scores.get(doc_idx, 0) + weights['dense'] * score
        
        # Sort and return top-k
        sorted_results = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
        return sorted_results[:k]

def evaluate_retrieval_system(retriever, questions, documents, k_values):
    """Evaluate a retrieval system on the given questions"""
    results = {
        "recall_at_k": {k: [] for k in k_values},
        "precision_at_k": {k: [] for k in k_values},
        "exact_matches": [],
        "f1_scores": [],
        "latencies": []
    }
    
    for q in questions:
        query = q["question"]
        expected_context = q["context"]
        expected_answer = q["answer"].lower()
        
        # Measure retrieval latency
        start_time = time.time()
        retrieved_docs = retriever.retrieve(query, max(k_values))
        latency = time.time() - start_time
        results["latencies"].append(latency)
        
        # Find if expected context is retrieved
        relevant_found = False
        for doc_idx, _ in retrieved_docs:
            if documents[doc_idx] == expected_context:
                relevant_found = True
                break
        
        # Calculate recall and precision at k
        for k in k_values:
            top_k_docs = retrieved_docs[:k]
            
            # Simple relevance check (context match)
            relevant_in_k = any(documents[doc_idx] == expected_context for doc_idx, _ in top_k_docs)
            
            results["recall_at_k"][k].append(1.0 if relevant_in_k else 0.0)
            results["precision_at_k"][k].append(1.0/k if relevant_in_k else 0.0)
        
        # Exact match and F1 (simplified)
        retrieved_text = " ".join([documents[doc_idx] for doc_idx, _ in retrieved_docs[:1]])
        exact_match = 1.0 if expected_answer in retrieved_text.lower() else 0.0
        
        # Simple F1 calculation
        answer_words = set(expected_answer.split())
        retrieved_words = set(retrieved_text.lower().split())
        
        if answer_words and retrieved_words:
            precision = len(answer_words & retrieved_words) / len(retrieved_words)
            recall = len(answer_words & retrieved_words) / len(answer_words)
            f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
        else:
            f1 = 0.0
        
        results["exact_matches"].append(exact_match)
        results["f1_scores"].append(f1)
    
    return results

def create_rag_visualization(all_results, questions):
    """Create comprehensive visualization of RAG comparison results"""
    import matplotlib.pyplot as plt
    import numpy as np
    
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    fig.suptitle('Dynamic RAG Comparison: Performance Analysis', fontsize=16, fontweight='bold')
    
    systems = list(all_results.keys())
    colors = plt.cm.Set3(np.linspace(0, 1, len(systems)))
    
    # 1. Recall@k comparison
    ax1 = axes[0, 0]
    k_values = [1, 3, 5]
    x = np.arange(len(k_values))
    width = 0.8 / len(systems)
    
    for i, system in enumerate(systems):
        recalls = [np.mean(all_results[system]["recall_at_k"][k]) for k in k_values]
        ax1.bar(x + i * width, recalls, width, label=system, color=colors[i], alpha=0.8)
    
    ax1.set_xlabel('k value')
    ax1.set_ylabel('Recall@k')
    ax1.set_title('Recall@k Performance')
    ax1.set_xticks(x + width * (len(systems) - 1) / 2)
    ax1.set_xticklabels([f'@{k}' for k in k_values])
    ax1.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    ax1.grid(True, alpha=0.3)
    
    # 2. Latency comparison
    ax2 = axes[0, 1]
    latencies = [np.mean(all_results[system]["latencies"]) * 1000 for system in systems]
    bars = ax2.bar(systems, latencies, color=colors, alpha=0.8)
    ax2.set_ylabel('Average Latency (ms)')
    ax2.set_title('Query Latency Comparison')
    ax2.set_xticklabels(systems, rotation=45, ha='right')
    ax2.grid(True, alpha=0.3)
    
    # Add value labels on bars
    for bar, latency in zip(bars, latencies):
        ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                f'{latency:.1f}ms', ha='center', va='bottom')
    
    # 3. F1 Score comparison
    ax3 = axes[0, 2]
    f1_scores = [np.mean(all_results[system]["f1_scores"]) for system in systems]
    bars = ax3.bar(systems, f1_scores, color=colors, alpha=0.8)
    ax3.set_ylabel('Average F1 Score')
    ax3.set_title('F1 Score Comparison')
    ax3.set_xticklabels(systems, rotation=45, ha='right')
    ax3.grid(True, alpha=0.3)
    
    # 4. Exact Match comparison
    ax4 = axes[1, 0]
    exact_matches = [np.mean(all_results[system]["exact_matches"]) for system in systems]
    bars = ax4.bar(systems, exact_matches, color=colors, alpha=0.8)
    ax4.set_ylabel('Exact Match Rate')
    ax4.set_title('Exact Match Comparison')
    ax4.set_xticklabels(systems, rotation=45, ha='right')
    ax4.grid(True, alpha=0.3)
    
    # 5. Performance heatmap
    ax5 = axes[1, 1]
    metrics = ['Recall@5', 'Precision@5', 'Exact Match', 'F1 Score']
    heatmap_data = []
    
    for system in systems:
        row = [
            np.mean(all_results[system]["recall_at_k"][5]),
            np.mean(all_results[system]["precision_at_k"][5]),
            np.mean(all_results[system]["exact_matches"]),
            np.mean(all_results[system]["f1_scores"])
        ]
        heatmap_data.append(row)
    
    im = ax5.imshow(heatmap_data, cmap='RdYlGn', aspect='auto', vmin=0, vmax=1)
    ax5.set_xticks(range(len(metrics)))
    ax5.set_xticklabels(metrics, rotation=45, ha='right')
    ax5.set_yticks(range(len(systems)))
    ax5.set_yticklabels(systems)
    ax5.set_title('Performance Heatmap')
    
    # Add text annotations
    for i in range(len(systems)):
        for j in range(len(metrics)):
            text = ax5.text(j, i, f'{heatmap_data[i][j]:.3f}',
                           ha="center", va="center", color="black", fontweight='bold')
    
    # 6. Overall ranking
    ax6 = axes[1, 2]
    
    # Calculate weighted score (you can adjust weights)
    weights = {'recall': 0.3, 'precision': 0.2, 'em': 0.3, 'f1': 0.2}
    
    overall_scores = []
    for system in systems:
        score = (weights['recall'] * np.mean(all_results[system]["recall_at_k"][5]) +
                weights['precision'] * np.mean(all_results[system]["precision_at_k"][5]) +
                weights['em'] * np.mean(all_results[system]["exact_matches"]) +
                weights['f1'] * np.mean(all_results[system]["f1_scores"]))
        overall_scores.append(score)
    
    # Sort by score
    sorted_data = sorted(zip(systems, overall_scores), key=lambda x: x[1], reverse=True)
    sorted_systems, sorted_scores = zip(*sorted_data)
    
    bars = ax6.barh(range(len(sorted_systems)), sorted_scores, color=colors[:len(sorted_systems)], alpha=0.8)
    ax6.set_yticks(range(len(sorted_systems)))
    ax6.set_yticklabels(sorted_systems)
    ax6.set_xlabel('Overall Score')
    ax6.set_title('Overall Performance Ranking')
    ax6.grid(True, alpha=0.3)
    
    # Add score labels
    for i, (bar, score) in enumerate(zip(bars, sorted_scores)):
        ax6.text(bar.get_width() + 0.01, bar.get_y() + bar.get_height()/2,
                f'{score:.3f}', ha='left', va='center')
    
    plt.tight_layout()
    return fig

# Initialize All Retrieval Systems
print("üîß Initializing retrieval systems...")

# Track initialization time for each system
init_times = {}

# 1. BM25 Retriever
print("\nüìä Initializing BM25 Retriever...")
start_time = time.time()
bm25_retriever = BM25Retriever(documents)
init_times["BM25"] = time.time() - start_time
print(f"   ‚úÖ BM25 initialized in {init_times['BM25']:.3f}s")

# 2. Static Embedding Retriever
print("\nüî¢ Initializing Static Embedding Retriever...")
start_time = time.time()
static_retriever = StaticEmbeddingRetriever(documents)
init_times["Static Embeddings"] = time.time() - start_time
print(f"   ‚úÖ Static Embeddings initialized in {init_times['Static Embeddings']:.3f}s")

# 3. DPR Retriever (if available)
if SENTENCE_TRANSFORMERS_AVAILABLE:
    print("\nüß† Initializing DPR-style Dense Retriever...")
    start_time = time.time()
    dpr_retriever = DPRRetriever(documents)
    init_times["DPR (Dense)"] = time.time() - start_time
    print(f"   ‚úÖ DPR initialized in {init_times['DPR (Dense)']:.3f}s")
else:
    print("\n‚ö†Ô∏è DPR not available - skipping dense retrieval")

# 4. InsightSpike Dynamic RAG
print("\nüöÄ Initializing InsightSpike Dynamic RAG...")
start_time = time.time()
insightspike_rag = InsightSpikeRAG(documents)
init_times["InsightSpike Dynamic RAG"] = time.time() - start_time
print(f"   ‚úÖ InsightSpike RAG initialized in {init_times['InsightSpike Dynamic RAG']:.3f}s")

# Display initialization summary
print(f"\n‚è±Ô∏è Initialization Times Summary:")
print("-" * 40)
for system, init_time in init_times.items():
    print(f"{system:<25}: {init_time:.3f}s")

# Create Comprehensive Evaluation Dataset
def safe_extract_text(text_data, max_tokens=500):
    """Safely extract text from various data structures"""
    if isinstance(text_data, str):
        return ' '.join(text_data.split()[:max_tokens])
    elif isinstance(text_data, list):
        if all(isinstance(item, str) for item in text_data):
            return ' '.join(text_data[:max_tokens])
        else:
            # Handle nested structures
            flat_text = []
            for item in text_data:
                if isinstance(item, str):
                    flat_text.extend(item.split())
                elif isinstance(item, list):
                    flat_text.extend(' '.join(str(x) for x in item).split())
            return ' '.join(flat_text[:max_tokens])
    else:
        return str(text_data)[:max_tokens*5]  # Rough character limit

def process_natural_questions(dataset, max_samples=50):
    """Process Natural Questions dataset with robust error handling"""
    questions = []
    documents = []
    
    if dataset is None:
        return questions, documents
    
    print(f"   üìö Processing Natural Questions ({min(len(dataset), max_samples)} samples)...")
    
    successful_samples = 0
    
    for i, example in enumerate(dataset):
        if successful_samples >= max_samples:
            break
            
        try:
            # Extract question
            question_text = example.get('question', {})
            if isinstance(question_text, dict):
                question_text = question_text.get('text', '')
            
            if not question_text:
                continue
                
            # Extract answer with multiple fallbacks
            answer = "Unknown"
            annotations = example.get('annotations', {})
            
            # Try yes/no answer first
            yes_no = annotations.get('yes_no_answer', [])
            if yes_no and len(yes_no) > 0 and yes_no[0] != -1:
                answer = "Yes" if yes_no[0] == 1 else "No"
            else:
                # Try short answers
                short_answers = annotations.get('short_answers', [])
                if short_answers and len(short_answers) > 0 and short_answers[0]:
                    try:
                        if isinstance(short_answers[0], list) and len(short_answers[0]) > 0:
                            answer_info = short_answers[0][0]
                            if isinstance(answer_info, dict):
                                start_token = answer_info.get('start_token', 0)
                                end_token = answer_info.get('end_token', start_token + 5)
                                
                                # Extract from document tokens
                                document = example.get('document', {})
                                tokens = document.get('tokens', {})
                                token_list = tokens.get('token', [])
                                
                                if token_list and start_token < len(token_list):
                                    end_token = min(end_token, len(token_list))
                                    answer = ' '.join(token_list[start_token:end_token])
                    except Exception as e:
                        pass  # Keep "Unknown" as fallback
            
            # Extract document text
            document = example.get('document', {})
            tokens = document.get('tokens', {})
            token_list = tokens.get('token', [])
            
            if token_list:
                doc_text = safe_extract_text(token_list, max_tokens=500)
            else:
                doc_text = str(document)[:1000]  # Fallback
            
            if doc_text and len(doc_text.strip()) > 10:  # Minimum meaningful content
                questions.append({
                    "question": question_text,
                    "answer": answer,
                    "context": doc_text,
                    "type": "factual",
                    "source": "natural_questions"
                })
                
                documents.append(doc_text)
                successful_samples += 1
                
        except Exception as e:
            print(f"      ‚ö†Ô∏è  Error processing NQ sample {i}: {e}")
            continue
    
    print(f"      ‚úÖ Successfully processed {successful_samples} Natural Questions samples")
    return questions, documents

def process_hotpot_qa(dataset, max_samples=25):
    """Process HotpotQA dataset with robust error handling"""
    questions = []
    documents = []
    
    if dataset is None:
        return questions, documents
    
    print(f"   üîó Processing HotpotQA ({min(len(dataset), max_samples)} samples)...")
    
    successful_samples = 0
    
    for i, example in enumerate(dataset):
        if successful_samples >= max_samples:
            break
            
        try:
            question_text = example.get('question', '')
            answer = example.get('answer', 'Unknown')
            
            if not question_text:
                continue
            
            # Extract context from supporting facts
            context_parts = []
            context_data = example.get('context', {})
            
            if isinstance(context_data, dict):
                sentences = context_data.get('sentences', [])
                if sentences:
                    # Take first few paragraphs
                    for sentence_group in sentences[:3]:
                        if isinstance(sentence_group, list):
                            context_parts.extend(sentence_group)
                        else:
                            context_parts.append(str(sentence_group))
            elif isinstance(context_data, list):
                # Direct list of context
                context_parts = context_data[:10]  # Limit context
            
            context = safe_extract_text(context_parts, max_tokens=400)
            
            if context and len(context.strip()) > 10:  # Minimum meaningful content
                questions.append({
                    "question": question_text,
                    "answer": answer,
                    "context": context,
                    "type": "multi-hop",
                    "source": "hotpot_qa"
                })
                
                documents.append(context)
                successful_samples += 1
                
        except Exception as e:
            print(f"      ‚ö†Ô∏è  Error processing HotpotQA sample {i}: {e}")
            continue
    
    print(f"      ‚úÖ Successfully processed {successful_samples} HotpotQA samples")
    return questions, documents

def create_synthetic_dataset():
    """Create high-quality synthetic dataset for fallback"""
    print("   üé® Creating synthetic evaluation dataset...")
    
    synthetic_data = [
        {
            "question": "When was the Declaration of Independence signed?",
            "answer": "July 4, 1776",
            "context": "The Declaration of Independence was signed on July 4, 1776, in Philadelphia. This document declared the thirteen American colonies' independence from British rule and established the United States as a sovereign nation.",
            "type": "factual",
            "source": "synthetic"
        },
        {
            "question": "What is the capital of France?",
            "answer": "Paris",
            "context": "Paris is the capital and largest city of France. It is located in the north-central part of the country and is known for its art, culture, cuisine, and iconic landmarks like the Eiffel Tower and Louvre Museum.",
            "type": "factual",
            "source": "synthetic"
        },
        {
            "question": "Who wrote 'Romeo and Juliet' and when was it written?",
            "answer": "William Shakespeare, around 1594-1596",
            "context": "Romeo and Juliet is a tragedy written by William Shakespeare. It was written around 1594-1596 and tells the story of two young star-crossed lovers whose deaths ultimately unite their feuding families in Verona, Italy.",
            "type": "multi-hop",
            "source": "synthetic"
        },
        {
            "question": "What is photosynthesis?",
            "answer": "The process by which plants convert light energy into chemical energy",
            "context": "Photosynthesis is the biological process by which plants, algae, and some bacteria convert light energy from the sun into chemical energy stored in glucose molecules. This process uses carbon dioxide and water as inputs.",
            "type": "factual",
            "source": "synthetic"
        },
        {
            "question": "If Einstein developed relativity and worked at Princeton, where did the theory of relativity originate?",
            "answer": "The theory was developed by Einstein, who later worked at Princeton",
            "context": "Albert Einstein developed the theory of relativity in the early 1900s while working at various institutions. He later joined Princeton University's Institute for Advanced Study where he continued his research until his death in 1955.",
            "type": "multi-hop",
            "source": "synthetic"
        },
        {
            "question": "What is the largest planet in our solar system?",
            "answer": "Jupiter",
            "context": "Jupiter is the largest planet in our solar system, with a mass greater than all other planets combined. It is a gas giant located fifth from the Sun and is known for its Great Red Spot, a giant storm larger than Earth.",
            "type": "factual",
            "source": "synthetic"
        },
        {
            "question": "Who invented the telephone and when?",
            "answer": "Alexander Graham Bell in 1876",
            "context": "Alexander Graham Bell invented the telephone in 1876. Bell was a Scottish-born inventor and scientist who was awarded the first U.S. patent for the telephone on March 7, 1876. The first successful telephone call was made on March 10, 1876.",
            "type": "factual",
            "source": "synthetic"
        },
        {
            "question": "If Shakespeare wrote Hamlet and lived during Elizabeth I's reign, what era was Hamlet written in?",
            "answer": "The Elizabethan era",
            "context": "William Shakespeare wrote Hamlet during the Elizabethan era, specifically around 1600-1601. Queen Elizabeth I reigned from 1558 to 1603, and Shakespeare wrote most of his famous plays during this period of English history.",
            "type": "multi-hop",
            "source": "synthetic"
        }
    ]
    
    print(f"      ‚úÖ Created {len(synthetic_data)} synthetic samples")
    return synthetic_data

def create_expanded_dataset():
    """Create comprehensive evaluation dataset with real and synthetic data"""
    
    print("üîÑ Creating comprehensive evaluation dataset...")
    
    all_questions = []
    all_documents = []
    
    # Process real datasets if available
    if nq_dataset is not None:
        nq_questions, nq_docs = process_natural_questions(nq_dataset)
        all_questions.extend(nq_questions)
        all_documents.extend(nq_docs)
    
    if hotpot_dataset is not None:
        hq_questions, hq_docs = process_hotpot_qa(hotpot_dataset)
        all_questions.extend(hq_questions)
        all_documents.extend(hq_docs)
    
    # Add synthetic data (always include some for diversity)
    synthetic_data = create_synthetic_dataset()
    all_questions.extend(synthetic_data)
    all_documents.extend([q["context"] for q in synthetic_data])
    
    # Create document variations for better retrieval testing
    print("   üìë Creating document variations for comprehensive testing...")
    expanded_docs = []
    for doc in all_documents:
        expanded_docs.append(doc)
        # Add slight variations to test retrieval robustness
        variation = doc.replace(".", ". This information is historically significant.")
        expanded_docs.append(variation)
    
    print(f"   ‚úÖ Dataset expansion complete")
    return all_questions, expanded_docs

# Create the final dataset
questions, documents = create_expanded_dataset()

print(f"\nüìä Final Dataset Summary:")
print(f"   üìù Total Questions: {len(questions)}")
print(f"   üìÑ Total Documents: {len(documents)}")

# Dataset statistics
question_types = {}
sources = {}
for q in questions:
    qtype = q.get("type", "unknown")
    question_types[qtype] = question_types.get(qtype, 0) + 1
    
    source = q.get("source", "unknown")
    sources[source] = sources.get(source, 0) + 1

print(f"\nüìà Dataset Composition:")
print(f"   Question Types:")
for qtype, count in question_types.items():
    print(f"     üìä {qtype}: {count} questions")

print(f"   Data Sources:")
for source, count in sources.items():
    emoji = "üåê" if source in ["natural_questions", "hotpot_qa"] else "üé®"
    print(f"     {emoji} {source}: {count} questions")

# Show sample questions
print(f"\nüîç Sample Questions Preview:")
print("-" * 60)

for i, q in enumerate(questions[:3]):
    source = q.get("source", "unknown")
    qtype = q.get("type", "unknown")
    print(f"Q{i+1} [{qtype}] [{source}]:")
    print(f"   ‚ùì Question: {q['question']}")
    print(f"   ‚úÖ Answer: {q['answer']}")
    print(f"   üìÑ Context: {q['context'][:100]}...")
    print()

print("üéØ Dataset ready for RAG system evaluation!")

In [None]:
# Test Retrieval Systems with Sample Query
print("üß™ Testing retrieval systems with sample query...")

sample_query = "When was the Declaration of Independence signed?"
print(f"Test Query: '{sample_query}'")
print("-" * 60)

# Test each retriever
retrievers = {
    "BM25": bm25_retriever,
    "Static Embeddings": static_retriever,
    "InsightSpike Dynamic RAG": insightspike_rag
}

if SENTENCE_TRANSFORMERS_AVAILABLE:
    retrievers["DPR (Dense)"] = dpr_retriever

for name, retriever in retrievers.items():
    print(f"\nüîç {name} Results:")
    start_time = time.time()
    results = retriever.retrieve(sample_query, k=3)
    query_time = time.time() - start_time
    
    print(f"   Query time: {query_time*1000:.1f}ms")
    
    for i, (doc_idx, score) in enumerate(results):
        doc_preview = documents[doc_idx][:100] + "..." if len(documents[doc_idx]) > 100 else documents[doc_idx]
        print(f"   {i+1}. Score: {score:.3f} | Doc: {doc_preview}")

## üöÄ Running the Complete Evaluation

Now let's run the comprehensive evaluation across all systems and metrics.

In [None]:
# Run Complete Evaluation
print("üöÄ Starting comprehensive RAG evaluation...")
print("‚è∞ This will take a few minutes to complete...")

# Configure evaluation parameters
k_values = [1, 3, 5]
print(f"üìä Evaluating with k values: {k_values}")

# Initialize results storage
all_results = {}

# Evaluate each system
for name, retriever in retrievers.items():
    print(f"\nüîç Evaluating {name}...")
    
    # Run evaluation
    results = evaluate_retrieval_system(retriever, questions, documents, k_values)
    all_results[name] = results
    
    # Display quick summary
    avg_recall_5 = np.mean(results["recall_at_k"][5])
    avg_precision_5 = np.mean(results["precision_at_k"][5])
    avg_em = np.mean(results["exact_matches"])
    avg_f1 = np.mean(results["f1_scores"])
    avg_latency = np.mean(results["latencies"])
    
    print(f"   üìà Quick Summary:")
    print(f"      Recall@5: {avg_recall_5:.3f}")
    print(f"      Precision@5: {avg_precision_5:.3f}")
    print(f"      Exact Match: {avg_em:.3f}")
    print(f"      F1 Score: {avg_f1:.3f}")
    print(f"      Avg Latency: {avg_latency*1000:.1f}ms")

print("\n‚úÖ Evaluation completed for all systems!")

## üìà Results Visualization and Analysis

Let's create comprehensive visualizations to understand the performance differences between systems.

In [None]:
# Create Main Visualization
print("üìà Creating comprehensive results visualization...")

# Generate the main comparison visualization
fig = create_rag_visualization(all_results, questions)
plt.show()

print("‚úÖ Main visualization complete!")

In [None]:
# Detailed Performance Analysis
print("üìä Detailed Performance Analysis")
print("=" * 50)

systems = list(all_results.keys())

# Create detailed comparison table
comparison_data = []
for system in systems:
    results = all_results[system]
    
    row = {
        "System": system,
        "Recall@1": f"{np.mean(results['recall_at_k'][1]):.3f} ¬± {np.std(results['recall_at_k'][1]):.3f}",
        "Recall@3": f"{np.mean(results['recall_at_k'][3]):.3f} ¬± {np.std(results['recall_at_k'][3]):.3f}",
        "Recall@5": f"{np.mean(results['recall_at_k'][5]):.3f} ¬± {np.std(results['recall_at_k'][5]):.3f}",
        "Precision@5": f"{np.mean(results['precision_at_k'][5]):.3f} ¬± {np.std(results['precision_at_k'][5]):.3f}",
        "Exact Match": f"{np.mean(results['exact_matches']):.3f} ¬± {np.std(results['exact_matches']):.3f}",
        "F1 Score": f"{np.mean(results['f1_scores']):.3f} ¬± {np.std(results['f1_scores']):.3f}",
        "Latency (ms)": f"{np.mean(results['latencies'])*1000:.1f} ¬± {np.std(results['latencies'])*1000:.1f}"
    }
    comparison_data.append(row)

comparison_df = pd.DataFrame(comparison_data)
display(HTML(comparison_df.to_html(index=False, table_id="comparison_table")))

# Statistical Significance Testing
print(f"\nüî¨ Statistical Significance Analysis:")
print("-" * 40)

from scipy import stats

# Compare InsightSpike against each baseline
insightspike_name = "InsightSpike Dynamic RAG"
if insightspike_name in all_results:
    insightspike_recall5 = all_results[insightspike_name]["recall_at_k"][5]
    insightspike_em = all_results[insightspike_name]["exact_matches"]
    
    for system in systems:
        if system != insightspike_name:
            system_recall5 = all_results[system]["recall_at_k"][5]
            system_em = all_results[system]["exact_matches"]
            
            # T-test for Recall@5
            _, p_recall = stats.ttest_ind(insightspike_recall5, system_recall5)
            
            # T-test for Exact Match
            _, p_em = stats.ttest_ind(insightspike_em, system_em)
            
            # Calculate effect sizes (Cohen's d)
            def cohens_d(group1, group2):
                n1, n2 = len(group1), len(group2)
                pooled_std = np.sqrt(((n1 - 1) * np.var(group1, ddof=1) + 
                                     (n2 - 1) * np.var(group2, ddof=1)) / (n1 + n2 - 2))
                return (np.mean(group1) - np.mean(group2)) / pooled_std
            
            recall_effect = cohens_d(insightspike_recall5, system_recall5)
            em_effect = cohens_d(insightspike_em, system_em)
            
            print(f"\nInsightSpike vs {system}:")
            print(f"  Recall@5: p={p_recall:.4f}, Cohen's d={recall_effect:.3f}")
            print(f"  Exact Match: p={p_em:.4f}, Cohen's d={em_effect:.3f}")
            
            # Interpretation
            if p_recall < 0.05:
                print(f"  Recall@5: Statistically significant difference ‚úÖ")
            else:
                print(f"  Recall@5: No significant difference ‚ùå")

In [None]:
# Performance by Question Type Analysis
print("üéØ Performance by Question Type")
print("=" * 40)

# Separate results by question type
factual_questions = [(i, q) for i, q in enumerate(questions) if q.get("type") == "factual"]
multihop_questions = [(i, q) for i, q in enumerate(questions) if q.get("type") == "multi-hop"]

print(f"Factual questions: {len(factual_questions)}")
print(f"Multi-hop questions: {len(multihop_questions)}")

# Calculate performance by question type
type_performance = {}

for system in systems:
    results = all_results[system]
    
    # Factual performance
    factual_recall5 = [results["recall_at_k"][5][i] for i, _ in factual_questions]
    factual_em = [results["exact_matches"][i] for i, _ in factual_questions]
    
    # Multi-hop performance
    multihop_recall5 = [results["recall_at_k"][5][i] for i, _ in multihop_questions]
    multihop_em = [results["exact_matches"][i] for i, _ in multihop_questions]
    
    type_performance[system] = {
        "factual": {
            "recall5": np.mean(factual_recall5) if factual_recall5 else 0,
            "em": np.mean(factual_em) if factual_em else 0
        },
        "multihop": {
            "recall5": np.mean(multihop_recall5) if multihop_recall5 else 0,
            "em": np.mean(multihop_em) if multihop_em else 0
        }
    }

# Visualize question type performance
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Recall@5 by question type
ax1 = axes[0]
x = np.arange(len(systems))
width = 0.35

factual_recall = [type_performance[sys]["factual"]["recall5"] for sys in systems]
multihop_recall = [type_performance[sys]["multihop"]["recall5"] for sys in systems]

ax1.bar(x - width/2, factual_recall, width, label='Factual', alpha=0.8)
ax1.bar(x + width/2, multihop_recall, width, label='Multi-hop', alpha=0.8)

ax1.set_xlabel('System')
ax1.set_ylabel('Recall@5')
ax1.set_title('Recall@5 by Question Type')
ax1.set_xticks(x)
ax1.set_xticklabels([s.replace(' ', '\n') for s in systems], fontsize=9)
ax1.legend()
ax1.grid(True, alpha=0.3)

# Exact Match by question type
ax2 = axes[1]
factual_em = [type_performance[sys]["factual"]["em"] for sys in systems]
multihop_em = [type_performance[sys]["multihop"]["em"] for sys in systems]

ax2.bar(x - width/2, factual_em, width, label='Factual', alpha=0.8)
ax2.bar(x + width/2, multihop_em, width, label='Multi-hop', alpha=0.8)

ax2.set_xlabel('System')
ax2.set_ylabel('Exact Match')
ax2.set_title('Exact Match by Question Type')
ax2.set_xticks(x)
ax2.set_xticklabels([s.replace(' ', '\n') for s in systems], fontsize=9)
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print detailed breakdown
print(f"\nüìã Detailed Question Type Performance:")
print("-" * 60)
print(f"{'System':<25} {'Factual R@5':<12} {'Factual EM':<11} {'Multi-hop R@5':<14} {'Multi-hop EM':<12}")
print("-" * 60)

for system in systems:
    perf = type_performance[system]
    print(f"{system:<25} {perf['factual']['recall5']:<12.3f} {perf['factual']['em']:<11.3f} "
          f"{perf['multihop']['recall5']:<14.3f} {perf['multihop']['em']:<12.3f}")

## üíæ Save Results and Create Download Package

Let's save all our experimental results and create a downloadable package.

In [None]:
# Save Experimental Results
print("üíæ Saving experimental results...")

# Create results directory
results_dir = Path("rag_comparison_results")
results_dir.mkdir(exist_ok=True)

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

# Prepare comprehensive results data
results_data = {
    "timestamp": timestamp,
    "experiment_type": "dynamic_rag_comparison",
    "environment": "Google Colab" if IN_COLAB else "Local",
    "dataset_info": {
        "num_questions": len(questions),
        "num_documents": len(documents),
        "question_types": {
            "factual": len([q for q in questions if q.get("type") == "factual"]),
            "multi_hop": len([q for q in questions if q.get("type") == "multi-hop"])
        }
    },
    "systems_evaluated": list(all_results.keys()),
    "evaluation_metrics": {
        "recall_at_k": k_values,
        "precision_at_k": k_values,
        "exact_match": True,
        "f1_score": True,
        "latency": True
    },
    "initialization_times": init_times,
    "detailed_results": all_results,
    "question_type_performance": type_performance
}

# Convert numpy arrays to lists for JSON serialization
def convert_numpy(obj):
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, np.integer):
        return int(obj)
    elif isinstance(obj, np.floating):
        return float(obj)
    elif isinstance(obj, dict):
        return {k: convert_numpy(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [convert_numpy(item) for item in obj]
    return obj

# Save JSON data
json_path = results_dir / f"rag_comparison_results_{timestamp}.json"
with open(json_path, 'w') as f:
    json.dump(convert_numpy(results_data), f, indent=2)

print(f"üìä Results saved to: {json_path}")

# Save main figure
fig.savefig(results_dir / f"rag_comparison_visualization_{timestamp}.png", 
           dpi=300, bbox_inches='tight')

# Save question type analysis figure
plt.savefig(results_dir / f"question_type_analysis_{timestamp}.png", 
           dpi=300, bbox_inches='tight')

print(f"üìà Visualizations saved to: {results_dir}/")

# Create summary CSV
summary_data = []
for system in systems:
    summary_data.append({
        "System": system,
        "Recall@1": np.mean(all_results[system]["recall_at_k"][1]),
        "Recall@3": np.mean(all_results[system]["recall_at_k"][3]),
        "Recall@5": np.mean(all_results[system]["recall_at_k"][5]),
        "Precision@5": np.mean(all_results[system]["precision_at_k"][5]),
        "Exact_Match": np.mean(all_results[system]["exact_matches"]),
        "F1_Score": np.mean(all_results[system]["f1_scores"]),
        "Latency_ms": np.mean(all_results[system]["latencies"]) * 1000,
        "Factual_Recall@5": type_performance[system]["factual"]["recall5"],
        "Factual_EM": type_performance[system]["factual"]["em"],
        "MultiHop_Recall@5": type_performance[system]["multihop"]["recall5"],
        "MultiHop_EM": type_performance[system]["multihop"]["em"]
    })

summary_df = pd.DataFrame(summary_data)
csv_path = results_dir / f"rag_summary_results_{timestamp}.csv"
summary_df.to_csv(csv_path, index=False)

print(f"üìÑ Summary CSV saved to: {csv_path}")
print("\n‚úÖ All results saved successfully!")

In [None]:
# Download Results (for Colab users)
if IN_COLAB:
    print("üì• Preparing files for download...")
    
    # Create a zip file with all results
    import zipfile
    
    zip_path = f"dynamic_rag_comparison_results_{timestamp}.zip"
    
    with zipfile.ZipFile(zip_path, 'w') as zipf:
        # Add all files from results directory
        for file_path in results_dir.glob("*"):
            zipf.write(file_path, file_path.name)
        
        # Add the experiment script
        zipf.write("experiments/colab_experiments/dynamic_rag_comparison/dynamic_rag_experiment.py", 
                   "dynamic_rag_experiment.py")
        
        # Add this notebook
        try:
            zipf.write("experiments/colab_experiments/dynamic_rag_comparison/dynamic_rag_colab.ipynb", 
                       "dynamic_rag_colab.ipynb")
        except:
            pass  # File might not exist in Colab
    
    print(f"üì¶ Created zip file: {zip_path}")
    
    # Download files
    from google.colab import files
    
    try:
        files.download(zip_path)
        print("‚úÖ Download initiated! Check your browser's download folder.")
    except:
        print("‚ö†Ô∏è Automatic download failed. You can manually download the files from the file browser.")
        print("üìÅ Available files:")
        !ls -la rag_comparison_results/
        !ls -la *.zip
else:
    print("üìÅ Results saved locally in the rag_comparison_results/ directory")
    print("üìã Available files:")
    !ls -la rag_comparison_results/

## üì¶ Experiment Results Download

Download your experimental results for further analysis or sharing.

In [None]:
# Download Experiment Results
print("üì¶ Preparing experiment results for download...")

def create_downloadable_results():
    """Create a downloadable package of all experimental results"""
    import zipfile
    import json
    from datetime import datetime
    from pathlib import Path
    
    # Create download directory
    download_dir = Path("downloads")
    download_dir.mkdir(exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    zip_filename = f"rag_experiment_results_{timestamp}.zip"
    zip_path = download_dir / zip_filename
    
    print(f"üìù Creating results package: {zip_filename}")
    
    # Create comprehensive results package
    with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
        
        # Add experiment results
        results_dir = Path("data/rag_experiments/results")
        if results_dir.exists():
            for file_path in results_dir.rglob("*"):
                if file_path.is_file():
                    arcname = f"results/{file_path.relative_to(results_dir)}"
                    zipf.write(file_path, arcname)
                    print(f"   üìÑ Added: {arcname}")
        
        # Add visualizations
        viz_dir = Path("data/rag_experiments/visualizations")
        if viz_dir.exists():
            for file_path in viz_dir.rglob("*.png"):
                if file_path.is_file():
                    arcname = f"visualizations/{file_path.name}"
                    zipf.write(file_path, arcname)
                    print(f"   üñºÔ∏è  Added: {arcname}")
        
        # Add baseline comparisons
        baselines_dir = Path("data/rag_experiments/baselines")
        if baselines_dir.exists():
            for baseline_dir in baselines_dir.iterdir():
                if baseline_dir.is_dir():
                    results_files = baseline_dir.rglob("*.json")
                    for file_path in results_files:
                        arcname = f"baselines/{baseline_dir.name}/{file_path.name}"
                        zipf.write(file_path, arcname)
                        print(f"   üìä Added: {arcname}")
        
        # Add experiment summary
        summary = {
            "experiment_type": "Dynamic RAG Comparison",
            "timestamp": timestamp,
            "notebook_version": "v1.0.0",
            "description": "Comparison of InsightSpike-AI dynamic RAG against baseline methods",
            "datasets": ["NaturalQuestions_sample", "HotpotQA_sample"],
            "methods_compared": ["BM25", "Static Embeddings", "DPR", "InsightSpike RAG"],
            "metrics": ["Recall@k", "Precision@k", "Exact Match", "F1 Score", "Latency"]
        }
        
        summary_path = download_dir / "experiment_summary.json"
        with open(summary_path, 'w') as f:
            json.dump(summary, f, indent=2)
        zipf.write(summary_path, "experiment_summary.json")
        
        print(f"   üìã Added: experiment_summary.json")
    
    file_size = zip_path.stat().st_size / (1024 * 1024)  # MB
    print(f"\n‚úÖ Results package created successfully!")
    print(f"üì¶ File: {zip_path}")
    print(f"üìè Size: {file_size:.2f} MB")
    
    return zip_path

# Create and prepare results for download
if IN_COLAB:
    try:
        # Create downloadable package
        zip_path = create_downloadable_results()
        
        # Download in Colab
        from google.colab import files
        files.download(str(zip_path))
        print("‚¨áÔ∏è  Download started in Colab!")
        
    except Exception as e:
        print(f"‚ùå Error creating download package: {e}")
        print("üí° You can manually download files from the file browser")
        
        # Show available files for manual download
        results_dir = Path("data/rag_experiments/results")
        if results_dir.exists():
            print(f"\nüìã Available result files:")
            for file_path in results_dir.rglob("*"):
                if file_path.is_file():
                    print(f"   üìÑ {file_path}")
else:
    # Local environment - just create the package
    zip_path = create_downloadable_results()
    print(f"üíæ Results saved locally: {zip_path}")
    print("üìÅ Open the 'downloads' folder to access your results")

print(f"\nüéâ Experiment complete! Your results are ready for analysis.")