# UMinFramework: Augmented LLM Benchmarking Demo

This notebook demonstrates the complete UMinFramework workflow for comparing baseline Large Language Models (LLMs) against **AugmentedLLMs** that integrate:

- 🔍 **Prompt Refinement**: Clarifies ambiguous prompts using fine-tuned models
- 🎯 **Uncertainty Quantification**: Real-time confidence monitoring during generation  
- 🔄 **Continuous Chain-of-Thought Backtracking**: Automatic reasoning injection when uncertainty is high

## What You'll Learn
- How to set up the UMinFramework environment
- Download and prepare coding benchmark datasets (HumanEval, MBPP)
- Run comparative benchmarks between baseline and augmented LLMs
- Analyze results using pass@k metrics and detailed performance statistics

## Prerequisites
- Google Colab with GPU runtime (recommended)
- Basic understanding of Python and machine learning concepts


# 🚀 Environment Setup

First, let's set up our Colab environment with GPU acceleration and install all required dependencies.

In [None]:
# Check GPU availability
import torch
import sys

print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU device: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
else:
    print("⚠️ No GPU detected. Consider switching to GPU runtime for better performance.")
    print("Go to Runtime > Change runtime type > Hardware accelerator > GPU")

In [None]:
# Clone the UMinFramework repository
!git clone https://github.com/johanjohnthomas/UMinFramework.git
%cd UMinFramework

# List the project structure
!ls -la

In [None]:
# Install required dependencies
print("Installing Python dependencies...")
!pip install -r requirements.txt

# Install additional dependencies that might be needed
!pip install datasets evaluate accelerate

print("\n✅ Dependencies installed successfully!")

# Verify installation
try:
    import transformers
    import datasets
    print(f"Transformers version: {transformers.__version__}")
    print(f"Datasets version: {datasets.__version__}")
    print("\n🎉 All core dependencies are available!")
except ImportError as e:
    print(f"❌ Import error: {e}")

# 📊 Data Setup and Download

Let's download and prepare the benchmark datasets using the provided automation scripts.

In [None]:
# Download benchmark datasets
print("Downloading HumanEval, MBPP, and AskCQ datasets...")
!python scripts/download_datasets.py

print("\nDataset download completed!")
print("\nAvailable datasets:")
!ls -la data/

In [None]:
# Inspect the downloaded datasets
import json

def inspect_dataset(filename, dataset_name):
    """Display sample data from a dataset file."""
    try:
        with open(f"data/{filename}", 'r') as f:
            # For JSONL files, read first line
            if filename.endswith('.jsonl'):
                sample = json.loads(f.readline().strip())
            else:
                sample = json.load(f)
                if isinstance(sample, list) and len(sample) > 0:
                    sample = sample[0]
        
        print(f"\n📋 {dataset_name} Sample:")
        print("="*50)
        for key, value in sample.items():
            if isinstance(value, str) and len(value) > 100:
                print(f"{key}: {value[:100]}...")
            else:
                print(f"{key}: {value}")
        print("="*50)
        
    except Exception as e:
        print(f"❌ Error reading {filename}: {e}")

# Inspect each dataset
inspect_dataset("humaneval.jsonl", "HumanEval")
inspect_dataset("mbpp.jsonl", "MBPP")
inspect_dataset("askcq.jsonl", "AskCQ")

# 🧪 UMinFramework Components Demo

Let's explore the core components of the UMinFramework before running the full benchmark.

In [None]:
# Import UMinFramework components
import sys
sys.path.append('src')

try:
    from umin_framework import (
        AugmentedLLM, 
        AugmentedLLMConfig,
        UncertaintyHead,
        GenerationLoop,
        SafeCodeExecutor
    )
    print("✅ UMinFramework components imported successfully!")
    
    # Display available components
    print("\n📦 Available Components:")
    print("- AugmentedLLM: Main wrapper class")
    print("- AugmentedLLMConfig: Configuration management")
    print("- UncertaintyHead: Token-level uncertainty quantification")
    print("- GenerationLoop: Uncertainty-aware generation with backtracking")
    print("- SafeCodeExecutor: Secure code execution sandbox")
    
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Please ensure the UMinFramework is properly installed.")

In [None]:
# Demo: Safe Code Execution
print("🔒 Safe Code Executor Demo")
print("="*40)

executor = SafeCodeExecutor(timeout=5.0)

# Test 1: Safe code
safe_code = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)

result = fibonacci(7)
print(f"Fibonacci(7) = {result}")
"""

test_code = "assert fibonacci(7) == 13"

print("Testing safe code:")
result = executor.execute(safe_code, test_code)
print(f"Success: {result.success}")
print(f"Output: {result.output}")
print(f"Execution time: {result.execution_time:.3f}s")

# Test 2: Dangerous code (should be blocked)
print("\nTesting dangerous code:")
dangerous_code = "import os; os.system('rm -rf /')"
result = executor.execute(dangerous_code)
print(f"Success: {result.success}")
print(f"Error: {result.error}")
print("\n✅ Dangerous code successfully blocked!")

In [None]:
# Demo: Basic AugmentedLLM Usage
print("🤖 AugmentedLLM Basic Demo")
print("="*40)

# Create configuration (without prompt refinement for this demo)
config = AugmentedLLMConfig(
    generation_model="mistralai/Mistral-7B-Instruct-v0.2",  # Small model for demo
    enable_prompt_refinement=False,  # Disable for this demo
    enable_uncertainty_monitoring=True,
    enable_backtracking=True,
    uncertainty_threshold=0.8,
    max_length=50,
    temperature=0.7
)

print(f"Configuration:")
print(f"- Model: {config.generation_model}")
print(f"- Uncertainty threshold: {config.uncertainty_threshold}")
print(f"- Backtracking enabled: {config.enable_backtracking}")
print(f"- Max length: {config.max_length}")

try:
    # Initialize AugmentedLLM
    print("\nInitializing AugmentedLLM...")
    augmented_llm = AugmentedLLM(config=config)
    print("✅ AugmentedLLM initialized successfully!")
    
    # Test generation
    test_prompt = "def add_two_numbers(a, b):"
    print(f"\nTest prompt: {test_prompt}")
    print("Generating...")
    
    result = augmented_llm.generate(test_prompt, return_metadata=True)
    
    print(f"\n📝 Generated code:")
    print(result['text'])
    print(f"\n📊 Metadata:")
    print(f"- Tokens generated: {result['generated_tokens']}")
    print(f"- Average uncertainty: {result['avg_uncertainty']:.3f}")
    print(f"- Backtrack events: {result['backtrack_events']}")
    
except Exception as e:
    print(f"❌ Error: {e}")
    print("This might be due to limited Colab resources. The benchmark script will work better.")

# 🔧 Prompt Refiner Training (Optional)

The UMinFramework includes a prompt refinement component that clarifies ambiguous prompts. Let's see if we can train it quickly or use a pre-trained version.

In [None]:
# Check if prompt refiner model exists
import os
from pathlib import Path

refiner_path = Path("models/prompt_refiner")

if refiner_path.exists() and (refiner_path / "config.json").exists():
    print("✅ Prompt refiner model found!")
    print(f"Location: {refiner_path}")
    
    # List model files
    print("\nModel files:")
    for file in refiner_path.iterdir():
        if file.is_file():
            print(f"- {file.name}")
    
    prompt_refiner_available = True
else:
    print("❌ Prompt refiner model not found.")
    print("\nTo train the prompt refiner (takes 10-15 minutes):")
    print("Uncomment and run the training cell below.")
    prompt_refiner_available = False

print(f"\nPrompt refiner available: {prompt_refiner_available}")

In [None]:
# Uncomment to train the prompt refiner (optional - takes time)
# ⚠️ This will take 10-15 minutes and requires significant GPU memory

# print("🔧 Training Prompt Refiner...")
# print("This may take 10-15 minutes with GPU acceleration.")

# !python scripts/finetune_prompt_refiner.py \
#     --model_name google/flan-t5-small \
#     --output_dir models/prompt_refiner \
#     --num_train_epochs 3 \
#     --per_device_train_batch_size 8 \
#     --save_strategy epoch \
#     --logging_steps 10

# print("✅ Prompt refiner training completed!")

print("⚠️ Prompt refiner training skipped for this demo.")
print("The benchmark will run without prompt refinement.")
print("Uncomment the code above to train the prompt refiner.")

# 📈 Running the Benchmark

Now let's run the core benchmarking suite to compare baseline and augmented LLMs on coding tasks.

In [None]:
# Run baseline-only benchmark first (faster)
print("🏃‍♂️ Running Baseline-Only Benchmark")
print("="*50)
print("This will evaluate a baseline LLM on the coding datasets.")
print("Estimated time: 2-5 minutes\n")

!python scripts/run_benchmark.py \
    --baseline-model mistralai/Mistral-7B-Instruct-v0.2 \
    --data-path data \
    --output-dir results/colab_baseline \
    --no-augmented \
    --max-length 150 \
    --temperature 0.2 \
    --timeout 20 \
    --verbose

print("\n✅ Baseline benchmark completed!")

In [None]:
# Run augmented benchmark (if prompt refiner is available)
if prompt_refiner_available:
    print("🚀 Running Augmented LLM Benchmark")
    print("="*50)
    print("This will compare baseline vs AugmentedLLM with all features enabled.")
    print("Estimated time: 5-10 minutes\n")
    
    !python scripts/run_benchmark.py \
        --baseline-model mistralai/Mistral-7B-Instruct-v0.2 \
        --augmented-model mistralai/Mistral-7B-Instruct-v0.2 \
        --prompt-refiner-model models/prompt_refiner \
        --data-path data \
        --output-dir results/colab_augmented \
        --uncertainty-threshold 0.7 \
        --backtrack-window 3 \
        --max-length 150 \
        --temperature 0.2 \
        --timeout 20 \
        --verbose
    
    print("\n✅ Augmented benchmark completed!")
else:
    print("⚠️ Augmented benchmark skipped (no prompt refiner available)")
    print("Running augmented benchmark without prompt refinement...")
    
    !python scripts/run_benchmark.py \
        --baseline-model mistralai/Mistral-7B-Instruct-v0.2 \
        --augmented-model mistralai/Mistral-7B-Instruct-v0.2 \
        --data-path data \
        --output-dir results/colab_augmented \
        --uncertainty-threshold 0.7 \
        --backtrack-window 3 \
        --max-length 150 \
        --temperature 0.2 \
        --timeout 20 \
        --verbose
    
    print("\n✅ Augmented benchmark (without prompt refinement) completed!")

# 📊 Results Analysis

Let's analyze the benchmark results and visualize the performance differences.

In [None]:
# Load and display benchmark results
import json
import pandas as pd
from pathlib import Path

def load_benchmark_results(results_dir):
    """Load benchmark results from JSON file."""
    results_file = Path(results_dir) / "benchmark_results.json"
    
    if not results_file.exists():
        print(f"❌ Results file not found: {results_file}")
        return None
    
    with open(results_file, 'r') as f:
        data = json.load(f)
    
    return data

def display_results_summary(data, title):
    """Display a summary of benchmark results."""
    print(f"\n{title}")
    print("="*len(title))
    
    if not data:
        print("No data available")
        return
    
    # Display metadata
    meta = data.get('metadata', {})
    print(f"📊 Evaluation Summary:")
    print(f"   Model: {meta.get('baseline_model', 'Unknown')}")
    print(f"   Total problems: {meta.get('total_results', 0)}")
    print(f"   Max length: {meta.get('max_length', 'N/A')}")
    print(f"   Temperature: {meta.get('temperature', 'N/A')}")
    
    # Display statistics
    stats = data.get('statistics', {})
    
    if 'baseline' in stats and stats['baseline']:
        baseline = stats['baseline']
        print(f"\n🤖 Baseline Results:")
        print(f"   Pass rate: {baseline.get('pass_rate', 0):.1%}")
        print(f"   Avg generation time: {baseline.get('avg_generation_time', 0):.2f}s")
        print(f"   Avg tokens: {baseline.get('avg_tokens', 0):.1f}")
    
    if 'augmented' in stats and stats['augmented']:
        augmented = stats['augmented']
        print(f"\n🚀 Augmented Results:")
        print(f"   Pass rate: {augmented.get('pass_rate', 0):.1%}")
        print(f"   Avg generation time: {augmented.get('avg_generation_time', 0):.2f}s")
        print(f"   Avg tokens: {augmented.get('avg_tokens', 0):.1f}")
        print(f"   Avg uncertainty: {augmented.get('avg_uncertainty', 0):.3f}")
        print(f"   Avg backtracks: {augmented.get('avg_backtrack_events', 0):.1f}")
    
    # Display pass@k metrics
    if 'pass_at_k' in data:
        pass_k = data['pass_at_k']
        
        if 'baseline' in pass_k and pass_k['baseline']['overall']:
            print(f"\n📈 Baseline Pass@k:")
            for k, score in pass_k['baseline']['overall'].items():
                print(f"   Pass@{k}: {score:.3f}")
        
        if 'augmented' in pass_k and pass_k['augmented']['overall']:
            print(f"\n🎯 Augmented Pass@k:")
            for k, score in pass_k['augmented']['overall'].items():
                print(f"   Pass@{k}: {score:.3f}")

# Load baseline results
baseline_data = load_benchmark_results("results/colab_baseline")
display_results_summary(baseline_data, "📊 BASELINE RESULTS")

# Load augmented results (if available)
augmented_data = load_benchmark_results("results/colab_augmented")
if augmented_data:
    display_results_summary(augmented_data, "🚀 AUGMENTED RESULTS")

In [None]:
# Detailed analysis with visualizations
import matplotlib.pyplot as plt
import numpy as np

def create_comparison_chart(baseline_data, augmented_data):
    """Create comparison charts for baseline vs augmented results."""
    
    if not baseline_data or not augmented_data:
        print("❌ Cannot create comparison chart - missing data")
        return
    
    # Extract statistics
    baseline_stats = baseline_data.get('statistics', {}).get('baseline', {})
    augmented_stats = augmented_data.get('statistics', {}).get('augmented', {})
    
    if not baseline_stats or not augmented_stats:
        print("❌ Cannot create comparison chart - missing statistics")
        return
    
    # Create subplots
    fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(12, 8))
    fig.suptitle('Baseline vs Augmented LLM Performance Comparison', fontsize=16, fontweight='bold')
    
    # 1. Pass Rate Comparison
    models = ['Baseline', 'Augmented']
    pass_rates = [
        baseline_stats.get('pass_rate', 0) * 100,
        augmented_stats.get('pass_rate', 0) * 100
    ]
    
    bars1 = ax1.bar(models, pass_rates, color=['#ff7f7f', '#7fbf7f'])
    ax1.set_ylabel('Pass Rate (%)')
    ax1.set_title('Success Rate Comparison')
    ax1.set_ylim(0, max(pass_rates) * 1.2 if max(pass_rates) > 0 else 100)
    
    # Add value labels on bars
    for bar, rate in zip(bars1, pass_rates):
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'{rate:.1f}%', ha='center', va='bottom')
    
    # 2. Generation Time Comparison
    gen_times = [
        baseline_stats.get('avg_generation_time', 0),
        augmented_stats.get('avg_generation_time', 0)
    ]
    
    bars2 = ax2.bar(models, gen_times, color=['#ff7f7f', '#7fbf7f'])
    ax2.set_ylabel('Time (seconds)')
    ax2.set_title('Average Generation Time')
    
    for bar, time in zip(bars2, gen_times):
        height = bar.get_height()
        ax2.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                f'{time:.2f}s', ha='center', va='bottom')
    
    # 3. Token Count Comparison
    token_counts = [
        baseline_stats.get('avg_tokens', 0),
        augmented_stats.get('avg_tokens', 0)
    ]
    
    bars3 = ax3.bar(models, token_counts, color=['#ff7f7f', '#7fbf7f'])
    ax3.set_ylabel('Token Count')
    ax3.set_title('Average Tokens Generated')
    
    for bar, tokens in zip(bars3, token_counts):
        height = bar.get_height()
        ax3.text(bar.get_x() + bar.get_width()/2., height + 0.5,
                f'{tokens:.0f}', ha='center', va='bottom')
    
    # 4. Pass@k Comparison (if available)
    baseline_pass_k = baseline_data.get('pass_at_k', {}).get('baseline', {}).get('overall', {})
    augmented_pass_k = augmented_data.get('pass_at_k', {}).get('augmented', {}).get('overall', {})
    
    if baseline_pass_k and augmented_pass_k:
        k_values = list(baseline_pass_k.keys())
        baseline_scores = [baseline_pass_k[k] for k in k_values]
        augmented_scores = [augmented_pass_k.get(k, 0) for k in k_values]
        
        x = np.arange(len(k_values))
        width = 0.35
        
        ax4.bar(x - width/2, baseline_scores, width, label='Baseline', color='#ff7f7f')
        ax4.bar(x + width/2, augmented_scores, width, label='Augmented', color='#7fbf7f')
        
        ax4.set_xlabel('k')
        ax4.set_ylabel('Pass@k Score')
        ax4.set_title('Pass@k Metrics Comparison')
        ax4.set_xticks(x)
        ax4.set_xticklabels([f'k={k}' for k in k_values])
        ax4.legend()
        ax4.set_ylim(0, 1.1)
    else:
        ax4.text(0.5, 0.5, 'Pass@k data\nnot available', 
                ha='center', va='center', transform=ax4.transAxes)
        ax4.set_title('Pass@k Metrics Comparison')
    
    plt.tight_layout()
    plt.show()
    
    # Print improvement summary
    print("\n🎯 PERFORMANCE IMPROVEMENT SUMMARY")
    print("="*50)
    
    baseline_pass = baseline_stats.get('pass_rate', 0)
    augmented_pass = augmented_stats.get('pass_rate', 0)
    
    if baseline_pass > 0:
        improvement = (augmented_pass - baseline_pass) / baseline_pass * 100
        print(f"Pass Rate Improvement: {improvement:+.1f}%")
    else:
        print("Pass Rate Improvement: Cannot calculate (baseline = 0)")
    
    baseline_time = baseline_stats.get('avg_generation_time', 0)
    augmented_time = augmented_stats.get('avg_generation_time', 0)
    
    if baseline_time > 0:
        time_change = (augmented_time - baseline_time) / baseline_time * 100
        print(f"Generation Time Change: {time_change:+.1f}%")
    
    print(f"Average Uncertainty Score: {augmented_stats.get('avg_uncertainty', 0):.3f}")
    print(f"Average Backtrack Events: {augmented_stats.get('avg_backtrack_events', 0):.1f}")

# Create comparison visualization
if baseline_data and augmented_data:
    create_comparison_chart(baseline_data, augmented_data)
else:
    print("⚠️ Cannot create comparison chart - missing benchmark data")
    print("Make sure both baseline and augmented benchmarks completed successfully.")

In [None]:
# Create detailed results table
def create_results_table(baseline_data, augmented_data):
    """Create a detailed comparison table."""
    
    # Load CSV files if available
    baseline_csv = Path("results/colab_baseline/benchmark_results.csv")
    augmented_csv = Path("results/colab_augmented/benchmark_results.csv")
    
    try:
        if baseline_csv.exists():
            baseline_df = pd.read_csv(baseline_csv)
            print("📋 BASELINE DETAILED RESULTS")
            print("="*40)
            print(baseline_df[['problem_id', 'dataset', 'baseline_passed', 
                             'baseline_generation_time', 'baseline_tokens_generated']].to_string(index=False))
        
        if augmented_csv.exists():
            augmented_df = pd.read_csv(augmented_csv)
            print("\n🚀 AUGMENTED DETAILED RESULTS")
            print("="*40)
            print(augmented_df[['problem_id', 'dataset', 'augmented_passed', 
                              'augmented_generation_time', 'augmented_tokens_generated',
                              'augmented_uncertainty_score', 'augmented_backtrack_events']].to_string(index=False))
            
            # Summary statistics
            print("\n📊 AUGMENTED MODEL INSIGHTS")
            print("="*40)
            
            problems_with_backtracks = augmented_df[augmented_df['augmented_backtrack_events'] > 0]
            print(f"Problems with backtracking: {len(problems_with_backtracks)} / {len(augmented_df)}")
            
            if len(problems_with_backtracks) > 0:
                avg_backtracks = problems_with_backtracks['augmented_backtrack_events'].mean()
                print(f"Average backtracks per problem (when > 0): {avg_backtracks:.1f}")
                
                # Show problems that benefited from backtracking
                improved_problems = problems_with_backtracks[
                    problems_with_backtracks['augmented_passed'] == True
                ]
                print(f"Problems that passed with backtracking: {len(improved_problems)}")
    
    except Exception as e:
        print(f"❌ Error loading CSV files: {e}")
        print("Detailed tables not available.")

# Display detailed results
create_results_table(baseline_data, augmented_data)

# 🎉 Conclusions and Next Steps

Congratulations! You've successfully run the UMinFramework benchmark and compared baseline vs augmented LLM performance.

In [None]:
# Summary and next steps
print("🎯 EXPERIMENT SUMMARY")
print("="*50)

print("What we accomplished:")
print("✅ Set up the UMinFramework environment")
print("✅ Downloaded and prepared coding benchmark datasets")
print("✅ Demonstrated safe code execution sandbox")
print("✅ Ran baseline LLM benchmark")

if augmented_data:
    print("✅ Ran AugmentedLLM benchmark with uncertainty monitoring and backtracking")
    print("✅ Compared performance using pass@k metrics")
    print("✅ Analyzed uncertainty quantification and backtracking behavior")
else:
    print("⚠️ AugmentedLLM benchmark not completed")

print("\n🔬 KEY INSIGHTS")
print("="*30)

if baseline_data and augmented_data:
    baseline_stats = baseline_data.get('statistics', {}).get('baseline', {})
    augmented_stats = augmented_data.get('statistics', {}).get('augmented', {})
    
    if baseline_stats and augmented_stats:
        baseline_pass = baseline_stats.get('pass_rate', 0)
        augmented_pass = augmented_stats.get('pass_rate', 0)
        
        print(f"• Baseline model pass rate: {baseline_pass:.1%}")
        print(f"• Augmented model pass rate: {augmented_pass:.1%}")
        
        if augmented_pass > baseline_pass:
            print(f"• 🎉 AugmentedLLM showed improvement!")
        elif augmented_pass < baseline_pass:
            print(f"• 📝 AugmentedLLM showed different behavior (may need tuning)")
        else:
            print(f"• 📊 Similar performance between models")
        
        avg_uncertainty = augmented_stats.get('avg_uncertainty', 0)
        avg_backtracks = augmented_stats.get('avg_backtrack_events', 0)
        
        print(f"• Average uncertainty score: {avg_uncertainty:.3f}")
        print(f"• Average backtrack events: {avg_backtracks:.1f}")
        
        if avg_backtracks > 0:
            print(f"• 🔄 Backtracking mechanism activated during generation")
        else:
            print(f"• 📊 No backtracking triggered (uncertainty below threshold)")

print("\n🚀 NEXT STEPS")
print("="*20)
print("1. 🔧 Experiment with different uncertainty thresholds")
print("2. 🎯 Try larger, more capable models (e.g., meta-llama/Meta-Llama-3.1-8B-Instruct)")
print("3. 📚 Train the prompt refiner for better prompt clarification")
print("4. 📊 Run on larger datasets for more statistically significant results")
print("5. 🔬 Analyze specific cases where backtracking helped or hindered")

print("\n💡 RESEARCH IDEAS")
print("="*25)
print("• Compare different uncertainty quantification methods")
print("• Experiment with various Chain-of-Thought prompt templates")
print("• Test on different domains (math, reasoning, creative writing)")
print("• Investigate the relationship between model size and uncertainty calibration")
print("• Study the impact of different backtracking window sizes")

print("\n📄 REPRODUCIBILITY")
print("="*25)
print("All results have been saved to the results/ directory:")
print("• JSON files with full details and metadata")
print("• CSV files for easy analysis in spreadsheet tools")
print("• Pass@k metrics calculated using proper statistical methods")
print("• Detailed logs for debugging and analysis")

print("\n🎓 Thank you for exploring the UMinFramework!")
print("   Visit the repository for more examples and documentation.")

# 💾 Export Results

Download your results for further analysis or sharing.

In [None]:
# Prepare results for download
import zipfile
from datetime import datetime

# Create a zip file with all results
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f"UMinFramework_Results_{timestamp}.zip"

with zipfile.ZipFile(zip_filename, 'w') as zipf:
    # Add baseline results if available
    baseline_dir = Path("results/colab_baseline")
    if baseline_dir.exists():
        for file in baseline_dir.glob("*"):
            if file.is_file():
                zipf.write(file, f"baseline/{file.name}")
    
    # Add augmented results if available
    augmented_dir = Path("results/colab_augmented")
    if augmented_dir.exists():
        for file in augmented_dir.glob("*"):
            if file.is_file():
                zipf.write(file, f"augmented/{file.name}")
    
    # Add this notebook (if you want to save the executed version)
    # Note: In Colab, you might want to download the notebook separately

print(f"📦 Results packaged in: {zip_filename}")
print(f"📁 File size: {Path(zip_filename).stat().st_size / 1024:.1f} KB")

# In Colab, you can download the file
try:
    from google.colab import files
    files.download(zip_filename)
    print("⬇️ Download started!")
except ImportError:
    print("💡 To download in Colab: files.download('{}') ".format(zip_filename))
    print("   (Uncomment the import and download lines above)")