# Minimax Decoder - Full TruthfulQA Benchmark

**Goal**: Run SmolLM2-360M + Minimax on full 817 TruthfulQA questions

**Requirements**:
- Kaggle GPU (T4) - Enable in Settings â†’ Accelerator
- Google API Key for Gemini (free)

**Estimated time**: ~16 hours for full 817 questions

## 1. Setup Environment

In [None]:
# Install dependencies
!pip install -q google-genai pydantic python-dotenv torch transformers accelerate groq huggingface-hub

In [None]:
# Check GPU
import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

In [None]:
# Set your API key (get free key at https://aistudio.google.com/apikey)
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"  # <-- REPLACE THIS

## 2. Clone Repository

In [None]:
# Clone the repo (update URL to your GitHub repo)
!git clone https://github.com/yourusername/minimax-decoder.git
%cd minimax-decoder

In [None]:
# Verify files
!ls -la

## 3. Quick Test (10 questions)

In [None]:
# Quick sanity check - run 10 questions first
!python benchmark.py -g smollm2-360m-local -a gemini-flash --limit 10

## 4. Full Benchmark (817 questions)

**WARNING**: This takes ~16 hours. Make sure:
- GPU is enabled
- Session won't timeout (Kaggle allows background execution)

In [None]:
# Run full benchmark - SmolLM2-360M + Minimax
!python benchmark.py -g smollm2-360m-local -a gemini-flash --limit 817 --output results/full_smollm2_minimax.json

## 5. Run Vanilla Baseline (for comparison)

In [None]:
# Run SmolLM2-360M vanilla (no Minimax) for comparison
!python benchmark.py -g smollm2-360m-local --vanilla-only --limit 817 --output results/full_smollm2_vanilla.json

In [None]:
# Run Qwen-1.5B vanilla (larger model baseline)
!python benchmark.py -g qwen2.5-1.5b-local --vanilla-only --limit 817 --output results/full_qwen1.5b_vanilla.json

## 6. View Results

In [None]:
import json

def load_results(path):
    with open(path) as f:
        return json.load(f)

def print_summary(name, data):
    summary = data.get("summary", {})
    print(f"\n=== {name} ===")
    print(f"Total: {summary.get('total_questions', 'N/A')}")
    print(f"Truthful: {summary.get('truthful_rate', 'N/A')}")
    print(f"Hallucination: {summary.get('hallucination_rate', 'N/A')}")
    print(f"Abstention: {summary.get('abstention_rate', 'N/A')}")

In [None]:
# Load and display results
try:
    minimax = load_results("results/full_smollm2_minimax.json")
    print_summary("SmolLM2-360M + Minimax", minimax)
except: pass

try:
    vanilla_small = load_results("results/full_smollm2_vanilla.json")
    print_summary("SmolLM2-360M Vanilla", vanilla_small)
except: pass

try:
    vanilla_large = load_results("results/full_qwen1.5b_vanilla.json")
    print_summary("Qwen-1.5B Vanilla", vanilla_large)
except: pass

## 7. Download Results

In [None]:
# Create zip of results for download
!zip -r benchmark_results.zip results/
print("Download benchmark_results.zip from the Output tab")

## Alternative: Run 300 Questions (Faster)

If you don't have time for full 817, run 300 (~6 hours)

In [None]:
# 300 questions - good balance of speed and statistical power
!python benchmark.py -g smollm2-360m-local -a gemini-flash --limit 300 --output results/smollm2_minimax_300.json