# Minimax Decoder - HaluEval Benchmark

**Goal**: Run SmolLM2-360M + Minimax on HaluEval QA

**Full dataset (10,000 samples) included - randomly samples 500 for benchmark**

## 1. Setup

In [None]:
# Install dependencies
!pip install -q google-genai pydantic python-dotenv torch transformers accelerate groq huggingface-hub

In [None]:
# Check GPU
import torch
print(f"GPU: {torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'CPU only'}")

In [None]:
# Set API key
import os
os.environ["GOOGLE_API_KEY"] = "YOUR_GEMINI_API_KEY_HERE"  # <-- REPLACE

## 2. Clone Repo

In [None]:
# Clone repo (HaluEval data is already included!)
!git clone https://github.com/yourusername/minimax-decoder.git
%cd minimax-decoder

In [None]:
# Verify HaluEval data exists (full 10,000 samples)
!wc -l data/HaluEval_QA.csv
!head -2 data/HaluEval_QA.csv

## 3. Quick Test (10 random questions)

In [None]:
# Test with 10 random questions
!python benchmark.py -g smollm2-360m-local -a gemini-flash --data data/HaluEval_QA.csv --sample 10

## 4. Full HaluEval Benchmark (500 random samples)

**Estimated time: ~2-3 hours**

In [None]:
# Run 500 random samples - SmolLM2 + Minimax
# --sample 500 randomly selects from full 10,000 dataset
# --seed 42 ensures reproducibility
!python benchmark.py -g smollm2-360m-local -a gemini-flash \
    --data data/HaluEval_QA.csv \
    --sample 500 \
    --seed 42 \
    --output results/halueval_smollm2_minimax.json

In [None]:
# Vanilla baseline (no Minimax) - same 500 samples
!python benchmark.py -g smollm2-360m-local --vanilla-only \
    --data data/HaluEval_QA.csv \
    --sample 500 \
    --seed 42 \
    --output results/halueval_smollm2_vanilla.json

## 5. Results

In [None]:
import json

def show_results(path, name):
    try:
        with open(path) as f:
            data = json.load(f)
        m = data.get("metrics", {})
        print(f"\n=== {name} ===")
        print(f"Questions: {m.get('total_questions', 'N/A')}")
        if "minimax" in m:
            print(f"Truthful: {m['minimax']['truthful_rate']*100:.1f}%")
            print(f"Hallucination: {m['minimax']['hallucination_rate']*100:.1f}%")
            print(f"Abstention: {m['minimax']['abstention_rate']*100:.1f}%")
        if "vanilla" in m:
            print(f"Vanilla Truthful: {m['vanilla']['truthful_rate']*100:.1f}%")
            print(f"Vanilla Hallucination: {m['vanilla']['hallucination_rate']*100:.1f}%")
    except FileNotFoundError:
        print(f"{name}: Not found yet")

show_results("results/halueval_smollm2_minimax.json", "HaluEval - SmolLM2 + Minimax")
show_results("results/halueval_smollm2_vanilla.json", "HaluEval - SmolLM2 Vanilla")

## 6. Download Results

In [None]:
!zip -r halueval_results.zip results/
print("Download halueval_results.zip from Output tab")