<a href="https://colab.research.google.com/github/peremartra/llama-glu-expansion-pruning/blob/main/notebooks/02_Evaluate_3B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# GLU Pruning Research - Llama-3.2-3B Evaluation
## 02 - Comprehensive Benchmark Suite Evaluation

### Exploring GLU Expansion Ratios in Llama-3.2 Models
by [Pere Martra](https://github.com/peremartra)

[![Paper](https://img.shields.io/badge/OSF-Paper-blue?logo=osf&logoColor=white)](https://doi.org/10.31219/osf.io/qgxea)
[![GitHub](https://img.shields.io/badge/⭐_Star-OptiPFair-orange?logo=github&logoColor=white)](https://github.com/peremartra/optipfair)
[![PyPI](https://img.shields.io/pypi/v/optipfair?logo=python&logoColor=white&label=v)](https://pypi.org/project/optipfair/)

**Repository:** [github.com/peremartra/llama-glu-expansion-pruning](https://github.com/peremartra/llama-glu-expansion-pruning)

---

**Colab Environment:** GPU L4 (or T4)

**Models to Evaluate:**
* Llama-3.2-3B (base) - Baseline
* Llama-3.2-3B-pruned-10% (% expansion)
* Llama-3.2-3B-pruned-20% (220% expansion)
* Llama-3.2-3B-pruned-30% (% expansion)
* Llama-3.2-3B-pruned-40% (140% expansion) ⭐ Star model
* Llama-3.2-3B-pruned-50% (% expansion)
* Llama-3.2-3B-pruned-60% (60% expansion)


**Benchmarks (10 total):**
* WikiText-2 Perplexity (0-shot)
* BoolQ (0-shot)
* Lambada-OpenAI (0-shot)
* MMLU (5-shot)
* ARC-Challenge (0-shot)
* HellaSwag (0-shot)
* WinoGrande (0-shot)
* PIQA (0-shot)
* TruthfulQA MC1/MC2 (0-shot)
* GSM8K (5-shot CoT)

**Estimated Runtime:** ~4-5 hours total

---

## 📋 Notebook Objective

This notebook conducts a comprehensive evaluation of the Llama-3.2-3B model family across three pruning levels (10%, 20%, 30%, 40%, 50%, 60%) to determine:

1. **Performance degradation patterns** across different pruning intensities
2. **Optimal expansion ratio** for GLU-MLP layers (hypothesis: 140%)
3. **Task-specific resilience** to pruning (knowledge vs. algorithmic tasks)
4. **Which models merit uploading to HuggingFace Hub** for Phase 2

### Key Features:
- ✅ **Checkpoint/Resume Support:** Survives Colab disconnections
- ✅ **On-the-fly Pruning:** No need to pre-create models
- ✅ **Robust Error Handling:** Continues if individual benchmarks fail
- ✅ **Progress Tracking:** Live updates and detailed logging

### Results will answer:
- Does 40% pruning (140% expansion) truly outperform other levels?
- Which benchmarks are most sensitive to pruning?
- Should we upload non-star models to HF, or only the 40% version?

---

**Note:** This evaluation uses the MAW (Maximum Absolute Weight) neuron selection method, validated in Notebook 00 as the optimal approach for GLU architectures.

---

# 1. Setup & Installation

In [1]:
# Install required libraries
!pip install -q optipfair
!pip install -q lm-eval
!pip install -q langdetect

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.2/49.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.6/53.6 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m113.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.6/293.6 kB[0m [31m24.3 MB/s[0m eta [36m0:00:

In [2]:
# Mount Google Drive for checkpoint persistence
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Download utils.py from GitHub repository
!wget -q https://raw.githubusercontent.com/peremartra/llama-glu-expansion-pruning/main/utils.py

# Verify download
import os
if os.path.exists('utils.py'):
    print("✅ utils.py downloaded successfully")
else:
    print("❌ Failed to download utils.py")

✅ utils.py downloaded successfully


In [4]:
# Import core libraries and utilities
import torch
import json
import pandas as pd
from datetime import datetime
from pathlib import Path

# Import our utility functions
from utils import (
    EXPERIMENT_CONFIG,
    BENCHMARKS_BASE,
    load_or_create_model,
    run_robust_evaluation,
    clear_gpu_cache,
    get_model_stats,
    format_results_table
)

print("✅ All imports successful")
print(f"📱 Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

✅ All imports successful
📱 Device: GPU
   GPU: NVIDIA L4
   Memory: 23.8 GB


# 2. Configuration & Planning

This section filters the experiment configuration for 3B models and displays the evaluation plan.

In [5]:
# Filter configuration for 3B models only
models_3b = [
    config for config in EXPERIMENT_CONFIG
    if "3B" in config["base_model"] and "1B" not in config["base_model"]
]

CHECKPOINT_DIR = "/content/drive/MyDrive/glu_pruning/checkpoints/3b"
RESULTS_DIR = "/content/drive/MyDrive/glu_pruning/results"

# Define checkpoint paths for each model
checkpoint_paths = {
    "baseline": f"{CHECKPOINT_DIR}/llama_3.2_3b_baseline.json",
    "10pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_10pct.json",
    "20pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_20pct.json",
    "30pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_30pct.json",
    "40pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_40pct.json",
    "50pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_50pct.json",
    "60pct": f"{CHECKPOINT_DIR}/llama_3.2_3b_pruned_60pct.json",
}

BASE_MODEL_ID = "meta-llama/Llama-3.2-3B"

In [6]:

print(f"\n{'='*70}")
print("📊 EVALUATION PLAN: Llama-3.2-3B Family")
print(f"{'='*70}\n")

print(f"Total models to evaluate: {len(models_3b) + 1}")  # +1 for base model
print(f"Benchmarks per model: {len(BENCHMARKS_BASE)}")
print(f"Total evaluations: {(len(models_3b) + 1) * len(BENCHMARKS_BASE)}")
print(f"Estimated runtime: ~4-5 hours\n")

# Display models table
print("Models to evaluate:")
print("-" * 70)
print(f"{'Model':<30} {'Pruning':<10} {'Star':<6}")
print("-" * 70)
print(f"{'Llama-3.2 (baseline)':<30} {'0%':<10} {'N/A':<6}")
for config in models_3b:
    model_name = config['hf_repo_id'].split('/')[-1]
    pruning = f"{config['pruning_pct']}%"
    star = "⭐ Yes" if config['is_star'] else "No"
    print(f"{model_name:<30} {pruning:<10} {star:<6}")
print("-" * 70)

# Display benchmarks
print("\nBenchmarks to run:")
print("-" * 70)
for i, task in enumerate(BENCHMARKS_BASE, 1):
    task_name = task['name']
    fewshot = f"{task['num_fewshot']}-shot"
    print(f"{i:2d}. {task_name:<25} {fewshot}")
print("-" * 70)

print("\n⚙️  Configuration:")
print(f"   - Neuron selection method: MAW (Maximum Absolute Weight)")
print(f"   - Checkpointing: Enabled (per-task granularity)")
print(f"   - Model creation: On-the-fly pruning (no pre-creation needed)")
print(f"   - Error handling: Skip failed tasks and continue\n")


📊 EVALUATION PLAN: Llama-3.2-3B Family

Total models to evaluate: 7
Benchmarks per model: 13
Total evaluations: 91
Estimated runtime: ~4-5 hours

Models to evaluate:
----------------------------------------------------------------------
Model                          Pruning    Star  
----------------------------------------------------------------------
Llama-3.2 (baseline)           0%         N/A   
Llama-3.2-3B-pruned-10pct      10%        ⭐ Yes 
Llama-3.2-3B-pruned-20pct      20%        No    
Llama-3.2-3B-pruned-30pct      30%        No    
Llama-3.2-3B-pruned-40pct      40%        No    
Llama-3.2-3B-pruned-50pct      50%        No    
Llama-3.2-3B-pruned-60pct      60%        No    
----------------------------------------------------------------------

Benchmarks to run:
----------------------------------------------------------------------
 1. wikitext                  0-shot
 2. boolq                     0-shot
 3. lambada_openai            0-shot
 4. mmlu                  

In [7]:

# Create directories if they don't exist
Path(CHECKPOINT_DIR).mkdir(parents=True, exist_ok=True)
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)

print(f"✅ Checkpoint directory: {CHECKPOINT_DIR}")
print(f"✅ Results directory: {RESULTS_DIR}")


print("\nCheckpoint files:")
for key, path in checkpoint_paths.items():
    exists = "✅ Exists" if Path(path).exists() else "🆕 New"
    print(f"   {key:<10}: {exists}")

✅ Checkpoint directory: /content/drive/MyDrive/glu_pruning/checkpoints/3b
✅ Results directory: /content/drive/MyDrive/glu_pruning/results

Checkpoint files:
   baseline  : ✅ Exists
   10pct     : ✅ Exists
   20pct     : ✅ Exists
   30pct     : ✅ Exists
   40pct     : ✅ Exists
   50pct     : 🆕 New
   60pct     : ✅ Exists


# 3. Baseline Evaluation

Evaluate the original Llama-3.2-3B model to establish performance baseline.

In [8]:
print(f"\n{'='*70}")
print("📊 PHASE 1: BASELINE EVALUATION")
print(f"{'='*70}\n")

# Load base model
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"Loading base model: {BASE_MODEL_ID}...")
base_model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL_ID,
    #dtype=torch.float16, #T4
    dtype=torch.bfloat16, #A100 #L4 <- Used for experiments
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL_ID)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("✅ Model loaded successfully")

# Display model statistics
base_stats = get_model_stats(base_model)
print(f"\n📈 Model Statistics:")
print(f"   Parameters: {base_stats['total_parameters']:,}")
print(f"   Size: {base_stats['size_gb']:.2f} GB")

# Run evaluation with checkpointing
baseline_results = run_robust_evaluation(
    model=base_model,
    tokenizer=tokenizer,
    tasks=BENCHMARKS_BASE,
    checkpoint_path=checkpoint_paths["baseline"],
    model_name="Llama-3.2-3B-baseline"
)

print(f"\n{'='*70}")
print("✅ BASELINE EVALUATION COMPLETED")
print(f"{'='*70}\n")

# Display results summary
print("Results Preview:")
print(format_results_table(baseline_results))

# Clear memory
del base_model
clear_gpu_cache()

print("\n🧹 Memory cleared, ready for pruned models")


📊 PHASE 1: BASELINE EVALUATION

Loading base model: meta-llama/Llama-3.2-3B...


config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

✅ Model loaded successfully

📈 Model Statistics:
   Parameters: 3,212,749,824
   Size: 5.98 GB
📂 Found existing checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_baseline.json
✅ Loaded checkpoint. Completed: 13/13 tasks
   Pending: []
🎉 All tasks already completed!

✅ BASELINE EVALUATION COMPLETED

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext               9.2628               1.5163             0.6006      NaN      NaN        NaN             NaN           NaN                      NaN                      

# 4. Pruned Models Evaluation Loop

Evaluate the three pruned variants (20%, 40%, 60%) using on-the-fly pruning with OptiPFair.

In [9]:
print(f"\n{'='*70}")
print("📊 PHASE 2: PRUNED MODELS EVALUATION")
print(f"{'='*70}\n")

# Store all results for final comparison
all_results = {
    "baseline": baseline_results
}

# Evaluate each pruned model
for i, config in enumerate(models_3b, 1):
    model_name = config['hf_repo_id'].split('/')[-1]
    pruning_pct = config['pruning_pct']
    is_star = config['is_star']

    print(f"\n{'─'*70}")
    print(f"🔄 EVALUATING MODEL {i}/{len(models_3b)}: {model_name}{pruning_pct} ")
    print(f"   Pruning: {pruning_pct}% |  Star: {'⭐' if is_star else 'No'}")
    print(f"{'─'*70}\n")

    try:
        # Load or create model using utility function
        model, tokenizer, stats = load_or_create_model(config, device="auto")

        # Display model statistics
        print(f"\n📈 Model Statistics:")
        print(f"   Parameters: {stats['total_parameters']:,}")
        print(f"   Size: {stats['size_gb']:.2f} GB")
        if 'pruning_stats' in stats:
            print(f"   Reduction: {stats['pruning_stats']['percentage_reduction']:.2f}%")
        print(f"   Source: {stats['source']}\n")

        # Determine checkpoint key
        checkpoint_key = f"{pruning_pct}pct"

        # Run evaluation with checkpointing
        results = run_robust_evaluation(
            model=model,
            tokenizer=tokenizer,
            tasks=BENCHMARKS_BASE,
            checkpoint_path=checkpoint_paths[checkpoint_key],
            model_name=model_name
        )

        # Store results
        all_results[checkpoint_key] = results

        print(f"\n✅ {model_name}{pruning_pct} evaluation completed")
        print("\nResults Preview:")
        print(format_results_table(results))

        # Clear memory before next model
        del model
        clear_gpu_cache()

    except Exception as e:
        print(f"\n❌ ERROR evaluating {model_name}: {str(e)}")
        print("   Continuing with next model...\n")
        clear_gpu_cache()
        continue

print(f"\n{'='*70}")
print("✅ ALL PRUNED MODELS EVALUATED")
print(f"{'='*70}\n")


📊 PHASE 2: PRUNED MODELS EVALUATION


──────────────────────────────────────────────────────────────────────
🔄 EVALUATING MODEL 1/6: Llama-3.2-3B-pruned-10pct10 
   Pruning: 10% |  Star: ⭐
──────────────────────────────────────────────────────────────────────


Loading model: oopere/Llama-3.2-3B-pruned-10pct
  Base: meta-llama/Llama-3.2-3B
  Pruning: 10%
  Star model: ⭐ Yes

📥 Attempting to load from HF Hub: oopere/Llama-3.2-3B-pruned-10pct
⚠️  HF Hub load failed: oopere/Llama-3.2-3B-pruned-10pct is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `hf auth login` or by passing `token=<your_token>`
   Falling back to on-the-fly pruning...
🔧 Creating model via on-the-fly pruning...


`torch_dtype` is deprecated! Use `dtype` instead!


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (10%)...


Pruning layers: 100%|██████████| 28/28 [00:15<00:00,  1.81it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 3,001,408,512
   Reduction: 6.58%

📈 Model Statistics:
   Parameters: 3,001,408,512
   Size: 5.59 GB
   Reduction: 6.58%
   Source: on_the_fly_pruning

📂 Found existing checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_10pct.json
✅ Loaded checkpoint. Completed: 13/13 tasks
   Pending: []
🎉 All tasks already completed!

✅ Llama-3.2-3B-pruned-10pct10 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext              11.87

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (20%)...


Pruning layers: 100%|██████████| 28/28 [00:13<00:00,  2.11it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 2,790,067,200
   Reduction: 13.16%

📈 Model Statistics:
   Parameters: 2,790,067,200
   Size: 5.20 GB
   Reduction: 13.16%
   Source: on_the_fly_pruning

📂 Found existing checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_20pct.json
✅ Loaded checkpoint. Completed: 13/13 tasks
   Pending: []
🎉 All tasks already completed!

✅ Llama-3.2-3B-pruned-20pct20 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext              15.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (30%)...


Pruning layers: 100%|██████████| 28/28 [00:11<00:00,  2.41it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 2,578,725,888
   Reduction: 19.73%

📈 Model Statistics:
   Parameters: 2,578,725,888
   Size: 4.80 GB
   Reduction: 19.73%
   Source: on_the_fly_pruning

📂 Found existing checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_30pct.json
✅ Loaded checkpoint. Completed: 13/13 tasks
   Pending: []
🎉 All tasks already completed!

✅ Llama-3.2-3B-pruned-30pct30 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext              23.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (40%)...


Pruning layers: 100%|██████████| 28/28 [00:10<00:00,  2.74it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 2,367,384,576
   Reduction: 26.31%

📈 Model Statistics:
   Parameters: 2,367,384,576
   Size: 4.41 GB
   Reduction: 26.31%
   Source: on_the_fly_pruning

📂 Found existing checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_40pct.json
✅ Loaded checkpoint. Completed: 13/13 tasks
   Pending: []
🎉 All tasks already completed!

✅ Llama-3.2-3B-pruned-40pct40 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext              42.

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (50%)...


Pruning layers: 100%|██████████| 28/28 [00:08<00:00,  3.34it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 2,155,785,216
   Reduction: 32.90%

📈 Model Statistics:
   Parameters: 2,155,785,216
   Size: 4.02 GB
   Reduction: 32.90%
   Source: on_the_fly_pruning

🆕 Creating new checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_50pct.json

🚀 Starting evaluation: 13 tasks remaining


[1/13] Evaluating: wikitext
──────────────────────────────────────────────────────────────────────





Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['wikitext'] (full dataset)
Few-shot config: {'wikitext': 0}





README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-trai(…):   0%|          | 0.00/6.18M [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-vali(…):   0%|          | 0.00/641k [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-test(…):   0%|          | 0.00/715k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/629 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/62 [00:00<?, ? examples/s]

100%|██████████| 62/62 [00:00<00:00, 531.52it/s]
100%|██████████| 62/62 [00:00<00:00, 75.74it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.11it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.49it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.99it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.13s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  4.00it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  3.35it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.94s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  3.36it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.87s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  4.53it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.49it/s]
Running loglikelihood requests: 100%|████████

✅ wikitext completed and saved to checkpoint
   Results: {'word_perplexity,none': '74.8280', 'byte_perplexity,none': '2.2411', 'bits_per_byte,none': '1.1642'}

[2/13] Evaluating: boolq
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['boolq'] (full dataset)
Few-shot config: {'boolq': 0}





README.md: 0.00B [00:00, ?B/s]

boolq/train-00000-of-00001.parquet:   0%|          | 0.00/3.85M [00:00<?, ?B/s]

boolq/validation-00000-of-00001.parquet:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

boolq/test-00000-of-00001.parquet:   0%|          | 0.00/1.31M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/9427 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3270 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3245 [00:00<?, ? examples/s]

100%|██████████| 3270/3270 [00:01<00:00, 1783.69it/s]
Running loglikelihood requests: 100%|██████████| 6540/6540 [02:03<00:00, 53.01it/s]


✅ boolq completed and saved to checkpoint
   Results: {'accuracy': '0.5119', 'acc_norm': 'N/A'}

[3/13] Evaluating: lambada_openai
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['lambada_openai'] (full dataset)
Few-shot config: {'lambada_openai': 0}



README.md: 0.00B [00:00, ?B/s]

default/test/default.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

100%|██████████| 5153/5153 [00:09<00:00, 547.49it/s]
Running loglikelihood requests: 100%|██████████| 5153/5153 [03:08<00:00, 27.35it/s]


bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:08<00:00, 11.19it/s]


✅ lambada_openai completed and saved to checkpoint
   Results: {'perplexity': '240.72', 'word_perplexity': '0.00', 'bits_per_byte': '0.0000'}

[4/13] Evaluating: mmlu
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['mmlu'] (full dataset)
Few-shot config: {'mmlu': 5}



README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

prehistory/test-00000-of-00001.parquet:   0%|          | 0.00/54.3k [00:00<?, ?B/s]

prehistory/validation-00000-of-00001.par(…):   0%|          | 0.00/9.89k [00:00<?, ?B/s]

prehistory/dev-00000-of-00001.parquet:   0%|          | 0.00/4.62k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/324 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/35 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

moral_disputes/test-00000-of-00001.parqu(…):   0%|          | 0.00/60.9k [00:00<?, ?B/s]

moral_disputes/validation-00000-of-00001(…):   0%|          | 0.00/10.7k [00:00<?, ?B/s]

moral_disputes/dev-00000-of-00001.parque(…):   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/346 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/38 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

philosophy/test-00000-of-00001.parquet:   0%|          | 0.00/48.6k [00:00<?, ?B/s]

philosophy/validation-00000-of-00001.par(…):   0%|          | 0.00/9.15k [00:00<?, ?B/s]

philosophy/dev-00000-of-00001.parquet:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/311 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/34 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_law/test-00000-of-00001.par(…):   0%|          | 0.00/1.04M [00:00<?, ?B/s]

professional_law/validation-00000-of-000(…):   0%|          | 0.00/116k [00:00<?, ?B/s]

professional_law/dev-00000-of-00001.parq(…):   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1534 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/170 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_world_history/test-00000-of-(…):   0%|          | 0.00/202k [00:00<?, ?B/s]

high_school_world_history/validation-000(…):   0%|          | 0.00/38.5k [00:00<?, ?B/s]

high_school_world_history/dev-00000-of-0(…):   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/237 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

formal_logic/test-00000-of-00001.parquet:   0%|          | 0.00/21.5k [00:00<?, ?B/s]

formal_logic/validation-00000-of-00001.p(…):   0%|          | 0.00/6.56k [00:00<?, ?B/s]

formal_logic/dev-00000-of-00001.parquet:   0%|          | 0.00/4.81k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/126 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

logical_fallacies/test-00000-of-00001.pa(…):   0%|          | 0.00/23.0k [00:00<?, ?B/s]

logical_fallacies/validation-00000-of-00(…):   0%|          | 0.00/6.52k [00:00<?, ?B/s]

logical_fallacies/dev-00000-of-00001.par(…):   0%|          | 0.00/4.12k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/163 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_us_history/test-00000-of-000(…):   0%|          | 0.00/155k [00:00<?, ?B/s]

high_school_us_history/validation-00000-(…):   0%|          | 0.00/27.3k [00:00<?, ?B/s]

high_school_us_history/dev-00000-of-0000(…):   0%|          | 0.00/17.8k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/204 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

moral_scenarios/test-00000-of-00001.parq(…):   0%|          | 0.00/89.8k [00:00<?, ?B/s]

moral_scenarios/validation-00000-of-0000(…):   0%|          | 0.00/14.9k [00:00<?, ?B/s]

moral_scenarios/dev-00000-of-00001.parqu(…):   0%|          | 0.00/5.14k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/895 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

jurisprudence/test-00000-of-00001.parque(…):   0%|          | 0.00/23.3k [00:00<?, ?B/s]

jurisprudence/validation-00000-of-00001.(…):   0%|          | 0.00/6.21k [00:00<?, ?B/s]

jurisprudence/dev-00000-of-00001.parquet:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/108 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_european_history/test-00000-(…):   0%|          | 0.00/142k [00:00<?, ?B/s]

high_school_european_history/validation-(…):   0%|          | 0.00/31.6k [00:00<?, ?B/s]

high_school_european_history/dev-00000-o(…):   0%|          | 0.00/22.2k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/165 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

world_religions/test-00000-of-00001.parq(…):   0%|          | 0.00/18.9k [00:00<?, ?B/s]

world_religions/validation-00000-of-0000(…):   0%|          | 0.00/4.94k [00:00<?, ?B/s]

world_religions/dev-00000-of-00001.parqu(…):   0%|          | 0.00/3.30k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/171 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/19 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

international_law/test-00000-of-00001.pa(…):   0%|          | 0.00/29.5k [00:00<?, ?B/s]

international_law/validation-00000-of-00(…):   0%|          | 0.00/7.12k [00:00<?, ?B/s]

international_law/dev-00000-of-00001.par(…):   0%|          | 0.00/4.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/121 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_microeconomics/test-00000-of(…):   0%|          | 0.00/38.8k [00:00<?, ?B/s]

high_school_microeconomics/validation-00(…):   0%|          | 0.00/7.22k [00:00<?, ?B/s]

high_school_microeconomics/dev-00000-of-(…):   0%|          | 0.00/3.83k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/238 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_macroeconomics/test-00000-of(…):   0%|          | 0.00/54.8k [00:00<?, ?B/s]

high_school_macroeconomics/validation-00(…):   0%|          | 0.00/9.89k [00:00<?, ?B/s]

high_school_macroeconomics/dev-00000-of-(…):   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/390 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/43 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

human_sexuality/test-00000-of-00001.parq(…):   0%|          | 0.00/23.2k [00:00<?, ?B/s]

human_sexuality/validation-00000-of-0000(…):   0%|          | 0.00/5.26k [00:00<?, ?B/s]

human_sexuality/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.08k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/131 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

sociology/test-00000-of-00001.parquet:   0%|          | 0.00/43.9k [00:00<?, ?B/s]

sociology/validation-00000-of-00001.parq(…):   0%|          | 0.00/8.36k [00:00<?, ?B/s]

sociology/dev-00000-of-00001.parquet:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/201 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

security_studies/test-00000-of-00001.par(…):   0%|          | 0.00/114k [00:00<?, ?B/s]

security_studies/validation-00000-of-000(…):   0%|          | 0.00/18.7k [00:00<?, ?B/s]

security_studies/dev-00000-of-00001.parq(…):   0%|          | 0.00/7.49k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/245 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/27 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_psychology/test-00000-of-00(…):   0%|          | 0.00/133k [00:00<?, ?B/s]

professional_psychology/validation-00000(…):   0%|          | 0.00/22.1k [00:00<?, ?B/s]

professional_psychology/dev-00000-of-000(…):   0%|          | 0.00/4.69k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/612 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/69 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_geography/test-00000-of-0000(…):   0%|          | 0.00/28.2k [00:00<?, ?B/s]

high_school_geography/validation-00000-o(…):   0%|          | 0.00/6.16k [00:00<?, ?B/s]

high_school_geography/dev-00000-of-00001(…):   0%|          | 0.00/3.93k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/198 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_psychology/test-00000-of-000(…):   0%|          | 0.00/92.8k [00:00<?, ?B/s]

high_school_psychology/validation-00000-(…):   0%|          | 0.00/15.2k [00:00<?, ?B/s]

high_school_psychology/dev-00000-of-0000(…):   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/545 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

us_foreign_policy/test-00000-of-00001.pa(…):   0%|          | 0.00/19.5k [00:00<?, ?B/s]

us_foreign_policy/validation-00000-of-00(…):   0%|          | 0.00/5.27k [00:00<?, ?B/s]

us_foreign_policy/dev-00000-of-00001.par(…):   0%|          | 0.00/4.22k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

econometrics/test-00000-of-00001.parquet:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

econometrics/validation-00000-of-00001.p(…):   0%|          | 0.00/7.02k [00:00<?, ?B/s]

econometrics/dev-00000-of-00001.parquet:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/114 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

public_relations/test-00000-of-00001.par(…):   0%|          | 0.00/20.6k [00:00<?, ?B/s]

public_relations/validation-00000-of-000(…):   0%|          | 0.00/6.45k [00:00<?, ?B/s]

public_relations/dev-00000-of-00001.parq(…):   0%|          | 0.00/4.43k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/110 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_government_and_politics/test(…):   0%|          | 0.00/40.2k [00:00<?, ?B/s]

high_school_government_and_politics/vali(…):   0%|          | 0.00/8.27k [00:00<?, ?B/s]

high_school_government_and_politics/dev-(…):   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/193 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

medical_genetics/test-00000-of-00001.par(…):   0%|          | 0.00/16.4k [00:00<?, ?B/s]

medical_genetics/validation-00000-of-000(…):   0%|          | 0.00/5.63k [00:00<?, ?B/s]

medical_genetics/dev-00000-of-00001.parq(…):   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_medicine/test-00000-of-0000(…):   0%|          | 0.00/125k [00:00<?, ?B/s]

professional_medicine/validation-00000-o(…):   0%|          | 0.00/19.9k [00:00<?, ?B/s]

professional_medicine/dev-00000-of-00001(…):   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/272 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

global_facts/test-00000-of-00001.parquet:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

global_facts/validation-00000-of-00001.p(…):   0%|          | 0.00/4.19k [00:00<?, ?B/s]

global_facts/dev-00000-of-00001.parquet:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_medicine/test-00000-of-00001.par(…):   0%|          | 0.00/42.5k [00:00<?, ?B/s]

college_medicine/validation-00000-of-000(…):   0%|          | 0.00/8.99k [00:00<?, ?B/s]

college_medicine/dev-00000-of-00001.parq(…):   0%|          | 0.00/4.84k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/173 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

nutrition/test-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

nutrition/validation-00000-of-00001.parq(…):   0%|          | 0.00/9.02k [00:00<?, ?B/s]

nutrition/dev-00000-of-00001.parquet:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/306 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/33 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

virology/test-00000-of-00001.parquet:   0%|          | 0.00/27.3k [00:00<?, ?B/s]

virology/validation-00000-of-00001.parqu(…):   0%|          | 0.00/7.05k [00:00<?, ?B/s]

virology/dev-00000-of-00001.parquet:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/166 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_accounting/test-00000-of-00(…):   0%|          | 0.00/69.5k [00:00<?, ?B/s]

professional_accounting/validation-00000(…):   0%|          | 0.00/12.9k [00:00<?, ?B/s]

professional_accounting/dev-00000-of-000(…):   0%|          | 0.00/4.89k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/282 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

management/test-00000-of-00001.parquet:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

management/validation-00000-of-00001.par(…):   0%|          | 0.00/4.50k [00:00<?, ?B/s]

management/dev-00000-of-00001.parquet:   0%|          | 0.00/3.61k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

clinical_knowledge/test-00000-of-00001.p(…):   0%|          | 0.00/40.5k [00:00<?, ?B/s]

clinical_knowledge/validation-00000-of-0(…):   0%|          | 0.00/7.48k [00:00<?, ?B/s]

clinical_knowledge/dev-00000-of-00001.pa(…):   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/265 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

miscellaneous/test-00000-of-00001.parque(…):   0%|          | 0.00/98.6k [00:00<?, ?B/s]

miscellaneous/validation-00000-of-00001.(…):   0%|          | 0.00/13.2k [00:00<?, ?B/s]

miscellaneous/dev-00000-of-00001.parquet:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/783 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/86 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

human_aging/test-00000-of-00001.parquet:   0%|          | 0.00/31.2k [00:00<?, ?B/s]

human_aging/validation-00000-of-00001.pa(…):   0%|          | 0.00/6.28k [00:00<?, ?B/s]

human_aging/dev-00000-of-00001.parquet:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/223 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

marketing/test-00000-of-00001.parquet:   0%|          | 0.00/37.3k [00:00<?, ?B/s]

marketing/validation-00000-of-00001.parq(…):   0%|          | 0.00/8.21k [00:00<?, ?B/s]

marketing/dev-00000-of-00001.parquet:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/234 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/25 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

business_ethics/test-00000-of-00001.parq(…):   0%|          | 0.00/21.6k [00:00<?, ?B/s]

business_ethics/validation-00000-of-0000(…):   0%|          | 0.00/5.09k [00:00<?, ?B/s]

business_ethics/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

machine_learning/test-00000-of-00001.par(…):   0%|          | 0.00/19.7k [00:00<?, ?B/s]

machine_learning/validation-00000-of-000(…):   0%|          | 0.00/6.17k [00:00<?, ?B/s]

machine_learning/dev-00000-of-00001.parq(…):   0%|          | 0.00/5.25k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/112 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_physics/test-00000-of-00001.(…):   0%|          | 0.00/33.0k [00:00<?, ?B/s]

high_school_physics/validation-00000-of-(…):   0%|          | 0.00/7.96k [00:00<?, ?B/s]

high_school_physics/dev-00000-of-00001.p(…):   0%|          | 0.00/4.57k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/151 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/17 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_computer_science/test-00000-(…):   0%|          | 0.00/27.3k [00:00<?, ?B/s]

high_school_computer_science/validation-(…):   0%|          | 0.00/5.28k [00:00<?, ?B/s]

high_school_computer_science/dev-00000-o(…):   0%|          | 0.00/6.54k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

astronomy/test-00000-of-00001.parquet:   0%|          | 0.00/28.3k [00:00<?, ?B/s]

astronomy/validation-00000-of-00001.parq(…):   0%|          | 0.00/6.05k [00:00<?, ?B/s]

astronomy/dev-00000-of-00001.parquet:   0%|          | 0.00/4.94k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/152 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_chemistry/test-00000-of-00001.pa(…):   0%|          | 0.00/17.9k [00:00<?, ?B/s]

college_chemistry/validation-00000-of-00(…):   0%|          | 0.00/4.87k [00:00<?, ?B/s]

college_chemistry/dev-00000-of-00001.par(…):   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_biology/test-00000-of-00001.parq(…):   0%|          | 0.00/31.8k [00:00<?, ?B/s]

college_biology/validation-00000-of-0000(…):   0%|          | 0.00/6.90k [00:00<?, ?B/s]

college_biology/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.27k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/144 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

anatomy/test-00000-of-00001.parquet:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

anatomy/validation-00000-of-00001.parque(…):   0%|          | 0.00/5.28k [00:00<?, ?B/s]

anatomy/dev-00000-of-00001.parquet:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/135 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_physics/test-00000-of-00001.parq(…):   0%|          | 0.00/18.6k [00:00<?, ?B/s]

college_physics/validation-00000-of-0000(…):   0%|          | 0.00/6.39k [00:00<?, ?B/s]

college_physics/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.51k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/102 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_mathematics/test-00000-of-00001.(…):   0%|          | 0.00/16.6k [00:00<?, ?B/s]

college_mathematics/validation-00000-of-(…):   0%|          | 0.00/5.00k [00:00<?, ?B/s]

college_mathematics/dev-00000-of-00001.p(…):   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_statistics/test-00000-of-000(…):   0%|          | 0.00/58.0k [00:00<?, ?B/s]

high_school_statistics/validation-00000-(…):   0%|          | 0.00/10.9k [00:00<?, ?B/s]

high_school_statistics/dev-00000-of-0000(…):   0%|          | 0.00/6.07k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/216 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

conceptual_physics/test-00000-of-00001.p(…):   0%|          | 0.00/25.0k [00:00<?, ?B/s]

conceptual_physics/validation-00000-of-0(…):   0%|          | 0.00/5.98k [00:00<?, ?B/s]

conceptual_physics/dev-00000-of-00001.pa(…):   0%|          | 0.00/3.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/235 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

computer_security/test-00000-of-00001.pa(…):   0%|          | 0.00/19.1k [00:00<?, ?B/s]

computer_security/validation-00000-of-00(…):   0%|          | 0.00/6.67k [00:00<?, ?B/s]

computer_security/dev-00000-of-00001.par(…):   0%|          | 0.00/4.33k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

abstract_algebra/test-00000-of-00001.par(…):   0%|          | 0.00/9.96k [00:00<?, ?B/s]

abstract_algebra/validation-00000-of-000(…):   0%|          | 0.00/3.73k [00:00<?, ?B/s]

abstract_algebra/dev-00000-of-00001.parq(…):   0%|          | 0.00/3.45k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

electrical_engineering/test-00000-of-000(…):   0%|          | 0.00/17.6k [00:00<?, ?B/s]

electrical_engineering/validation-00000-(…):   0%|          | 0.00/5.08k [00:00<?, ?B/s]

electrical_engineering/dev-00000-of-0000(…):   0%|          | 0.00/4.08k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/145 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_chemistry/test-00000-of-0000(…):   0%|          | 0.00/33.3k [00:00<?, ?B/s]

high_school_chemistry/validation-00000-o(…):   0%|          | 0.00/8.31k [00:00<?, ?B/s]

high_school_chemistry/dev-00000-of-00001(…):   0%|          | 0.00/4.16k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/203 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_computer_science/test-00000-of-0(…):   0%|          | 0.00/28.1k [00:00<?, ?B/s]

college_computer_science/validation-0000(…):   0%|          | 0.00/6.25k [00:00<?, ?B/s]

college_computer_science/dev-00000-of-00(…):   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_biology/test-00000-of-00001.(…):   0%|          | 0.00/62.7k [00:00<?, ?B/s]

high_school_biology/validation-00000-of-(…):   0%|          | 0.00/10.6k [00:00<?, ?B/s]

high_school_biology/dev-00000-of-00001.p(…):   0%|          | 0.00/4.94k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/310 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/32 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

elementary_mathematics/test-00000-of-000(…):   0%|          | 0.00/41.1k [00:00<?, ?B/s]

elementary_mathematics/validation-00000-(…):   0%|          | 0.00/9.38k [00:00<?, ?B/s]

elementary_mathematics/dev-00000-of-0000(…):   0%|          | 0.00/4.55k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/378 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/41 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_mathematics/test-00000-of-00(…):   0%|          | 0.00/33.7k [00:00<?, ?B/s]

high_school_mathematics/validation-00000(…):   0%|          | 0.00/6.99k [00:00<?, ?B/s]

high_school_mathematics/dev-00000-of-000(…):   0%|          | 0.00/4.50k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/270 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

100%|██████████| 270/270 [00:02<00:00, 121.92it/s]
100%|██████████| 378/378 [00:02<00:00, 126.13it/s]
100%|██████████| 310/310 [00:02<00:00, 126.71it/s]
100%|██████████| 100/100 [00:00<00:00, 126.24it/s]
100%|██████████| 203/203 [00:01<00:00, 127.06it/s]
100%|██████████| 145/145 [00:01<00:00, 126.76it/s]
100%|██████████| 100/100 [00:00<00:00, 125.65it/s]
100%|██████████| 100/100 [00:00<00:00, 125.37it/s]
100%|██████████| 235/235 [00:01<00:00, 126.90it/s]
100%|██████████| 216/216 [00:01<00:00, 127.13it/s]
100%|██████████| 100/100 [00:00<00:00, 125.99it/s]
100%|██████████| 102/102 [00:00<00:00, 126.77it/s]
100%|██████████| 135/135 [00:01<00:00, 127.10it/s]
100%|██████████| 144/144 [00:01<00:00, 127.65it/s]
100%|██████████| 100/100 [00:00<00:00, 126.41it/s]
100%|██████████| 152/152 [00:01<00:00, 92.99it/s]
100%|██████████| 100/100 [00:00<00:00, 126.56it/s]
100%|██████████| 151/151 [00:01<00:00, 126.12it/s]
100%|██████████| 112/112 [00:00<00:00, 126.12it/s]
100%|██████████| 100/100 [00:00<

✅ mmlu completed and saved to checkpoint
   Results: {'accuracy': '0.2555', 'acc_norm': 'N/A'}

[5/13] Evaluating: arc_challenge
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['arc_challenge'] (full dataset)
Few-shot config: {'arc_challenge': 0}



README.md: 0.00B [00:00, ?B/s]

ARC-Challenge/train-00000-of-00001.parqu(…):   0%|          | 0.00/190k [00:00<?, ?B/s]

ARC-Challenge/test-00000-of-00001.parque(…):   0%|          | 0.00/204k [00:00<?, ?B/s]

ARC-Challenge/validation-00000-of-00001.(…):   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

100%|██████████| 1172/1172 [00:01<00:00, 965.88it/s]
Running loglikelihood requests: 100%|██████████| 4687/4687 [02:37<00:00, 29.79it/s]


✅ arc_challenge completed and saved to checkpoint
   Results: {'accuracy': '0.2278', 'acc_norm': '0.2381'}

[6/13] Evaluating: hellaswag
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['hellaswag'] (full dataset)
Few-shot config: {'hellaswag': 0}



README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/6.32M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

100%|██████████| 10042/10042 [00:04<00:00, 2275.55it/s]
Running loglikelihood requests: 100%|██████████| 40168/40168 [24:34<00:00, 27.24it/s]


✅ hellaswag completed and saved to checkpoint
   Results: {'accuracy': '0.3043', 'acc_norm': '0.3399'}

[7/13] Evaluating: winogrande
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['winogrande'] (full dataset)
Few-shot config: {'winogrande': 0}



README.md: 0.00B [00:00, ?B/s]

winogrande_xl/train-00000-of-00001.parqu(…):   0%|          | 0.00/2.06M [00:00<?, ?B/s]

winogrande_xl/test-00000-of-00001.parque(…):   0%|          | 0.00/118k [00:00<?, ?B/s]

winogrande_xl/validation-00000-of-00001.(…):   0%|          | 0.00/85.9k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/40398 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1767 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1267 [00:00<?, ? examples/s]

100%|██████████| 1267/1267 [00:00<00:00, 93022.39it/s]
Running loglikelihood requests: 100%|██████████| 2534/2534 [01:32<00:00, 27.26it/s]


✅ winogrande completed and saved to checkpoint
   Results: {'accuracy': '0.4886', 'acc_norm': 'N/A'}

[8/13] Evaluating: piqa
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['piqa'] (full dataset)
Few-shot config: {'piqa': 0}



piqa_train.parquet:   0%|          | 0.00/2.64M [00:00<?, ?B/s]

piqa_validation.parquet:   0%|          | 0.00/300k [00:00<?, ?B/s]

piqa_test.parquet:   0%|          | 0.00/496k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/16113 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1838 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3084 [00:00<?, ? examples/s]

100%|██████████| 1838/1838 [00:01<00:00, 969.46it/s]
Running loglikelihood requests: 100%|██████████| 3676/3676 [02:13<00:00, 27.45it/s]


✅ piqa completed and saved to checkpoint
   Results: {'accuracy': '0.6088', 'acc_norm': '0.6045'}

[9/13] Evaluating: truthfulqa_mc1
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['truthfulqa_mc1'] (full dataset)
Few-shot config: {'truthfulqa_mc1': 0}



README.md: 0.00B [00:00, ?B/s]

multiple_choice/validation-00000-of-0000(…):   0%|          | 0.00/271k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]

100%|██████████| 817/817 [00:01<00:00, 687.77it/s]
Running loglikelihood requests: 100%|██████████| 4114/4114 [02:34<00:00, 26.69it/s]


✅ truthfulqa_mc1 completed and saved to checkpoint
   Results: {'accuracy': '0.2472', 'acc_norm': 'N/A'}

[10/13] Evaluating: truthfulqa_mc2
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['truthfulqa_mc2'] (full dataset)
Few-shot config: {'truthfulqa_mc2': 0}



100%|██████████| 817/817 [00:01<00:00, 419.67it/s]
Running loglikelihood requests: 100%|██████████| 5882/5882 [03:40<00:00, 26.68it/s]


✅ truthfulqa_mc2 completed and saved to checkpoint
   Results: {'accuracy': '0.4391', 'acc_norm': 'N/A'}

[11/13] Evaluating: gsm8k
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['gsm8k'] (full dataset)
Few-shot config: {'gsm8k': 5}



README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

100%|██████████| 1319/1319 [00:06<00:00, 204.16it/s]
Running generate_until requests:   0%|          | 0/1319 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|██████████| 1319/1319 [2:11:19<00:00,  5.97s/it]


✅ gsm8k completed and saved to checkpoint
   Results: {'exact_match,strict-match': '0.0068', 'exact_match_stderr,strict-match': '0.0023', 'exact_match,flexible-extract': '0.0136', 'exact_match_stderr,flexible-extract': '0.0032'}

[12/13] Evaluating: ifeval
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['ifeval'] (full dataset)
Few-shot config: {'ifeval': 0}



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Downloaded punkt_tab on rank 0


README.md: 0.00B [00:00, ?B/s]

ifeval_input_data.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/541 [00:00<?, ? examples/s]

100%|██████████| 541/541 [00:00<00:00, 90990.39it/s]
Running generate_until requests: 100%|██████████| 541/541 [5:55:46<00:00, 39.46s/it]


✅ ifeval completed and saved to checkpoint
   Results: {'prompt_level_strict_acc,none': '0.1627', 'prompt_level_strict_acc_stderr,none': '0.0159', 'inst_level_strict_acc,none': '0.2698', 'prompt_level_loose_acc,none': '0.1664', 'prompt_level_loose_acc_stderr,none': '0.0160', 'inst_level_loose_acc,none': '0.2806'}

[13/13] Evaluating: leaderboard_musr
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['leaderboard_musr'] (full dataset)
Few-shot config: {'leaderboard_musr': 0}



README.md:   0%|          | 0.00/963 [00:00<?, ?B/s]

murder_mystery.csv: 0.00B [00:00, ?B/s]

object_placements.csv: 0.00B [00:00, ?B/s]

team_allocation.csv: 0.00B [00:00, ?B/s]

Generating murder_mysteries split:   0%|          | 0/250 [00:00<?, ? examples/s]

Generating object_placements split:   0%|          | 0/256 [00:00<?, ? examples/s]

Generating team_allocation split:   0%|          | 0/250 [00:00<?, ? examples/s]

100%|██████████| 250/250 [00:00<00:00, 2086.08it/s]
100%|██████████| 256/256 [00:00<00:00, 2011.64it/s]
100%|██████████| 250/250 [00:00<00:00, 2026.36it/s]
Running loglikelihood requests: 100%|██████████| 2198/2198 [03:40<00:00,  9.95it/s]


✅ leaderboard_musr completed and saved to checkpoint
   Results: {'acc_norm,none': '0.3545', 'acc_norm_stderr,none': '0.0169'}

🎉 ALL TASKS COMPLETED!


✅ Llama-3.2-3B-pruned-50pct50 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext              74.8280               2.2411             1.1642      NaN      NaN        NaN             NaN           NaN                      NaN                             NaN                          NaN                                 NaN                          NaN                

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

✂️  Pruning with MAW method (60%)...


Pruning layers: 100%|██████████| 28/28 [00:06<00:00,  4.08it/s]


✅ Model created
   Original params: 3,212,749,824
   Pruned params: 1,944,443,904
   Reduction: 39.48%

📈 Model Statistics:
   Parameters: 1,944,443,904
   Size: 3.62 GB
   Reduction: 39.48%
   Source: on_the_fly_pruning

🆕 Creating new checkpoint: /content/drive/MyDrive/glu_pruning/checkpoints/3b/llama_3.2_3b_pruned_60pct.json

🚀 Starting evaluation: 13 tasks remaining


[1/13] Evaluating: wikitext
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['wikitext'] (full dataset)
Few-shot config: {'wikitext': 0}



100%|██████████| 62/62 [00:00<00:00, 538.53it/s]
100%|██████████| 62/62 [00:00<00:00, 76.06it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  6.79it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.27it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.70it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.46s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  3.50it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.92it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:02<00:00,  2.03s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.93it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:02<00:00,  2.24s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  4.34it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.22it/s]
Running loglikelihood requests: 100%|████████

✅ wikitext completed and saved to checkpoint
   Results: {'word_perplexity,none': '162.4732', 'byte_perplexity,none': '2.5908', 'bits_per_byte,none': '1.3734'}

[2/13] Evaluating: boolq
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['boolq'] (full dataset)
Few-shot config: {'boolq': 0}



100%|██████████| 3270/3270 [00:01<00:00, 1775.38it/s]
Running loglikelihood requests: 100%|██████████| 6540/6540 [02:07<00:00, 51.10it/s]


✅ boolq completed and saved to checkpoint
   Results: {'accuracy': '0.5034', 'acc_norm': 'N/A'}

[3/13] Evaluating: lambada_openai
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['lambada_openai'] (full dataset)
Few-shot config: {'lambada_openai': 0}



100%|██████████| 5153/5153 [00:09<00:00, 541.22it/s]
Running loglikelihood requests: 100%|██████████| 5153/5153 [03:14<00:00, 26.49it/s]


bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:08<00:00, 12.39it/s]


✅ lambada_openai completed and saved to checkpoint
   Results: {'perplexity': '5960.46', 'word_perplexity': '0.00', 'bits_per_byte': '0.0000'}

[4/13] Evaluating: mmlu
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['mmlu'] (full dataset)
Few-shot config: {'mmlu': 5}



100%|██████████| 270/270 [00:02<00:00, 123.89it/s]
100%|██████████| 378/378 [00:02<00:00, 127.00it/s]
100%|██████████| 310/310 [00:02<00:00, 104.23it/s]
100%|██████████| 100/100 [00:00<00:00, 124.40it/s]
100%|██████████| 203/203 [00:01<00:00, 126.00it/s]
100%|██████████| 145/145 [00:01<00:00, 125.80it/s]
100%|██████████| 100/100 [00:00<00:00, 126.31it/s]
100%|██████████| 100/100 [00:00<00:00, 127.62it/s]
100%|██████████| 235/235 [00:01<00:00, 126.30it/s]
100%|██████████| 216/216 [00:01<00:00, 128.08it/s]
100%|██████████| 100/100 [00:00<00:00, 128.26it/s]
100%|██████████| 102/102 [00:00<00:00, 127.61it/s]
100%|██████████| 135/135 [00:01<00:00, 128.74it/s]
100%|██████████| 144/144 [00:01<00:00, 128.44it/s]
100%|██████████| 100/100 [00:00<00:00, 127.25it/s]
100%|██████████| 152/152 [00:01<00:00, 129.31it/s]
100%|██████████| 100/100 [00:00<00:00, 129.23it/s]
100%|██████████| 151/151 [00:01<00:00, 127.72it/s]
100%|██████████| 112/112 [00:00<00:00, 127.95it/s]
100%|██████████| 100/100 [00:00

✅ mmlu completed and saved to checkpoint
   Results: {'accuracy': '0.2589', 'acc_norm': 'N/A'}

[5/13] Evaluating: arc_challenge
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['arc_challenge'] (full dataset)
Few-shot config: {'arc_challenge': 0}



100%|██████████| 1172/1172 [00:01<00:00, 1020.05it/s]
Running loglikelihood requests: 100%|██████████| 4687/4687 [02:39<00:00, 29.39it/s]


✅ arc_challenge completed and saved to checkpoint
   Results: {'accuracy': '0.1971', 'acc_norm': '0.2150'}

[6/13] Evaluating: hellaswag
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['hellaswag'] (full dataset)
Few-shot config: {'hellaswag': 0}



100%|██████████| 10042/10042 [00:04<00:00, 2274.21it/s]
Running loglikelihood requests: 100%|██████████| 40168/40168 [24:38<00:00, 27.17it/s]


✅ hellaswag completed and saved to checkpoint
   Results: {'accuracy': '0.2781', 'acc_norm': '0.2959'}

[7/13] Evaluating: winogrande
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['winogrande'] (full dataset)
Few-shot config: {'winogrande': 0}



100%|██████████| 1267/1267 [00:00<00:00, 83353.20it/s]
Running loglikelihood requests: 100%|██████████| 2534/2534 [01:32<00:00, 27.41it/s]


✅ winogrande completed and saved to checkpoint
   Results: {'accuracy': '0.4815', 'acc_norm': 'N/A'}

[8/13] Evaluating: piqa
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['piqa'] (full dataset)
Few-shot config: {'piqa': 0}



100%|██████████| 1838/1838 [00:01<00:00, 984.29it/s]
Running loglikelihood requests: 100%|██████████| 3676/3676 [02:12<00:00, 27.67it/s]


✅ piqa completed and saved to checkpoint
   Results: {'accuracy': '0.5696', 'acc_norm': '0.5539'}

[9/13] Evaluating: truthfulqa_mc1
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['truthfulqa_mc1'] (full dataset)
Few-shot config: {'truthfulqa_mc1': 0}



100%|██████████| 817/817 [00:01<00:00, 704.86it/s]
Running loglikelihood requests: 100%|██████████| 4114/4114 [02:33<00:00, 26.72it/s]


✅ truthfulqa_mc1 completed and saved to checkpoint
   Results: {'accuracy': '0.2387', 'acc_norm': 'N/A'}

[10/13] Evaluating: truthfulqa_mc2
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['truthfulqa_mc2'] (full dataset)
Few-shot config: {'truthfulqa_mc2': 0}



100%|██████████| 817/817 [00:01<00:00, 697.74it/s]
Running loglikelihood requests: 100%|██████████| 5882/5882 [03:42<00:00, 26.39it/s]


✅ truthfulqa_mc2 completed and saved to checkpoint
   Results: {'accuracy': '0.4574', 'acc_norm': 'N/A'}

[11/13] Evaluating: gsm8k
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['gsm8k'] (full dataset)
Few-shot config: {'gsm8k': 5}



100%|██████████| 1319/1319 [00:06<00:00, 209.45it/s]
Running generate_until requests: 100%|██████████| 1319/1319 [1:56:39<00:00,  5.31s/it]


✅ gsm8k completed and saved to checkpoint
   Results: {'exact_match,strict-match': '0.0106', 'exact_match_stderr,strict-match': '0.0028', 'exact_match,flexible-extract': '0.0227', 'exact_match_stderr,flexible-extract': '0.0041'}

[12/13] Evaluating: ifeval
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['ifeval'] (full dataset)
Few-shot config: {'ifeval': 0}



100%|██████████| 541/541 [00:00<00:00, 102263.21it/s]
Running generate_until requests: 100%|██████████| 541/541 [6:15:11<00:00, 41.61s/it]


✅ ifeval completed and saved to checkpoint
   Results: {'prompt_level_strict_acc,none': '0.1331', 'prompt_level_strict_acc_stderr,none': '0.0146', 'inst_level_strict_acc,none': '0.2398', 'prompt_level_loose_acc,none': '0.1386', 'prompt_level_loose_acc_stderr,none': '0.0149', 'inst_level_loose_acc,none': '0.2458'}

[13/13] Evaluating: leaderboard_musr
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['leaderboard_musr'] (full dataset)
Few-shot config: {'leaderboard_musr': 0}



100%|██████████| 250/250 [00:00<00:00, 2030.24it/s]
100%|██████████| 256/256 [00:00<00:00, 1989.31it/s]
100%|██████████| 250/250 [00:00<00:00, 2010.34it/s]
Running loglikelihood requests: 100%|██████████| 2198/2198 [03:54<00:00,  9.37it/s]


✅ leaderboard_musr completed and saved to checkpoint
   Results: {'acc_norm,none': '0.3598', 'acc_norm_stderr,none': '0.0172'}

🎉 ALL TASKS COMPLETED!


✅ Llama-3.2-3B-pruned-60pct60 evaluation completed

Results Preview:
            task word_perplexity,none byte_perplexity,none bits_per_byte,none accuracy acc_norm perplexity word_perplexity bits_per_byte exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none acc_norm,none acc_norm_stderr,none
        wikitext             162.4732               2.5908             1.3734      NaN      NaN        NaN             NaN           NaN                      NaN                             NaN                          NaN                                 NaN                          NaN                

# 5. Results Consolidation & Export

Consolidate all evaluation results and export to CSV for analysis.

In [10]:
# Import necessary libraries for this cell
import os
import json
import glob
import re
import pandas as pd
from datetime import datetime

print(f"\n{'='*70}")
print("📊 CONSOLIDATING RESULTS (DYNAMIC FILE-BASED)")
print(f"{'='*70}\n")

# --- Directory Setup ---
# Ensure CHECKPOINT_DIR is defined (it should be defined in a previous cell)
# This is where the individual JSON results are.
# Example: CHECKPOINT_DIR = "/content/drive/MyDrive/glu_pruning/checkpoints/3b"
if 'CHECKPOINT_DIR' not in globals():
    print("⚠️ Warning: CHECKPOINT_DIR not set. Using default './checkpoints/3b'")
    CHECKPOINT_DIR = "./checkpoints/default"

# Ensure RESULTS_DIR is defined (it should be defined in a previous cell)
# This is where the consolidated CSV will be saved.
# Example: RESULTS_DIR = "/content/drive/MyDrive/glu_pruning/results"
if 'RESULTS_DIR' not in globals():
    print("⚠️ Warning: RESULTS_DIR not set. Using default './results'")
    RESULTS_DIR = "./results"
# --- End Directory Setup ---


# Prepare data for DataFrame
consolidated_data = []

# --- Dynamic Loading ---
# 1. Find all individual 3B model result files
# *** THIS IS THE CORRECTED LINE: Using CHECKPOINT_DIR ***
json_files = glob.glob(f"{CHECKPOINT_DIR}/llama_3.2_3b_*.json")

# 2. Exclude any aggregate/summary files
json_files = [
    f for f in json_files
    if "results" not in os.path.basename(f) and "complete" not in os.path.basename(f)
]

print(f"Searching for results in: {CHECKPOINT_DIR}")
print(f"Found {len(json_files)} individual result files to process:")

# 3. Process each model's result file
for json_path in sorted(json_files):
    print(f"  -> Processing: {os.path.basename(json_path)}")
    try:
        with open(json_path, 'r') as f:
            data = json.load(f)
    except Exception as e:
        print(f"    ⚠️ Warning: Could not read or parse file. Error: {e}")
        continue

    # Extract metadata and results
    metadata = data.get("metadata", {})
    model_name_from_file = metadata.get("model_name", "Unknown Model")

    results = data.get("results", {})
    if not results:
        print(f"    ⚠️ Warning: No 'results' found in file. Skipping.")
        continue

    # --- Dynamically derive info from metadata ---
    pruning_pct = 0
    is_star = False

    # Parse model name to get pruning percentage and display name
    if "baseline" in model_name_from_file:
        display_name = "Llama-3.2-3B"
        pruning_pct = 0
    else:
        # Use regex to find pruning percentage
        match = re.search(r'pruned-(\d+)pct', model_name_from_file)
        if match:
            pruning_pct = int(match.group(1))
            display_name = f"Llama-3.2-3B-pruned-{pruning_pct}%"
        else:
            display_name = model_name_from_file # Fallback

    # Star logic (as per original project spec/hardcoded cell)
    if pruning_pct == 10:
        is_star = True
    # --- End dynamic info derivation ---

    # Process each task for this model
    for task_name, metrics in results.items():
        row = {
            "model": display_name,
            "pruning_pct": pruning_pct,
            "is_star": is_star,
            "task": task_name,
        }

        # Add all metrics from this task
        for metric_name, value in metrics.items():
            # Convert string values to float where possible
            try:
                row[metric_name] = float(value)
            except (ValueError, TypeError):
                row[metric_name] = value

        consolidated_data.append(row)

# --- End Dynamic Loading ---


📊 CONSOLIDATING RESULTS (DYNAMIC FILE-BASED)

Searching for results in: /content/drive/MyDrive/glu_pruning/checkpoints/3b
Found 1 individual result files to process:
  -> Processing: llama_3.2_3b_pruned_60pct.json


In [11]:
# Create DataFrame
df = pd.DataFrame(consolidated_data)

# Sort by pruning_pct and then task to ensure consistent order
if not df.empty:
    df = df.sort_values(by=["pruning_pct", "task"]).reset_index(drop=True)

print(f"\n✅ Consolidated {len(df)} result rows")
print(f"   Models: {df['model'].nunique()}")
print(f"   Tasks: {df['task'].nunique()}")
if 'model' in df.columns:
    print(f"   Metrics per task: {len(df.columns) - 4}")  # Exclude metadata columns
else:
    print("   No data consolidated.")

# Display summary
print("\nDataFrame Preview:")
print(df.head(10))

# Save to CSV
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_path = f"{RESULTS_DIR}/llama_3b_results_{timestamp}.csv"
df.to_csv(csv_path, index=False)

print(f"\n💾 Results saved to: {csv_path}")

# Also save a "latest" version for easy access
latest_path = f"{RESULTS_DIR}/llama_3b_results_latest.csv"
df.to_csv(latest_path, index=False)
print(f"💾 Latest results: {latest_path}")

print(f"\n{'='*70}")
print("✅ EVALUATION COMPLETE - ALL RESULTS SAVED")
print(f"{'='*70}\n")


✅ Consolidated 13 result rows
   Models: 1
   Tasks: 13
   Metrics per task: 20

DataFrame Preview:
                     model  pruning_pct  is_star              task  \
0  Llama-3.2-3B-pruned-60%           60    False     arc_challenge   
1  Llama-3.2-3B-pruned-60%           60    False             boolq   
2  Llama-3.2-3B-pruned-60%           60    False             gsm8k   
3  Llama-3.2-3B-pruned-60%           60    False         hellaswag   
4  Llama-3.2-3B-pruned-60%           60    False            ifeval   
5  Llama-3.2-3B-pruned-60%           60    False    lambada_openai   
6  Llama-3.2-3B-pruned-60%           60    False  leaderboard_musr   
7  Llama-3.2-3B-pruned-60%           60    False              mmlu   
8  Llama-3.2-3B-pruned-60%           60    False              piqa   
9  Llama-3.2-3B-pruned-60%           60    False    truthfulqa_mc1   

   word_perplexity,none  byte_perplexity,none  bits_per_byte,none  accuracy  \
0                   NaN                   NaN    

OSError: Cannot save file into a non-existent directory: '/content/drive/MyDrive/glu_pruning/results'

# 6. Quick Analysis & Visualization

Generate quick insights to decide which models merit uploading to HuggingFace Hub.

In [None]:
print(f"\n{'='*70}")
print("📈 QUICK ANALYSIS: Performance vs. Pruning Level")
print(f"{'='*70}\n")

# This cell assumes the 'df' DataFrame was created in the previous cell.

# Calculate average performance degradation per model
# Focus on key metrics: accuracy for classification, perplexity for generation

summary_metrics = []

# --- Dynamic Analysis ---
# Group by the model-level info we created in the previous cell
# This replaces the hardcoded 'model_info' dictionary
try:
    grouped = df.groupby(['model', 'pruning_pct', 'is_star'])
except KeyError:
    print("❌ Error: 'df' DataFrame not found or is missing required columns.")
    print("   Please ensure the previous consolidation cell was run successfully.")
    # Create an empty df to avoid crashing the rest of the cell
    grouped = pd.DataFrame().groupby(['model', 'pruning_pct', 'is_star'])

print(f"Analyzing {len(grouped)} unique models found in the DataFrame...")

for (model_name, pruning, is_star_bool), model_df in grouped:
    # 'model_df' is now the DataFrame for this specific model

    # Extract key metrics
    # We look for 'accuracy' (from boolq, mmlu, etc.)
    accuracies = model_df['accuracy'].dropna()
    # We look for 'perplexity' (from lambada_openai)
    # Note: wikitext 'word_perplexity,none' is not used here for simplicity
    perplexities = model_df['perplexity'].dropna()

    summary = {
        "model": model_name,
        "pruning": pruning,
        "star": "⭐" if is_star_bool else "",
        "avg_accuracy": accuracies.mean() if len(accuracies) > 0 else None,
        "avg_perplexity": perplexities.mean() if len(perplexities) > 0 else None,
        "num_tasks": len(model_df),
    }

    summary_metrics.append(summary)

# --- End Dynamic Analysis ---

if not summary_metrics:
    print("\nNo summary metrics to display. Skipping analysis.")
else:
    summary_df = pd.DataFrame(summary_metrics)
    # Sort by pruning percentage to ensure a logical order
    summary_df = summary_df.sort_values(by="pruning").reset_index(drop=True)

    print("\nPerformance Summary:")
    print("-" * 90)
    print(summary_df.to_string(index=False, float_format="%.4f"))
    print("-" * 90)

    # Calculate degradation vs baseline
    # Find baseline row (pruning == 0)
    baseline_row = summary_df.loc[summary_df['pruning'] == 0]

    if baseline_row.empty:
        print("\n⚠️ Baseline model (pruning=0) not found. Cannot calculate degradation.")
    else:
        baseline_acc = baseline_row['avg_accuracy'].values[0]
        baseline_ppl = baseline_row['avg_perplexity'].values[0]

        print(f"\nDegradation vs. Baseline (Acc: {baseline_acc:.4f}, PPL: {baseline_ppl:.2f}):")
        print("-" * 90)

        for _, row in summary_df.iterrows():
            if row['pruning'] == 0:
                continue

            acc_delta_str = "N/A"
            if row['avg_accuracy'] is not None and baseline_acc is not None and baseline_acc != 0:
                acc_delta = ((row['avg_accuracy'] - baseline_acc) / baseline_acc * 100)
                acc_delta_str = f"{acc_delta:+.2f}%"

            ppl_delta_str = "N/A"
            if row['avg_perplexity'] is not None and baseline_ppl is not None and baseline_ppl != 0:
                ppl_delta = ((row['avg_perplexity'] - baseline_ppl) / baseline_ppl * 100)
                ppl_delta_str = f"{ppl_delta:+.2f}%"

            print(f"{row['model']:<35} {row['star']:>2}")
            print(f"   Accuracy:   {acc_delta_str}")
            print(f"   Perplexity: {ppl_delta_str}")
            print()

        print("-" * 90)

In [None]:
# Visualization: Performance across pruning levels
import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy plot
axes[0].plot(summary_df['pruning'], summary_df['avg_accuracy'], marker='o', linewidth=2, markersize=8)
axes[0].axhline(y=baseline_acc, color='r', linestyle='--', alpha=0.5, label='Baseline')
axes[0].set_xlabel('Pruning Level (%)', fontsize=12)
axes[0].set_ylabel('Average Accuracy', fontsize=12)
axes[0].set_title('Accuracy vs. Pruning Level', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[0].legend()

# Highlight star model
star_idx = summary_df[summary_df['star'] == '⭐'].index[0]
axes[0].plot(summary_df.loc[star_idx, 'pruning'], summary_df.loc[star_idx, 'avg_accuracy'],
             marker='*', markersize=20, color='gold', markeredgecolor='black', markeredgewidth=1.5)

# Perplexity plot
axes[1].plot(summary_df['pruning'], summary_df['avg_perplexity'], marker='o', linewidth=2, markersize=8, color='orange')
axes[1].axhline(y=baseline_ppl, color='r', linestyle='--', alpha=0.5, label='Baseline')
axes[1].set_xlabel('Pruning Level (%)', fontsize=12)
axes[1].set_ylabel('Average Perplexity', fontsize=12)
axes[1].set_title('Perplexity vs. Pruning Level', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
axes[1].legend()

# Highlight star model
axes[1].plot(summary_df.loc[star_idx, 'pruning'], summary_df.loc[star_idx, 'avg_perplexity'],
             marker='*', markersize=20, color='gold', markeredgecolor='black', markeredgewidth=1.5)

plt.tight_layout()
plt.savefig(f"{RESULTS_DIR}/llama_3b_performance_analysis.png", dpi=300, bbox_inches='tight')
plt.show()

print(f"\n📊 Visualization saved to: {RESULTS_DIR}/llama_3b_performance_analysis.png")

# 7. Decision Matrix: Which Models to Upload?

Based on the evaluation results, determine which models should be uploaded to HuggingFace Hub for Phase 2.

In [None]:
import numpy as np # Need numpy for nan checks
import pandas as pd # Ensure pandas is imported

print(f"\n{'='*70}")
print("🎯 DECISION MATRIX: Models for HuggingFace Hub Upload")
print(f"{'='*70}\n")

print("Evaluation Criteria:")
print("  1. Performance degradation (avg_accuracy) < 15% vs baseline")
print("  2. Outperforms or matches baseline in at least 3 tasks (Primary Metric)")
print("  3. Accuracy degradation (avg_accuracy) < 50% vs baseline")
print("  4. Sufficient parameter reduction to justify storage\n")

# --- Setup: Load BENCHMARKS_BASE and Baseline Scores (as before) ---
print("Building primary metric map from BENCHMARKS_BASE...")
TASK_PRIMARY_METRICS = {}
if 'BENCHMARKS_BASE' not in globals():
    print("="*70)
    print("❌ Error: BENCHMARKS_BASE variable not found in global scope.")
    print("   Please ensure you have run the setup cell that imports from utils.py:")
    print("   >>> from utils import BENCHMARKS_BASE")
    print("="*70)
    raise NameError("BENCHMARKS_BASE is not defined.")
else:
    print(f"✅ Found BENCHMARKS_BASE with {len(BENCHMARKS_BASE)} tasks.")

for task_spec in BENCHMARKS_BASE:
    task_name = task_spec["name"]
    if task_name == 'wikitext':
        TASK_PRIMARY_METRICS[task_name] = ('word_perplexity,none', False) # Lower is better
    elif task_name == 'lambada_openai':
        TASK_PRIMARY_METRICS[task_name] = ('perplexity', False) # Lower is better
    elif task_name == 'gsm8k':
        TASK_PRIMARY_METRICS[task_name] = ('exact_match,strict-match', True) # Higher is better
    else:
        TASK_PRIMARY_METRICS[task_name] = ('accuracy', True) # Higher is better
print(f"Built primary metric map for {len(TASK_PRIMARY_METRICS)} tasks.")

try:
    df_baseline = df[df['pruning_pct'] == 0].set_index('task')
    baseline_scores = {}
    for task, (metric, _) in TASK_PRIMARY_METRICS.items():
        if task in df_baseline.index and metric in df_baseline.columns:
            score = df_baseline.loc[task, metric]
            if not pd.isna(score):
                baseline_scores[task] = score
    print(f"Captured {len(baseline_scores)} valid scores from baseline model.")
except Exception as e:
    print(f"❌ Error: Could not get baseline scores from 'df'. {e}")
    baseline_scores = {}
# --- End Setup ---


# Decision logic
decisions = []
if 'summary_df' not in globals():
     print("❌ Error: 'summary_df' not found. Please run the previous analysis cell.")
else:
    for _, row in summary_df.iterrows():
        if row['pruning'] == 0:
            continue  # Skip baseline

        decision = {
            "model": row['model'],
            "pruning": row['pruning'],
            "star": row['star'],
        }

        acc_degradation = abs((row['avg_accuracy'] - baseline_acc) / baseline_acc * 100) if baseline_acc and not pd.isna(row['avg_accuracy']) else 999
        ppl_degradation = abs((row['avg_perplexity'] - baseline_ppl) / baseline_ppl * 100) if baseline_ppl and not pd.isna(row['avg_perplexity']) else 999

        # --- MODIFIED: Check Criterion 2 AND store task names ---
        outperform_count = 0
        outperforming_tasks = []  # <-- NEW: List to store names
        model_tasks_df = df[df['model'] == row['model']].set_index('task')

        for task, (metric, higher_is_better) in TASK_PRIMARY_METRICS.items():
            if task not in baseline_scores or task not in model_tasks_df.index:
                continue

            model_score = model_tasks_df.loc[task, metric]
            if pd.isna(model_score):
                continue

            baseline_score = baseline_scores[task]

            if higher_is_better:
                if model_score >= baseline_score:
                    outperform_count += 1
                    outperforming_tasks.append(task) # <-- NEW: Store task name
            else: # Lower is better
                if model_score <= baseline_score:
                    outperform_count += 1
                    outperforming_tasks.append(task) # <-- NEW: Store task name

        decision['outperform_count'] = outperform_count
        decision['outperforming_tasks'] = outperforming_tasks # <-- NEW: Store list
        tasks_str = ", ".join(outperforming_tasks) if outperforming_tasks else "None"
        # --- End Criterion 2 Check ---


        # --- MODIFIED: Updated Decision Logic with task list ---
        is_star = row['star'] == '⭐'
        crit_1_low_degrad = (acc_degradation < 15)
        crit_2_tasks = (outperform_count >= 3)
        crit_3_not_catastrophic = (acc_degradation < 50)

        if is_star:
            decision['upload'] = True
            decision['reason'] = f"Star model. Won/Tied: {tasks_str}"
        elif crit_1_low_degrad and crit_2_tasks:
            decision['upload'] = True
            decision['reason'] = f"Low degradation (acc: {acc_degradation:.1f}%) AND Won/Tied: {tasks_str}"
        elif crit_3_not_catastrophic and crit_2_tasks:
            decision['upload'] = True
            decision['reason'] = f"Acceptable degradation (acc: {acc_degradation:.1f}%) AND Won/Tied: {tasks_str}"
        else:
            decision['upload'] = False
            reason_parts = []
            if not crit_3_not_catastrophic:
                reason_parts.append(f"High acc degradation ({acc_degradation:.1f}%)")
            if not crit_2_tasks:
                reason_parts.append(f"Only {outperform_count} tasks won ({tasks_str})")

            if not reason_parts:
                 reason_parts.append(f"Degradation (acc: {acc_degradation:.1f}%) or task count ({outperform_count}) too low")

            decision['reason'] = " AND ".join(reason_parts)

        decisions.append(decision)

# Display decision table
print("\nUpload Decisions (Updated Logic with Task Details):")
print("-" * 140) # <-- Widen table
# <-- MODIFIED: Widen Reason column
print(f"{'Model':<35} {'Pruning':<10} {'Star':<6} {'Tasks Won/Tied':<16} {'Upload?':<10} {'Reason'}")
print("-" * 140) # <-- Widen table

for dec in decisions:
    upload_status = "✅ YES" if dec['upload'] else "❌ NO"
    print(f"{dec['model']:<35} {dec['pruning']:<10}% {dec['star']:<6} {dec.get('outperform_count', 'N/A'):<16} {upload_status:<10} {dec['reason']}")

print("-" * 140) # <-- Widen table

# Summary
models_to_upload = sum(1 for d in decisions if d['upload'])
print(f"\n📦 Total models to upload to HF Hub: {models_to_upload}/{len(decisions)}")
print(f"\n✅ PHASE 3 COMPLETE - Ready for Phase 2 (Model Factory)")

In [None]:
print(f"\n{'='*70}")
print("💾 SAVING COMPLETE RESULTS FOR RESEARCH SHARING")
print(f"{'='*70}\n")

# --- Dynamic Model Info Setup ---
# Build a lookup for expansion rates from EXPERIMENT_CONFIG (if it exists)
exp_rate_map = {}
if 'EXPERIMENT_CONFIG' in globals():
    try:
        # --- ROBUST PARSING ---
        # Only add if all keys are present
        for cfg in EXPERIMENT_CONFIG:
            if (cfg.get("base_model", "") == "meta-llama/Llama-3.2-3B"
                and 'pruning_pct' in cfg
                and 'expansion_rate' in cfg):
                 exp_rate_map[cfg['pruning_pct']] = cfg['expansion_rate']
    except Exception as e:
        print(f"⚠️ Warning: Could not parse EXPERIMENT_CONFIG. {e}")
exp_rate_map[0] = 300 # Baseline expansion rate

print(f"Loaded {len(exp_rate_map)} expansion rates from config (Baseline included).")

# Build 'models_evaluated' section dynamically from the 'df'
models_evaluated = {}
if 'df' not in globals():
    print("❌ Error: 'df' DataFrame not found. Cannot build results.")
    raise NameError("'df' is not defined. Please run consolidation cell.")

# Group by model to rebuild the nested result structure
grouped_by_model = df.groupby(['model', 'pruning_pct', 'is_star'])

for (model_name, pruning_pct, is_star_bool), model_df in grouped_by_model:
    model_key = f"pruned_{int(pruning_pct)}pct" if pruning_pct > 0 else "baseline"

    # Rebuild the nested results dict
    model_results_dict = {}
    for _, row in model_df.iterrows():
        task_name = row['task']
        metrics = row.drop(['model', 'pruning_pct', 'is_star', 'task'])

        task_metrics = {}
        for col, val in metrics.dropna().items():
            # Convert numpy types to native Python types
            if isinstance(val, (np.integer, np.int64)):
                task_metrics[col] = int(val)
            elif isinstance(val, (np.floating, np.float64)):
                task_metrics[col] = float(val)
            else:
                task_metrics[col] = val
        model_results_dict[task_name] = task_metrics

    # Assemble the entry for this model
    models_evaluated[model_key] = {
        "name": model_name,
        "pruning_pct": int(pruning_pct), # Cast to native int
        "expansion_rate": exp_rate_map.get(int(pruning_pct), None), # Get from map
        "is_star": bool(is_star_bool), # <--- *** CRITICAL FIX: Cast to native bool ***
        "hf_repo": (
            "meta-llama/Llama-3.2-3B" if pruning_pct == 0
            else f"peremartra/Llama-3.2-3B-pruned-{int(pruning_pct)}pct"
        ),
        "results": model_results_dict
    }
print(f"Dynamically built {len(models_evaluated)} model entries from 'df'.")
# --- End Dynamic Model Setup ---


# Consolidate all data into a comprehensive JSON
complete_results = {
    "experiment_metadata": {
        "timestamp": datetime.now().isoformat(),
        "notebook": "02_Evaluate_3B.ipynb",
        "model_family": "Llama-3.2-3B",
        "pruning_method": "MAW (Maximum Absolute Weight)",
        #"library_versions": library_versions,
        "hardware": {
            "device": torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU",
            "gpu_memory_gb": torch.cuda.get_device_properties(0).total_memory / 1e9 if torch.cuda.is_available() else None
        }
    },

    "benchmarks": [
        {"name": task["name"], "num_fewshot": task["num_fewshot"]}
        for task in BENCHMARKS_BASE
    ],

    "models_evaluated": models_evaluated, # Use dynamic model data

    "summary_statistics": {
        "baseline": {
            "avg_accuracy": float(summary_df.loc[summary_df['pruning'] == 0, 'avg_accuracy'].values[0]),
            "avg_perplexity": float(summary_df.loc[summary_df['pruning'] == 0, 'avg_perplexity'].values[0]),
        },
        "pruned_models": [
            {
                "pruning_pct": int(row['pruning']),
                "is_star": row['star'] == '⭐', # This is native bool, so it's fine
                "avg_accuracy": float(row['avg_accuracy']) if pd.notna(row['avg_accuracy']) else None,
                "avg_perplexity": float(row['avg_perplexity']) if pd.notna(row['avg_perplexity']) else None,
                "accuracy_degradation_pct": float(((row['avg_accuracy'] - baseline_acc) / baseline_acc * 100)) if baseline_acc and pd.notna(row['avg_accuracy']) else None,
                "perplexity_degradation_pct": float(((row['avg_perplexity'] - baseline_ppl) / baseline_ppl * 100)) if baseline_ppl and pd.notna(row['avg_perplexity']) else None
            }
            for _, row in summary_df.iterrows() if row['pruning'] > 0
        ]
    },

    "upload_decisions": decisions, # Use dynamic decisions

    "citation": {
        "paper": "Exploring GLU Expansion Ratios: Structured Pruning in Llama-3.2 Models",
        "author": "Pere Martra",
        "doi": "https://doi.org/10.31219/osf.io/qgxea",
        "github": "https://github.com/peremartra/llama-glu-expansion-pruning",
        "note": "Results are freely available for research purposes. Please cite the paper if you use this data."
    }
}

# --- Save to JSON ---
try:
    # Ensure RESULTS_DIR is defined
    if 'RESULTS_DIR' not in globals():
        print("❌ Error: RESULTS_DIR not defined. Defaulting to './results'")
        RESULTS_DIR = "./results"

    # Create RESULTS_DIR if it doesn't exist
    os.makedirs(RESULTS_DIR, exist_ok=True)

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    json_path = f"{RESULTS_DIR}/llama_3b_complete_results_{timestamp}.json"
    with open(json_path, 'w') as f:
        json.dump(complete_results, f, indent=2, ensure_ascii=False)

    print(f"✅ Complete results saved to:")
    print(f"   {json_path}")

    # Also save a "latest" version
    latest_json = f"{RESULTS_DIR}/llama_3b_complete_results_latest.json"
    with open(latest_json, 'w') as f:
        json.dump(complete_results, f, indent=2, ensure_ascii=False)

    print(f"✅ Latest version:")
    print(f"   {latest_json}")

    # Display file size
    file_size_kb = Path(json_path).stat().st_size / 1024
    print(f"\n📊 File size: {file_size_kb:.1f} KB")

except Exception as e:
    print(f"❌ Error saving JSON files: {e}")
    print(f"   Please ensure RESULTS_DIR is defined and writeable: {RESULTS_DIR}")


print(f"\n📦 Models included: {len(complete_results['models_evaluated'])}")
print(f"📋 Benchmarks per model: {len(BENCHMARKS_BASE)}")
print(f"🔬 Total result entries: {len(df)}")

print(f"\n{'='*70}")
print("✅ COMPLETE RESULTS SAVED - Ready for research sharing")
# --- THIS IS THE CORRECTED LINE ---
print(f"{'='*70}\n")

---

## 🎓 Key Takeaways

This notebook evaluated the Llama-3.2-3B model family across 11 comprehensive benchmarks to determine:

1. **Optimal pruning level** for GLU-MLP layers
2. **Performance-efficiency trade-offs** at different expansion ratios
3. **Which models justify upload** to HuggingFace Hub


---

**Powered by OptiPFair** - Structured Pruning for GLU Architectures

If this research helps your work:
- ⭐ Star [the repo](https://github.com/peremartra/optipfair)
- 📖 Read the [documentation](https://peremartra.github.io/optipfair/)
- 🐛 Report issues or suggest features

---