<a href="https://colab.research.google.com/github/peremartra/fairness-pruning/blob/main/notebooks/02_Evaluate_1B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness Pruning Research - Base Model Evaluation
## 02 - Comprehensive Benchmark Suite for Unpruned Models

### Establishing Performance Baselines for Bias Mitigation Research
by [Pere Martra](https://github.com/peremartra)

[![GitHub](https://img.shields.io/badge/⭐_Star-OptiPFair-orange?logo=github&logoColor=white)](https://github.com/peremartra/optipfair)
[![PyPI](https://img.shields.io/pypi/v/optipfair?logo=python&logoColor=white&label=v)](https://pypi.org/project/optipfair/)

**Repository:** [github.com/peremartra/fairness-pruning](https://github.com/peremartra/fairness-pruning)

---

**Colab Environment:** GPU L4 or A100

**Models to Evaluate:**
* Llama-3.2-1B (base)
* Llama-3.2-3B (base)
* Additional models defined in `EXPERIMENT_CONFIG`

**Benchmarks (15 total):**
* English: MMLU, HellaSwag, BoolQ, ARC-Challenge, WinoGrande, PIQA, TruthfulQA, GSM8K, IFEval, MUSR
* Spanish: Belebele, XCOPA, MMLU-ES
* Language Modeling: WikiText, Lambada-OpenAI

**Estimated Runtime:** ~3-4 hours (varies by number of models)

---

## 📋 Objective

Establish **performance baselines** for the Fairness Pruning project by evaluating unpruned base models.

**Purpose:**
1. Measure baseline performance before bias mitigation interventions
2. Create reference metrics for future pruned model comparisons
3. Validate benchmark configurations across different architectures
4. Capture cross-lingual performance (English + Spanish)

**Features:**
- ✅ Checkpoint/Resume Support (survives Colab disconnections)
- ✅ Multi-Model Support (generic, not 1B-specific)
- ✅ Robust Error Handling (continues on task failures)
- ✅ Automated Path Management (no manual configuration needed)

**Note:** This notebook evaluates ONLY base models (no pruning applied). For bias mitigation experiments with pruned models, see subsequent notebooks.

---

# 1. Setup & Installation

In [1]:
# Install required libraries
!pip install -q optipfair
!pip install -q lm-eval
!pip install -q langdetect

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.7/51.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.8/53.8 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m150.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m293.6/293.6 kB[0m [31m26.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.1/104.1 kB[0m [31m10.0 MB/s[0m eta [36m0:00:

In [2]:
# Mount Google Drive for checkpoint persistence
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Download utils.py from GitHub repository
!wget -q https://raw.githubusercontent.com/peremartra/fairness-pruning/main/utils.py

# Verify download
import os
if os.path.exists('utils.py'):
    print("✅ utils.py downloaded successfully")
else:
    print("❌ Failed to download utils.py")

✅ utils.py downloaded successfully


In [4]:
# Import core libraries and utilities
import torch
import json
import pandas as pd
from datetime import datetime
from pathlib import Path

# Import our utility functions
from utils import (
    EXPERIMENT_CONFIG,
    BENCHMARKS_BASE,
    load_or_create_model,
    run_robust_evaluation,
    clear_gpu_cache,
    get_model_stats,
    format_results_table
)

print("✅ All imports successful")
print(f"📱 Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

✅ All imports successful
📱 Device: GPU
   GPU: NVIDIA L4
   Memory: 23.8 GB


# 1. Helper Functions

Utility functions for automatic checkpoint path generation and model size detection.

In [5]:
import re
import os

def get_model_size(model_name: str) -> str:
    """Extract model size identifier from HuggingFace model name.

    Examples:
        "meta-llama/Llama-3.2-1B" → "1b"
        "meta-llama/Llama-3.2-3B-Instruct" → "3b_instruct"
        "BSC-LT/salamandra-2b" → "2b"
    """
    match = re.search(r'(\d+\.?\d*)[Bb]', model_name)
    if not match:
        return "unknown"

    size = match.group(1).replace('.', '_') + "b"
    if "instruct" in model_name.lower():
        size += "_instruct"

    return size.lower()

def get_checkpoint_path(model_name: str, base_dir: str) -> str:
    """Generate checkpoint path with size-based subdirectory.

    Args:
        model_name: Full HuggingFace model identifier
        base_dir: Base directory for checkpoints

    Returns:
        Full path to checkpoint file
    """
    model_size = get_model_size(model_name)
    safe_name = model_name.replace('/', '_').replace('-', '_').lower()
    checkpoint_dir = os.path.join(base_dir, model_size)
    os.makedirs(checkpoint_dir, exist_ok=True)
    return os.path.join(checkpoint_dir, f"{safe_name}.json")

# Test with EXPERIMENT_CONFIG
print("Testing helper functions with EXPERIMENT_CONFIG:")
print("-" * 70)
for cfg in EXPERIMENT_CONFIG:
    model_id = cfg['base_model']
    size = get_model_size(model_id)
    print(f"{model_id:<50} → {size}")
print("-" * 70)

Testing helper functions with EXPERIMENT_CONFIG:
----------------------------------------------------------------------
meta-llama/Llama-3.2-1B                            → 1b
meta-llama/Llama-3.2-3B                            → 3b
meta-llama/Llama-3.2-3B                            → 3b
----------------------------------------------------------------------


# 2. Configuration & Evaluation Plan

This section prepares the evaluation for all models defined in `EXPERIMENT_CONFIG`.

In [6]:
# Directory setup
CHECKPOINT_BASE_DIR = "/content/drive/MyDrive/fair_pruning/checkpoints"
RESULTS_DIR = "/content/drive/MyDrive/fair_pruning/results"
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)

# De-duplicate models from EXPERIMENT_CONFIG
unique_models = list(dict.fromkeys([cfg["base_model"] for cfg in EXPERIMENT_CONFIG]))

print(f"{'='*70}")
print("📊 EVALUATION PLAN: Base Model Benchmarking")
print(f"{'='*70}\n")
print(f"Models to evaluate: {len(unique_models)}")
print(f"Benchmarks per model: {len(BENCHMARKS_BASE)}")
print(f"Total evaluations: {len(unique_models) * len(BENCHMARKS_BASE)}")
print(f"Estimated time: ~{len(unique_models) * 1.5:.1f} hours\n")

# Display models with checkpoint status
print("Models to evaluate:")
print("-" * 70)
print(f"{'Model ID':<50} {'Size':<10} {'Status'}")
print("-" * 70)
for model_id in unique_models:
    size = get_model_size(model_id)
    cp_path = get_checkpoint_path(model_id, CHECKPOINT_BASE_DIR)
    exists = "✅ Exists" if Path(cp_path).exists() else "🆕 New"
    print(f"{model_id:<50} {size:<10} {exists}")
print("-" * 70)

# Display benchmarks
print("\nBenchmarks:")
print("-" * 70)
for i, task in enumerate(BENCHMARKS_BASE, 1):
    fewshot_str = f"{task['num_fewshot']}-shot"
    print(f"{i:2d}. {task['name']:<30} {fewshot_str}")
print("-" * 70)

print(f"\n⚙️  Configuration:")
print(f"   - Checkpointing: Enabled (per-task granularity)")
print(f"   - Auto-resume: Yes (survives disconnections)")
print(f"   - Error handling: Skip failed tasks, continue evaluation")
print(f"   - Device: {'GPU' if torch.cuda.is_available() else 'CPU'}\n")

📊 EVALUATION PLAN: Base Model Benchmarking

Models to evaluate: 2
Benchmarks per model: 12
Total evaluations: 24
Estimated time: ~3.0 hours

Models to evaluate:
----------------------------------------------------------------------
Model ID                                           Size       Status
----------------------------------------------------------------------
meta-llama/Llama-3.2-1B                            1b         🆕 New
meta-llama/Llama-3.2-3B                            3b         🆕 New
----------------------------------------------------------------------

Benchmarks:
----------------------------------------------------------------------
 1. wikitext                       0-shot
 2. lambada_openai                 0-shot
 3. ifeval                         0-shot
 4. gsm8k                          5-shot
 5. mmlu                           5-shot
 6. arc_challenge                  0-shot
 7. hellaswag                      0-shot
 8. truthfulqa_mc2                 0-shot
 

# 3. Base Model Evaluation

Evaluates each base model across all benchmarks with checkpoint/resume support.

**Process:**
1. Load model directly from HuggingFace Hub (no pruning applied)
2. Calculate model statistics (parameters, size)
3. Run evaluation with checkpoint system (saves progress after each task)
4. Clear GPU memory before next model

In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"\n{'='*70}")
print("🚀 STARTING EVALUATION")
print(f"{'='*70}\n")

all_model_results = {}

for i, model_id in enumerate(unique_models, 1):
    print(f"\n{'='*70}")
    print(f"📊 MODEL {i}/{len(unique_models)}: {model_id}")
    print(f"{'='*70}\n")

    try:
        # 1. Load model from HuggingFace Hub (NO pruning)
        print(f"Loading from HuggingFace Hub...")
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,  # Use bfloat16 for A100, float16 for T4/L4
            device_map="auto"
        )

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        print("✅ Model loaded successfully\n")

        # 2. Display model statistics
        stats = get_model_stats(model)
        print(f"📈 Model Statistics:")
        print(f"   Parameters: {stats['total_parameters']:,}")
        print(f"   Size: {stats['size_gb']:.2f} GB\n")

        # 3. Generate checkpoint path automatically
        checkpoint_path = get_checkpoint_path(model_id, CHECKPOINT_BASE_DIR)
        print(f"📁 Checkpoint: {checkpoint_path}\n")

        # 4. Run evaluation with checkpoint/resume support
        results = run_robust_evaluation(
            model=model,
            tokenizer=tokenizer,
            tasks=BENCHMARKS_BASE,
            checkpoint_path=checkpoint_path,
            model_name=model_id
        )

        all_model_results[model_id] = results

        print(f"\n✅ Completed: {model_id}")
        print("\nResults Preview:")
        print(format_results_table(results))

        # 5. Cleanup memory before next model
        del model, tokenizer
        clear_gpu_cache()

    except Exception as e:
        print(f"\n❌ ERROR evaluating {model_id}: {str(e)}")

        # Check for common issues
        if "401" in str(e) or "403" in str(e):
            print("   → Authentication required. Run: huggingface-cli login")
        elif "CUDA out of memory" in str(e):
            print("   → GPU OOM. Try reducing batch size or using smaller model")

        print("   → Continuing with next model...\n")

        # Cleanup and continue
        if 'model' in locals():
            del model
        if 'tokenizer' in locals():
            del tokenizer
        clear_gpu_cache()
        continue

print(f"\n{'='*70}")
print(f"✅ EVALUATION COMPLETE: {len(all_model_results)}/{len(unique_models)} models")
print(f"{'='*70}\n")


🚀 STARTING EVALUATION


📊 MODEL 1/2: meta-llama/Llama-3.2-1B

Loading from HuggingFace Hub...


config.json:   0%|          | 0.00/843 [00:00<?, ?B/s]

`torch_dtype` is deprecated! Use `dtype` instead!


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]

✅ Model loaded successfully

📈 Model Statistics:
   Parameters: 1,235,814,400
   Size: 2.30 GB

📁 Checkpoint: /content/drive/MyDrive/fair_pruning/checkpoints/1b/meta_llama_llama_3.2_1b.json

🆕 Creating new checkpoint: /content/drive/MyDrive/fair_pruning/checkpoints/1b/meta_llama_llama_3.2_1b.json

🚀 Starting evaluation: 12 tasks remaining


[1/12] Evaluating: wikitext
──────────────────────────────────────────────────────────────────────





Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['wikitext'] (full dataset)
Few-shot config: {'wikitext': 0}





README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-trai(…):   0%|          | 0.00/6.18M [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-vali(…):   0%|          | 0.00/641k [00:00<?, ?B/s]

wikitext-2-raw-v1/wikitext-2-raw-v1-test(…):   0%|          | 0.00/715k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/629 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/62 [00:00<?, ? examples/s]

100%|██████████| 62/62 [00:00<00:00, 543.28it/s]
100%|██████████| 62/62 [00:00<00:00, 82.17it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.10s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.93it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  6.05it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.68it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  7.50it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  6.45it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.11it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  6.47it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.04it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  8.82it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  4.88it/s]
Running loglikelihood requests: 100%|████████

✅ wikitext completed and saved to checkpoint
   Results: {'word_perplexity,none': '11.9853', 'byte_perplexity,none': '1.5912', 'bits_per_byte,none': '0.6701'}

[2/12] Evaluating: lambada_openai
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['lambada_openai'] (full dataset)
Few-shot config: {'lambada_openai': 0}



README.md: 0.00B [00:00, ?B/s]

default/test/default.parquet:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/5153 [00:00<?, ? examples/s]

100%|██████████| 5153/5153 [00:09<00:00, 547.90it/s]
Running loglikelihood requests: 100%|██████████| 5153/5153 [01:53<00:00, 45.41it/s]


bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:08<00:00, 11.73it/s]


✅ lambada_openai completed and saved to checkpoint
   Results: {'perplexity': '5.43', 'word_perplexity': '0.00', 'bits_per_byte': '0.0000'}

[3/12] Evaluating: ifeval
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['ifeval'] (full dataset)
Few-shot config: {'ifeval': 0}



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


Downloaded punkt_tab on rank 0


README.md: 0.00B [00:00, ?B/s]

ifeval_input_data.jsonl: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/541 [00:00<?, ? examples/s]

100%|██████████| 541/541 [00:00<00:00, 87583.70it/s]
Running generate_until requests:   0%|          | 0/541 [00:00<?, ?it/s]The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Running generate_until requests: 100%|██████████| 541/541 [1:31:51<00:00, 10.19s/it]


✅ ifeval completed and saved to checkpoint
   Results: {'prompt_level_strict_acc,none': '0.0998', 'prompt_level_strict_acc_stderr,none': '0.0129', 'inst_level_strict_acc,none': '0.1475', 'prompt_level_loose_acc,none': '0.1128', 'prompt_level_loose_acc_stderr,none': '0.0136', 'inst_level_loose_acc,none': '0.1607'}

[4/12] Evaluating: gsm8k
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['gsm8k'] (full dataset)
Few-shot config: {'gsm8k': 5}



README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

100%|██████████| 1319/1319 [00:06<00:00, 207.95it/s]
Running generate_until requests: 100%|██████████| 1319/1319 [53:41<00:00,  2.44s/it]


✅ gsm8k completed and saved to checkpoint
   Results: {'exact_match,strict-match': '0.0553', 'exact_match_stderr,strict-match': '0.0063', 'exact_match,flexible-extract': '0.0584', 'exact_match_stderr,flexible-extract': '0.0065'}

[5/12] Evaluating: mmlu
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['mmlu'] (full dataset)
Few-shot config: {'mmlu': 5}



README.md: 0.00B [00:00, ?B/s]

dataset_infos.json: 0.00B [00:00, ?B/s]

world_religions/test-00000-of-00001.parq(…):   0%|          | 0.00/18.9k [00:00<?, ?B/s]

world_religions/validation-00000-of-0000(…):   0%|          | 0.00/4.94k [00:00<?, ?B/s]

world_religions/dev-00000-of-00001.parqu(…):   0%|          | 0.00/3.30k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/171 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/19 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_law/test-00000-of-00001.par(…):   0%|          | 0.00/1.04M [00:00<?, ?B/s]

professional_law/validation-00000-of-000(…):   0%|          | 0.00/116k [00:00<?, ?B/s]

professional_law/dev-00000-of-00001.parq(…):   0%|          | 0.00/15.1k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/1534 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/170 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

prehistory/test-00000-of-00001.parquet:   0%|          | 0.00/54.3k [00:00<?, ?B/s]

prehistory/validation-00000-of-00001.par(…):   0%|          | 0.00/9.89k [00:00<?, ?B/s]

prehistory/dev-00000-of-00001.parquet:   0%|          | 0.00/4.62k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/324 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/35 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

philosophy/test-00000-of-00001.parquet:   0%|          | 0.00/48.6k [00:00<?, ?B/s]

philosophy/validation-00000-of-00001.par(…):   0%|          | 0.00/9.15k [00:00<?, ?B/s]

philosophy/dev-00000-of-00001.parquet:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/311 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/34 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

moral_scenarios/test-00000-of-00001.parq(…):   0%|          | 0.00/89.8k [00:00<?, ?B/s]

moral_scenarios/validation-00000-of-0000(…):   0%|          | 0.00/14.9k [00:00<?, ?B/s]

moral_scenarios/dev-00000-of-00001.parqu(…):   0%|          | 0.00/5.14k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/895 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

moral_disputes/test-00000-of-00001.parqu(…):   0%|          | 0.00/60.9k [00:00<?, ?B/s]

moral_disputes/validation-00000-of-00001(…):   0%|          | 0.00/10.7k [00:00<?, ?B/s]

moral_disputes/dev-00000-of-00001.parque(…):   0%|          | 0.00/4.41k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/346 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/38 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

logical_fallacies/test-00000-of-00001.pa(…):   0%|          | 0.00/23.0k [00:00<?, ?B/s]

logical_fallacies/validation-00000-of-00(…):   0%|          | 0.00/6.52k [00:00<?, ?B/s]

logical_fallacies/dev-00000-of-00001.par(…):   0%|          | 0.00/4.12k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/163 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

jurisprudence/test-00000-of-00001.parque(…):   0%|          | 0.00/23.3k [00:00<?, ?B/s]

jurisprudence/validation-00000-of-00001.(…):   0%|          | 0.00/6.21k [00:00<?, ?B/s]

jurisprudence/dev-00000-of-00001.parquet:   0%|          | 0.00/4.05k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/108 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

international_law/test-00000-of-00001.pa(…):   0%|          | 0.00/29.5k [00:00<?, ?B/s]

international_law/validation-00000-of-00(…):   0%|          | 0.00/7.12k [00:00<?, ?B/s]

international_law/dev-00000-of-00001.par(…):   0%|          | 0.00/4.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/121 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/13 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_world_history/test-00000-of-(…):   0%|          | 0.00/202k [00:00<?, ?B/s]

high_school_world_history/validation-000(…):   0%|          | 0.00/38.5k [00:00<?, ?B/s]

high_school_world_history/dev-00000-of-0(…):   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/237 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_us_history/test-00000-of-000(…):   0%|          | 0.00/155k [00:00<?, ?B/s]

high_school_us_history/validation-00000-(…):   0%|          | 0.00/27.3k [00:00<?, ?B/s]

high_school_us_history/dev-00000-of-0000(…):   0%|          | 0.00/17.8k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/204 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_european_history/test-00000-(…):   0%|          | 0.00/142k [00:00<?, ?B/s]

high_school_european_history/validation-(…):   0%|          | 0.00/31.6k [00:00<?, ?B/s]

high_school_european_history/dev-00000-o(…):   0%|          | 0.00/22.2k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/165 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

formal_logic/test-00000-of-00001.parquet:   0%|          | 0.00/21.5k [00:00<?, ?B/s]

formal_logic/validation-00000-of-00001.p(…):   0%|          | 0.00/6.56k [00:00<?, ?B/s]

formal_logic/dev-00000-of-00001.parquet:   0%|          | 0.00/4.81k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/126 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

us_foreign_policy/test-00000-of-00001.pa(…):   0%|          | 0.00/19.5k [00:00<?, ?B/s]

us_foreign_policy/validation-00000-of-00(…):   0%|          | 0.00/5.27k [00:00<?, ?B/s]

us_foreign_policy/dev-00000-of-00001.par(…):   0%|          | 0.00/4.22k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

sociology/test-00000-of-00001.parquet:   0%|          | 0.00/43.9k [00:00<?, ?B/s]

sociology/validation-00000-of-00001.parq(…):   0%|          | 0.00/8.36k [00:00<?, ?B/s]

sociology/dev-00000-of-00001.parquet:   0%|          | 0.00/4.21k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/201 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

security_studies/test-00000-of-00001.par(…):   0%|          | 0.00/114k [00:00<?, ?B/s]

security_studies/validation-00000-of-000(…):   0%|          | 0.00/18.7k [00:00<?, ?B/s]

security_studies/dev-00000-of-00001.parq(…):   0%|          | 0.00/7.49k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/245 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/27 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

public_relations/test-00000-of-00001.par(…):   0%|          | 0.00/20.6k [00:00<?, ?B/s]

public_relations/validation-00000-of-000(…):   0%|          | 0.00/6.45k [00:00<?, ?B/s]

public_relations/dev-00000-of-00001.parq(…):   0%|          | 0.00/4.43k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/110 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_psychology/test-00000-of-00(…):   0%|          | 0.00/133k [00:00<?, ?B/s]

professional_psychology/validation-00000(…):   0%|          | 0.00/22.1k [00:00<?, ?B/s]

professional_psychology/dev-00000-of-000(…):   0%|          | 0.00/4.69k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/612 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/69 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

human_sexuality/test-00000-of-00001.parq(…):   0%|          | 0.00/23.2k [00:00<?, ?B/s]

human_sexuality/validation-00000-of-0000(…):   0%|          | 0.00/5.26k [00:00<?, ?B/s]

human_sexuality/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.08k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/131 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_psychology/test-00000-of-000(…):   0%|          | 0.00/92.8k [00:00<?, ?B/s]

high_school_psychology/validation-00000-(…):   0%|          | 0.00/15.2k [00:00<?, ?B/s]

high_school_psychology/dev-00000-of-0000(…):   0%|          | 0.00/5.18k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/545 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/60 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_microeconomics/test-00000-of(…):   0%|          | 0.00/38.8k [00:00<?, ?B/s]

high_school_microeconomics/validation-00(…):   0%|          | 0.00/7.22k [00:00<?, ?B/s]

high_school_microeconomics/dev-00000-of-(…):   0%|          | 0.00/3.83k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/238 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_macroeconomics/test-00000-of(…):   0%|          | 0.00/54.8k [00:00<?, ?B/s]

high_school_macroeconomics/validation-00(…):   0%|          | 0.00/9.89k [00:00<?, ?B/s]

high_school_macroeconomics/dev-00000-of-(…):   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/390 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/43 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_government_and_politics/test(…):   0%|          | 0.00/40.2k [00:00<?, ?B/s]

high_school_government_and_politics/vali(…):   0%|          | 0.00/8.27k [00:00<?, ?B/s]

high_school_government_and_politics/dev-(…):   0%|          | 0.00/4.47k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/193 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/21 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_geography/test-00000-of-0000(…):   0%|          | 0.00/28.2k [00:00<?, ?B/s]

high_school_geography/validation-00000-o(…):   0%|          | 0.00/6.16k [00:00<?, ?B/s]

high_school_geography/dev-00000-of-00001(…):   0%|          | 0.00/3.93k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/198 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

econometrics/test-00000-of-00001.parquet:   0%|          | 0.00/24.5k [00:00<?, ?B/s]

econometrics/validation-00000-of-00001.p(…):   0%|          | 0.00/7.02k [00:00<?, ?B/s]

econometrics/dev-00000-of-00001.parquet:   0%|          | 0.00/4.54k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/114 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/12 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

virology/test-00000-of-00001.parquet:   0%|          | 0.00/27.3k [00:00<?, ?B/s]

virology/validation-00000-of-00001.parqu(…):   0%|          | 0.00/7.05k [00:00<?, ?B/s]

virology/dev-00000-of-00001.parquet:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/166 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/18 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_medicine/test-00000-of-0000(…):   0%|          | 0.00/125k [00:00<?, ?B/s]

professional_medicine/validation-00000-o(…):   0%|          | 0.00/19.9k [00:00<?, ?B/s]

professional_medicine/dev-00000-of-00001(…):   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/272 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

professional_accounting/test-00000-of-00(…):   0%|          | 0.00/69.5k [00:00<?, ?B/s]

professional_accounting/validation-00000(…):   0%|          | 0.00/12.9k [00:00<?, ?B/s]

professional_accounting/dev-00000-of-000(…):   0%|          | 0.00/4.89k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/282 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

nutrition/test-00000-of-00001.parquet:   0%|          | 0.00/55.0k [00:00<?, ?B/s]

nutrition/validation-00000-of-00001.parq(…):   0%|          | 0.00/9.02k [00:00<?, ?B/s]

nutrition/dev-00000-of-00001.parquet:   0%|          | 0.00/4.99k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/306 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/33 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

miscellaneous/test-00000-of-00001.parque(…):   0%|          | 0.00/98.6k [00:00<?, ?B/s]

miscellaneous/validation-00000-of-00001.(…):   0%|          | 0.00/13.2k [00:00<?, ?B/s]

miscellaneous/dev-00000-of-00001.parquet:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/783 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/86 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

medical_genetics/test-00000-of-00001.par(…):   0%|          | 0.00/16.4k [00:00<?, ?B/s]

medical_genetics/validation-00000-of-000(…):   0%|          | 0.00/5.63k [00:00<?, ?B/s]

medical_genetics/dev-00000-of-00001.parq(…):   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

marketing/test-00000-of-00001.parquet:   0%|          | 0.00/37.3k [00:00<?, ?B/s]

marketing/validation-00000-of-00001.parq(…):   0%|          | 0.00/8.21k [00:00<?, ?B/s]

marketing/dev-00000-of-00001.parquet:   0%|          | 0.00/4.28k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/234 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/25 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

management/test-00000-of-00001.parquet:   0%|          | 0.00/14.7k [00:00<?, ?B/s]

management/validation-00000-of-00001.par(…):   0%|          | 0.00/4.50k [00:00<?, ?B/s]

management/dev-00000-of-00001.parquet:   0%|          | 0.00/3.61k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/103 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

human_aging/test-00000-of-00001.parquet:   0%|          | 0.00/31.2k [00:00<?, ?B/s]

human_aging/validation-00000-of-00001.pa(…):   0%|          | 0.00/6.28k [00:00<?, ?B/s]

human_aging/dev-00000-of-00001.parquet:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/223 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

global_facts/test-00000-of-00001.parquet:   0%|          | 0.00/11.5k [00:00<?, ?B/s]

global_facts/validation-00000-of-00001.p(…):   0%|          | 0.00/4.19k [00:00<?, ?B/s]

global_facts/dev-00000-of-00001.parquet:   0%|          | 0.00/3.58k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_medicine/test-00000-of-00001.par(…):   0%|          | 0.00/42.5k [00:00<?, ?B/s]

college_medicine/validation-00000-of-000(…):   0%|          | 0.00/8.99k [00:00<?, ?B/s]

college_medicine/dev-00000-of-00001.parq(…):   0%|          | 0.00/4.84k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/173 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

clinical_knowledge/test-00000-of-00001.p(…):   0%|          | 0.00/40.5k [00:00<?, ?B/s]

clinical_knowledge/validation-00000-of-0(…):   0%|          | 0.00/7.48k [00:00<?, ?B/s]

clinical_knowledge/dev-00000-of-00001.pa(…):   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/265 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

business_ethics/test-00000-of-00001.parq(…):   0%|          | 0.00/21.6k [00:00<?, ?B/s]

business_ethics/validation-00000-of-0000(…):   0%|          | 0.00/5.09k [00:00<?, ?B/s]

business_ethics/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

machine_learning/test-00000-of-00001.par(…):   0%|          | 0.00/19.7k [00:00<?, ?B/s]

machine_learning/validation-00000-of-000(…):   0%|          | 0.00/6.17k [00:00<?, ?B/s]

machine_learning/dev-00000-of-00001.parq(…):   0%|          | 0.00/5.25k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/112 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_statistics/test-00000-of-000(…):   0%|          | 0.00/58.0k [00:00<?, ?B/s]

high_school_statistics/validation-00000-(…):   0%|          | 0.00/10.9k [00:00<?, ?B/s]

high_school_statistics/dev-00000-of-0000(…):   0%|          | 0.00/6.07k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/216 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_physics/test-00000-of-00001.(…):   0%|          | 0.00/33.0k [00:00<?, ?B/s]

high_school_physics/validation-00000-of-(…):   0%|          | 0.00/7.96k [00:00<?, ?B/s]

high_school_physics/dev-00000-of-00001.p(…):   0%|          | 0.00/4.57k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/151 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/17 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_mathematics/test-00000-of-00(…):   0%|          | 0.00/33.7k [00:00<?, ?B/s]

high_school_mathematics/validation-00000(…):   0%|          | 0.00/6.99k [00:00<?, ?B/s]

high_school_mathematics/dev-00000-of-000(…):   0%|          | 0.00/4.50k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/270 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_computer_science/test-00000-(…):   0%|          | 0.00/27.3k [00:00<?, ?B/s]

high_school_computer_science/validation-(…):   0%|          | 0.00/5.28k [00:00<?, ?B/s]

high_school_computer_science/dev-00000-o(…):   0%|          | 0.00/6.54k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/9 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_chemistry/test-00000-of-0000(…):   0%|          | 0.00/33.3k [00:00<?, ?B/s]

high_school_chemistry/validation-00000-o(…):   0%|          | 0.00/8.31k [00:00<?, ?B/s]

high_school_chemistry/dev-00000-of-00001(…):   0%|          | 0.00/4.16k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/203 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

high_school_biology/test-00000-of-00001.(…):   0%|          | 0.00/62.7k [00:00<?, ?B/s]

high_school_biology/validation-00000-of-(…):   0%|          | 0.00/10.6k [00:00<?, ?B/s]

high_school_biology/dev-00000-of-00001.p(…):   0%|          | 0.00/4.94k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/310 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/32 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

elementary_mathematics/test-00000-of-000(…):   0%|          | 0.00/41.1k [00:00<?, ?B/s]

elementary_mathematics/validation-00000-(…):   0%|          | 0.00/9.38k [00:00<?, ?B/s]

elementary_mathematics/dev-00000-of-0000(…):   0%|          | 0.00/4.55k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/378 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/41 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

electrical_engineering/test-00000-of-000(…):   0%|          | 0.00/17.6k [00:00<?, ?B/s]

electrical_engineering/validation-00000-(…):   0%|          | 0.00/5.08k [00:00<?, ?B/s]

electrical_engineering/dev-00000-of-0000(…):   0%|          | 0.00/4.08k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/145 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

conceptual_physics/test-00000-of-00001.p(…):   0%|          | 0.00/25.0k [00:00<?, ?B/s]

conceptual_physics/validation-00000-of-0(…):   0%|          | 0.00/5.98k [00:00<?, ?B/s]

conceptual_physics/dev-00000-of-00001.pa(…):   0%|          | 0.00/3.96k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/235 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/26 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

computer_security/test-00000-of-00001.pa(…):   0%|          | 0.00/19.1k [00:00<?, ?B/s]

computer_security/validation-00000-of-00(…):   0%|          | 0.00/6.67k [00:00<?, ?B/s]

computer_security/dev-00000-of-00001.par(…):   0%|          | 0.00/4.33k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_physics/test-00000-of-00001.parq(…):   0%|          | 0.00/18.6k [00:00<?, ?B/s]

college_physics/validation-00000-of-0000(…):   0%|          | 0.00/6.39k [00:00<?, ?B/s]

college_physics/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.51k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/102 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_mathematics/test-00000-of-00001.(…):   0%|          | 0.00/16.6k [00:00<?, ?B/s]

college_mathematics/validation-00000-of-(…):   0%|          | 0.00/5.00k [00:00<?, ?B/s]

college_mathematics/dev-00000-of-00001.p(…):   0%|          | 0.00/5.16k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_computer_science/test-00000-of-0(…):   0%|          | 0.00/28.1k [00:00<?, ?B/s]

college_computer_science/validation-0000(…):   0%|          | 0.00/6.25k [00:00<?, ?B/s]

college_computer_science/dev-00000-of-00(…):   0%|          | 0.00/6.81k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_chemistry/test-00000-of-00001.pa(…):   0%|          | 0.00/17.9k [00:00<?, ?B/s]

college_chemistry/validation-00000-of-00(…):   0%|          | 0.00/4.87k [00:00<?, ?B/s]

college_chemistry/dev-00000-of-00001.par(…):   0%|          | 0.00/4.04k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/8 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

college_biology/test-00000-of-00001.parq(…):   0%|          | 0.00/31.8k [00:00<?, ?B/s]

college_biology/validation-00000-of-0000(…):   0%|          | 0.00/6.90k [00:00<?, ?B/s]

college_biology/dev-00000-of-00001.parqu(…):   0%|          | 0.00/4.27k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/144 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

astronomy/test-00000-of-00001.parquet:   0%|          | 0.00/28.3k [00:00<?, ?B/s]

astronomy/validation-00000-of-00001.parq(…):   0%|          | 0.00/6.05k [00:00<?, ?B/s]

astronomy/dev-00000-of-00001.parquet:   0%|          | 0.00/4.94k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/152 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

anatomy/test-00000-of-00001.parquet:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

anatomy/validation-00000-of-00001.parque(…):   0%|          | 0.00/5.28k [00:00<?, ?B/s]

anatomy/dev-00000-of-00001.parquet:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/135 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

abstract_algebra/test-00000-of-00001.par(…):   0%|          | 0.00/9.96k [00:00<?, ?B/s]

abstract_algebra/validation-00000-of-000(…):   0%|          | 0.00/3.73k [00:00<?, ?B/s]

abstract_algebra/dev-00000-of-00001.parq(…):   0%|          | 0.00/3.45k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

100%|██████████| 100/100 [00:00<00:00, 121.20it/s]
100%|██████████| 135/135 [00:01<00:00, 126.50it/s]
100%|██████████| 152/152 [00:01<00:00, 127.84it/s]
100%|██████████| 144/144 [00:01<00:00, 127.48it/s]
100%|██████████| 100/100 [00:00<00:00, 125.81it/s]
100%|██████████| 100/100 [00:00<00:00, 127.78it/s]
100%|██████████| 100/100 [00:00<00:00, 128.01it/s]
100%|██████████| 102/102 [00:00<00:00, 127.45it/s]
100%|██████████| 100/100 [00:00<00:00, 126.25it/s]
100%|██████████| 235/235 [00:01<00:00, 126.95it/s]
100%|██████████| 145/145 [00:01<00:00, 127.65it/s]
100%|██████████| 378/378 [00:02<00:00, 128.37it/s]
100%|██████████| 310/310 [00:02<00:00, 127.65it/s]
100%|██████████| 203/203 [00:01<00:00, 129.35it/s]
100%|██████████| 100/100 [00:00<00:00, 128.80it/s]
100%|██████████| 270/270 [00:02<00:00, 128.90it/s]
100%|██████████| 151/151 [00:01<00:00, 128.12it/s]
100%|██████████| 216/216 [00:01<00:00, 127.22it/s]
100%|██████████| 112/112 [00:00<00:00, 127.27it/s]
100%|██████████| 100/100 [00:00

✅ mmlu completed and saved to checkpoint
   Results: {'accuracy': '0.3143', 'acc_norm': 'N/A'}

[6/12] Evaluating: arc_challenge
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['arc_challenge'] (full dataset)
Few-shot config: {'arc_challenge': 0}



README.md: 0.00B [00:00, ?B/s]

ARC-Challenge/train-00000-of-00001.parqu(…):   0%|          | 0.00/190k [00:00<?, ?B/s]

ARC-Challenge/test-00000-of-00001.parque(…):   0%|          | 0.00/204k [00:00<?, ?B/s]

ARC-Challenge/validation-00000-of-00001.(…):   0%|          | 0.00/55.7k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1119 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1172 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/299 [00:00<?, ? examples/s]

100%|██████████| 1172/1172 [00:01<00:00, 1000.67it/s]
Running loglikelihood requests: 100%|██████████| 4687/4687 [01:34<00:00, 49.39it/s]


✅ arc_challenge completed and saved to checkpoint
   Results: {'accuracy': '0.3148', 'acc_norm': '0.3720'}

[7/12] Evaluating: hellaswag
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['hellaswag'] (full dataset)
Few-shot config: {'hellaswag': 0}



README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/24.4M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/6.11M [00:00<?, ?B/s]

data/validation-00000-of-00001.parquet:   0%|          | 0.00/6.32M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/39905 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/10003 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10042 [00:00<?, ? examples/s]

Map:   0%|          | 0/39905 [00:00<?, ? examples/s]

Map:   0%|          | 0/10042 [00:00<?, ? examples/s]

100%|██████████| 10042/10042 [00:04<00:00, 2264.02it/s]
Running loglikelihood requests: 100%|██████████| 40168/40168 [14:52<00:00, 44.99it/s]


✅ hellaswag completed and saved to checkpoint
   Results: {'accuracy': '0.4810', 'acc_norm': '0.6419'}

[8/12] Evaluating: truthfulqa_mc2
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['truthfulqa_mc2'] (full dataset)
Few-shot config: {'truthfulqa_mc2': 0}



README.md: 0.00B [00:00, ?B/s]

multiple_choice/validation-00000-of-0000(…):   0%|          | 0.00/271k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]

100%|██████████| 817/817 [00:01<00:00, 671.97it/s]
Running loglikelihood requests: 100%|██████████| 5882/5882 [02:12<00:00, 44.41it/s]


✅ truthfulqa_mc2 completed and saved to checkpoint
   Results: {'accuracy': '0.3854', 'acc_norm': 'N/A'}

[9/12] Evaluating: global_mmlu_es
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['global_mmlu_es'] (full dataset)
Few-shot config: {'global_mmlu_es': 5}



README.md: 0.00B [00:00, ?B/s]

es/test-00000-of-00001.parquet:   0%|          | 0.00/142k [00:00<?, ?B/s]

es/dev-00000-of-00001.parquet:   0%|          | 0.00/81.6k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/400 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

Filter:   0%|          | 0/215 [00:00<?, ? examples/s]

Filter:   0%|          | 0/400 [00:00<?, ? examples/s]

100%|██████████| 58/58 [00:00<00:00, 166.75it/s]
100%|██████████| 102/102 [00:00<00:00, 168.79it/s]
100%|██████████| 36/36 [00:00<00:00, 170.13it/s]
100%|██████████| 56/56 [00:00<00:00, 169.97it/s]
100%|██████████| 46/46 [00:00<00:00, 169.35it/s]
100%|██████████| 102/102 [00:00<00:00, 170.08it/s]
Running loglikelihood requests: 100%|██████████| 1600/1600 [00:22<00:00, 72.43it/s] 


✅ global_mmlu_es completed and saved to checkpoint
   Results: {'accuracy': '0.3550', 'acc_norm': 'N/A'}

[10/12] Evaluating: arc_es
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['arc_es'] (full dataset)
Few-shot config: {'arc_es': 0}



README.md: 0.00B [00:00, ?B/s]

data/es/train.jsonl:   0%|          | 0.00/507k [00:00<?, ?B/s]

data/es/val.jsonl:   0%|          | 0.00/141k [00:00<?, ?B/s]

data/es/test.jsonl:   0%|          | 0.00/542k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1118 [00:00<?, ? examples/s]

Generating val split:   0%|          | 0/297 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1170 [00:00<?, ? examples/s]

Map:   0%|          | 0/1118 [00:00<?, ? examples/s]

Map:   0%|          | 0/1170 [00:00<?, ? examples/s]

100%|██████████| 1170/1170 [00:00<00:00, 93332.62it/s]
Running loglikelihood requests: 100%|██████████| 4679/4679 [01:42<00:00, 45.83it/s]


✅ arc_es completed and saved to checkpoint
   Results: {'accuracy': '0.2564', 'acc_norm': '0.3000'}

[11/12] Evaluating: hellaswag_es
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['hellaswag_es'] (full dataset)
Few-shot config: {'hellaswag_es': 0}



README.md: 0.00B [00:00, ?B/s]

data/es/val.jsonl:   0%|          | 0.00/13.4M [00:00<?, ?B/s]

Generating val split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/9374 [00:00<?, ? examples/s]

100%|██████████| 9374/9374 [00:00<00:00, 86124.74it/s]
Running loglikelihood requests: 100%|██████████| 37496/37496 [14:04<00:00, 44.39it/s]


✅ hellaswag_es completed and saved to checkpoint
   Results: {'accuracy': '0.3731', 'acc_norm': '0.4731'}

[12/12] Evaluating: belebele_spa_Latn
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-1B'
Tasks: ['belebele_spa_Latn'] (full dataset)
Few-shot config: {'belebele_spa_Latn': 0}



README.md: 0.00B [00:00, ?B/s]

spa_Latn.jsonl: 0.00B [00:00, ?B/s]

Generating test split:   0%|          | 0/900 [00:00<?, ? examples/s]

100%|██████████| 900/900 [00:01<00:00, 882.72it/s]
Running loglikelihood requests: 100%|██████████| 3600/3600 [00:21<00:00, 164.99it/s]


✅ belebele_spa_Latn completed and saved to checkpoint
   Results: {'accuracy': '0.3233', 'acc_norm': '0.3233'}

🎉 ALL TASKS COMPLETED!


✅ Completed: meta-llama/Llama-3.2-1B

Results Preview:
             task word_perplexity,none byte_perplexity,none bits_per_byte,none perplexity word_perplexity bits_per_byte prompt_level_strict_acc,none prompt_level_strict_acc_stderr,none inst_level_strict_acc,none prompt_level_loose_acc,none prompt_level_loose_acc_stderr,none inst_level_loose_acc,none exact_match,strict-match exact_match_stderr,strict-match exact_match,flexible-extract exact_match_stderr,flexible-extract accuracy acc_norm
         wikitext              11.9853               1.5912             0.6701        NaN             NaN           NaN                          NaN                                 NaN                        NaN                         NaN                                NaN                       NaN                      NaN                             NaN          

config.json:   0%|          | 0.00/844 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/301 [00:00<?, ?B/s]



✅ Model loaded successfully

📈 Model Statistics:
   Parameters: 3,212,749,824
   Size: 5.98 GB

📁 Checkpoint: /content/drive/MyDrive/fair_pruning/checkpoints/3b/meta_llama_llama_3.2_3b.json

🆕 Creating new checkpoint: /content/drive/MyDrive/fair_pruning/checkpoints/3b/meta_llama_llama_3.2_3b.json

🚀 Starting evaluation: 12 tasks remaining


[1/12] Evaluating: wikitext
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['wikitext'] (full dataset)
Few-shot config: {'wikitext': 0}



100%|██████████| 62/62 [00:00<00:00, 520.50it/s]
100%|██████████| 62/62 [00:00<00:00, 72.45it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  5.08it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.06it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.24it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:01<00:00,  1.56s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.72it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.29it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:02<00:00,  2.41s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  2.37it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:02<00:00,  2.69s/it]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  3.40it/s]
Running loglikelihood requests: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]
Running loglikelihood requests: 100%|████████

✅ wikitext completed and saved to checkpoint
   Results: {'word_perplexity,none': '9.5372', 'byte_perplexity,none': '1.5246', 'bits_per_byte,none': '0.6084'}

[2/12] Evaluating: lambada_openai
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['lambada_openai'] (full dataset)
Few-shot config: {'lambada_openai': 0}



100%|██████████| 5153/5153 [00:09<00:00, 539.30it/s]
Running loglikelihood requests: 100%|██████████| 5153/5153 [03:20<00:00, 25.64it/s]


bootstrapping for stddev: perplexity


100%|██████████| 100/100 [00:08<00:00, 12.18it/s]


✅ lambada_openai completed and saved to checkpoint
   Results: {'perplexity': '3.88', 'word_perplexity': '0.00', 'bits_per_byte': '0.0000'}

[3/12] Evaluating: ifeval
──────────────────────────────────────────────────────────────────────

Starting lm-eval on model 'meta-llama/Llama-3.2-3B'
Tasks: ['ifeval'] (full dataset)
Few-shot config: {'ifeval': 0}



100%|██████████| 541/541 [00:00<00:00, 73379.64it/s]
Running generate_until requests:  98%|█████████▊| 530/541 [2:35:09<04:18, 23.46s/it]

KeyboardInterrupt: 

# 4. Results Consolidation

Load checkpoint files and consolidate into a single DataFrame for analysis.

In [None]:
import glob

print(f"{'='*70}")
print("📊 CONSOLIDATING RESULTS")
print(f"{'='*70}\n")

# Find all checkpoint files recursively
checkpoint_files = glob.glob(f"{CHECKPOINT_BASE_DIR}/**/*.json", recursive=True)
print(f"Found {len(checkpoint_files)} checkpoint files\n")

consolidated_data = []

for json_path in sorted(checkpoint_files):
    print(f"  → Processing: {os.path.basename(json_path)}")

    try:
        with open(json_path, 'r') as f:
            data = json.load(f)

        # Extract metadata
        metadata = data.get("metadata", {})
        model_name = metadata.get("model_name", "Unknown")
        model_size = get_model_size(model_name)

        # Extract results for each task
        results = data.get("results", {})
        if not results:
            print(f"    ⚠️ No results found, skipping")
            continue

        # Process each task
        for task_name, metrics in results.items():
            row = {
                "model": model_name,
                "model_size": model_size,
                "task": task_name
            }

            # Add all metrics
            for metric_name, value in metrics.items():
                try:
                    row[metric_name] = float(value)
                except (ValueError, TypeError):
                    row[metric_name] = value

            consolidated_data.append(row)

    except Exception as e:
        print(f"    ⚠️ Error processing file: {e}")
        continue

# Create DataFrame
df = pd.DataFrame(consolidated_data)

if not df.empty:
    df = df.sort_values(by=["model", "task"]).reset_index(drop=True)
    print(f"\n✅ Consolidated {len(df)} result rows")
    print(f"   Models: {df['model'].nunique()}")
    print(f"   Tasks: {df['task'].nunique()}")
    print(f"   Metrics per task: {len(df.columns) - 3}")  # Exclude model, model_size, task

    print("\nDataFrame Preview:")
    print(df.head(15).to_string())
else:
    print("\n⚠️ No data consolidated")

In [None]:
if not df.empty:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Save detailed results CSV
    csv_path = f"{RESULTS_DIR}/base_models_results_{timestamp}.csv"
    df.to_csv(csv_path, index=False)
    print(f"\n💾 Results saved:")
    print(f"   {csv_path}")

    # Save latest version
    latest_csv = f"{RESULTS_DIR}/base_models_results_latest.csv"
    df.to_csv(latest_csv, index=False)
    print(f"   {latest_csv}")

    # Save JSON format
    json_path = f"{RESULTS_DIR}/base_models_results_{timestamp}.json"
    df.to_json(json_path, orient='records', indent=2)
    print(f"   {json_path}")

    print(f"\n✅ All results exported successfully")

# 5. Summary Analysis

Generate summary statistics comparing models.

In [None]:
if not df.empty:
    print(f"{'='*70}")
    print("📈 SUMMARY STATISTICS")
    print(f"{'='*70}\n")

    summary = []
    for model_name, model_df in df.groupby('model'):
        # Calculate aggregated metrics
        acc = model_df['accuracy'].dropna()
        ppl = model_df['perplexity'].dropna()

        # Get model metadata (use first row since all rows have same metadata)
        model_size = model_df['model_size'].iloc[0] if 'model_size' in model_df.columns else get_model_size(model_name)

        summary.append({
            "model": model_name,
            "model_size": model_size,
            "avg_accuracy": acc.mean() if len(acc) > 0 else None,
            "avg_perplexity": ppl.mean() if len(ppl) > 0 else None,
            "tasks_completed": len(model_df),
            "tasks_with_accuracy": len(acc),
            "tasks_with_perplexity": len(ppl)
        })

    summary_df = pd.DataFrame(summary)
    summary_df = summary_df.sort_values("model").reset_index(drop=True)

    print(summary_df.to_string(index=False, float_format="%.4f"))

    # Save summary
    summary_csv = f"{RESULTS_DIR}/base_models_summary_{timestamp}.csv"
    summary_df.to_csv(summary_csv, index=False)

    print(f"\n💾 Summary saved: {summary_csv}")
    print(f"\n{'='*70}")

# 6. Evaluation Complete

## Summary

Baseline performance metrics established for the Fairness Pruning project.

**Generated Files:**
- `base_models_results_latest.csv` - Full evaluation results
- `base_models_results_YYYYMMDD_HHMMSS.json` - Structured export
- `base_models_summary_YYYYMMDD_HHMMSS.csv` - Summary metrics
- Individual checkpoint JSONs per model (in subdirectories by size)

**Next Steps:**
1. Use these baselines as reference for bias mitigation experiments
2. Identify high-variance tasks that may be sensitive to interventions
3. Proceed to bias detection and pruning notebooks

---

**Powered by OptiPFair** - Activation-Guided MLP Width Pruning for Bias Mitigation

If this research helps your work:
- ⭐ Star [the repo](https://github.com/peremartra/optipfair)
- 📖 Read the [documentation](https://peremartra.github.io/optipfair/)
- 🐛 Report issues or suggest features

---

In [None]:
print(f"{'='*70}")
print("📁 GENERATED FILES")
print(f"{'='*70}\n")

print("Results:")
if 'csv_path' in locals() and os.path.exists(csv_path):
    print(f"  ✅ {csv_path}")
if 'latest_csv' in locals() and os.path.exists(latest_csv):
    print(f"  ✅ {latest_csv}")
if 'json_path' in locals() and os.path.exists(json_path):
    print(f"  ✅ {json_path}")
if 'summary_csv' in locals() and os.path.exists(summary_csv):
    print(f"  ✅ {summary_csv}")

print("\nCheckpoints:")
if 'checkpoint_files' in locals():
    for f in sorted(checkpoint_files)[:10]:  # Show first 10
        print(f"  ✅ {f}")
    if len(checkpoint_files) > 10:
        print(f"  ... and {len(checkpoint_files) - 10} more")

print(f"\n{'='*70}")
print("✅ EVALUATION COMPLETE")
print(f"{'='*70}")