<a href="https://colab.research.google.com/github/peremartra/fairness-pruning/blob/main/notebooks/02_Evaluate_Base_Capabilities.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness Pruning Research - Base Model Evaluation
## 02 - Comprehensive Benchmark Suite for Unpruned Models

### Establishing Performance Baselines for Bias Mitigation Research
by [Pere Martra](https://github.com/peremartra)

[![GitHub](https://img.shields.io/badge/‚≠ê_Star-OptiPFair-orange?logo=github&logoColor=white)](https://github.com/peremartra/optipfair)
[![PyPI](https://img.shields.io/pypi/v/optipfair?logo=python&logoColor=white&label=v)](https://pypi.org/project/optipfair/)

**Repository:** [github.com/peremartra/fairness-pruning](https://github.com/peremartra/fairness-pruning)

---

**Colab Environment:** GPU L4 or A100

**Models to Evaluate:**
* Llama-3.2-1B (base)
* Llama-3.2-3B (base)
* Additional models defined in `EXPERIMENT_CONFIG`

**Benchmarks (15 total):**
* English: MMLU, HellaSwag, BoolQ, ARC-Challenge, WinoGrande, PIQA, TruthfulQA, GSM8K, IFEval, MUSR
* Spanish: Belebele, XCOPA, MMLU-ES
* Language Modeling: WikiText, Lambada-OpenAI

**Estimated Runtime:** ~3-4 hours (varies by number of models)

---

## üìã Objective

Establish **performance baselines** for the Fairness Pruning project by evaluating unpruned base models.

**Purpose:**
1. Measure baseline performance before bias mitigation interventions
2. Create reference metrics for future pruned model comparisons
3. Validate benchmark configurations across different architectures
4. Capture cross-lingual performance (English + Spanish)

**Features:**
- ‚úÖ Checkpoint/Resume Support (survives Colab disconnections)
- ‚úÖ Multi-Model Support (generic, not 1B-specific)
- ‚úÖ Robust Error Handling (continues on task failures)
- ‚úÖ Automated Path Management (no manual configuration needed)

**Note:** This notebook evaluates ONLY base models (no pruning applied). For bias mitigation experiments with pruned models, see subsequent notebooks.

---

# 1. Setup & Installation

In [1]:
# Install required libraries
!pip install -q optipfair
!pip install -q lm-eval
!pip install -q langdetect

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m52.2/52.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m53.8/53.8 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.8/51.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m8.2/8.2 MB[0m [31m95.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

In [2]:
# Mount Google Drive for checkpoint persistence
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Download utils.py from GitHub repository
!wget -q https://raw.githubusercontent.com/peremartra/fairness-pruning/main/utils.py

# Verify download
import os
if os.path.exists('utils.py'):
    print("‚úÖ utils.py downloaded successfully")
else:
    print("‚ùå Failed to download utils.py")

‚úÖ utils.py downloaded successfully


In [4]:
!wget -q https://raw.githubusercontent.com/peremartra/fairness-pruning/main/veritas_qa_ca.yaml
!wget -q https://raw.githubusercontent.com/peremartra/fairness-pruning/main/veritas_qa_es.yaml

In [5]:
import os
import shutil
import lm_eval

print(f"{'='*50}")
print("üì¶ INSTALLING VERITAS QA (ES/CA)")
print(f"{'='*50}")

lib_path = os.path.dirname(lm_eval.__file__)
target_dir = os.path.join(lib_path, "tasks", "veritas_qa")

os.makedirs(target_dir, exist_ok=True)
print(f"üìç TASKS DIRECTORY: {target_dir}")

for lang in ["es", "ca"]:
    filename = f"veritas_qa_{lang}.yaml"

    if os.path.exists(filename):
        dst = os.path.join(target_dir, filename)
        shutil.move(filename, dst)
        print(f"   ‚úÖ {filename} -> Installation OK")
    else:
        print(f"   ‚ùå Error: {filename} no downloaded.")

print("\nüöÄ OK!")

üì¶ INSTALLING VERITAS QA (ES/CA)
üìç TASKS DIRECTORY: /usr/local/lib/python3.12/dist-packages/lm_eval/tasks/veritas_qa
   ‚úÖ veritas_qa_es.yaml -> Installation OK
   ‚úÖ veritas_qa_ca.yaml -> Installation OK

üöÄ OK!


In [6]:
# Import core libraries and utilities
import torch
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
import logging

# Import our utility functions
from utils import (
    EXPERIMENT_CONFIG,
    BENCHMARKS_BASE,
    load_or_create_model,
    run_robust_evaluation,
    clear_gpu_cache,
    get_model_stats,
    format_results_table
)

logging.getLogger("lm_eval").setLevel(logging.INFO)

print("‚úÖ All imports successful")
print(f"üì± Device: {'GPU' if torch.cuda.is_available() else 'CPU'}")
if torch.cuda.is_available():
    print(f"   GPU: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

‚úÖ All imports successful
üì± Device: GPU
   GPU: NVIDIA L4
   Memory: 23.8 GB


# 1. Helper Functions

Utility functions for automatic checkpoint path generation and model size detection.

In [7]:
import re
import os

def get_model_size(model_name: str) -> str:
    """Extract model size identifier from HuggingFace model name.

    Examples:
        "meta-llama/Llama-3.2-1B" ‚Üí "1b"
        "meta-llama/Llama-3.2-3B-Instruct" ‚Üí "3b_instruct"
        "BSC-LT/salamandra-2b" ‚Üí "2b"
    """
    match = re.search(r'(\d+\.?\d*)[Bb]', model_name)
    if not match:
        return "unknown"

    size = match.group(1).replace('.', '_') + "b"
    if "instruct" in model_name.lower():
        size += "_instruct"

    return size.lower()

def get_checkpoint_path(model_name: str, base_dir: str) -> str:
    """Generate checkpoint path with size-based subdirectory.

    Args:
        model_name: Full HuggingFace model identifier
        base_dir: Base directory for checkpoints

    Returns:
        Full path to checkpoint file
    """
    model_size = get_model_size(model_name)
    safe_name = model_name.replace('/', '_').replace('-', '_').lower()
    checkpoint_dir = os.path.join(base_dir, model_size)
    os.makedirs(checkpoint_dir, exist_ok=True)
    return os.path.join(checkpoint_dir, f"{safe_name}.json")

# Test with EXPERIMENT_CONFIG
print("Testing helper functions with EXPERIMENT_CONFIG:")
print("-" * 70)
for cfg in EXPERIMENT_CONFIG:
    model_id = cfg['base_model']
    size = get_model_size(model_id)
    print(f"{model_id:<50} ‚Üí {size}")
print("-" * 70)

Testing helper functions with EXPERIMENT_CONFIG:
----------------------------------------------------------------------
BSC-LT/salamandra-2b                               ‚Üí 2b
meta-llama/Llama-3.2-1B                            ‚Üí 1b
meta-llama/Llama-3.2-3B                            ‚Üí 3b
----------------------------------------------------------------------


# 2. Configuration & Evaluation Plan

This section prepares the evaluation for all models defined in `EXPERIMENT_CONFIG`.

In [8]:
# Directory setup
CHECKPOINT_BASE_DIR = "/content/drive/MyDrive/fair_pruning/checkpoints"
RESULTS_DIR = "/content/drive/MyDrive/fair_pruning/results"
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)

# De-duplicate models from EXPERIMENT_CONFIG
unique_models = list(dict.fromkeys([cfg["base_model"] for cfg in EXPERIMENT_CONFIG]))

print(f"{'='*70}")
print("üìä EVALUATION PLAN: Base Model Benchmarking")
print(f"{'='*70}\n")
print(f"Models to evaluate: {len(unique_models)}")
print(f"Benchmarks per model: {len(BENCHMARKS_BASE)}")
print(f"Total evaluations: {len(unique_models) * len(BENCHMARKS_BASE)}")
print(f"Estimated time: ~{len(unique_models) * 1.5:.1f} hours\n")

# Display models with checkpoint status
print("Models to evaluate:")
print("-" * 70)
print(f"{'Model ID':<50} {'Size':<10} {'Status'}")
print("-" * 70)
for model_id in unique_models:
    size = get_model_size(model_id)
    cp_path = get_checkpoint_path(model_id, CHECKPOINT_BASE_DIR)
    exists = "‚úÖ Exists" if Path(cp_path).exists() else "üÜï New"
    print(f"{model_id:<50} {size:<10} {exists}")
print("-" * 70)

# Display benchmarks
print("\nBenchmarks:")
print("-" * 70)
for i, task in enumerate(BENCHMARKS_BASE, 1):
    fewshot_str = f"{task['num_fewshot']}-shot"
    print(f"{i:2d}. {task['name']:<30} {fewshot_str}")
print("-" * 70)

print(f"\n‚öôÔ∏è  Configuration:")
print(f"   - Checkpointing: Enabled (per-task granularity)")
print(f"   - Auto-resume: Yes (survives disconnections)")
print(f"   - Error handling: Skip failed tasks, continue evaluation")
print(f"   - Device: {'GPU' if torch.cuda.is_available() else 'CPU'}\n")

üìä EVALUATION PLAN: Base Model Benchmarking

Models to evaluate: 3
Benchmarks per model: 14
Total evaluations: 42
Estimated time: ~4.5 hours

Models to evaluate:
----------------------------------------------------------------------
Model ID                                           Size       Status
----------------------------------------------------------------------
BSC-LT/salamandra-2b                               2b         ‚úÖ Exists
meta-llama/Llama-3.2-1B                            1b         ‚úÖ Exists
meta-llama/Llama-3.2-3B                            3b         ‚úÖ Exists
----------------------------------------------------------------------

Benchmarks:
----------------------------------------------------------------------
 1. wikitext                       0-shot
 2. lambada_openai                 0-shot
 3. ifeval                         0-shot
 4. gsm8k                          5-shot
 5. mmlu                           5-shot
 6. arc_challenge                  0-shot

# 3. Base Model Evaluation

Evaluates each base model across all benchmarks with checkpoint/resume support.

**Process:**
1. Load model directly from HuggingFace Hub (no pruning applied)
2. Calculate model statistics (parameters, size)
3. Run evaluation with checkpoint system (saves progress after each task)
4. Clear GPU memory before next model

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

print(f"\n{'='*70}")
print("üöÄ STARTING EVALUATION")
print(f"{'='*70}\n")

all_model_results = {}

for i, model_id in enumerate(unique_models, 1):
    print(f"\n{'='*70}")
    print(f"üìä MODEL {i}/{len(unique_models)}: {model_id}")
    print(f"{'='*70}\n")

    try:
        # 1. Load model from HuggingFace Hub (NO pruning)
        print(f"Loading from HuggingFace Hub...")
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,  # Use bfloat16 for A100, float16 for T4/L4
            device_map="auto"
        )

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        print("‚úÖ Model loaded successfully\n")

        # 2. Display model statistics
        stats = get_model_stats(model)
        print(f"üìà Model Statistics:")
        print(f"   Parameters: {stats['total_parameters']:,}")
        print(f"   Size: {stats['size_gb']:.2f} GB\n")

        # 3. Generate checkpoint path automatically
        checkpoint_path = get_checkpoint_path(model_id, CHECKPOINT_BASE_DIR)
        print(f"üìÅ Checkpoint: {checkpoint_path}\n")

        # 4. Run evaluation with checkpoint/resume support
        results = run_robust_evaluation(
            model=model,
            tokenizer=tokenizer,
            tasks=BENCHMARKS_BASE,
            checkpoint_path=checkpoint_path,
            model_name=model_id,
        )

        all_model_results[model_id] = results

        print(f"\n‚úÖ Completed: {model_id}")
        print("\nResults Preview:")
        print(format_results_table(results))

        # 5. Cleanup memory before next model
        del model, tokenizer
        clear_gpu_cache()

    except Exception as e:
        print(f"\n‚ùå ERROR evaluating {model_id}: {str(e)}")

        # Check for common issues
        if "401" in str(e) or "403" in str(e):
            print("   ‚Üí Authentication required. Run: huggingface-cli login")
        elif "CUDA out of memory" in str(e):
            print("   ‚Üí GPU OOM. Try reducing batch size or using smaller model")

        print("   ‚Üí Continuing with next model...\n")

        # Cleanup and continue
        if 'model' in locals():
            del model
        if 'tokenizer' in locals():
            del tokenizer
        clear_gpu_cache()
        continue

print(f"\n{'='*70}")
print(f"‚úÖ EVALUATION COMPLETE: {len(all_model_results)}/{len(unique_models)} models")
print(f"{'='*70}\n")

# 4. Results Consolidation

Load checkpoint files and consolidate into a single DataFrame for analysis.

In [10]:
import glob
import json
import pandas as pd
import os

# --- 1. Funci√≥n auxiliar para arreglar los datos anidados (MMLU) ---
def flatten_metrics(metrics, prefix=''):
    flat = {}
    for k, v in metrics.items():
        if isinstance(v, dict):
            flat.update(flatten_metrics(v, prefix=f"{prefix}{k}_"))
        else:
            flat[f"{prefix}{k}"] = v
    return flat

# --- 2. Configuraci√≥n ---
print(f"{'='*70}")
print(" CONSOLIDATING RESULTS (FILTERED)")
print(f"{'='*70}\n")

# Buscamos todos los JSONs recursivamente
checkpoint_files = glob.glob(f"{CHECKPOINT_BASE_DIR}/**/*.json", recursive=True)
print(f"Total archivos JSON encontrados: {len(checkpoint_files)}")

consolidated_data = []

for json_path in sorted(checkpoint_files):
    # --- FILTRO CLAVE: Ignorar carpetas temporales ---
    if "lm_evals" in json_path:
        continue  # Saltamos este archivo silenciosamente
    # -------------------------------------------------

    print(f" Procesando: {os.path.basename(json_path)}")

    try:
        with open(json_path, 'r') as f:
            data = json.load(f)

        # Extracci√≥n de metadatos
        metadata = data.get("metadata", {})
        model_name = metadata.get("model_name", "Unknown")

        # Determinamos el tama√±o (seg√∫n tu l√≥gica o fallback simple)
        if "1b" in model_name.lower(): model_size = "1b"
        elif "2b" in model_name.lower(): model_size = "2b"
        elif "3b" in model_name.lower(): model_size = "3b"
        else: model_size = "unknown"

        results = data.get("results", {})

        if not results:
            print("   -> Sin resultados, saltando.")
            continue

        # Procesar cada tarea
        for task_name, metrics in results.items():
            row = {
                "model": model_name,
                "model_size": model_size,
                "task": task_name
            }
            # Usamos la funci√≥n flatten para evitar errores de formato
            row.update(flatten_metrics(metrics))
            consolidated_data.append(row)

    except Exception as e:
        print(f"   -> Error leyendo {json_path}: {e}")

# --- 3. Crear DataFrame Final ---
if consolidated_data:
    df = pd.DataFrame(consolidated_data)
    # Ordenamos: Modelo -> Tarea
    df = df.sort_values(by=["model", "task"]).reset_index(drop=True)

    print(f"\n‚úÖ √âXITO: Se han consolidado {len(df)} filas correctamente.")
    print(f"Modelos √∫nicos: {df['model'].unique()}")

    # Mostrar vista previa
    display(df.head())

    # Guardar (usando tus rutas originales)
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    df.to_csv(f"{RESULTS_DIR}/base_models_results_{timestamp}.csv", index=False)
else:
    print("\n‚ö†Ô∏è No se encontraron datos v√°lidos despu√©s del filtrado.")

 CONSOLIDATING RESULTS (FILTERED)

Total archivos JSON encontrados: 10
 Procesando: meta_llama_llama_3.2_1b.json
 Procesando: bsc_lt_salamandra_2b.json
 Procesando: meta_llama_llama_3.2_3b.json

‚úÖ √âXITO: Se han consolidado 42 filas correctamente.
Modelos √∫nicos: ['BSC-LT/salamandra-2b' 'meta-llama/Llama-3.2-1B'
 'meta-llama/Llama-3.2-3B']


Unnamed: 0,model,model_size,task,"word_perplexity,none","byte_perplexity,none","bits_per_byte,none",perplexity,word_perplexity,bits_per_byte,"prompt_level_strict_acc,none",...,subcategories_elementary_mathematics,subcategories_high_school_biology,subcategories_high_school_chemistry,subcategories_high_school_computer_science,subcategories_high_school_mathematics,subcategories_high_school_physics,subcategories_high_school_statistics,subcategories_machine_learning,subcategories_business,subcategories_medical
0,BSC-LT/salamandra-2b,2b,arc_challenge,,,,,,,,...,,,,,,,,,,
1,BSC-LT/salamandra-2b,2b,arc_es,,,,,,,,...,,,,,,,,,,
2,BSC-LT/salamandra-2b,2b,belebele_spa_Latn,,,,,,,,...,,,,,,,,,,
3,BSC-LT/salamandra-2b,2b,global_mmlu_es,,,,,,,,...,,,,,,,,,0.3276,0.25
4,BSC-LT/salamandra-2b,2b,gsm8k,,,,,,,,...,,,,,,,,,,


In [11]:
if not df.empty:
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Save detailed results CSV
    csv_path = f"{RESULTS_DIR}/base_models_results_{timestamp}.csv"
    df.to_csv(csv_path, index=False)
    print(f"\nüíæ Results saved:")
    print(f"   {csv_path}")

    # Save latest version
    latest_csv = f"{RESULTS_DIR}/base_models_results_latest.csv"
    df.to_csv(latest_csv, index=False)
    print(f"   {latest_csv}")

    # Save JSON format
    json_path = f"{RESULTS_DIR}/base_models_results_{timestamp}.json"
    df.to_json(json_path, orient='records', indent=2)
    print(f"   {json_path}")

    print(f"\n‚úÖ All results exported successfully")


üíæ Results saved:
   /content/drive/MyDrive/fair_pruning/results/base_models_results_20251206_150732.csv
   /content/drive/MyDrive/fair_pruning/results/base_models_results_latest.csv
   /content/drive/MyDrive/fair_pruning/results/base_models_results_20251206_150732.json

‚úÖ All results exported successfully


# 5. Summary Analysis

Generate summary statistics comparing models.

In [12]:
# --- Celda 5: Summary Analysis Corregida ---
if not df.empty:
    print(f"{'='*70}")
    print("üìà SUMMARY STATISTICS")
    print(f"{'='*70}\n")

    summary = []
    # Agrupamos por modelo
    for model_name, model_df in df.groupby('model'):

        # --- CORRECCI√ìN AQU√ç: Convertir a n√∫meros expl√≠citamente ---
        # Usamos errors='coerce' para que si hay texto no num√©rico se convierta en NaN
        acc_series = pd.to_numeric(model_df['accuracy'], errors='coerce')
        ppl_series = pd.to_numeric(model_df['perplexity'], errors='coerce')

        # Ahora s√≠ podemos hacer dropna() seguro
        acc = acc_series.dropna()
        ppl = ppl_series.dropna()
        # -----------------------------------------------------------

        # Obtener metadata (intentando ser robustos si falta 'model_size')
        if 'model_size' in model_df.columns:
            model_size = model_df['model_size'].iloc[0]
        else:
            # Fallback simple si no existe la columna
            model_size = "unknown"
            if "1b" in model_name.lower(): model_size = "1b"
            elif "2b" in model_name.lower(): model_size = "2b"
            elif "3b" in model_name.lower(): model_size = "3b"

        summary.append({
            "model": model_name,
            "model_size": model_size,
            "avg_accuracy": acc.mean() if len(acc) > 0 else None,
            "avg_perplexity": ppl.mean() if len(ppl) > 0 else None,
            "tasks_completed": len(model_df),
            "tasks_with_accuracy": len(acc),
            "tasks_with_perplexity": len(ppl)
        })

    summary_df = pd.DataFrame(summary)

    if not summary_df.empty:
        summary_df = summary_df.sort_values("model").reset_index(drop=True)
        # Formato limpio para la tabla
        print(summary_df.to_string(index=False, float_format="%.4f"))

        # Guardar summary
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        summary_csv = f"{RESULTS_DIR}/base_models_summary_{timestamp}.csv"
        summary_df.to_csv(summary_csv, index=False)
        print(f"\nüíæ Summary saved: {summary_csv}")
    else:
        print("‚ö†Ô∏è No se pudo generar el resumen (datos insuficientes).")

    print(f"\n{'='*70}")
else:
    print("DataFrame vac√≠o. No hay estad√≠sticas que calcular.")

üìà SUMMARY STATISTICS

                  model model_size  avg_accuracy  avg_perplexity  tasks_completed  tasks_with_accuracy  tasks_with_perplexity
   BSC-LT/salamandra-2b         2b        0.2919          7.2700               14                   10                      1
meta-llama/Llama-3.2-1B         1b        0.3136          5.4300               14                   10                      1
meta-llama/Llama-3.2-3B         3b        0.4161          3.8800               14                   10                      1

üíæ Summary saved: /content/drive/MyDrive/fair_pruning/results/base_models_summary_20251206_150732.csv



# 6. Evaluation Complete

## Summary

Baseline performance metrics established for the Fairness Pruning project.

**Generated Files:**
- `base_models_results_latest.csv` - Full evaluation results
- `base_models_results_YYYYMMDD_HHMMSS.json` - Structured export
- `base_models_summary_YYYYMMDD_HHMMSS.csv` - Summary metrics
- Individual checkpoint JSONs per model (in subdirectories by size)

**Next Steps:**
1. Use these baselines as reference for bias mitigation experiments
2. Identify high-variance tasks that may be sensitive to interventions
3. Proceed to bias detection and pruning notebooks

---

**Powered by OptiPFair** - Activation-Guided MLP Width Pruning for Bias Mitigation

If this research helps your work:
- ‚≠ê Star [the repo](https://github.com/peremartra/optipfair)
- üìñ Read the [documentation](https://peremartra.github.io/optipfair/)
- üêõ Report issues or suggest features

---

In [13]:
print(f"{'='*70}")
print("üìÅ GENERATED FILES")
print(f"{'='*70}\n")

print("Results:")
if 'csv_path' in locals() and os.path.exists(csv_path):
    print(f"  ‚úÖ {csv_path}")
if 'latest_csv' in locals() and os.path.exists(latest_csv):
    print(f"  ‚úÖ {latest_csv}")
if 'json_path' in locals() and os.path.exists(json_path):
    print(f"  ‚úÖ {json_path}")
if 'summary_csv' in locals() and os.path.exists(summary_csv):
    print(f"  ‚úÖ {summary_csv}")

print("\nCheckpoints:")
if 'checkpoint_files' in locals():
    for f in sorted(checkpoint_files)[:10]:  # Show first 10
        print(f"  ‚úÖ {f}")
    if len(checkpoint_files) > 10:
        print(f"  ... and {len(checkpoint_files) - 10} more")

print(f"\n{'='*70}")
print("‚úÖ EVALUATION COMPLETE")
print(f"{'='*70}")

üìÅ GENERATED FILES

Results:
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_results_20251206_150732.csv
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_results_latest.csv
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_results_20251206_150732.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_summary_20251206_150732.csv

Checkpoints:
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/1b/meta_llama_llama_3.2_1b.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/2b/bsc_lt_salamandra_2b.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/3b/meta_llama_llama_3.2_3b.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/results/lm_evals/bsc_lt_salamandra_2b_veritas_qa_ca.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/results/lm_evals/bsc_lt_salamandra_2b_veritas_qa_es.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints/results/lm_evals/meta_llama_llama_3.2_1b_truthfulqa_mc2.json
  ‚úÖ /cont