<a href="https://colab.research.google.com/github/peremartra/fairness-pruning/blob/main/notebooks/02_Evaluate_MBBQ.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fairness Pruning Research - MBBQ (EsBBQ) Evaluation
## 02 - Multilingual BBQ Benchmark for Spanish Bias Detection

### Establishing Bias Performance using EsBBQ (MBBQ) for Spanish Bias Mitigation Research
by [Pere Martra](https://github.com/peremartra)

[![GitHub](https://img.shields.io/badge/‚≠ê_Star-OptiPFair-orange?logo=github&logoColor=white)](https://github.com/peremartra/optipfair)
[![PyPI](https://img.shields.io/pypi/v/optipfair?logo=python&logoColor=white&label=v)](https://pypi.org/project/optipfair/)

**Repository:** [github.com/peremartra/fairness-pruning](https://github.com/peremartra/fairness-pruning)

---

**Colab Environment:** GPU L4 or A100

**Models to Evaluate:**
* Llama-3.2-1B (base)
* Llama-3.2-3B (base)
* Salamandra-2B (base)

**EsBBQ Categories:**
* Age, Disability Status, Gender, LGBTQIA+, Nationality
* Physical Appearance, Race/Ethnicity, Religion, SES, Spanish Region

---

**Note:** This notebook evaluates ONLY base models (no pruning applied) on Spanish bias benchmarks. For English BBQ evaluation, see `02_Evaluate_BBQ.ipynb`.

---

# 1. Setup & Installation

In [1]:
# Install required libraries
!pip install -q optipfair
!pip install -q lm-eval
!pip install -q langdetect

[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m52.2/52.2 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m53.8/53.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m51.8/51.8 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m8.2/8.2 MB[0m [31m88.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚

In [2]:
# Mount Google Drive for checkpoint persistence
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Download utils.py from GitHub repository
!wget -q https://raw.githubusercontent.com/peremartra/fairness-pruning/main/utils.py

# Verify download
import os
if os.path.exists('utils.py'):
    print("‚úÖ utils.py downloaded successfully")
else:
    print("‚ùå Failed to download utils.py")

‚úÖ utils.py downloaded successfully


In [4]:
# Download EsBBQ task YAML files from GitHub repository
import os
import shutil
os.makedirs('custom_tasks/esbbq', exist_ok=True)

# Check if we are in the repo and can use local clean files
local_source = '../custom_tasks/esbbq'
files_to_copy = [
    'esbbq_age.yaml',
    'esbbq_disabilitystatus.yaml',
    'esbbq_gender.yaml',
    'esbbq_lgbtqia.yaml',
    'esbbq_nationality.yaml',
    'esbbq_physicalappearance.yaml',
    'esbbq_raceethnicity.yaml',
    'esbbq_religion.yaml',
    'esbbq_ses.yaml',
    'esbbq_spanishregion.yaml'
]

used_local = False
if os.path.exists(local_source):
    print("üìÇ Found local custom_tasks in ../custom_tasks. Copying to notebook dir...")
    for fname in files_to_copy:
        src = os.path.join(local_source, fname)
        dst = os.path.join('custom_tasks/esbbq', fname)
        if os.path.exists(src):
            shutil.copy(src, dst)
            used_local = True

if not used_local:
    print("üåê Local files not found. Downloading from GitHub...")
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_age.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_disabilitystatus.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_gender.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_lgbtqia.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_nationality.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_physicalappearance.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_raceethnicity.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_religion.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_ses.yaml
    !wget -q -P custom_tasks/esbbq https://raw.githubusercontent.com/peremartra/fairness-pruning/main/custom_tasks/esbbq/esbbq_spanishregion.yaml

# Verify downloads (or copies)
yaml_count = len([f for f in os.listdir('custom_tasks/esbbq') if f.endswith('.yaml')])
if used_local:
    print(f"‚úÖ Copied {yaml_count} EsBBQ YAML files from local repo.")
else:
    print(f"‚úÖ Downloaded {yaml_count} EsBBQ YAML files from GitHub.")


üåê Local files not found. Downloading from GitHub...
‚úÖ Downloaded 10 EsBBQ YAML files from GitHub.


In [5]:
# INSTALACI√ìN SIMPLE DE EsBBQ (archivos ya limpios en el repo)
import os
import shutil
import lm_eval

print(f"{'='*70}")
print("üì¶ INSTALLING EsBBQ TASKS")
print(f"{'='*70}")

# 1. Localizar directorios
lib_path = os.path.dirname(lm_eval.__file__)
target_dir = os.path.join(lib_path, "tasks", "esbbq")

# 2. Limpiar y recrear directorio
if os.path.exists(target_dir):
    shutil.rmtree(target_dir)
os.makedirs(target_dir, exist_ok=True)
print(f"üìç TASKS DIRECTORY: {target_dir}\n")

# 3. Copiar todos los YAMLs (YA est√°n limpios)
yaml_files = [f for f in os.listdir('custom_tasks/esbbq') if f.endswith('.yaml')]

for yaml_file in yaml_files:
    src = os.path.join('custom_tasks/esbbq', yaml_file)
    dst = os.path.join(target_dir, yaml_file)
    shutil.copy(src, dst)
    print(f"   ‚úÖ {yaml_file}")

print(f"\nüöÄ OK! {len(yaml_files)} EsBBQ tasks installed.")

üì¶ INSTALLING EsBBQ TASKS
üìç TASKS DIRECTORY: /usr/local/lib/python3.12/dist-packages/lm_eval/tasks/esbbq

   ‚úÖ esbbq_ses.yaml
   ‚úÖ esbbq_gender.yaml
   ‚úÖ esbbq_physicalappearance.yaml
   ‚úÖ esbbq_disabilitystatus.yaml
   ‚úÖ esbbq_raceethnicity.yaml
   ‚úÖ esbbq_nationality.yaml
   ‚úÖ esbbq_spanishregion.yaml
   ‚úÖ esbbq_lgbtqia.yaml
   ‚úÖ esbbq_age.yaml
   ‚úÖ esbbq_religion.yaml

üöÄ OK! 10 EsBBQ tasks installed.


# 2. Helper Functions

Utility functions for automatic checkpoint path generation and model size detection.

In [6]:
import re
import os

def get_model_size(model_name: str) -> str:
    """Extract model size identifier from HuggingFace model name.

    Examples:
        "meta-llama/Llama-3.2-1B" ‚Üí "1b"
        "meta-llama/Llama-3.2-3B-Instruct" ‚Üí "3b_instruct"
        "BSC-LT/salamandra-2b" ‚Üí "2b"
    """
    match = re.search(r'(\d+\.?\d*)[Bb]', model_name)
    if not match:
        return "unknown"

    size = match.group(1).replace('.', '_') + "b"
    if "instruct" in model_name.lower():
        size += "_instruct"

    return size.lower()

def get_checkpoint_path(model_name: str, base_dir: str) -> str:
    """Generate checkpoint path with size-based subdirectory.

    Args:
        model_name: Full HuggingFace model identifier
        base_dir: Base directory for checkpoints

    Returns:
        Full path to checkpoint file
    """
    model_size = get_model_size(model_name)
    safe_name = model_name.replace('/', '_').replace('-', '_').lower()
    checkpoint_dir = os.path.join(base_dir, model_size)
    os.makedirs(checkpoint_dir, exist_ok=True)
    return os.path.join(checkpoint_dir, f"{safe_name}.json")

# Test with example models
print("Testing helper functions:")
print("-" * 70)
test_models = [
    "BSC-LT/salamandra-2b",
    "meta-llama/Llama-3.2-1B",
    "meta-llama/Llama-3.2-3B"
]
for model_id in test_models:
    size = get_model_size(model_id)
    print(f"{model_id:<50} ‚Üí {size}")
print("-" * 70)

Testing helper functions:
----------------------------------------------------------------------
BSC-LT/salamandra-2b                               ‚Üí 2b
meta-llama/Llama-3.2-1B                            ‚Üí 1b
meta-llama/Llama-3.2-3B                            ‚Üí 3b
----------------------------------------------------------------------


In [7]:
import shutil
import os

# Borrar cach√© de Datasets de Hugging Face (Aqu√≠ est√° el veneno)
paths_to_clean = [
    "/root/.cache/huggingface/datasets",
    "/root/.cache/huggingface/modules",
    os.path.expanduser("~/.cache/huggingface/datasets")
]

print("üßπ Iniciando limpieza profunda de cach√© de Datasets...")

for p in paths_to_clean:
    if os.path.exists(p):
        try:
            shutil.rmtree(p)
            print(f"‚úÖ Eliminado: {p}")
        except Exception as e:
            print(f"‚ö†Ô∏è No se pudo borrar {p}: {e}")
    else:
        print(f"‚ÑπÔ∏è No exist√≠a: {p}")

print("\nüöÄ Cach√© limpia. La pr√≥xima ejecuci√≥n descargar√° los datos frescos y correctos.")

üßπ Iniciando limpieza profunda de cach√© de Datasets...
‚ÑπÔ∏è No exist√≠a: /root/.cache/huggingface/datasets
‚ÑπÔ∏è No exist√≠a: /root/.cache/huggingface/modules
‚ÑπÔ∏è No exist√≠a: /root/.cache/huggingface/datasets

üöÄ Cach√© limpia. La pr√≥xima ejecuci√≥n descargar√° los datos frescos y correctos.


# 3. Configuration & Evaluation Plan (MBBQ/EsBBQ)

Configure paths, select EsBBQ tasks, and list the models we will evaluate.

In [8]:
import torch
import json
import pandas as pd
from datetime import datetime
from pathlib import Path
import logging
import os
from transformers import AutoModelForCausalLM, AutoTokenizer
from IPython.display import display
from utils import (
    EXPERIMENT_CONFIG,
    run_robust_evaluation,
    load_or_create_model,
    clear_gpu_cache,
    format_results_table,
    get_model_stats,
 )

# Paths (Drive recommended in Colab)
CHECKPOINT_BASE_DIR = "/content/drive/MyDrive/fair_pruning/checkpoints_mbbq"
RESULTS_DIR = "/content/drive/MyDrive/fair_pruning/results"
Path(CHECKPOINT_BASE_DIR).mkdir(parents=True, exist_ok=True)
Path(RESULTS_DIR).mkdir(parents=True, exist_ok=True)

# EsBBQ (MBBQ) task list - all categories in one group (0-shot)
# Using task group name that encompasses all EsBBQ subtasks
MBBQ_TASKS = [
    {"name": "esbbq_age", "num_fewshot": 0},
    {"name": "esbbq_disabilitystatus", "num_fewshot": 0},
    {"name": "esbbq_gender", "num_fewshot": 0},
    {"name": "esbbq_lgbtqia", "num_fewshot": 0},
    {"name": "esbbq_nationality", "num_fewshot": 0},
    {"name": "esbbq_physicalappearance", "num_fewshot": 0},
    {"name": "esbbq_raceethnicity", "num_fewshot": 0},
    {"name": "esbbq_religion", "num_fewshot": 0},
    {"name": "esbbq_ses", "num_fewshot": 0},
    {"name": "esbbq_spanishregion", "num_fewshot": 0}
]

# De-duplicate models from EXPERIMENT_CONFIG
unique_models = list(dict.fromkeys([cfg["base_model"] for cfg in EXPERIMENT_CONFIG]))

logging.getLogger("lm_eval").setLevel(logging.INFO)

print(f"{'='*70}")
print("üìä EVALUATION PLAN: MBBQ (EsBBQ) - Spanish Bias Benchmark")
print(f"{'='*70}")
print(f"Checkpoints: {CHECKPOINT_BASE_DIR}")
print(f"Results: {RESULTS_DIR}")
print(f"Tasks: {[task['name'] for task in MBBQ_TASKS]}")
print(f"Limit per dataset: 100 (quick test mode)")
print("Models:")
for m in unique_models:
    print(f" - {m}")
print(f"{'='*70}\n")

üìä EVALUATION PLAN: MBBQ (EsBBQ) - Spanish Bias Benchmark
Checkpoints: /content/drive/MyDrive/fair_pruning/checkpoints_mbbq
Results: /content/drive/MyDrive/fair_pruning/results
Tasks: ['esbbq_age', 'esbbq_disabilitystatus', 'esbbq_gender', 'esbbq_lgbtqia', 'esbbq_nationality', 'esbbq_physicalappearance', 'esbbq_raceethnicity', 'esbbq_religion', 'esbbq_ses', 'esbbq_spanishregion']
Limit per dataset: 100 (quick test mode)
Models:
 - BSC-LT/salamandra-2b
 - meta-llama/Llama-3.2-1B
 - meta-llama/Llama-3.2-3B



# 4. Run MBBQ Evaluation

Evaluate each base model on EsBBQ tasks with checkpoint/resume and raw result saving.

In [None]:
print(f"\n{'='*70}")
print("üöÄ STARTING MBBQ (EsBBQ) EVALUATION (Hugging Face Native Loading)")
print(f"{'='*70}\n")

all_model_results = {}

# Pre-check simple para asegurar que datasets respira
try:
    from datasets import load_dataset
    # Usamos una carga dummy r√°pida para asegurar que la librer√≠a no explota
    print("üîç Verificando salud de la librer√≠a datasets...")
    # No usamos trust_remote_code=True aqu√≠ para evitar el warning de Belebele,
    # simplemente comprobamos que podemos importar y cargar algo b√°sico.
    print("‚úÖ Librer√≠a datasets operativa.")
except Exception as e_ds:
    print(f"‚ö†Ô∏è Alerta en pre-check (no bloqueante): {e_ds}")


for idx, model_id in enumerate(unique_models, 1):
    print(f"\n{'='*70}")
    print(f"üìä MODEL {idx}/{len(unique_models)}: {model_id}")
    print(f"{'='*70}\n")
    try:
        # Generate checkpoint path
        checkpoint_path = get_checkpoint_path(model_id, CHECKPOINT_BASE_DIR)

        # 1. Definir directorios ANTES de usarlos (Correcci√≥n del NameError)
        raw_results_dir = os.path.join(
            os.path.dirname(checkpoint_path),
            "results", "lm_evals"
        )
        os.makedirs(raw_results_dir, exist_ok=True)

        # 2. Cargar Modelo
        print(f"üì• Loading directly from Hugging Face Hub...")
        model = AutoModelForCausalLM.from_pretrained(
            model_id,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )

        tokenizer = AutoTokenizer.from_pretrained(model_id)
        if tokenizer.pad_token is None:
            tokenizer.pad_token = tokenizer.eos_token

        print("‚úÖ Model loaded successfully (Native HF)")

        # Show statistics
        stats = get_model_stats(model)
        print(f"üìà Params: {stats['total_parameters']:,} | Size: {stats['size_gb']:.2f} GB")
        print(f"üìÅ Checkpoint: {checkpoint_path}\n")

        # --- üïµÔ∏è DEBUGGING TRACES START ---
        print(f"\n{'='*20} üïµÔ∏è DEBUGGING TRACES: {model_id} {'='*20}")

        # Verificar Identidad
        print(f"üÜî Object Memory ID: {id(model)}")

        # Verificar Vocabulario (Prueba definitiva de que es el modelo correcto)
        test_word = "Inteligencia Artificial"
        encoded_ids = tokenizer.encode(test_word)
        print(f"üî§ Tokenizer Check ('{test_word}'): {encoded_ids}")
        print(f"üìè Vocab Size: {tokenizer.vocab_size}")
        if "salamandra" in model_id.lower() and tokenizer.vocab_size < 200000:
            print("‚ö†Ô∏è ADVERTENCIA: Salamandra deber√≠a tener vocab > 200k. Revisa el tokenizer.")

        # Verificar si ya existen resultados previos
        # Ahora raw_results_dir S√ç est√° definido
        test_file = os.path.join(raw_results_dir, f"{model_id.replace('/', '_')}_esbbq_age.json")
        if os.path.exists(test_file):
            import time
            mod_time = time.ctime(os.path.getmtime(test_file))
            print(f"üìÅ Fichero previo detectado: {test_file}")
            print(f"   üïí Fecha: {mod_time} (Deber√≠a actualizarse al finalizar)")
        else:
            print(f"üìÅ Fichero nuevo. Se crear√° en: {raw_results_dir}")

        print(f"{'='*60}\n")
        # --- üïµÔ∏è DEBUGGING TRACES END ---

        # 3. Ejecutar Evaluaci√≥n
        from utils import model_evaluation

        print(f"üìä Running evaluation with limit=100 per task...")
        print(f"üíæ Raw results will be saved to: {raw_results_dir}\n")

        results = model_evaluation(
            model_obj=model,
            tokenizer=tokenizer,
            tasks=MBBQ_TASKS,
            limit=1000,
            save_raw_results=True,
            raw_results_dir=raw_results_dir
        )

        # Save results to checkpoint format
        checkpoint_data = {
            "metadata": {
                "model_name": model_id,
                "started_at": datetime.now().isoformat(),
                "completed": True,
                "completed_at": datetime.now().isoformat()
            },
            "results": results,
            "pending_tasks": [],
            "failed_tasks": []
        }

        with open(checkpoint_path, 'w') as f:
            json.dump(checkpoint_data, f, indent=2)

        all_model_results[model_id] = results
        print(f"\n‚úÖ Completed: {model_id}")
        print("Results Preview (first few tasks):")
        preview_results = {k: v for i, (k, v) in enumerate(results.items()) if i < 3}
        print(format_results_table(preview_results))

    except Exception as e:
        print(f"\n‚ùå ERROR evaluating {model_id}: {e}")
        import traceback
        traceback.print_exc()
        if 'model' in locals(): del model
        if 'tokenizer' in locals(): del tokenizer
        clear_gpu_cache()
        continue

    # Memory cleanup
    del model
    del tokenizer
    clear_gpu_cache()

print(f"\n{'='*70}")
print(f"‚úÖ MBBQ EVALUATION COMPLETE: {len(all_model_results)}/{len(unique_models)} models")
print(f"{'='*70}\n")

# 5. Consolidate MBBQ Results

Load checkpoint files, flatten metrics, and export combined MBBQ results in CSV/JSON format.

In [10]:
import glob

def flatten_metrics(metrics, prefix=''):
    """Recursively flatten nested metric dictionaries."""
    flat = {}
    for k, v in metrics.items():
        if isinstance(v, dict):
            flat.update(flatten_metrics(v, prefix=f"{prefix}{k}_"))
        else:
            flat[f"{prefix}{k}"] = v
    return flat

print(f"{'='*70}")
print(" CONSOLIDATING MBBQ RESULTS")
print(f"{'='*70}\n")

checkpoint_files = glob.glob(f"{CHECKPOINT_BASE_DIR}/**/*.json", recursive=True)
print(f"Total checkpoint JSONs found: {len(checkpoint_files)}")

consolidated_data = []

for json_path in sorted(checkpoint_files):
    # Skip raw lm-eval dumps (only process model checkpoints)
    if "lm_evals" in json_path:
        continue

    try:
        with open(json_path, "r") as f:
            data = json.load(f)

        metadata = data.get("metadata", {})
        model_name = metadata.get("model_name", "Unknown")

        # Extract model size
        model_size = get_model_size(model_name)

        results = data.get("results", {})
        if not results:
            continue

        for task_name, metrics in results.items():
            row = {
                "model": model_name,
                "model_size": model_size,
                "task": task_name,
            }
            row.update(flatten_metrics(metrics))
            consolidated_data.append(row)

    except Exception as e:
        print(f"   ‚ö†Ô∏è Error reading {json_path}: {e}")

if consolidated_data:
    df = pd.DataFrame(consolidated_data)
    df = df.sort_values(by=["model", "task"]).reset_index(drop=True)

    print(f"\nüìä Consolidated {len(df)} task results from {df['model'].nunique()} models")
    display(df.head(10))

    # Save timestamped and latest versions
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_path = f"{RESULTS_DIR}/base_models_mbbq_results_{timestamp}.csv"
    latest_csv = f"{RESULTS_DIR}/base_models_mbbq_results_latest.csv"
    json_path = f"{RESULTS_DIR}/base_models_mbbq_results_{timestamp}.json"

    df.to_csv(csv_path, index=False)
    df.to_csv(latest_csv, index=False)
    df.to_json(json_path, orient="records", indent=2)

    print("\nüíæ MBBQ results saved:")
    print(f"   {csv_path}")
    print(f"   {latest_csv}")
    print(f"   {json_path}")
else:
    print("‚ö†Ô∏è No MBBQ results found to consolidate.")

 CONSOLIDATING MBBQ RESULTS

Total checkpoint JSONs found: 33

üìä Consolidated 30 task results from 3 models


Unnamed: 0,model,model_size,task,alias,acc,acc_stderr,acc_norm,acc_norm_stderr
0,BSC-LT/salamandra-2b,2b,esbbq_age,esbbq_age,0.384,0.0154,0.383,0.0154
1,BSC-LT/salamandra-2b,2b,esbbq_disabilitystatus,esbbq_disabilitystatus,0.345,0.015,0.335,0.0149
2,BSC-LT/salamandra-2b,2b,esbbq_gender,esbbq_gender,0.35,0.0151,0.354,0.0151
3,BSC-LT/salamandra-2b,2b,esbbq_lgbtqia,esbbq_lgbtqia,0.346,0.0151,0.364,0.0152
4,BSC-LT/salamandra-2b,2b,esbbq_nationality,esbbq_nationality,0.3611,0.0214,0.3552,0.0213
5,BSC-LT/salamandra-2b,2b,esbbq_physicalappearance,esbbq_physicalappearance,0.343,0.015,0.344,0.015
6,BSC-LT/salamandra-2b,2b,esbbq_raceethnicity,esbbq_raceethnicity,0.359,0.0152,0.353,0.0151
7,BSC-LT/salamandra-2b,2b,esbbq_religion,esbbq_religion,0.3565,0.0188,0.3519,0.0188
8,BSC-LT/salamandra-2b,2b,esbbq_ses,esbbq_ses,0.354,0.0151,0.344,0.015
9,BSC-LT/salamandra-2b,2b,esbbq_spanishregion,esbbq_spanishregion,0.3502,0.0152,0.3502,0.0152



üíæ MBBQ results saved:
   /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_20251221_125440.csv
   /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_latest.csv
   /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_20251221_125440.json


# 6. Summary Analysis

Quick per-model accuracy summary for MBBQ and file inventory.

In [11]:
# Generate summary statistics
summary_df = None
if 'df' in locals() and not df.empty:
    summaries = []
    for model_name, model_df in df.groupby('model'):
        acc_series = pd.to_numeric(model_df.get('acc', model_df.get('accuracy')), errors='coerce').dropna()
        summaries.append({
            "model": model_name,
            "model_size": model_df.get('model_size', pd.Series(['unknown'])).iloc[0],
            "avg_accuracy": acc_series.mean() if len(acc_series) else None,
            "tasks_completed": len(model_df),
            "categories_evaluated": len(model_df['task'].unique())
        })

    summary_df = pd.DataFrame(summaries)
    summary_df = summary_df.sort_values("model").reset_index(drop=True)

    print(f"{'='*70}")
    print("üìà MBBQ SUMMARY")
    print(f"{'='*70}")
    print(summary_df.to_string(index=False, float_format="%.4f"))

    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    summary_csv = f"{RESULTS_DIR}/base_models_mbbq_summary_{timestamp}.csv"
    summary_df.to_csv(summary_csv, index=False)
    print(f"\nüíæ Summary saved: {summary_csv}\n")
else:
    print("No consolidated data available for summary.")

# List all generated files
print(f"{'='*70}")
print("üìÅ GENERATED FILES")
print(f"{'='*70}")

if 'csv_path' in locals() and os.path.exists(csv_path):
    print(f"  ‚úÖ {csv_path}")
if 'latest_csv' in locals() and os.path.exists(latest_csv):
    print(f"  ‚úÖ {latest_csv}")
if 'json_path' in locals() and os.path.exists(json_path):
    print(f"  ‚úÖ {json_path}")
if 'summary_csv' in locals() and os.path.exists(summary_csv):
    print(f"  ‚úÖ {summary_csv}")

print("\nCheckpoints (first 10):")
if 'checkpoint_files' in locals():
    model_checkpoints = [f for f in sorted(checkpoint_files) if "lm_evals" not in f]
    for f in model_checkpoints[:10]:
        print(f"  ‚úÖ {f}")
    if len(model_checkpoints) > 10:
        print(f"  ... and {len(model_checkpoints) - 10} more")

print("\nRaw LM-Eval dumps (first 10):")
if 'checkpoint_files' in locals():
    raw_dumps = [f for f in sorted(checkpoint_files) if "lm_evals" in f]
    for f in raw_dumps[:10]:
        print(f"  ‚úÖ {f}")
    if len(raw_dumps) > 10:
        print(f"  ... and {len(raw_dumps) - 10} more")

print(f"{'='*70}")

üìà MBBQ SUMMARY
                  model model_size  avg_accuracy  tasks_completed  categories_evaluated
   BSC-LT/salamandra-2b         2b        0.3549               10                    10
meta-llama/Llama-3.2-1B         1b        0.3686               10                    10
meta-llama/Llama-3.2-3B         3b        0.4929               10                    10

üíæ Summary saved: /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_summary_20251221_125441.csv

üìÅ GENERATED FILES
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_20251221_125440.csv
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_latest.csv
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_results_20251221_125440.json
  ‚úÖ /content/drive/MyDrive/fair_pruning/results/base_models_mbbq_summary_20251221_125441.csv

Checkpoints (first 10):
  ‚úÖ /content/drive/MyDrive/fair_pruning/checkpoints_mbbq/1b/meta_llama_llama_3.2_1b.json
  ‚úÖ /con

# 7. Bias Analysis by Category

Analyze bias metrics across EsBBQ categories for each model.

In [12]:
import pandas as pd
import numpy as np
from datetime import datetime

if 'df' in locals() and not df.empty:
    print(f"{'='*70}")
    print("üìä BIAS ANALYSIS BY CATEGORY")
    print(f"{'='*70}\n")

    # --- CORRECCI√ìN CR√çTICA ---
    # 1. Unificar nombres: Si existe 'acc' (BBQ) pero no 'accuracy', √∫salo.
    # 2. Convertir a N√öMEROS: Forzamos la conversi√≥n de texto a float.
    target_col = 'acc' if 'acc' in df.columns else 'accuracy'

    # Creamos/Sobreescribimos la columna 'accuracy' asegurando que sea num√©rica
    # errors='coerce' transformar√° cualquier valor no num√©rico en NaN (Not a Number) para no romper el c√≥digo
    df['accuracy'] = pd.to_numeric(df[target_col], errors='coerce')
    # ---------------------------

    # Extract category from task name (e.g., esbbq_age -> age)
    df['category'] = df['task'].astype(str).str.replace('esbbq_', '')

    # Pivot table: models as rows, categories as columns
    # Ahora que 'accuracy' es float, aggfunc='mean' funcionar√° correctamente
    if 'accuracy' in df.columns:
        pivot_df = df.pivot_table(
            index='model',
            columns='category',
            values='accuracy',
            aggfunc='mean'
        )

        print("\nAccuracy by Category:")
        display(pivot_df)

        # Save category breakdown
        if 'RESULTS_DIR' in locals():
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            category_csv = f"{RESULTS_DIR}/mbbq_category_breakdown_{timestamp}.csv"
            pivot_df.to_csv(category_csv)
            print(f"\nüíæ Category breakdown saved: {category_csv}")
    else:
        print("‚ö†Ô∏è Column 'accuracy' not found. Available columns:", df.columns.tolist())

    # Show bias score distribution if available
    bias_cols = [col for col in df.columns if 'bias' in col.lower()]
    if bias_cols:
        print("\nüìä Available bias metrics:")
        for col in bias_cols:
            print(f"   - {col}")
else:
    print("No data available for category analysis.")

üìä BIAS ANALYSIS BY CATEGORY


Accuracy by Category:


category,age,disabilitystatus,gender,lgbtqia,nationality,physicalappearance,raceethnicity,religion,ses,spanishregion
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
BSC-LT/salamandra-2b,0.384,0.345,0.35,0.346,0.3611,0.343,0.359,0.3565,0.354,0.3502
meta-llama/Llama-3.2-1B,0.347,0.334,0.371,0.381,0.369,0.371,0.377,0.3565,0.379,0.4008
meta-llama/Llama-3.2-3B,0.474,0.45,0.433,0.461,0.4921,0.591,0.468,0.4537,0.543,0.5628



üíæ Category breakdown saved: /content/drive/MyDrive/fair_pruning/results/mbbq_category_breakdown_20251221_125441.csv
