# Professional Benchmarking Workflow for External Hugging Face Models

This notebook provides a professional, robust, and extensible workflow for benchmarking external language models from the Hugging Face Hub using the `karpathy/nanochat` evaluation framework.

### Key Features:
1.  **Centralized Configuration**: Easily manage all models and evaluation parameters in the main.py script.
2.  **Robust JSON Reporting**: Evaluation results are saved as structured JSON files for reliable aggregation.
3.  **Resumable Sessions**: Skips previously evaluated models, allowing you to resume interrupted benchmark runs.
4.  **Modern Model Support**: Includes the `trust_remote_code=True` flag necessary for many new models.
5.  **Data Visualization**: Automatically generates a summary table and a bar chart for easy comparison of model performance.

## 1. Environment Setup

We will clone the repository, install all necessary dependencies, and download the evaluation datasets.

In [1]:
# Clone the repository and navigate into it
!git clone https://github.com/karpathy/nanochat.git
%cd nanochat

# Set the base directory for artifacts and create a results directory
import os
os.environ['NANOCHAT_BASE_DIR'] = '/content/nanochat_data'
results_dir = '/content/nanochat_data/results'
!mkdir -p $NANOCHAT_BASE_DIR
!mkdir -p {results_dir}

# Install uv package manager and add to PATH
!curl -LsSf https://astral.sh/uv/install.sh | sh
os.environ['PATH'] = f"/root/.local/bin:{os.environ['PATH']}"

# Install dependencies
#!uv venv
!uv sync

# Install necessary libraries for evaluation
!bash -c "pip install pyarrow transformers accelerate pandas matplotlib seaborn"
!bash -c "pip install -e ."

# Download the eval_bundle for CORE metric evaluation
!if [ ! -d "$NANOCHAT_BASE_DIR/eval_bundle" ]; then \
    curl -L -o eval_bundle.zip 'https://karpathy-public.s3.us-west-2.amazonaws.com/eval_bundle.zip' && \
    unzip -q eval_bundle.zip && \
    rm eval_bundle.zip && \
    mv eval_bundle $NANOCHAT_BASE_DIR; \
fi

print('✨ Environment and data setup complete.')

Cloning into 'nanochat'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects:   1% (1/95)[Kremote: Counting objects:   2% (2/95)[Kremote: Counting objects:   3% (3/95)[Kremote: Counting objects:   4% (4/95)[Kremote: Counting objects:   5% (5/95)[Kremote: Counting objects:   6% (6/95)[Kremote: Counting objects:   7% (7/95)[Kremote: Counting objects:   8% (8/95)[Kremote: Counting objects:   9% (9/95)[Kremote: Counting objects:  10% (10/95)[Kremote: Counting objects:  11% (11/95)[Kremote: Counting objects:  12% (12/95)[Kremote: Counting objects:  13% (13/95)[Kremote: Counting objects:  14% (14/95)[Kremote: Counting objects:  15% (15/95)[Kremote: Counting objects:  16% (16/95)[Kremote: Counting objects:  17% (17/95)[Kremote: Counting objects:  18% (18/95)[Kremote: Counting objects:  20% (19/95)[Kremote: Counting objects:  21% (20/95)[Kremote: Counting objects:  22% (21/95)[Kremote: Counting objects:  23% (22/95)[Kremote: Count

## 2. Create Professional Evaluation Script

This script, `evaluate_hf_model.py`, is the core of our workflow. It's designed to be flexible, accepting command-line arguments for tasks, batch size, and problem limits. It also supports loading modern models with `trust_remote_code=True` and saves structured JSON results.

In [2]:
script_content = """
import sys
import argparse
import json
import os
import torch
from types import SimpleNamespace
from nanochat.common import compute_init, compute_cleanup, print0, get_base_dir
from scripts.base_eval import evaluate_model as evaluate_core
from scripts.chat_eval import run_chat_eval
from nanochat.engine import Engine
from nanochat.report import get_report
from nanochat.tokenizer import HuggingFaceTokenizer as BaseHuggingFaceTokenizer

# Create a custom Tokenizer class that inherits from nanochat's wrapper
# and adds the missing 'render_for_completion' method needed for evaluation.
class HuggingFaceTokenizer(BaseHuggingFaceTokenizer):
    def render_for_completion(self, conversation):
        # This method prepares a conversation to prime the model for a completion.
        # It tokenizes up to the point where the assistant would start speaking.
        
        # 1. Get special token IDs and validate they exist
        tokens = {
            "bos": self.encode_special("<|bos|>",),
            "user_start": self.encode_special("<|user_start|>",),
            "user_end": self.encode_special("<|user_end|>",),
            "assistant_start": self.encode_special("<|assistant_start|>",)
        }
        for name, token_id in tokens.items():
            if token_id is None:
                raise ValueError(f"Special token '{name}' not found in the tokenizer's vocabulary.")

        # 2. In eval tasks, the user message is the first one.
        user_message_content = conversation['messages'][0]['content']
        
        # 3. Tokenize the user message content
        user_message_ids = self.tokenizer.encode(user_message_content, add_special_tokens=False).ids

        # 4. Construct the final token sequence in the nanochat format:
        # <|bos|><|user_start|>...user message...<|user_end|><|assistant_start|>
        ids = [tokens["bos"], tokens["user_start"]] + user_message_ids + [tokens["user_end"], tokens["assistant_start"]]
        
        return ids

class ModelWrapper:
    def __init__(self, model, config, max_seq_len=None):
        self.model = model
        self.config = config # This is the crucial fix
        self.max_seq_len = max_seq_len
    def __call__(self, input_ids):
        outputs = self.model(input_ids)
        return outputs.logits
    def get_device(self):
        return self.model.device

def load_hf_model(hf_path: str, device):
    print0(f'Loading model and tokenizer from: {hf_path}')
    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    tokenizer = AutoTokenizer.from_pretrained(hf_path, trust_remote_code=True)
    model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)

    if tokenizer.pad_token is None:
        if tokenizer.eos_token is not None:
            tokenizer.pad_token = tokenizer.eos_token
        else:
            tokenizer.add_special_tokens({'pad_token': '[PAD]'}) 

    special_tokens_to_add = [
        "<|bos|>", "<|user_start|>", "<|user_end|>", 
        "<|assistant_start|>", "<|assistant_end|>"
    ]
    existing_tokens = set(tokenizer.get_vocab().keys())
    new_tokens = [token for token in special_tokens_to_add if token not in existing_tokens]
    
    if new_tokens:
        tokenizer.add_special_tokens({'additional_special_tokens': new_tokens})
        model.resize_token_embeddings(len(tokenizer))

    # --- Start of Fix ---
    # Create a nanochat-compatible config object from the HF model's config
    # This shim allows the nanochat Engine to work with external models.
    hf_config = model.config
    nanochat_config = SimpleNamespace(
        n_layer=hf_config.num_hidden_layers,
        n_head=hf_config.num_attention_heads,
        n_kv_head=getattr(hf_config, 'num_key_value_heads', hf_config.num_attention_heads),
        n_embd=hf_config.hidden_size,
        sequence_len=getattr(hf_config, 'max_position_embeddings', 2048),
        vocab_size=hf_config.vocab_size
    )
    # --- End of Fix ---

    model.eval()
    # Pass the new config object to the wrapper
    wrapped_model = ModelWrapper(model, config=nanochat_config, max_seq_len=getattr(tokenizer, 'model_max_length', 1024))
    
    temp_tok_dir = '/tmp/hf_tokenizer'
    tokenizer.save_pretrained(temp_tok_dir)
    final_tokenizer = HuggingFaceTokenizer.from_directory(temp_tok_dir)
    
    return wrapped_model, final_tokenizer

def main():
    parser = argparse.ArgumentParser(description='Evaluate external Hugging Face models.')
    parser.add_argument('model_path', type=str)
    parser.add_argument('--eval_type', type=str, required=True, choices=['core', 'chat'])
    parser.add_argument('--tasks', type=str, nargs='+', default=None)
    parser.add_argument('--batch_size', type=int, default=4)
    parser.add_argument('--max_problems', type=int, default=256)
    args = parser.parse_args()

    ddp, ddp_rank, ddp_local_rank, ddp_world_size, device = compute_init()
    autocast_ctx = torch.amp.autocast(device_type=\"cuda\", dtype=torch.bfloat16)
    results_dir = os.path.join(get_base_dir(), 'results')
    model_results = {'model_path': args.model_path, 'eval_type': args.eval_type, 'metrics': {}}

    try:
        model, tokenizer = load_hf_model(args.model_path, device)
        engine = Engine(model, tokenizer)
    except Exception as e:
        print0(f'ERROR: Failed to load model {args.model_path}: {e}')
        compute_cleanup()
        return

    report = get_report()
    
    if args.eval_type == 'core':
        with autocast_ctx:
            eval_results = evaluate_core(model, tokenizer, device)
        if ddp_rank == 0:
            model_results['metrics'] = eval_results['centered_results']
            model_results['metrics']['CORE metric'] = eval_results['core_metric']
            report.log(section=f'External Model CORE Eval: {args.model_path}', data=[
                {'Model': args.model_path, 'CORE metric': eval_results['core_metric']},
                eval_results['centered_results'],
            ])
    elif args.eval_type == 'chat':
        tasks_to_run = args.tasks
        for task in tasks_to_run:
            print0(f'-- Evaluating task: {task}')
            with autocast_ctx:
                accuracy = run_chat_eval(task, model, tokenizer, engine, batch_size=args.batch_size, max_problems=args.max_problems)
            model_results['metrics'][task] = accuracy
            print0(f'  {task} accuracy: {accuracy:.4f}')
        if ddp_rank == 0:
            report.log(section=f'External Model CHAT Eval: {args.model_path}', data=[
                {'Model': args.model_path},
                model_results['metrics'],
            ])

    if ddp_rank == 0:
        json_filename = args.model_path.replace('/', '__') + '.json'
        json_path = os.path.join(results_dir, json_filename)
        with open(json_path, 'w') as f:
            json.dump(model_results, f, indent=2)
        print0(f'Results for {args.model_path} saved to {json_path} and logged to report.')

    compute_cleanup()

if __name__ == '__main__':
    main()
"""

with open('evaluate_hf_model.py', 'w') as f:
    f.write(script_content)

print('✨ `evaluate_hf_model.py` script created.')

✨ `evaluate_hf_model.py` script created.


## 3.  Define Benchmark Configuration and Run the Benchmark

Define the entire benchmark run in the main.py script. Add models to the `models` list and customize their evaluation parameters. You can specify different tasks for each chat model or override other defaults.

This loop executes the evaluation for each model defined in the configuration. It dynamically constructs the command with the correct parameters and skips any models that have already been evaluated.

In [None]:
# Reset the report to start fresh
!bash -c "python -m nanochat.report reset"

script_content = """
import subprocess
import os

BENCHMARK_CONFIG = {
    #"default_chat_tasks": ['MMLU', 'ARC-Challenge', 'GSM8K', 'HumanEval'],
    "default_chat_tasks": ['GSM8K', 'HumanEval'],
    "default_batch_size": 4,
    "default_max_problems": 256, 
    "models": [
        {
            'path': 'Qwen/Qwen3-4B-Instruct-2507', 
            'type': 'chat',
            # This model will use the default chat tasks
        },
        {
            'path': 'Qwen/Qwen3-4B-Thinking-2507', 
            'type': 'chat',
            #'tasks': ['MMLU', 'GSM8K'] # Override: only run these two tasks
        },
        {
            'path': 'LiquidAI/LFM2-1.2B-RAG', 
            'type': 'chat'
        },
        # Add more models here for a comprehensive benchmark
        # e.g., {'path': 'google/gemma-2-9b-it', 'type': 'chat'},
    ]
}

print('✨ Benchmark configuration loaded.')

# Loop through the models and run evaluations
for model_config in BENCHMARK_CONFIG['models']:
    model_path = model_config['path']
    json_filename = model_path.replace('/', '__') + '.json'
    json_path = os.path.join('/content/nanochat_data/results', json_filename)

    if os.path.exists(json_path):
        print(f"\\n{'='*80}")
        print(f"Skipping already evaluated model: {model_path}")
        print(f"{'='*80}")
        continue

    eval_type = model_config['type']
    batch_size = model_config.get('batch_size', BENCHMARK_CONFIG['default_batch_size'])
    max_problems = model_config.get('max_problems', BENCHMARK_CONFIG['default_max_problems'])
    
    command = f'python evaluate_hf_model.py "{model_path}" --eval_type="{eval_type}" --batch_size={batch_size} --max_problems={max_problems}'

    if eval_type == 'chat':
        tasks = model_config.get('tasks', BENCHMARK_CONFIG['default_chat_tasks'])
        tasks_str = ' '.join(tasks)
        command += f' --tasks {tasks_str}'
    
    print(f"\\n{'='*80}")
    print(f"Evaluating model: {model_path} (type: {eval_type})")
    print(f"{'='*80}")
    subprocess.run(command, shell=True)
"""

with open('main.py', 'w') as f:
    f.write(script_content)

print('✨ `main.py` script created.')

#!bash -c "./.venv/bin/python -c 'import transformers'"
!bash -c "python main.py"

Reset report and wrote header to /content/nanochat_data/report/header.md
✨ `main.py` script created.
✨ Benchmark configuration loaded.

Evaluating model: Qwen/Qwen3-4B-Instruct-2507 (type: chat)
2025-10-16 11:26:52,568 - datasets - [32m[1mINFO[0m - JAX version 0.7.1 available.
2025-10-16 11:26:53,057 - nanochat.common - [32m[1mINFO[0m - Distributed world size: 1
Loading model and tokenizer from: Qwen/Qwen3-4B-Instruct-2507
tokenizer_config.json: 0.00B [00:00, ?B/s]tokenizer_config.json: 9.38kB [00:00, 28.1MB/s]
vocab.json: 0.00B [00:00, ?B/s]vocab.json: 2.62MB [00:00, 10.7MB/s]vocab.json: 2.78MB [00:00, 11.3MB/s]
merges.txt: 0.00B [00:00, ?B/s]merges.txt: 1.67MB [00:00, 91.0MB/s]
tokenizer.json:   0%|                               | 0.00/11.4M [00:00<?, ?B/s]tokenizer.json: 100%|██████████████████████| 11.4M/11.4M [00:00<00:00, 19.5MB/s]tokenizer.json: 100%|██████████████████████| 11.4M/11.4M [00:00<00:00, 19.5MB/s]
config.json:   0%|                        

## 4. Aggregate Results & Visualize

Here, we gather all the structured JSON results, compile them into a pandas DataFrame for easy analysis, generate the final nanochat report, and create a visual summary chart to compare model performance.

In [None]:
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display, Markdown

# Generate the full detailed report from all logged sections
!bash -c "python -m nanochat.report generate"
with open('report.md', 'r') as f:
    report_content = f.read()
print("--- Full Detailed Report ---")
display(Markdown(report_content))

# --- Create and Display the Comparison Summary Table from JSON files ---
all_results = []
for filename in os.listdir(results_dir):
    if filename.endswith('.json'):
        with open(os.path.join(results_dir, filename), 'r') as f:
            data = json.load(f)
            row = {'Model': data['model_path']}
            row.update(data['metrics'])
            all_results.append(row)

if all_results:
    df = pd.DataFrame(all_results).set_index('Model')
    df = df.sort_index(axis=1) # Sort columns alphabetically for consistent order
    
    print("\n\n--- External Model Comparison Summary ---")
    display(df.style.format('{:.4f}', na_rep='N/A').background_gradient(cmap='viridis', axis=0))

    # --- Visualize the results ---
    if not df.empty:
        # Normalize chat scores for better comparison if CORE metric is also present
        plot_df = df.copy()
        chat_cols = [col for col in df.columns if col != 'CORE metric']
        if chat_cols and 'CORE metric' in df.columns:
             plot_df['Chat Average'] = df[chat_cols].mean(axis=1)
             plot_cols = ['CORE metric', 'Chat Average']
        else:
             plot_cols = df.columns.tolist()

        plot_df[plot_cols].plot(kind='bar', figsize=(12, 7), rot=45, width=0.8)
        plt.title('External Model Performance Comparison', fontsize=16)
        plt.ylabel('Score (Higher is better)', fontsize=12)
        plt.xlabel('')
        plt.tight_layout()
        plt.grid(axis='y', linestyle='--', alpha=0.7)
        plt.show()
else:
    print("\nNo evaluation results found to generate a summary.")