# MMLU Pro Physics Evaluation with Gemini 2.5 Flash

This notebook evaluates the Gemini 2.5 Flash model on the MMLU Pro Physics task using the lm-evaluation-harness.

## Prerequisites

1. Install lm-eval: `pip install -e .` (from the root of this repository)
2. Set your Gemini API key as an environment variable:
   - Linux/Mac: `export OPENAI_API_KEY=YOUR_GEMINI_API_KEY`
   - Windows (CMD): `set OPENAI_API_KEY=YOUR_GEMINI_API_KEY`
   - Windows (PowerShell): `$env:OPENAI_API_KEY="YOUR_GEMINI_API_KEY"`

Get your Gemini API key from: https://aistudio.google.com/apikey

## 1. Setup and Imports

In [None]:
# Import the main evaluation function
from lm_eval import simple_evaluate

# Import utilities for displaying results
import json
import os
from pprint import pprint

# Optional: For better result visualization
try:
    import pandas as pd
    PANDAS_AVAILABLE = True
except ImportError:
    print("Pandas not available. Install with: pip install pandas")
    PANDAS_AVAILABLE = False

print("Imports successful!")

## 2. Configuration

**Important Configuration Notes:**

1. **Base URL:** Must include the full endpoint path `/chat/completions`, not just the base path
   - Correct: `https://generativelanguage.googleapis.com/v1beta/openai/chat/completions`
   - Incorrect: `https://generativelanguage.googleapis.com/v1beta/openai/`

2. **Tokenizer Configuration:**
   - Gemini's OpenAI compatibility layer doesn't support the `/tokenizer_info` endpoint
   - Use `tokenizer_backend="huggingface"` with `tokenizer="Xenova/gpt-4"`
   - Do NOT use `tokenizer_backend="tiktoken"` (tiktoken doesn't recognize Gemini models)
   - Do NOT use `tokenizer_backend="remote"` (endpoint not supported)

Alternative tokenizers you can try:
- `"Xenova/gpt-4o"` - For newer models
- `"gpt2"` - For simpler tokenization

**Note:** The tokenizer is only used for calculating token counts locally and doesn't affect the actual API requests to Gemini.

In [None]:
# Verify API key is set
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError(
        "OPENAI_API_KEY environment variable not set. "
        "Please set it to your Gemini API key from https://aistudio.google.com/apikey"
    )
print(f"API key found: {api_key[:10]}...")

# Gemini API configuration
model_args = {
    "model": "gemini-2.5-pro",
    "base_url": "https://generativelanguage.googleapis.com/v1beta/openai/chat/completions",
    "tokenizer_backend": "huggingface",  # Use HuggingFace tokenizer
    "tokenizer": "Xenova/gpt-4",  # Use GPT-4 compatible tokenizer for Gemini
    "max_length": 32768,  # Increase to 32K or higher
    "max_gen_toks": 30000,     # Output generation limit
    "reasoning_effort": "low"  # Options: "low", "medium", "high"
}

# System instruction to ensure correct answer format
system_instruction = """CRITICAL FORMATTING RULE: You must end your final answer with exactly this format: "the answer is (X)" where X is a single capital letter A-J.

     DO NOT use any other format for your final answer. DO NOT use LaTeX formatting, boxed notation, or mathematical symbols.

     Example of CORRECT format: "the answer is (I)"
     Example of INCORRECT format: "The final answer is $\\boxed{I}$"

     This formatting rule overrides all other stylistic preferences."""

# Evaluation configuration
eval_config = {
    "tasks": ["mmlu_pro_physics"], #["mmlu_pro_physics"],  # Can also use ["mmlu_pro"] for all subjects 
    "num_fewshot": 0,                # Number of few-shot examples (default for MMLU Pro)
    "batch_size": 1,                 # Must be 1 for chat completions
    "apply_chat_template": True,     # Important for chat models
    "system_instruction": system_instruction,  # Add system instruction
    "log_samples": True,             # Save individual predictions for analysis
    "limit": None,                   # Use None for full dataset, or set to int for testing (e.g., 10)
}

print("\nConfiguration:")
print(f"  Model: {model_args['model']}")
print(f"  Base URL: {model_args['base_url']}")
print(f"  Tokenizer Backend: {model_args['tokenizer_backend']}")
print(f"  Tokenizer: {model_args['tokenizer']}")
print(f"  Tasks: {eval_config['tasks']}")
print(f"  Few-shot examples: {eval_config['num_fewshot']}")
print(f"  System instruction: {system_instruction[:100]}...")
print(f"  Limit: {eval_config['limit'] or 'Full dataset'}")

API key found: AIzaSyCMb_...

Configuration:
  Model: gemini-2.5-pro
  Base URL: https://generativelanguage.googleapis.com/v1beta/openai/chat/completions
  Tokenizer Backend: huggingface
  Tokenizer: Xenova/gpt-4
  Tasks: ['mmlu_pro_physics']
  Few-shot examples: 0
  System instruction: CRITICAL FORMATTING RULE: You must end your final answer with exactly this format: "the answer is (X...
  Limit: Full dataset


## 3. Quick Test (Optional)

Run a quick test with 10 examples to verify everything is working before running the full evaluation.

In [3]:
# Uncomment to run a quick test with 10 examples
test_results = simple_evaluate(
    model="gemini-chat",  # Use gemini-chat model instead of local-chat-completions
    model_args=model_args,
    system_instruction=eval_config["system_instruction"], 
    tasks=eval_config["tasks"],
    num_fewshot=eval_config["num_fewshot"],
    batch_size=eval_config["batch_size"],
    apply_chat_template=eval_config["apply_chat_template"],
    log_samples=True,
    limit=5,  # Only 2 example for testing
    verbosity="INFO",
)

print("\nTest Results:")
print("\nAvailable top-level keys:", list(test_results.keys()))

# Check what's in the results
if "results" in test_results:
    print("\nAvailable tasks:", list(test_results["results"].keys()))

    if "mmlu_pro_physics" in test_results["results"]:
        print("\nMetrics for mmlu_pro_physics:")
        for key, value in test_results["results"]["mmlu_pro_physics"].items():
            print(f"  {key}: {value}")

# DETAILED DEBUGGING OF EXACT MATCH
if "samples" in test_results and "mmlu_pro_physics" in test_results["samples"]:
    samples = test_results["samples"]["mmlu_pro_physics"]
    if samples:
        print("\n" + "="*80)
        print("DEBUGGING EXACT MATCH FOR FIRST SAMPLE")
        print("="*80)

        sample = samples[0]

        # Show the raw response from the model
        print("\n1. RAW MODEL RESPONSE:")
        raw_response = sample.get("resps", [[""]])[0][0] if sample.get("resps") else ""
        print(f"   '{raw_response}'")

        # Show what was filtered/extracted
        print("\n2. FILTERED/EXTRACTED ANSWER:")
        filtered = sample.get("filtered_resps", [""])[0] if sample.get("filtered_resps") else ""
        print(f"   '{filtered}'")

        # Show the target answer
        print("\n3. TARGET ANSWER:")
        target = sample.get("target", "")
        print(f"   '{target}'")

        # Show the exact match result
        print("\n4. EXACT MATCH RESULT:")
        exact_match = sample.get("exact_match", None)
        print(f"   {exact_match}")

        # Show why it might be failing
        print("\n5. COMPARISON:")
        print(f"   Filtered == Target: {filtered == target}")
        print(f"   Filtered (lower) == Target (lower): {str(filtered).lower() == str(target).lower()}")

        # Check if the regex pattern matched
        import re
        regex_pattern = r'answer is \(?([ABCDEFGHIJ])\)?'
        match = re.search(regex_pattern, raw_response, re.IGNORECASE)
        print("\n6. REGEX MATCH CHECK:")
        print(f"   Pattern: {regex_pattern}")
        print(f"   Found match: {match is not None}")
        if match:
            print(f"   Extracted letter: '{match.group(1)}'")
        else:
            print(f"   No match found in response!")
            print(f"   This is likely why exact_match is failing.")


2025-10-31:08:51:56 INFO     [evaluator:202] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2025-10-31:08:51:56 INFO     [evaluator:227] Initializing gemini-chat model, with arguments: {'model': 'gemini-2.5-pro', 'base_url': 'https://generativelanguage.googleapis.com/v1beta/openai/chat/completions', 'tokenizer_backend': 'huggingface', 'tokenizer': 'Xenova/gpt-4', 'max_length': 32768, 'max_gen_toks': 30000, 'reasoning_effort': 'low'}


ValueError: Attempted to load model 'gemini-chat', but no model for this name found! Supported model names: local-completions, local-chat-completions, openai-completions, openai-chat-completions, anthropic-completions, anthropic-chat, anthropic-chat-completions, dummy, gguf, ggml, hf-auto, hf, huggingface, hf-audiolm-qwen, steered, hf-multimodal, watsonx_llm, mamba_ssm, nemo_lm, neuronx, ipex, openvino, sglang, sglang-generate, textsynth, vllm, vllm-vlm

In [6]:
import pprint

pprint.pprint(test_results['samples']['mmlu_pro_physics'])  # Print first 2 samples for inspection

[{'arguments': [(JsonChatStr(prompt='[{"role": "system", "content": "CRITICAL FORMATTING RULE: You must end your final answer with exactly this format: \\"the answer is (X)\\" where X is a single capital letter A-J.\\n\\n     DO NOT use any other format for your final answer. DO NOT use LaTeX formatting, boxed notation, or mathematical symbols.\\n\\n     Example of CORRECT format: \\"the answer is (I)\\"\\n     Example of INCORRECT format: \\"The final answer is $\\\\boxed{I}$\\"\\n\\n     This formatting rule overrides all other stylistic preferences.\\n\\nThe following are multiple choice questions (with answers) about physics. Think step by step and then finish your answer with \\"the answer is (X)\\" where X is the correct letter choice.\\n", "type": "text"}, {"role": "user", "content": "Question:\\nWhat is the significance of the 1:2:4 resonance in the Jupiter\'s moons system?\\nOptions:\\nA. It causes the three moons to always be in a straight line.\\nB. It causes Io to rotate on

## 4. Run Full Evaluation

This will evaluate the model on the complete MMLU Pro Physics dataset. This may take some time depending on the dataset size and API rate limits.

In [None]:
print("Starting evaluation...\n")
print("This may take several minutes depending on the dataset size and API rate limits.")
print("="*80)

# Run the evaluation
results = simple_evaluate(
    model="gemini-chat",  # Use gemini-chat model instead of local-chat-completions
    model_args=model_args,
    tasks=eval_config["tasks"],
    num_fewshot=eval_config["num_fewshot"],
    batch_size=eval_config["batch_size"],
    apply_chat_template=eval_config["apply_chat_template"],
    log_samples=eval_config["log_samples"],
    limit=eval_config["limit"],
    verbosity="INFO",
)

print("\n" + "="*80)
print("Evaluation complete!")

## 5. Display Results

In [None]:
# Extract task results
task_name = "mmlu_pro_physics"
task_results = results["results"][task_name]

print("\n" + "="*80)
print(f"MMLU Pro Physics - Gemini 2.5 Flash Results")
print("="*80)

# Display all metrics
print("\nMetrics:")
for metric_name, metric_value in task_results.items():
    if isinstance(metric_value, (int, float)):
        if "acc" in metric_name or "exact_match" in metric_name:
            print(f"  {metric_name}: {metric_value:.2%}")
        else:
            print(f"  {metric_name}: {metric_value:.4f}")
    else:
        print(f"  {metric_name}: {metric_value}")

# Display configuration info
print("\nConfiguration:")
print(f"  Model: {model_args['model']}")
print(f"  Few-shot examples: {eval_config['num_fewshot']}")
print(f"  Samples evaluated: {len(results.get('samples', {}).get(task_name, []))}")

## 6. Sample Predictions Analysis

In [None]:
# Get sample predictions if available
samples = results.get("samples", {}).get(task_name, [])

if samples:
    print(f"\nTotal samples: {len(samples)}")
    
    # Count correct vs incorrect
    correct = sum(1 for s in samples if s.get("exact_match") == 1.0)
    incorrect = len(samples) - correct
    
    print(f"Correct: {correct} ({correct/len(samples):.2%})")
    print(f"Incorrect: {incorrect} ({incorrect/len(samples):.2%})")
    
    # Show a few examples
    print("\n" + "="*80)
    print("Sample Predictions (first 3):")
    print("="*80)
    
    for i, sample in enumerate(samples[:3]):
        print(f"\n--- Sample {i+1} ---")
        print(f"Question (truncated): {str(sample.get('doc', {}).get('question', 'N/A'))[:200]}...")
        print(f"Predicted: {sample.get('resps', ['N/A'])[0] if sample.get('resps') else 'N/A'}")
        print(f"Target: {sample.get('target', 'N/A')}")
        print(f"Correct: {sample.get('exact_match') == 1.0}")
else:
    print("\nNo sample data available. Set log_samples=True to see individual predictions.")

## 7. Detailed Analysis (Optional)

In [None]:
# Create a DataFrame for easier analysis (if pandas is available)
if PANDAS_AVAILABLE and samples:
    # Extract relevant fields
    data = []
    for i, sample in enumerate(samples):
        data.append({
            "sample_id": i,
            "correct": sample.get("exact_match") == 1.0,
            "predicted": sample.get("resps", [["N/A"]])[0][0] if sample.get("resps") else "N/A",
            "target": sample.get("target", "N/A"),
        })
    
    df = pd.DataFrame(data)
    
    print("\nDataFrame created with sample predictions.")
    print(f"Shape: {df.shape}")
    print("\nFirst few rows:")
    display(df.head())
    
    print("\nAccuracy summary:")
    display(df['correct'].value_counts())
    
elif not PANDAS_AVAILABLE:
    print("Install pandas for detailed analysis: pip install pandas")
else:
    print("No samples available for analysis.")

## 8. Save Results to File

In [None]:
# Save results to JSON file
output_file = "gemini_mmlu_pro_physics_results.json"

with open(output_file, "w") as f:
    json.dump(results, f, indent=2)

print(f"\nResults saved to: {output_file}")
print(f"File size: {os.path.getsize(output_file) / 1024:.2f} KB")

## Summary

This notebook demonstrates how to:
1. Configure and use Gemini models with lm-evaluation-harness
2. Run evaluations programmatically using `simple_evaluate()`
3. Analyze and visualize results
4. Save results for further analysis

### Key Points:
- Set `OPENAI_API_KEY` environment variable to your Gemini API key
- Use `local-chat-completions` model type with Gemini's OpenAI-compatible endpoint
- Always set `apply_chat_template=True` for chat models
- Batch size must be 1 for chat completions
- Use `limit=10` for quick testing before running full evaluation

### Next Steps:
- Try other MMLU Pro subjects: `mmlu_pro_chemistry`, `mmlu_pro_math`, etc.
- Run all MMLU Pro tasks: `tasks=["mmlu_pro"]`
- Experiment with different models: `gemini-2.5-pro`, `gemini-2.0-flash`
- Adjust few-shot examples: `num_fewshot=0` (zero-shot) or `num_fewshot=10`