# AIME 2025 Evaluation (Ollama)

This notebook mirrors the MathArena-style evaluation flow for **AIME 2025** using the same helper functions wired into the `llm_wc` codebase.

**Pipeline**
1. Load the AIME 2025 dataset (MathArena format).
2. Configure the Ollama/OpenAI-compatible client.
3. Run the model on the selected problems.
4. Parse boxed answers (or fallback integers) and compute accuracy.


## Step 0 - Configuration
Adjust these settings for your model, dataset path, and evaluation size.


In [56]:
%run _dev_setup.py

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
🔁 Autoreload is ON (IPython detected).
✅ Using llm_wc from: /home/iamsikun/research/llm-wc/src/llm_wc




In [57]:
import os
import json
from pathlib import Path
from pprint import pprint
from typing import Any


In [66]:
from llm_wc.aime_2025 import (
    AIME_2025_PROMPT,
    evaluate_model_on_aime_2025,
    normalize_aime_answer,
)
from llm_wc.client import ClientConfig, build_client
from llm_wc.core import EvalResult
from llm_wc.core.eval import compute_accuracy
from llm_wc.matharena import extract_answer, load_matharena_competition_problems


In [61]:
OLLAMA_URL = "http://localhost:11434/v1"

# Ollama ignores the key, but the client expects it
os.environ.setdefault('OPENAI_API_KEY', 'ollama')

# Optional limits for quick smoke tests
LIMIT_PROBLEMS = 5  # set None for full evaluation

STRICT_PARSING = False

# Where to save raw predictions
OUTPUT_DIR = Path('eval_results/ollama_aime_2025')


## Step 1 - Load the dataset
This loader supports both local MathArena layouts and the Hugging Face dataset name.


In [62]:
problems: list[dict] = load_matharena_competition_problems(
    dataset_path="MathArena/aime_2025",
    problem_ids=None,  # e.g. [1, 2, 3] or None for all
    final_answer=True,
)
len(problems)


30

In [63]:
problem: dict = problems[0]

print(f"Problem id: {problem['problem_idx']}")
print(f"Problem type(s): {problem['problem_type']}")
print(f"Problem text: {problem['problem']}")
print(f"Answer: {problem['answer']}")


Problem id: 1
Problem type(s): ['Number Theory']
Problem text: Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$
Answer: 70


## Step 2 - Build LLM client

In [65]:
MODEL_NAME = "gemma3:27b"
client_cfg = ClientConfig(
    provider="openai",
    model=MODEL_NAME,
    api_base=OLLAMA_URL,
    api_key="ollama",
)
client = build_client(client_cfg)


## Step 3 - Format the prompt

In [67]:
instruction: str = AIME_2025_PROMPT

print("Instruction:\n", instruction)

Instruction:
 Please reason step by step, and put your final answer within \boxed{}.
The answer is an integer between 0 and 999 inclusive.


In [68]:
problem_text: str = problem["problem"]
prompt = instruction + "\n\n" + problem_text

print("Instruction + problem text:\n", prompt)

Instruction + problem text:
 Please reason step by step, and put your final answer within \boxed{}.
The answer is an integer between 0 and 999 inclusive.

Find the sum of all integer bases $b>9$ for which $17_b$ is a divisor of $97_b.$


## Step 4 - Query the model
We call the model for each problem and parse boxed answers with an optional integer fallback.


In [77]:
params: dict[str, Any] = {
    "temperature": 0.0,
    "max_tokens": 32768,
    "top_p": 0.95,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
    "logprobs": True,
    "top_logprobs": 10,
    "seed": 3,
}

response = client.chat(
    [{"role": "user", "content": prompt}],
    **params,
)


In [None]:
token_usage = response.usage
if token_usage:
    print(f"Total tokens: {token_usage.total_tokens}")
    print(f"  Prompt tokens: {token_usage.prompt_tokens}")
    print(f"  Completion tokens: {token_usage.completion_tokens}")

print("\nModel's answer:")
print(response.content)

print("\nModel's reasoning:")
print(response.reasoning or "No reasoning provided")


Total tokens: 1024
  Prompt tokens: 72
  Completion tokens: 952

Model's answer:
Let $17_b$ and $97_b$ be numbers in base $b$. We have $17_b = 1 \cdot b^1 + 7 \cdot b^0 = b+7$ and $97_b = 9 \cdot b^1 + 7 \cdot b^0 = 9b+7$.
We are given that $17_b$ is a divisor of $97_b$, which means that $b+7$ divides $9b+7$.
We can write $9b+7 = 9(b+7) - 63 + 7 = 9(b+7) - 56$.
Since $b+7$ divides $9(b+7)$, we must have $b+7$ divides $9b+7 - 9(b+7) = -56$.
Thus, $b+7$ must be a divisor of 56.
Since $b>9$, we have $b+7 > 16$.
The divisors of 56 are 1, 2, 4, 7, 8, 14, 28, 56.
We need to find the divisors of 56 that are greater than 16. These are 28 and 56.
If $b+7 = 28$, then $b = 28-7 = 21$.
If $b+7 = 56$, then $b = 56-7 = 49$.
We need to check that the digits used in the numbers $17_b$ and $97_b$ are less than $b$.
For $b=21$, the digits are 1, 7, 9, so $9 < 21$, which is true.
For $b=49$, the digits are 1, 7, 9, so $9 < 49$, which is true.
Thus, the possible values of $b$ are 21 and 49.
The sum of the

In [84]:
parsed_ans = extract_answer(response.content, strict_parsing=STRICT_PARSING)
pred = parsed_ans.answer

print(f"Parsed answer: {pred}")


Parsed answer: 70


In [85]:
EvalResult(
    benchmark="aime_2025",
    question_id=int(problem.get("problem_idx", -1)),
    original_id=str(problem.get("problem_idx", "")),
    question=problem.get("problem"),
    choices={},
    answer=str(problem.get("answer", "")),
    pred=pred,
    prompt_type="default",
    model_outputs=response.content,
    category=problem.get("problem_type"),
    metadata={
        "image": problem.get("image"),
        "warning": parsed_ans.warning.name,
    },
)




## Step 4 - Run the evaluation loop
Compare parsed predictions to the gold answers.


In [None]:
results = evaluate_model_on_aime_2025(
    llm_client=client,
    problem_ids=None,
    limit=LIMIT_PROBLEMS,
    dataset_path="MathArena/aime_2025",
    instruction=AIME_2025_PROMPT,
    strict_parsing=STRICT_PARSING,
    request_params=params,
    show_progress=False,
)
summary = compute_accuracy(results, normalizer=normalize_aime_answer)
summary


---
## Notes
- Set `LIMIT_PROBLEMS = None` to evaluate the full 30-problem AIME 2025 set.
- Use `STRICT_PARSING = True` to require a `\boxed{}` answer with no fallback to the last integer.
- The dataset loader supports a local MathArena-style layout or the Hugging Face dataset name.
