# MMLU-Pro Evaluation (Ollama)

This notebook follows the MMLU-Pro repo's evaluation logic, but **in-notebook**: we copy the key functions, load the dataset, run an Ollama model, and compute accuracy.

**Pipeline**
1. Load the MMLU-Pro dataset.
2. Initialize the Ollama (OpenAI-compatible) client.
3. Run the model on the dataset (optionally limited for a smoke test).
4. Compare predictions to ground-truth answers and compute accuracy.


# Step 0 Configuration

In [1]:
%run _dev_setup.py

üîÅ Autoreload is ON (IPython detected).
‚úÖ Using llm_wc from: /home/iamsikun/research/llm-wc/src/llm_wc


In [2]:
import os
import json
import random
import re
import math
from pathlib import Path
from pprint import pprint
from typing import Any

In [None]:
from llm_wc.core import * 
from llm_wc.mmlu_pro import * 
from llm_wc.client import * 

In [4]:
# Use the OpenAI SDK to call Ollama's /v1 endpoint
os.environ.setdefault('OPENAI_API_KEY', 'ollama')

# Where to save raw predictions
OUTPUT_DIR = Path('eval_results/ollama_mmlu_pro')

## Step 1 - Load MMLU-Pro dataset and preprocess it
These functions are taken directly from the `evaluate_from_apiX.py` and `compute_accuracy.py` scripts from the Tiger-Lab/MMLU-Pro repo with no logic changes.


In [5]:
bench_ds = load_mmlu_pro_benchmark()

In [6]:
print("Total number of questions: ", len(bench_ds.questions))
for category in bench_ds.categories:
    questions = bench_ds.get_questions_by_category(category)
    print(f"  {category}: {len(questions)} questions")

Total number of questions:  12032
  math: 1351 questions
  health: 818 questions
  physics: 1299 questions
  business: 789 questions
  biology: 717 questions
  chemistry: 1132 questions
  computer science: 410 questions
  economics: 844 questions
  engineering: 969 questions
  philosophy: 499 questions
  other: 924 questions
  history: 381 questions
  psychology: 798 questions
  law: 1101 questions


In [7]:
print(f"Sample question: ")
pprint(bench_ds.questions[0])

Sample question: 
Question(id=0,
         original_id='70',
         question='Typical advertising regulatory bodies suggest, for example '
                  'that adverts must not: encourage _________, cause '
                  'unnecessary ________ or _____, and must not cause _______ '
                  'offence.',
         choices={'A': 'Safe practices, Fear, Jealousy, Trivial',
                  'B': 'Unsafe practices, Distress, Joy, Trivial',
                  'C': 'Safe practices, Wants, Jealousy, Trivial',
                  'D': 'Safe practices, Distress, Fear, Trivial',
                  'E': 'Unsafe practices, Wants, Jealousy, Serious',
                  'F': 'Safe practices, Distress, Jealousy, Serious',
                  'G': 'Safe practices, Wants, Fear, Serious',
                  'H': 'Unsafe practices, Wants, Fear, Trivial',
                  'I': 'Unsafe practices, Distress, Fear, Serious'},
         answer='I',
         category='business',
         subcategory=None,


In [8]:
# dev_dict has the same structure as test_dict
# but it contains COT examples for each question
# It will be used as few-shot examples for the model
print(f"Sample COT Example: ")
pprint(bench_ds.metadata["cot_examples_by_category"]["business"][0])

Sample COT Example: 
{'answer': 'F',
 'answer_index': 5,
 'category': 'business',
 'cot_content': "A: Let's think step by step. We refer to Wikipedia articles "
                'on business ethics for help. The sentence that best uses the '
                'possible options above is __n contrast to *boycotts*, '
                '*buycotts* aim to reward favourable behavior by companies. '
                'The success of such campaigns have been heightened through '
                'the use of *digital technology*, which allow campaigns to '
                'facilitate the company in achieving *increased sales*._ The '
                'answer is (F).',
 'options': ['Boycotts, Buyalls, Blockchain technology, Increased Sales',
             'Buycotts, Boycotts, Digital technology, Decreased Sales',
             'Boycotts, Buycotts, Digital technology, Decreased Sales',
             'Buycotts, Boycotts, Blockchain technology, Charitable donations',
             'Boycotts, Buyalls, Blockchai

## Step 2 - Build the Ollama client and request helper
This mirrors the repo's `call_api` pattern but uses the OpenAI SDK configured for Ollama.


In [11]:
client_cfg = ClientConfig(
    provider="openai", 
    model="gpt-oss:latest",
    api_base=OLLAMA_URL,
    api_key='ollama'
)

In [12]:
client = build_client(client_cfg)

## Step 3 - Format the prompt

In [13]:
subject: str = "business"

question: Question = bench_ds.get_questions_by_category(subject)[0]
cot_examples: list[dict] = bench_ds.metadata["cot_examples_by_category"][subject]
question_text: str = question.question
options: list[str] = question.choices
instruction: str = (
        "The following are multiple choice questions (with answers) about {}. Think step by"
        ' step and then output the answer in the format of "The answer is (X)" at the end.\n\n'.format(
            subject
        )
    )

print("Task intro:\n")
print(instruction)

Task intro:

The following are multiple choice questions (with answers) about business. Think step by step and then output the answer in the format of "The answer is (X)" at the end.




In [14]:
# Add COT examples to the prompt
for cot_example in cot_examples:
    instruction += format_example(
        cot_example["question"], cot_example["options"], cot_example["cot_content"]
    )

print("Task intro + COT examples:\n")
print(instruction)

Task intro + COT examples:

The following are multiple choice questions (with answers) about business. Think step by step and then output the answer in the format of "The answer is (X)" at the end.

Question: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .
Options: A. Boycotts, Buyalls, Blockchain technology, Increased Sales
B. Buycotts, Boycotts, Digital technology, Decreased Sales
C. Boycotts, Buycotts, Digital technology, Decreased Sales
D. Buycotts, Boycotts, Blockchain technology, Charitable donations
E. Boycotts, Buyalls, Blockchain technology, Charitable donations
F. Boycotts, Buycotts, Digital technology, Increased Sales
G. Buycotts, Boycotts, Digital technology, Increased Sales
H. Boycotts, Buycotts, Physical technology, Increased Sales
I. Buycotts, Buyalls, Blockchain technology, Charitable

In [15]:
# Format the question and options
input_text: str = format_example(question, options)
print("Task intro + COT examples + question:\n")
print(instruction + input_text)

Task intro + COT examples + question:

The following are multiple choice questions (with answers) about business. Think step by step and then output the answer in the format of "The answer is (X)" at the end.

Question: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .
Options: A. Boycotts, Buyalls, Blockchain technology, Increased Sales
B. Buycotts, Boycotts, Digital technology, Decreased Sales
C. Boycotts, Buycotts, Digital technology, Decreased Sales
D. Buycotts, Boycotts, Blockchain technology, Charitable donations
E. Boycotts, Buyalls, Blockchain technology, Charitable donations
F. Boycotts, Buycotts, Digital technology, Increased Sales
G. Buycotts, Boycotts, Digital technology, Increased Sales
H. Boycotts, Buycotts, Physical technology, Increased Sales
I. Buycotts, Buyalls, Blockchain technology,

In [16]:
# integrated prompt builder 
final_prompt = prompt_mmlu_pro(
    bench=bench_ds,
    question=question,
    prompt_type="cot"
)
print(final_prompt)

The following are multiple choice questions (with answers) about business. Think step by step and then output the answer in the format of "The answer is (X)" at the end.

Question: In contrast to _______, _______ aim to reward favourable behaviour by companies. The success of such campaigns have been heightened through the use of ___________, which allow campaigns to facilitate the company in achieving _________ .
Options: A. Boycotts, Buyalls, Blockchain technology, Increased Sales
B. Buycotts, Boycotts, Digital technology, Decreased Sales
C. Boycotts, Buycotts, Digital technology, Decreased Sales
D. Buycotts, Boycotts, Blockchain technology, Charitable donations
E. Boycotts, Buyalls, Blockchain technology, Charitable donations
F. Boycotts, Buycotts, Digital technology, Increased Sales
G. Buycotts, Boycotts, Digital technology, Increased Sales
H. Boycotts, Buycotts, Physical technology, Increased Sales
I. Buycotts, Buyalls, Blockchain technology, Charitable donations
J. Boycotts, Buyc

## Step 4: Query the model

In [17]:
model: str = "gpt-oss:latest"
params: dict[str, Any] = {
    "temperature": 0.0,
    "max_tokens": 32768,
    "top_p": 0.95,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
    "logprobs": True, 
    "top_logprobs": 10, 
    "seed": 3, 
}

message_text = [{"role": "user","content": final_prompt}]

response = client.chat(message_text, **params,)

In [18]:
token_usage = response.usage
print(f"Total tokens: {token_usage.total_tokens}")
print(f"  Prompt tokens: {token_usage.prompt_tokens}")
print(f"  Completion tokens: {token_usage.completion_tokens}")

print("\nModel's answer:")
print(response.content)
print("\nModel's reasoning:")
print(response.reasoning)

Total tokens: 1520
  Prompt tokens: 1089
  Completion tokens: 431

Model's answer:
The answer is (I)

Model's reasoning:
We need to answer the last question: "Typical advertising regulatory bodies suggest, for example that adverts must not: encourage _________, cause unnecessary ________ or _____, and must not cause _______ offence."

We need to fill blanks: encourage _________, cause unnecessary ________ or _____, and must not cause ____ offence.

We need to choose from options A-I. Let's analyze.

Advertising regulatory bodies (e.g., ASA in UK, FTC in US) guidelines: adverts must not encourage unsafe practices, cause unnecessary distress or fear, and must not cause serious offence. Or maybe "encourage unsafe practices, cause unnecessary distress or fear, and must not cause serious offence." Let's check each option.

Option A: Safe practices, Fear, Jealousy, Trivial. That would be "encourage safe practices" - that seems wrong; they should not encourage unsafe practices. So A wrong.

O

## Step 4 - Run the evaluation loop
This is the minimal end-to-end flow: load data, run the model, store predictions.


In [None]:
subjects: list[str] = ["business", "math"]
params: dict[str, Any] = {
    "temperature": 0.0,
    "max_tokens": 32768,
    "top_p": 1.0,
    "frequency_penalty": 0.0,
    "presence_penalty": 0.0,
    "seed": 3, 
}

results: list[EvalResult] = evaluate_model_on_mmlu_pro(
    llm_client=client,
    subjects=subjects,
    limit_per_subject=3, 
    prompt_type="cot",
    request_params=params,
    show_progress=False, 
)

## Step 5 - Compute accuracy
We reuse the repo's extraction strategy to compute accuracy over the collected results.


In [27]:
acc_dict: dict[str, float | int] = compute_accuracy(results)

print(f"Accuracy: {acc_dict['accuracy']:.2%}")


Accuracy: 83.33%
