# GPQA Evaluation (Ollama)

This notebook mirrors the **GPQA** closed-book evaluation flow based on the official GPQA repo, but implemented using the shared helpers in `llm_wc`.

**Pipeline**
1. Load GPQA questions (main or diamond) and shuffle answer choices deterministically.
2. Build prompts (zero-shot, chain-of-thought, or 5-shot).
3. Query the model and parse the multiple-choice answer.
4. Run the evaluation loop and compute accuracy.


## Step 0 - Configuration
Adjust these settings for your model, dataset path, and prompt type.


In [1]:
%run _dev_setup.py


🔁 Autoreload is ON (IPython detected).
✅ Using llm_wc from: /home/iamsikun/research/llm-wc/src/llm_wc


In [2]:
import os
import json
from pathlib import Path
from pprint import pprint
from typing import Any


In [3]:
from llm_wc.gpqa import (
    GPQA_CHOICES,
    GPQA_DATASET,
    GPQA_DEFAULT_COT_EXAMPLES,
    GPQA_DEFAULT_PROMPT_TYPE,
    GPQA_DEFAULT_SPLIT,
    load_gpqa_benchmark,
    prompt_gpqa,
    evaluate_model_on_gpqa,
)
from llm_wc.client import ClientConfig, build_client
from llm_wc.core import compute_accuracy
from llm_wc.core.mcqa import extract_choice_answer


In [4]:
OLLAMA_URL = 'http://localhost:11434/v1'
MODEL_NAME = 'gpt-oss:latest'

# Ollama ignores the key, but the client expects it
os.environ.setdefault('OPENAI_API_KEY', 'ollama')

# GPQA config
DATASET_PATH = GPQA_DATASET  # or local CSV directory
SUBSET = 'main'  # 'diamond' for GPQA-Diamond
SPLIT = GPQA_DEFAULT_SPLIT
PROMPT_TYPE = GPQA_DEFAULT_PROMPT_TYPE  # zero_shot | chain_of_thought | 5_shot
COT_EXAMPLES_PATH = GPQA_DEFAULT_COT_EXAMPLES

SEED = 12345

# Where to save raw predictions
OUTPUT_DIR = Path('eval_results/ollama_gpqa')


### Gated dataset access
GPQA is gated on Hugging Face. You can either authenticate via `huggingface-cli login` or export a token in an env var (recommended).

- Set `HF_TOKEN_ENV` to the env var name that stores your token, or
- Assign `HF_TOKEN` directly for an explicit override.

These values are passed into `load_gpqa_examples()` and ultimately `datasets.load_dataset()`.


## Step 1 - Load the dataset
The loader supports local CSVs (e.g., `gpqa_main.csv` / `gpqa_diamond.csv`) or the Hugging Face dataset name.


In [5]:
dataset = load_gpqa_benchmark(
    dataset_path=DATASET_PATH,
    subset=SUBSET,
    split=SPLIT,
    seed=SEED,
    max_examples=LIMIT_QUESTIONS,
    hf_token=HF_TOKEN,
    hf_token_env=HF_TOKEN_ENV,
)
len(dataset.questions)


448

In [6]:
example = dataset.questions[0]
print(f'Question id: {example.id}')
print(f'Question: {example.question}')
print('Choices:')
for letter, choice in example.choices.items():
    print(f'  {letter}: {choice}')
print(f'Correct letter: {example.answer}')


Question id: 0
Question: A large gene has dozens of exons, of which the central ones code for folded triple helical repeats that connect the cytoskeleton with sarcolemma and extracellular space. Each exon usually codes for one folded triple alpha helix. The most common mutations of the gene are central exon deletions that create out-of-frame peptides and progressive degenerative organ waste. A solution is to deliver a Morpholino that recognizes the 5' end of the out-of-frame exon in pre-mRNA. The molecule prevents binding of the spliceosome and creates exon skipping and in-frame joining. Several missing exons are well tolerated by an organism. Which structure below is not involved in the proposed therapy?
Choices:
  0: polyA tail
  1: lariat
  2: antisense
  3: R-loops
Correct choice index: 3
Correct letter: D


## Step 2 - Build LLM client


In [7]:
client_cfg = ClientConfig(
    provider="openai",
    model=MODEL_NAME,
    api_base=OLLAMA_URL,
    api_key="ollama",
)
client = build_client(client_cfg)


## Step 3 - Build the prompt
Use `PROMPT_TYPE` to swap between the official GPQA prompt styles.


In [9]:
prompt = prompt_gpqa(
    dataset,
    example,
    prompt_type=PROMPT_TYPE,
    cot_examples_path=COT_EXAMPLES_PATH,
)
print(prompt[:1200])


FileNotFoundError: [Errno 2] No such file or directory: 'resources/gpqa/chain_of_thought_examples.json'

## Step 4 - Query the model
We request a model completion for a single GPQA prompt, then parse the answer letter.


In [None]:
params: dict[str, Any] = {
    'temperature': 0.0,
    'max_tokens': 32768,
    'top_p': 0.95,
    'frequency_penalty': 0.0,
    'presence_penalty': 0.0,
    'logprobs': True,
    'top_logprobs': 10,
    'seed': 3,
}

message_text = [{
    'role': 'user',
    'content': prompt
}]

completion = client.chat.completions.create(
    model=MODEL_NAME,
    messages=message_text,
    **params,
)


In [None]:
token_usage = completion.usage
print(f'Total tokens: {token_usage.total_tokens}')
print(f'  Prompt tokens: {token_usage.prompt_tokens}')
print(f'  Completion tokens: {token_usage.completion_tokens}')

message = completion.choices[0].message
print('Model answer:')
print(message.content)
print('Model reasoning:')
print(message.reasoning)


In [None]:
parsed = extract_choice_answer(message.content, choices=GPQA_CHOICES)
print(f'Parsed answer: {parsed.choice}')
print(f'Matched pattern: {parsed.matched_pattern}')


## Step 5 - Run the evaluation loop
Now run the full evaluation (optionally limited) and score accuracy.


In [None]:
client_cfg = ClientConfig(
    provider="openai",
    model=MODEL_NAME,
    api_base=OLLAMA_URL,
    api_key="ollama",
)
client = build_client(client_cfg)

results = evaluate_model_on_gpqa(
    llm_client=client,
    dataset_path=DATASET_PATH,
    subset=SUBSET,
    split=SPLIT,
    seed=SEED,
    max_examples=LIMIT_QUESTIONS,
    question_ids=None,
    prompt_type=PROMPT_TYPE,
    cot_examples_path=COT_EXAMPLES_PATH,
    hf_token=HF_TOKEN,
    hf_token_env=HF_TOKEN_ENV,
    request_params=params,
    show_progress=False,
)
len(results)


In [None]:
summary = compute_accuracy(results)
summary


## Step 6 - Save results
Persist raw predictions and the summary stats for later analysis.


In [None]:
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
(OUTPUT_DIR / 'results.json').write_text(json.dumps(results, indent=2))
(OUTPUT_DIR / 'summary.json').write_text(json.dumps(summary, indent=2))
OUTPUT_DIR


---
## Notes
- Use `SUBSET = 'diamond'` to evaluate the GPQA Diamond split.
- `PROMPT_TYPE` supports `zero_shot`, `chain_of_thought`, and `5_shot`, matching the official GPQA closed-book baselines.
- The choice order is shuffled with `SEED` to match the GPQA evaluation setup.
- The dataset loader supports either Hugging Face (`idavidrein/gpqa`) or local CSVs.
