# Using Ollama to evaluate our instruction responses

The most practical method we have to really assess the quality of our instruction-following model is to use a bigger, better model to judge it.

I'm keeping this code in its own notebook so that I _don't_ have to import `torch` and everything else. I'm going to assume a `model_output.json` file was already generated by `chat.ipynb`.

In [20]:
import json
from typing import TypedDict
import psutil
import urllib.request
from urllib.error import URLError
import tqdm

In [21]:
# this is almost the same as the one defined in chat.ipynb, but it has model_output as well.
class InstructionExampleResponse(TypedDict):
    instruction: str      # A description of the task to be performed
    input: str            # Optional parameter for the task
    output: str           # The expected result of performing the task
    model_output: str     # The actual result of performing the task
    assessment: str|None  # The assessment provided by the judge LLM, if any
    score: int|None       # The score assigned by the judge LLM, if any


In [22]:
def check_if_running(process_name: str):
    running = False
    for proc in psutil.process_iter(["name"]):
        if process_name in proc.info["name"]:
            running = True
            break
    return running

ollama_running = check_if_running("ollama")
if not ollama_running:
    raise RuntimeError(
        "Ollama is not running. Launch ollama before proceeding."
    )

print("Ollama is running.")

Ollama is running.


In [23]:
def query_model(
        prompt: str,
        model:str="qwen3:4b",
        url:str="http://localhost:11434/api/chat",
) -> str:
    data = {
        "model": model,
        "messages": [
            {"role": "user", "content": prompt}
        ],
        "options": {
            "seed": 123,
            "temperature": 0,
            "num_ctx": 2048
        }
    }

    payload = json.dumps(data).encode("utf-8")
    request = urllib.request.Request(
        url,
        data=payload,
        method="POST"
    )
    request.add_header("Content-Type", "application/json")

    response_data = ""
    try:
        with urllib.request.urlopen(request, timeout=60) as response:
            while True:
                line = response.readline().decode("utf-8")
                if not line:
                    break
                response_json = json.loads(line)
                response_data += response_json["message"]["content"]
    except URLError as error:
        print("Timed out on request:")
        print("---")
        print(prompt)
        print("---")
    
    return response_data

In [24]:
def response_to_score(response: str) -> int:
    lines = response.split("\n")
    last_line = lines[-1].strip()
    # despite my careful prompting, the model will often reply with "Score: 20"
    last_word = last_line.split(" ")[-1]
    try:
        return int(last_word)
    except ValueError:
        print("Invalid response\n")
        print(response)
        return 0

def judge_example(example: InstructionExampleResponse) -> InstructionExampleResponse:
    preface = ("Below you'll find the output from a new LLM. Please "
        "evaluate the quality of the \"model_output\" compared to the reference "
        "\"output\". Explain the rationale for your judgment, then end with a new line "
        "containing only an integer score out of 100. The final line should contain only "
        "a single integer. Factually incorrect responses must be scored below 50, with higher scores indicating that the answer is close to the truth. \n"
        "Ungrammatical responses must be scored below 50, with lower scores indicating nonsense.\n"
        "Other than that, please judge based on how close the response is to being complete, accurate, and relevant to the prompt.\n"
        "If the model_output is functionally identical to the reference output, the score should be 100.\n"
        "Please note that the last line of your response must be a single integer with no other characters.\n"
        "Example:\n\n"
        "The model's response incorrectly identifies the capital of Poland as Portland, but it is otherwise grammatically correct.\n"
        "While Portland is a city, it is not in Poland and it is not a capital city.\n25\n\n"
        "Remember that your score will be parsed with the assumption that the final line contains only a single integer value.\n"
        "A valid score looks like:\n85\n\n"
        "An invalid score looks like:\n**100**\n"
        "or:\nScore: 100 out of 100\n"
        'or: \\box{100}\n\n'
        "Result to evaluate:\n")
    prompt = preface + json.dumps(example)
    response = query_model(prompt)
    score = response_to_score(response)
    result = example.copy()
    result['assessment'] = response
    result['score'] = score
    return result


In [25]:
with open('small_training_output.json', 'r') as f:
    examples = json.load(f)

examples[1]
# judge_example(examples[45])

{'instruction': 'What type of cloud is typically associated with thunderstorms?',
 'input': '',
 'output': 'The type of cloud typically associated with thunderstorms is cumulonimbus.',
 'model_output': 'A thunderstorm is a thunderstorm that produces a significant wind gust.'}

In [26]:
def score_examples(examples: list[InstructionExampleResponse], out_path:str):
    for example in tqdm.tqdm(examples, desc="Scoring examples"):
        result = judge_example(example)
        with open(out_path, 'a') as f:
            f.write(json.dumps(result))
            f.write("\n")

In [27]:
score_examples(examples, 'graded_small_training.lsv')

Scoring examples:  86%|████████▋ | 95/110 [39:40<1:37:36, 390.44s/it]

Invalid response

<think>
Okay, let's evaluate the model's response. The question is about the boiling point of chlorine in Celsius. The reference output says -34°C, but the model output says 37.5°C. 

First, I need to check the correct boiling point of chlorine. From what I remember, chlorine is a diatomic molecule (Cl₂) and its boiling point is indeed around -34°C. The model's answer of 37.5°C is way off. That's probably a mistake. 

Wait, maybe I should verify this. Let me think. Chlorine is a green gas at room temperature. Its boiling point is below zero, so -34°C makes sense. 37.5°C is actually the boiling point of water, which is way higher. So the model's answer is incorrect. 

The response is factually wrong, so according to the instructions, factually incorrect responses must be scored below 50. The model's answer is not only wrong but also unrelated to the correct value. The reference is correct, so the model's answer is a clear error. 

The response is grammatically correct,

Scoring examples: 100%|██████████| 110/110 [42:16<00:00, 23.06s/it]  


In [28]:
def average_score(tsv_file: str):
    with open(tsv_file, 'r') as f:
        lines = f.readlines()
    results = [json.loads(x) for x in lines]
    scores = [x['score'] for x in results]
    avg = sum(scores) / len(scores)
    return avg

In [31]:
def scroll_results(tsv_file: str):
    with open(tsv_file, 'r') as f:
        lines = f.readlines()
    results: list[InstructionExampleResponse] = [json.loads(x) for x in lines]
    for result in results:
        print(f"Instruction: {result['instruction']}")
        print(f"Input: {result['input']}") if result['input'] else None
        print(f"Output: {result['model_output']}")
        print(f"Score: {result['score']}")
        print("-------------")
    

In [32]:
scroll_results('graded_small_training.lsv')

Instruction: Rewrite the sentence using a simile.
Input: The car is very fast.
Output: The car is very slow.
Score: 25
-------------
Instruction: What type of cloud is typically associated with thunderstorms?
Output: A thunderstorm is a thunderstorm that produces a significant wind gust.
Score: 25
-------------
Instruction: Name the author of 'Pride and Prejudice'.
Output: The author is J.K. Rowling.
Score: 25
-------------
Instruction: What is the periodic symbol for chlorine?
Output: The periodic symbol for chlorine is C.
Score: 25
-------------
Instruction: Correct the punctuation in the sentence.
Input: Its time to go home.
Output: The punctuation in the sentence should be 'their time to go home' and not 'their time'.
Score: 25
-------------
Instruction: Rewrite the sentence.
Input: The lecture was delivered in a clear manner.
Output: The lecture was delivered in a loud manner.
Score: 25
-------------
Instruction: Generate a humorous anecdote.
Output: I was reading the dictionary a

In [33]:
scroll_results('graded_355m_alpaca.lsv')

Instruction: Rewrite the sentence using a simile.
Input: The car is very fast.
Output: The car is like a cheetah sprinting through the jungle.
Score: 90
-------------
Instruction: What type of cloud is typically associated with thunderstorms?
Output: A typical cloud typically associated with thunderstorms is a layer of smoke and dust.
Score: 25
-------------
Instruction: Name the author of 'Pride and Prejudice'.
Output: The author of 'Pride and Prejudice' is William Shakespeare.
Score: 25
-------------
Instruction: What is the periodic symbol for chlorine?
Output: The periodic symbol for chlorine is H2S2.
Score: 25
-------------
Instruction: Correct the punctuation in the sentence.
Input: Its time to go home.
Output: It's time to go home.
Score: 100
-------------
Instruction: Rewrite the sentence.
Input: The lecture was delivered in a clear manner.
Output: The lecture was delivered in a clear manner.
Score: 50
-------------
Instruction: Generate a humorous anecdote.
Output: Once a man 