# Scorers
## Introduction

## Installation

In [1]:
%pip install -q openai anthropic ipywidgets colorama
import os
os.environ['XDG_RUNTIME_DIR'] = "/tmp"

from helpers.reporter.pretty import pretty_results


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Scorer
- Once the solver has done it's work, we want to check the results
- A `scorer` takes the output after al the `solvers` are finished.
- Inspect.AI has mnany solvers integrated <https://inspect.aisi.org.uk/scorers.html#built-in-scorers>
- here we used `included` and `match` to check the output of the generate solver.


In [2]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, match
from textwrap import dedent

@task
def include_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
            target="console.log(",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            includes(ignore_case=True), # takes the target and checks if it is included in the generated code
            match(location="begin", numeric=False)
        ]
    )

results = eval(include_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: console.log(
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is the content for a JavaScript file named `hello-world.js` that prints out "hello":

```javascript
// hello-world.js

console.log("hello");
```

You can create this file by copying the above code into a text editor and saving it with the name `hello-world.js`. When you run this file using Node.js or in a browser console, it will print "hello" to the output.
Scorer[includes][VALUE]: C
Scorer[includes][EXPLANA

## Model QA Grader
- We can also ask an LLM to grade the result it generated.
- This is the `model_qa_greader` solver : it looks at the output of the solvers and then does an assessment given the initial ask in the Samnple input.

In [3]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_qa
from textwrap import dedent

@task
def model_graded_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_qa()
        ]
    )

results = eval(model_graded_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is the content for the `hello-world.js` file that prints out "hello":

```javascript
// hello-world.js
console.log("hello");
```

Make sure to create a file named `hello-world.js` and place the above code inside it. You can run this JavaScript file using Node.js with the command:

```
node hello-world.js
``` 

This will print "hello" to the console.
Scorer[model_graded_qa][VALUE]: C
Scorer[model_graded_qa][EXPLANATION]: 

- We can see the prompt template for it.
- The `question` is the oringal input in the Sample
- The `criterion` and `answer` are the result of all the solvers.
- So it judges what it created based on the original question.
- Internally it used a grading mechanism to have the scorer grade the result.

In [4]:
from inspect_ai.scorer._model import DEFAULT_GRADE_PATTERN, DEFAULT_MODEL_GRADED_QA_TEMPLATE, default_instructions
print("DEFAULT_GRADE_PATTERN: ",DEFAULT_GRADE_PATTERN)
print("DEFAULT_MODEL_GRADED_QA_TEMPLATE: ",DEFAULT_MODEL_GRADED_QA_TEMPLATE)
print("default_instructions: ",default_instructions(partial_credit=False))

DEFAULT_GRADE_PATTERN:  (?i)GRADE\s*:\s*([CPI])(.*)$
DEFAULT_MODEL_GRADED_QA_TEMPLATE:  
You are assessing a submitted answer on a given task based on a criterion. Here is the data:

[BEGIN DATA]
***
[Task]: {question}
***
[Submission]: {answer}
***
[Criterion]: {criterion}
***
[END DATA]

Does the submission meet the criterion?

{instructions}

default_instructions:  
After assessing the submitted answer, reply with 'GRADE: $LETTER' (without quotes) where LETTER is one of CI.  Please choose ONE option for the grade: either "C" for correct answers, or "I" for incorrect answers.

For example, after reviewing a correct answer you might write 'GRADE: C' or after reviewing an incorrect answer you might write 'GRADE: I'.

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then, end with your answer formatted as 'GRADE: $LETTER' (without quotes) where LETTER is one of

## Different grader model
- You can also specify a different model to judge the other model that was used to generate the answer.

In [5]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_qa
from textwrap import dedent

@task
def model_graded_solver_o3() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_qa(model="openai/o3-mini"), #takes the target as the fact
            model_graded_qa(model="openai/o3") #takes the target as the fact
        ]
    )

results = eval(model_graded_solver_o3,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is the code for a JavaScript file named `hello-world.js` that prints out "hello":

```javascript
// hello-world.js
console.log("hello");
```

To create the file, you can simply copy the code above into a file named `hello-world.js`. When you run this file using Node.js or in a browser console, it will print "hello" to the output.
Scorer[model_graded_qa][VALUE]: C
Scorer[model_graded_qa][EXPLANATION]: Step 1: The task req

## Fact grader
- Now instead of just using the initial question to judge the results, we can set a target in the sample.
- The qa judge will now take that as `expert` input and seed if that fact is true.

In [6]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_fact
from textwrap import dedent

@task
def model_graded_fact_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
            target="The code block should include a console.log statement that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_fact() #takes the target as the fact
        ]
    )

results = eval(model_graded_fact_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: The code block should include a console.log statement that prints out hello
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is a simple JavaScript file named `hello-world.js` that prints out "hello":

```javascript
// hello-world.js
console.log("hello");
```

You can create a file named `hello-world.js` and copy the above code into it. Running this file will output "hello" to the console.
Scorer[model_graded_fact][VALUE]: C
Scorer[model_graded_fact][EXPLANATION]: To de

## Judge scoring
- `model_graded_qa_fact` only takes input from the Sample.
- We can use a simple helper function to replace the prompt with our own criteria.

In [7]:
from inspect_ai.scorer._model import DEFAULT_MODEL_GRADED_FACT_TEMPLATE

def judge_criteria(criteria):
    return DEFAULT_MODEL_GRADED_FACT_TEMPLATE.replace("{criterion}", criteria)

print(judge_criteria("The code block should include a console.log statement"))


You are comparing a submitted answer to an expert answer on a given question. Here is the data:

[BEGIN DATA]
************
[Question]: {question}
************
[Expert]: The code block should include a console.log statement
************
[Submission]: {answer}
************
[END DATA]

Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.

Does the submission contain the content in the expert answer?

{instructions}



In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_fact
from textwrap import dedent

@task
def judge_scorer() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
            target="The code block should include a console.log statement that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_fact(template = judge_criteria("The code block should include a console.log statement.")),
            model_graded_fact(template = judge_criteria("The code block should print out workshop.")),
        ]
    )

results = eval(judge_scorer,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: The code block should include a console.log statement that prints out hello
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Sure! Here is the content for a JavaScript file named `hello-world.js` that prints out "hello":

```javascript
// hello-world.js
console.log('hello');
```

You can create the file named `hello-world.js` and add the above code to it. When you run this script using Node.js or in a browser console, it will output "hello".
Scorer[model_graded_fact][VALUE]:

## Custom scorer

In [10]:
from inspect_ai.scorer import (
    Score,
    Target,
    accuracy,
    scorer,
    stderr,
)

from inspect_ai.solver._task_state import TaskState

# looks at the output of the previous completion
@scorer(metrics=[accuracy(), stderr()])
def is_markdown():

    async def score(state: TaskState, target: Target):

        # check for correct
        answer = state.output.completion

        result = "I"  # incorrect
        explanation = "no markdown backticks found"

        if "```" in answer:
            result = "P"  # partial
            explanation = "we got some backticks"

        if answer.startswith("```"):
            result = "C"  # correct
            explanation = "we start with backticks"

        # return score
        return Score(
            value=result,
            answer=answer,
            explanation=explanation
        )
    return score

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_fact
from textwrap import dedent

@task
def markdown_scorer() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
        )
    ]

    return Task(dataset=dataset,
        solver=[
            generate()
        ],
        scorer=[
            is_markdown(),
        ]
    )

results = eval(markdown_scorer,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Sure! Below is the content you would put in a file named `hello-world.js` to print out "hello":

```javascript
// hello-world.js

console.log("hello");
```

To create this file, you can follow these steps:

1. Open your favorite text editor (like VSCode, Sublime Text, or even Notepad).
2. Copy the code snippet above.
3. Paste it into the editor.
4. Save the file as `hello-world.js`.

To run the file and see the output, you would typically use Node.js. Here’s how you can do that:

1. Open your terminal or command prompt.
2. Navigate to the directory where your `hello-world.js` file is located.
3. Run the command:

   ```bash
   node hello-world.js
   ```

You should see the output:

```
hello
```
Scorer[is_markdown][VALUE]: P
Scorer[is_markdown][