# Inspect - an Eval Framework
## Introduction
- As we've seen in the lesson about simple testing, changing the prompt slightly and senting it to the same llm can give different results.
- Imagine swapping in and out another model, how do you make sure your code keeps working?
- Much like in Test Driven Development, the LLM community talks about *Evals*
- Evals can be seen as a test suite to check the results across multiple prompts and llms
- We've seen how it works, but luckily there exist frameworks that have some helpers , so we don't have to do it all by ourselves

The framework we will show here is `Inspect AI` - <https://inspect.ai-safety-institute.org.uk/>. It comes in the form of a VSCode plugin too, this is eanbled in this workshop.

## Installation

Note : we are pinning the inspect_ai due to a recent breaking change

In [2]:
%pip install -q openai anthropic ipywidgets colorama
import os
os.environ['XDG_RUNTIME_DIR'] = "/tmp"

from helpers.reporter.pretty import pretty_results


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Your first solver

In [2]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate

@task
def simple_generate() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello"
        )
    ]

    return Task(dataset=dataset,
        solver=[
            generate(model = "openai/gpt-4o-mini",)
        ],
    )

results = eval(simple_generate, max_steps=10,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T12-52-13+00-00_simple-generate_N6PWPz8u4tEVxZnwikVr79.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Certainly! Below is the content for a JavaScript file named `hello-world.js` that prints out "hello".

```javascript
// hello-world.js

console.log("hello");
```

You can create a file named `hello-world.js` and copy the above code into it. When you run this file with Node.js (or in the browser's console), it will output "hello". 

To run it using Node.js, you would navigate to the directory where the file is located in your terminal and run the following command:

```bash
node hello-world.js
```
********************************************************************************


## Set default model

In [None]:
# Set default model to gpt-4o-mini
os.environ['INSPECT_EVAL_MODEL'] = "openai/gpt-4o-mini"

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate

@task
def simple_generate() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello"
        )
    ]

    return Task(dataset=dataset,
        solver=[
            generate()
        ],
    )

results = eval(simple_generate,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T12-53-05+00-00_simple-generate_HREVmyWUxcPtsR3fkJAk97.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here's a simple `hello-world.js` file that prints "hello" to the console. You can create this file in your text editor and run it using Node.js or directly in a browser console.

```javascript
// hello-world.js
console.log("hello");
```

To execute this file:

1. Save the above code in a file named `hello-world.js`.
2. Open your terminal (command prompt) and navigate to the directory where the file is located.
3. Run the following command:

```bash
node hello-world.js
```

This should print "hello" in the console. If you run this in a web browser console, simply copy the `console.log("hello");` line and paste it there to see

In [9]:
# Set default model to gpt-4o-mini
os.environ['INSPECT_EVAL_MODEL'] = "openai/gpt-4o-mini"

from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import generate

@task
def simple_generate() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello"
        )
    ]


    return Task(dataset=dataset,
        solver=[
            generate()
        ],
    )

models = [
    "openai/gpt-4o-mini",
    "openai/gpt-4o",
]
results = eval(simple_generate,log_level="info", model = models, display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
Status: success Model: openai/gpt-4o
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T13-05-19+00-00_simple-generate_HxSyMQZoMT2iEPEt3Tzjd7.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is a simple JavaScript code snippet that will print "Hello" to the console. You can save this code in a file named `hello-world.js`.

```javascript
// hello-world.js
console.log("Hello");
```

To run this script, you can use Node.js. Simply follow these steps:

1. Make sure you have Node.js installed on your machine.
2. Create a new file named `hello-world.js`.
3. Copy and paste the above code into `hello-world.js`.
4. Open your terminal or command prompt.
5. Navigate to the directory where you saved the file.
6. Run the command: 

```bash
node hello-world.js
```

You should see the 

## Adding a system prompt

In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from textwrap import dedent

@task
def system_message_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello"
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    Make sure to output the code and not any additional text or comments.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
    )

results = eval(system_message_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T10-52-41+00-00_system-message-solver_Sps8AtzSdJA8xvyXeuqhYv.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    Make sure to output the code and not any additional text or comments.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> ```javascript
console.log("hello");
```
********************************************************************************


## Scorer


https://inspect.aisi.org.uk/scorers.html#built-in-scorers

In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, match
from textwrap import dedent

@task
def include_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
            target="console.log(",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            includes(ignore_case=True), # takes the target and checks if it is included in the generated code
            match(location="begin", numeric=False)
        ]
    )

results = eval(include_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T11-43-49+00-00_include-solver_4RrBUeN388sL9nzDmxwNTn.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: console.log(
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Sure! Below is the content for a JavaScript file named `hello-world.js` that prints out "hello".

```javascript
// hello-world.js
console.log("hello");
```

You can create a file named `hello-world.js` and copy the above code into it. When you run the file using Node.js or a browser console, it

## Grader

In [None]:
from inspect_ai.scorer._model import DEFAULT_GRADE_PATTERN, DEFAULT_MODEL_GRADED_FACT_TEMPLATE, default_instructions
print("DEFAULT_GRADE_PATTERN: ",DEFAULT_GRADE_PATTERN)
print("default_instructions: ",default_instructions(partial_credit=False))
print("DEFAULT_MODEL_GRADED_FACT_TEMPLATE: ",DEFAULT_MODEL_GRADED_FACT_TEMPLATE)

DEFAULT_GRADE_PATTERN:  (?i)GRADE\s*:\s*([CPI])(.*)$
default_instructions:  
After assessing the submitted answer, reply with 'GRADE: $LETTER' (without quotes) where LETTER is one of CI.  Please choose ONE option for the grade: either "C" for correct answers, or "I" for incorrect answers.

For example, after reviewing a correct answer you might write 'GRADE: C' or after reviewing an incorrect answer you might write 'GRADE: I'.

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then, end with your answer formatted as 'GRADE: $LETTER' (without quotes) where LETTER is one of CI.

DEFAULT_MODEL_GRADED_FACT_TEMPLATE:  
You are comparing a submitted answer to an expert answer on a given question. Here is the data:

[BEGIN DATA]
************
[Question]: {question}
************
[Expert]: {criterion}
************
[Submission]: {answer}
************
[END DATA]

Compare t

In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_qa
from textwrap import dedent

@task
def model_graded_fact_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_qa() #takes the target as the fact
        ]
    )

results = eval(model_graded_fact_solver,log_level="info",display="none")
print(pretty_results(results))

Output()

DEFAULT_GRADE_PATTERN (?i)GRADE\s*:\s*([CPI])(.*)$
default_instructions 
After assessing the submitted answer, reply with 'GRADE: $LETTER' (without quotes) where LETTER is one of CI.  Please choose ONE option for the grade: either "C" for correct answers, or "I" for incorrect answers.

For example, after reviewing a correct answer you might write 'GRADE: C' or after reviewing an incorrect answer you might write 'GRADE: I'.

First, write out in a step by step manner your reasoning about the criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then, end with your answer formatted as 'GRADE: $LETTER' (without quotes) where LETTER is one of CI.

DEFAULT_MODEL_GRADED_FACT_TEMPLATE 
You are comparing a submitted answer to an expert answer on a given question. Here is the data:

[BEGIN DATA]
************
[Question]: {question}
************
[Expert]: {criterion}
************
[Submission]: {answer}
************
[END DATA]

Compare the fac

## Different grader model

In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_qa
from textwrap import dedent

@task
def model_graded_fact_solver_o3() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_qa(model="openai/o3-mini"), #takes the target as the fact
            model_graded_qa(model="openai/o3") #takes the target as the fact
        ]
    )

results = eval(model_graded_fact_solver_o3,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T11-46-04+00-00_model-graded-fact-solver-o3_hZjDfceuDtoYrMSv83ZH6h.eval
input : Generate a javascript file name hello-world.js that prints out hello
target: 
[33m system     [39m> You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.

[33m user       [39m> Generate a javascript file name hello-world.js that prints out hello
[33m assistant  [39m> Here is the content for the `hello-world.js` file that prints out "hello":

```javascript
// hello-world.js
console.log('hello');
```

Simply create a file named `hello-world.js` and paste the above code into it. When you run this file using Node.js, it will print "hello" to the console.
Score

## Fact grader

In [None]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_fact
from textwrap import dedent

@task
def model_graded_fact_solver() -> Task:

    dataset=[
        Sample(
            input="Generate a javascript file name hello-world.js that prints out hello",
            target="The code block should include a console.log statement that prints out hello",
        )
    ]

    SYSTEM_PROMPT = """You are a code generation assistant. Your task is to generate code based on the input provided.
    You should ensure that the generated code meets the requirements specified in the input.
    You should also ensure that the generated code is syntactically correct and can be executed without errors.
    """
    
    return Task(dataset=dataset,
        solver=[
            system_message(dedent(SYSTEM_PROMPT)),
            generate()
        ],
        scorer=[
            model_graded_fact() #takes the target as the fact
        ]
    )

results = eval(model_graded_fact_solver,log_level="info",display="none")
print(pretty_results(results))

## Basic picture

In [3]:
from inspect_ai import Task, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.model import ChatMessageUser, ContentImage, ContentText
from inspect_ai.solver import system_message, generate
from inspect_ai.scorer import includes, model_graded_fact

@task
def picture_basic():
    dataset = [
        Sample(input = [
            ChatMessageUser(content = [
                ContentImage(image="data/screenshot.png"),
                ContentText(text="What is this a picture of?")
            ])
        ], target="The picture represents the homepage of Uber.")
    ]

    task = Task(
        dataset=dataset,
        solver=generate(),
        scorer=model_graded_fact(),
    )
    return task


results = eval(picture_basic,log_level="info",display="none")
print(pretty_results(results))

Output()

Status: success Model: openai/gpt-4o-mini
/workspaces/workshop-testing/lessons/04-testing/logs/2025-05-07T14-21-49+00-00_picture-basic_mS3RevET4bxFuzvckrmZKt.eval
input : [ChatMessageUser(id='JH5Lx8YjrzXQX6z88uJekT', content=[ContentText(type='text', text='(Image)', refusal=None), ContentText(type='text', text='What is this a picture of?', refusal=None)], source=None, internal=None, role='user', tool_call_id=None)]
target: The picture represents the homepage of Uber.

[33m user       [39m+> What is this a picture of?
[33m assistant  [39m> This is a screenshot of the Uber website, showcasing their ride-hailing service. It features sections for entering pickup and drop-off locations, options to select the date and time, a map on the right side, and a notification about cookie use.
Scorer[model_graded_fact][VALUE]: C
Scorer[model_graded_fact][EXPLANATION]: To determine whether the submitted answer contains the content in the expert answer, I will analyze both responses carefully.

1. 