# Introduction

This notebook documents test suites in Checklist. If you are not already familiar with creating tests in Checklist, consider reading the MFT Examples notebook.

## Setup
First, let's import the libraries and load the model.

In [1]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import pandas as pd
import random
import json
import checklist
from checklist.editor import Editor
from checklist.expect import Expect
from checklist.pred_wrapper import PredictorWrapper
from checklist.test_types import MFT
from checklist.test_suite import TestSuite
from torch.nn import functional as F
from typing import List
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Initialize random seed
# Remove this code to experiment with random samples
random.seed(123)
torch.manual_seed(456)

<torch._C.Generator at 0x7faa7904b110>

In [3]:
# Load pretrained model tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Load pretrained model (weights)
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)
device = 'cuda'
model.eval()
model.to(device)
"Model loaded"

'Model loaded'

# Creating a Test Suite

Checklist can run multiple tests in a test suite. Tests can be grouped by capability and results can be explored in a visual table.

We will create a test suite called 'Same Token Prediction' with 3 MFTs. Each MFT will test if the token substituted into the prompt template also appears in the generated text.

For example, if we prompt the model with "The **dog** is running in the zoo" and the model responds with "The **dog** looks very happy", then it passes the test because the same animal appears in the model's response.

## Creating the MFTs
### MFT 1: Same animal appears in response
This MFT uses an `{animal}` placeholder in the template. The expectation function checks that the same animal appears in the prediction.

In [4]:
editor = Editor()
animal_prompts = editor.template("The {animal} is running in the zoo", animal=["dog", "cat", "giraffe", "aardvark"], meta=True)
animal_prompts.data

['The dog is running in the zoo',
 'The cat is running in the zoo',
 'The giraffe is running in the zoo',
 'The aardvark is running in the zoo']

In [5]:
def contains_same_animal(x, pred, conf, label=None, meta=None):
    return meta['animal'] in pred

In [6]:
same_animal_expect_fn = Expect.single(contains_same_animal)
same_animal_test = MFT(**animal_prompts, name='Same animal in response', description='The response contains the same animal mentioned in the prompt.', expect=same_animal_expect_fn)

### MFT 2: Same country appears in response
This MFT uses a `{country}` placeholder in the template. The expectation function checks that the same country appears in the prediction.

In [7]:
country_prompts = editor.template("Earlier today, scientists from {country} discovered", meta=True, nsamples=10)
country_prompts

MunchWithAdd({'meta': [{'country': 'Tajikistan'}, {'country': 'Bolivia'}, {'country': 'Japan'}, {'country': 'Kiribati'}, {'country': 'Kyrgyzstan'}, {'country': 'Namibia'}, {'country': 'Malaysia'}, {'country': 'Honduras'}, {'country': 'Ukraine'}, {'country': 'Angola'}], 'data': ['Earlier today, scientists from Tajikistan discovered', 'Earlier today, scientists from Bolivia discovered', 'Earlier today, scientists from Japan discovered', 'Earlier today, scientists from Kiribati discovered', 'Earlier today, scientists from Kyrgyzstan discovered', 'Earlier today, scientists from Namibia discovered', 'Earlier today, scientists from Malaysia discovered', 'Earlier today, scientists from Honduras discovered', 'Earlier today, scientists from Ukraine discovered', 'Earlier today, scientists from Angola discovered']})

In [8]:
def contains_same_country(x, pred, conf, label=None, meta=None):
    return meta['country'] in pred

In [9]:
same_country_expect_fn = Expect.single(contains_same_country)
same_country_test = MFT(**country_prompts, name='Same country in response', description='The response contains the same country mentioned in the prompt.', expect=same_country_expect_fn)

### MFT 3: Same person appears in response
This MFT uses a `{first_name}` placeholder in the template. The expectation function checks that the same first name appears in the prediction.

In [10]:
person_prompts = editor.template("{first_name} is my neighbor.", meta=True, nsamples=10)
person_prompts

MunchWithAdd({'meta': [{'first_name': 'Marie'}, {'first_name': 'Ben'}, {'first_name': 'Jill'}, {'first_name': 'Jill'}, {'first_name': 'Andrew'}, {'first_name': 'Victoria'}, {'first_name': 'Philip'}, {'first_name': 'Charlie'}, {'first_name': 'Cynthia'}, {'first_name': 'Lawrence'}], 'data': ['Marie is my neighbor.', 'Ben is my neighbor.', 'Jill is my neighbor.', 'Jill is my neighbor.', 'Andrew is my neighbor.', 'Victoria is my neighbor.', 'Philip is my neighbor.', 'Charlie is my neighbor.', 'Cynthia is my neighbor.', 'Lawrence is my neighbor.']})

In [11]:
def contains_same_person(x, pred, conf, label=None, meta=None):
    return meta['first_name'] in pred

In [12]:
same_person_expect_fn = Expect.single(contains_same_person)
same_person_test = MFT(**person_prompts, name='Same person in response', description='The response contains the same person\'s first name mentioned in the prompt.', expect=same_person_expect_fn)

## Adding the tests to the suite
The `TestSuite()` constructor creates an empty test suite. Tests can be added one by one using `suite.add(test)`. The optional `capability` parameter can be used to label and group tests that test similar capabilities.

In [13]:
suite = TestSuite()

In [14]:
suite.add(same_animal_test, capability="Same Token Prediction")
suite.add(same_country_test, capability="Same Token Prediction")
suite.add(same_person_test, capability="Same Token Prediction")

## Generating the predictions
Now we define the function that Checklist will use to generate predictions from the model. The predictions need to be returned in the form `([predictions], [scores])`, so we will wrap the `generate_sentences()` function with `PredictorWrapper.wrap_predict()` to automatically create a tuple `([predictions], [1, 1, ...])`

In [15]:
def generate_sentence(prompt: str) -> str:
    token_tensor = tokenizer.encode(prompt, return_tensors='pt').to(device) # return_tensors = "pt" returns a PyTorch tensor
    out = model.generate(
        token_tensor,
        do_sample=True,
        min_length=10,
        max_length=50,
        num_beams=1,
        temperature=1.0,
        no_repeat_ngram_size=2,
        early_stopping=False,
        output_scores=True,
        return_dict_in_generate=True)
    text = tokenizer.decode(out.sequences[0], skip_special_tokens=True)
    return text[len(prompt):]

In [16]:
def generate_sentences(prompts: List[str]) -> List[str]:
    sentences = []
    for prompt in prompts:
        sentences.append(generate_sentence(prompt))
    return sentences

In [17]:
wrapped_generator = PredictorWrapper.wrap_predict(generate_sentences)
wrapped_generator(["Hello, nice to meet you.", "Goodbye, see you later."])

([' Now, let us begin to talk…\n\nI heard you guys were coming.\n: I thought you were from there, right?\n (Slight sigh, looks back at Vixen) Yeah,',
  ' Goodbye, goodbye, dear." But when I heard that I felt almost lost.\n\nMy wife gave me the same answer of "Thank you."\n.'],
 array([1., 1.]))

## Running the suite
We can now run the suite and view the results.

In [18]:
suite.run(wrapped_generator, overwrite=True)

Running Same animal in response
Predicting 4 examples
Running Same country in response
Predicting 10 examples
Running Same person in response
Predicting 10 examples


In [19]:
def format_example(x, pred, conf, label=None, meta=None): 
    return 'Prompt:      %s\nCompletion:      %s' % (x, pred) 

In [20]:
suite.summary(format_example_fn = format_example)

Same Token Prediction

Same animal in response
Test cases:      4
Fails (rate):    4 (100.0%)

Example fails:
Prompt:      The dog is running in the zoo
Completion:      , she says, and a friend of her walks in on them, shouting things like, "I was scared, can you please sit down"?

He's also apparently had a seizure and an eye scratch,
----
Prompt:      The cat is running in the zoo
Completion:      , but I'll only bring back a one year old. There's one guy at the park that is crazy. He was doing some nice things but it's definitely going to end up being one of those stories that
----
Prompt:      The aardvark is running in the zoo
Completion:      . I got to see it in person.

Lately, my daughter, who was visiting from college now, said she's been fascinated with the penguin-shaped shell on the side of
----


Same country in response
Test cases:      10
Fails (rate):    10 (100.0%)

Example fails:
Prompt:      Earlier today, scientists from Angola discovered
Completion:       that bl

In [21]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'Same animal in respo…

# Using files with test suites

Some models cannot be run directly on the same machine that is running the Checklist test suite. For instance, a model might need to run in a specially configured lab environment. In this case, Checklist does not have to receive the predictions from the model directly. The predictions can be saved to a file, then the test suite can check the predictions from the file.

## Exporting a test suite to a file
First, let's create a file that contains all the prompts that we will send to the model.

### Accessing test suite data internally
Tests are stored in `suite.tests`, which is a dictionary mapping the test name to the test.

### Accessing test suite data internally
Tests are stored in `suite.tests`, which is a dictionary mapping the test name to the test.

In [22]:
for key in suite.tests.keys():
    print(key)

Same animal in response
Same country in response
Same person in response


We can access the test information by like this:

In [23]:
suite.tests['Same animal in response'].data

['The dog is running in the zoo',
 'The cat is running in the zoo',
 'The giraffe is running in the zoo',
 'The aardvark is running in the zoo']

In [24]:
suite.tests['Same animal in response'].meta

[{'animal': 'dog'},
 {'animal': 'cat'},
 {'animal': 'giraffe'},
 {'animal': 'aardvark'}]

### Exporting to JSON file with to_raw_file()
TestSuite's `to_raw_file()` function exports a test suite to a file. The `format_fn` parameter allows us to control how each example in the suite is printed to the file. We can use `format_fn` to print the examples in a JSON format.

In [25]:
def suite_to_json_file(suite, filename):
    class Counter:
        def __init__(self):
            self.count = 0
        def get_count(self):
            self.count += 1
            return self.count
    
    counter = Counter()
    total_tests = 0
    for t in suite.tests.values():
        total_tests += len(t.data)
        
    def json_format_fn(x):
        example_id = counter.get_count()
        json_str = ""
        if example_id == 1:
            json_str = '{"examples": ['
        json_str += json.dumps({'content': x, 'id': example_id}) + ","
        if example_id == total_tests:
            # remove trailing comma
            json_str = json_str[:len(json_str)-1]
            json_str += "]}"
        return json_str
    
    suite.to_raw_file(filename, format_fn = json_format_fn)

In [26]:
suite_to_json_file(suite, 'same_token_suite.json')

In [27]:
cat 'same_token_suite.json'

{"examples": [{"content": "The dog is running in the zoo", "id": 1},
{"content": "The cat is running in the zoo", "id": 2},
{"content": "The giraffe is running in the zoo", "id": 3},
{"content": "The aardvark is running in the zoo", "id": 4},
{"content": "Earlier today, scientists from Tajikistan discovered", "id": 5},
{"content": "Earlier today, scientists from Bolivia discovered", "id": 6},
{"content": "Earlier today, scientists from Japan discovered", "id": 7},
{"content": "Earlier today, scientists from Kiribati discovered", "id": 8},
{"content": "Earlier today, scientists from Kyrgyzstan discovered", "id": 9},
{"content": "Earlier today, scientists from Namibia discovered", "id": 10},
{"content": "Earlier today, scientists from Malaysia discovered", "id": 11},
{"content": "Earlier today, scientists from Honduras discovered", "id": 12},
{"content": "Earlier today, scientists from Ukraine discovered", "id": 13},
{"content": "Earlier today, scientists from Angola discove

## Importing the test suite JSON
The JSON file we created can be imported back into a Python object by using `json.load()`.

In [28]:
import json
f = open('same_token_suite.json', 'r')
suite_dict = json.load(f)
f.close()
suite_dict['examples'][0:3]

[{'content': 'The dog is running in the zoo', 'id': 1},
 {'content': 'The cat is running in the zoo', 'id': 2},
 {'content': 'The giraffe is running in the zoo', 'id': 3}]

## Generating predictions from the loaded data
Our data has been loaded into a variable named `suite_dict`. Now we can read each example from `suite_dict` and generate the predictions. Each prediction will be written to another file named `same_token_suite_predictions.json`, which will be sent to Checklist to evaluate the results.

In [29]:
with open('same_token_suite_predictions.json', 'w') as f:
    for example in suite_dict['examples']:
        prediction = generate_sentence(example['content'])
        prediction = prediction.replace('"', '\"')
        f.write(json.dumps({'prediction': prediction, 'id': example['id']}) + '\n')

In [30]:
cat 'same_token_suite_predictions.json'

{"prediction": " to study the animals in a cage where it is kept. After taking the test and asking the other dog about its health, he takes the blood samples and looks at them again and again.\n\nIf the results", "id": 1}
{"prediction": ".\n\nThat's a small number (at the moment), but it's not the only cat to look out for the blind. Other animals also get attention from other animals. It's the reason we get the", "id": 2}
{"prediction": " with six other giraffes. Credit: Dr Rolf Lecka/Flickr (CC BY 2.0)\n\nBut as the animal swells out of the pen and into adulthood, it will", "id": 3}
{"prediction": ", as he runs out during a lecture and gets caught up in it as well. He also finds out about the future of his mother.\n\nNotes", "id": 4}
{"prediction": " a new method of extracting plutonium. The team showed that a series of two hydrogen-containing platinum atoms could be used as a nucleite to produce a plutonium-like molecule which would then be added to the", "id": 5}
{"prediction":

## Reading test results from file
TestSuite has a `run_from_file()` function that reads the predictions line by line from a file. The `format_fn` parameter is used to parse each line of the file. Our format function, `read_json_prediction()`, converts the JSON object into a tuple of `(predicted text, confidence)`. We don't care about confidence values here, so we will just set confidence to 1 for every prediction.

In [31]:
def read_json_prediction(x):
    test_output = json.loads(x)
    return (test_output['prediction'], 1)
suite.run_from_file('same_token_suite_predictions.json', format_fn = read_json_prediction, overwrite=True)

In [32]:
suite.summary(format_example_fn = format_example)

Same Token Prediction

Same animal in response
Test cases:      4
Fails (rate):    1 (25.0%)

Example fails:
Prompt:      The aardvark is running in the zoo
Completion:      , as he runs out during a lecture and gets caught up in it as well. He also finds out about the future of his mother.

Notes
----


Same country in response
Test cases:      10
Fails (rate):    10 (100.0%)

Example fails:
Prompt:      Earlier today, scientists from Namibia discovered
Completion:       that a comet has struck a young target that was too small and potentially too far out to escape the comet to carry it into the surrounding clouds. The researchers theorised that the object might be made of material that
----
Prompt:      Earlier today, scientists from Ukraine discovered
Completion:       a new species — the "Haparovian" species.

"This is such a big deal. It would be a significant discovery even if it wasn't," said Alexander Bozhirov,
----
Prompt:      Earlier today, scientists from Malaysia discovere

In [33]:
suite.visual_summary_table()

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'Same animal in respo…