In [24]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import pandas as pd
import checklist
from checklist.editor import Editor
from checklist.expect import Expect
from checklist.test_types import MFT
import warnings
warnings.filterwarnings('ignore')

# MFTs: Introduction
In this notebook, we will create Minimum Functionality Tests (MFTs) for a generative language model. MFTs test one specific function of a language model. They are analogous to unit tests in traditional software engineering.

## Setup generative model
Before we can test anything, we need to set up our language model. We will use the HuggingFace transformers library to load a GPT2 model.

First, we create a tokenizer. The tokenizer is responsible for splitting strings into individual words, then converting those words into vectors of numbers that our model can understand.

In [25]:
# Load pretrained model tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Demonstrate what the tokenizer does
tokenizer.encode("Wherefore art thou Romeo?")

[8496, 754, 1242, 14210, 43989, 30]

Our tokenizer has turned the human-readable text into a list of numbers that the model understands. Next, let's load the GPT2 model.

In [26]:
# Load pretrained model (weights)
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Generating text with the model requires a bit of work. Let's write a function `generate_sentence` to handle the text generation.

`generate_sentence` has 3 mandatory parameters: 
- A tokenizer `tok`
- A model `mdl`
- A prompt `prompt`

The prompt is a string that the model will use as a starting point for generating new text. It gives the model context about what kind of text should be generated.

`generate_sentence` will output the generated text, as well as the model's raw scores for each predicted token. The scores will be useful later on for the MFT.

In [27]:
def generate_sentence(tok, mdl, prompt, max_length=150, device='cuda') -> str:
    tok_tensor = tok.encode(prompt, return_tensors='pt').to(device) # return_tensors = "pt" returns a PyTorch tensor
    mdl.eval()
    mdl.to(device)
    out = mdl.generate(tok_tensor, max_length=max_length, num_beams=5, no_repeat_ngram_size=2, early_stopping=True, output_scores=True, return_dict_in_generate=True)
    text = tok.decode(out.sequences[0], skip_special_tokens=True)
    scores = out.scores[0]
    return {"text": text, "scores": scores}

In [28]:
generate_sentence(tokenizer, model, "Wherefore art thou Romeo?")

{'text': 'Wherefore art thou Romeo?\n\nThou art Romeo, and thou art not Romeo.\n\n\nAnd now, then, I say unto thee, thou wilt not be Romeo; but thou shalt be a man of God; and I will make thee a king of the world. And now I tell thee that I am the Lord of all things, that thou mayest reign in heaven and in earth, for ever and ever. Amen.',
 'scores': tensor([[-8.4136e+00, -8.9426e+00, -1.3182e+01,  ..., -1.5396e+01,
          -1.2783e+01, -7.7030e+00],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
          -1.0000e+09, -1.0000e+09],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
          -1.0000e+09, -1.0000e+09],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
          -1.0000e+09, -1.0000e+09],
         [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
          -1.0000e+09, -1.0000e+09]], device='cuda:0')}

Now that everything is ready, we can write our first MFT.

## MFT - Language prompt
For this MFT, we will expect the model to create a reasonable continuation of a prompt. The model will be prompted with strings like "In {country} the most commonly spoken language is " where {country} is a placeholder for a country such as Spain.

We will consider the MFT to pass if the model's output contains any language. This will demonstrate that the model understands the general context of the prompt. The mentioned language doesn't have to be accurate - for example, "In Spain the most commonly spoken language is Indonesian" would pass our test, because Indonesian is a language. The language may also be located anywhere in the output - for example, "In Spain the most commonly spoken language is not easy to learn. Spanish has many complicated conjugations." would also pass our test.

### Handwritten MFT
First, we will write the MFT by hand. Then, we'll use Checklist's MFT class to demonstrate how Checklist helps us create the MFT much more quickly.

#### Generate prompts from template
We will use Checklist's Editor class to quickly create the prompts.

In [29]:
editor = Editor()
prompt_strs = editor.template("In {country} the most commonly spoken language is ")
prompt_strs.data = prompt_strs.data[0:10]
prompt_strs.data

['In China the most commonly spoken language is ',
 'In India the most commonly spoken language is ',
 'In United States the most commonly spoken language is ',
 'In Indonesia the most commonly spoken language is ',
 'In Brazil the most commonly spoken language is ',
 'In Pakistan the most commonly spoken language is ',
 'In Nigeria the most commonly spoken language is ',
 'In Bangladesh the most commonly spoken language is ',
 'In Russia the most commonly spoken language is ',
 'In Mexico the most commonly spoken language is ']

#### Language CSV
We need a list of languages to check if the model's output contains a language. To save some time, we will read language names from a CSV file. The data comes from standard ISO Language Codes https://datahub.io/core/language-codes 

In [59]:
import urllib.request
urllib.request.urlretrieve('https://datahub.io/core/language-codes/r/language-codes.csv', 'language-codes.csv')
lang_codes_csv = pd.read_csv('language-codes.csv')
lang_codes_csv

Unnamed: 0,alpha2,English
0,aa,Afar
1,ab,Abkhazian
2,ae,Avestan
3,af,Afrikaans
4,ak,Akan
...,...,...
179,yi,Yiddish
180,yo,Yoruba
181,za,Zhuang; Chuang
182,zh,Chinese


#### Run the MFT
Now we're ready to create the MFT. We will create 3 Pandas dataframes, one each for prompts, responses, and results. Then, we will loop over the prompts, send each prompt to the model, and determine if it passes or fails the test. Each prompt and its test result will be recorded in the dataframes.

In [31]:
prompts = pd.DataFrame({"id": [], "prompt": []})
responses = pd.DataFrame({"id": [], "response": []})
results = pd.DataFrame({"id": [], "p/f": []})
langs = lang_codes_csv["English"].tolist()

for (i, s) in enumerate(prompt_strs.data):
    res = generate_sentence(tokenizer, model, s, device='cuda')
    model_response = res["text"][len(s):]
    pf = 'fail'
    
    # Check if any language from the CSV data is in the generated string
    for l in langs:
        if l in model_response:
            pf = 'pass'
            break

    prompts = prompts.append({"id": i, "prompt": s}, ignore_index=True)
    responses = responses.append({"id": i, "response": model_response}, ignore_index=True)
    results = results.append({"id": i, "p/f": pf}, ignore_index=True)

#### Show test results
Now let's look at the results of our test.

In [32]:
prompts

Unnamed: 0,id,prompt
0,0.0,In China the most commonly spoken language is
1,1.0,In India the most commonly spoken language is
2,2.0,In United States the most commonly spoken lang...
3,3.0,In Indonesia the most commonly spoken language...
4,4.0,In Brazil the most commonly spoken language is
5,5.0,In Pakistan the most commonly spoken language is
6,6.0,In Nigeria the most commonly spoken language is
7,7.0,In Bangladesh the most commonly spoken languag...
8,8.0,In Russia the most commonly spoken language is
9,9.0,In Mexico the most commonly spoken language is


In [33]:
responses

Unnamed: 0,id,response
0,0.0,"한국어 (가지는), followed by 그리 (하고) and 현에 (아로).\n\..."
1,1.0,"vernacular Hindi, followed by Tamil, Bengali, ..."
2,2.0,"한국어 (하기), followed by 고지 (가요).\n\nIn other wor..."
3,3.0,"한국어 (하기요), followed by 가장 (고지).\n\nIn the Unit..."
4,4.0,"한국어, which is used to describe the state of af..."
5,5.0,"vernacular English, which is also spoken in ma..."
6,6.0,"한국어 (하기에서 모습니다).\n\nIn the United States, Engl..."
7,7.0,vernacular Bengali.\n\nBengali is the second m...
8,8.0,русский стразывация в польшение на что обреган...
9,9.0,"русский.\n\nIn the United States, there are se..."


In [34]:
results

Unnamed: 0,id,p/f
0,0.0,pass
1,1.0,pass
2,2.0,pass
3,3.0,pass
4,4.0,pass
5,5.0,pass
6,6.0,pass
7,7.0,pass
8,8.0,fail
9,9.0,pass


We can merge all the dataframes to make the results easier to read.

In [35]:
merged = pd.merge(responses, results, on="id")
merged = pd.merge(prompts, merged, on="id")
merged

Unnamed: 0,id,prompt,response,p/f
0,0.0,In China the most commonly spoken language is,"한국어 (가지는), followed by 그리 (하고) and 현에 (아로).\n\...",pass
1,1.0,In India the most commonly spoken language is,"vernacular Hindi, followed by Tamil, Bengali, ...",pass
2,2.0,In United States the most commonly spoken lang...,"한국어 (하기), followed by 고지 (가요).\n\nIn other wor...",pass
3,3.0,In Indonesia the most commonly spoken language...,"한국어 (하기요), followed by 가장 (고지).\n\nIn the Unit...",pass
4,4.0,In Brazil the most commonly spoken language is,"한국어, which is used to describe the state of af...",pass
5,5.0,In Pakistan the most commonly spoken language is,"vernacular English, which is also spoken in ma...",pass
6,6.0,In Nigeria the most commonly spoken language is,"한국어 (하기에서 모습니다).\n\nIn the United States, Engl...",pass
7,7.0,In Bangladesh the most commonly spoken languag...,vernacular Bengali.\n\nBengali is the second m...,pass
8,8.0,In Russia the most commonly spoken language is,русский стразывация в польшение на что обреган...,fail
9,9.0,In Mexico the most commonly spoken language is,"русский.\n\nIn the United States, there are se...",pass


Finally, let's display the failing tests.

In [36]:
merged.loc[merged['p/f'] == 'fail']

Unnamed: 0,id,prompt,response,p/f
8,8.0,In Russia the most commonly spoken language is,русский стразывация в польшение на что обреган...,fail


### Test with Checklist

Next, let's try running the MFT with Checklist. We will no longer need to keep track of results in Pandas dataframes, since Checklist will track the results for us.

#### Create the expectation function
In order to determine if an example passes or fails the test, Checklist uses an expectation function. An expectation function is a function that receives the example, then returns true if the example passes the test, or false if the example fails.

In [37]:
def contains_language(x, pred, conf, label=None, meta=None):
    for l in langs:
        if l in pred:
            return True
    return False

In [38]:
expect_fn = Expect.single(contains_language)

Now we can feed our prompts and expectation function into the MFT constructor.

In [39]:
test = MFT(**prompt_strs, name='Language in response', description='The response contains a language.', expect=expect_fn)

In order to run the test, Checklist also needs a function that generates the model's predictions for the inputs. The function receives all inputs (prompts) as a list, and must return the results in a tuple `(model_predictions, confidences)`, where `model_predictions` is a list of all the predictions, and `confidences` is a list of the model's scores for those predictions.

In [40]:
def generate_responses(inputs):
    responses = []
    confidences = []
    for x in inputs:
        res = generate_sentence(tokenizer, model, x, device='cuda')
        model_response = res["text"][len(x):]
        responses.append(model_response)
        confidences.append(res["scores"])
    return (responses, confidences)

In [41]:
generate_responses(["In Brazil the most commonly spoken language is "])

(['한국어, which is used to describe the state of affairs in the country.\n\nIn the United States, there are many different dialects of English, including English-speaking languages such as Spanish, French, German, Italian, Japanese, Korean, and Vietnamese.'],
 [tensor([[-1.1706e+01, -1.2572e+01, -1.4455e+01,  ..., -2.1273e+01,
           -2.1612e+01, -1.6069e+01],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09],
          [-1.0000e+09, -1.0000e+09, -1.0000e+09,  ..., -1.0000e+09,
           -1.0000e+09, -1.0000e+09]], device='cuda:0')])

Now we're ready to run the test. The first argument to the `test.run()` function is the generator function we just created. We will also set the optional parameter `overwrite=True` so the test can be re-run without an error. If overwrite=False, then Checklist will reject subsequent test runs to prevent you from accidentally overwriting your test results.

In [42]:
test.run(generate_responses, overwrite=True)

Predicting 10 examples


To see the results, we can use the `summary` function.

In [43]:
test.summary()

Test cases:      10
Fails (rate):    1 (10.0%)

Example fails:
русский стразывация в польшение на что обреганных и кормать.

Продевительного таковы дляжется, можнощию задали былекти. Навроись � In Russia the most commonly spoken language is 
----


Test results can also be explored visually by using the `visual_summary` function.

In [44]:
test.visual_summary()

TestSummarizer(stats={'npassed': 9, 'nfailed': 1, 'nfiltered': 0}, summarizer={'name': 'Language in response',…

## MFT - Animal names prompt
Let's make another MFT using the prompt "The {animal} is running in the zoo." We will test if the model's response contains the same animal mentioned in the prompt. For example, for the prompt "The goat is running in the zoo," a passing response would be something like, "I have never seen a goat running before," because the word *goat* is mentioned in the response.

This MFT has a small complication compared to the previous MFT. In this MFT, we need to check if the response mentions the same animal as the prompt. In order to do that, we need to associate the animal in the prompt with the animal in the response. We will see how Checklist can help us do this by storing metadata for each test case.

### Handwritten test
As before, let's first write the test without using Checklist.

In [45]:

prompts = pd.DataFrame({"id": [], "prompt": []})
responses = pd.DataFrame({"id": [], "response": []})
test_results = pd.DataFrame({"id": [], "p/f": []})

animals = ["dog", "cat", "giraffe", "aardvark"]

for (i, animal) in enumerate(animals):
    prompt = f"The {animal} is running in the zoo"
    res = generate_sentence(tokenizer, model, prompt, device='cuda')
    pf = 'fail'
    model_response = res["text"][len(prompt):]
    
    # Check if the same animal is mentioned in the response
    if animal in model_response:
        pf = 'pass'

    prompts = prompts.append({"id": i, "prompt": prompt}, ignore_index=True)
    responses = responses.append({"id": i, "response": model_response}, ignore_index=True)
    test_results = test_results.append({"id": i, "p/f": pf}, ignore_index=True)


#### Show test results
Let's look at our test results. The first dataframe contains the prompts given to the model.

In [46]:
prompts

Unnamed: 0,id,prompt
0,0.0,The dog is running in the zoo
1,1.0,The cat is running in the zoo
2,2.0,The giraffe is running in the zoo
3,3.0,The aardvark is running in the zoo


The next dataframe shows the model's response to the prompt (not including the prompt itself)

In [47]:
# Todo: Make pandas print the whole response
responses

Unnamed: 0,id,response
0,0.0,".\n\n""It's been a long time since I've seen a ..."
1,1.0,".\n\n""I don't know what's going on,"" he said. ..."
2,2.0,".\n\n""It's been a long time coming, but it's f..."
3,3.0,".\n\n""It's been a long time coming,"" he said. ..."


The final dataframe shows the pass/fail status of the test

In [48]:
test_results

Unnamed: 0,id,p/f
0,0.0,pass
1,1.0,fail
2,2.0,fail
3,3.0,fail


### Testing with Checklist
Now let's do the same test using Checklist. First, we will generate the prompts. This time, we will use the optional argument `meta=True` in the `editor.template()` function. This will cause Checklist to store the string substituted into the template so that it can be checked in the expectation function.

In [49]:
animal_prompts = editor.template("The {animal} is running in the zoo", animal=["dog", "cat", "giraffe", "aardvark"], meta=True)
animal_prompts.data

['The dog is running in the zoo',
 'The cat is running in the zoo',
 'The giraffe is running in the zoo',
 'The aardvark is running in the zoo']

#### Seeing the metadata
Let's take a peek at the metadata stored with the template.

In [50]:
animal_prompts.meta

[{'animal': 'dog'},
 {'animal': 'cat'},
 {'animal': 'giraffe'},
 {'animal': 'aardvark'}]

#### Accessing metadata in the expectation function
Checklist will use the optional `meta` parameter to send the metadata to the expectation function.

In [57]:
def contains_same_animal(x, pred, conf, label=None, meta=None):
    return meta['animal'] in pred

In [52]:
expect_fn = Expect.single(contains_same_animal)

#### Running the test
Now we're ready to run the test. We can reuse the `generate_responses()` function that we wrote at the beginning of the notebook.

In [53]:
test = MFT(**animal_prompts, name='Same animal in response', description='The response contains the same animal mentioned in the prompt.', expect=expect_fn)

In [54]:
test.run(generate_responses, overwrite=True)

Predicting 4 examples


In [55]:
test.summary()

Test cases:      4
Fails (rate):    3 (75.0%)

Example fails:
.

"It's been a long time coming, but it's finally here," he said. The giraffe is running in the zoo
----
.

"I don't know what's going on," he said. "I've never seen anything like this before." The cat is running in the zoo
----
.

"It's been a long time coming," he said. "I've never seen anything like it before." The aardvark is running in the zoo
----


In [56]:
test.visual_summary()

TestSummarizer(stats={'npassed': 1, 'nfailed': 3, 'nfiltered': 0}, summarizer={'name': 'Same animal in respons…