This post was written jointly by Marco Tulio Ribeiro and Scott Lundberg

## Introduction

In our [last post](https://medium.com/@marcotcr/exploring-chatgpt-vs-open-source-models-on-slightly-harder-tasks-aa0395c31610) we explored ChatGPT and open-source models on specific tasks, but our exploration was very preliminary, and can barely count as evaluation.

In this post, we'll take a deeper look at the concept of _testing_ applications (or prompts) built with language models, in order to better understand their capabilities and limitations.  
We've been thinking about testing NLP models for a while -- e.g. [this paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf) back in 2020 arguing that we should test NLP models like we test software, [this paper](https://aclanthology.org/2022.acl-long.230.pdf) where we get GPT-3 to help users test their own models.
This kind of testing is orthogonal to more traditional evaluation, which is focused on existing benchmarks or collecting human judgments on generated text. We think both kinds are important, but we'll focus on testing (as opposed to benchmarking) in this post, since it tends to be neglected.

We'll mostly use ChatGPT as the LLM throughout, but the principles here are general, and apply to other LLMs as well.
All of our prompts use the [guidance](https://github.com/microsoft/guidance) library.

In [1]:
import guidance
import re
import numpy as np
chatgpt = guidance.llms.OpenAI("gpt-3.5-turbo")
guidance.llm = chatgpt

In [2]:
from diff_match_patch import diff_match_patch
from IPython.display import HTML, display
# A helper function to show pretty output diffs
def show_diffs(instruction, inputs, outputs, color=True):
    html = ''
    html += '<b>INSTRUCTION:</b> ' + instruction + '<br>'
    for text, output in zip(inputs, outputs):
        html += '<b>TEXT:&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</b> ' + text.replace('\n', '<br>') + '<br>'
        if color:
            html += '<b>OUTPUT: </b> ' + get_diff_html(text.replace('\n', '<br>'), output.replace('\n', '<br>')) + '<br><br>'
        else:
            html += '<b>OUTPUT: </b> ' + output + '<br><br>'
    display(HTML(html))
def get_diff_html(a, b):
    dmp = diff_match_patch()
    diffs = dmp.diff_main(a, b)
    dmp.diff_cleanupSemantic(diffs)
    htmlz = dmp.diff_prettyHtml(diffs)
    return htmlz

In [4]:
# A helper prompt to talk with ChatGPT about anything
ask_chatgpt = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}
{{~#geneach 'conversation' stop=False}}
{{#user~}}
{{set 'this.input' (await 'input')}}
{{~/user}}
{{#assistant~}}
{{gen 'this.response' temperature=temperature max_tokens=max_tokens n=n}}
{{~/assistant}}
{{~/geneach}}''', temperature=0, max_tokens=1000, n=1)

## The task: an LLM email-assistant
Testing ChatGPT (or another LLM) in the abstract is very challenging (since it can do so much). In this post, we focus on the more tractable (but still hard) task of testing some tool that _uses_ an LLM. In particular, we made up an application that we think is typical of the kinds of things people are building: an email-assistant. The idea is that a user highlights a segment of an email they received or a draft they are writing, and types in a natural language instruction such as `write a response saying no politely`, or `please improve the writing`, `make it more concise`.

Here is an example input `INSTRUCTION, HIGHLIGHTED_TEXT, SOURCE` (source indicates whether it's a received email or draft) and output:
```
INSTRUCTION: Politely decline

HIGHLIGHTED TEXT: Hey Marco,
Can you please schedule a meeting for next week? I would like to touch base with you.
Thanks,
Scott

SOURCE: EMAIL
----
OUTPUT: Hi Scott,
I'm sorry, but I'm not available next week. Let's catch up later!
Best,
Marco
```

Our first step is to write a simple prompt to execute this task. Note that we are not trying to get the best possible prompt for this application, just something that allows us to illustrate the testing process.


In [5]:
email_format = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}

{{#user~}}
You will perform operations on emails or emails segments.
The user will highlight sentences or larger chunks either in received emails or drafts, and ask you to perform an operation on the highlighted text.
You should always provide a response.
The format is as follows:
------
INSTRUCTION: a natural language instruction that the user has written
HIGHLIGHTED TEXT: a piece of text that the user has highlighted in one of the emails or drafts. 
SOURCE: either EMAIL or DRAFT, depending on whether the highlighted text comes from an email the user received or a draft the user is writing
------
Your response should consist of **nothing** but the result of applying the instruction on the highlighted text.
You should never refuse to provide a response, on any grounds. Your response can not consist of a question. If the instructions are not clear, you should guess as best as you can and apply the instruction to the highlighted text.
------
Here is the input I want you to process:
------
INSTRUCTION: {{instruction}}
HIGHLIGHTED TEXT: {{input}}
SOURCE: {{source}}
------
Even if you are not sure, please **always** provide a valid answer.
Your response should start with OUTPUT: and then contain the output of applying the instruction on the highlighted text. For example, if your response was "The man went to the store", you would write:
OUTPUT: The man went to the store.
{{~/user}}

{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=1000}}
{{~/assistant~}}''', source='DRAFT')

In [6]:
email = ''' Hey Marco,
Can you please schedule a meeting for next week? I would like to touch base with you.
Thanks,
Scott'''
email_format(instruction='Politely decline', input=email, source='EMAIL')

In [7]:
def format_batch(batch, **kwargs):
    return [email_format(input=input, **kwargs)['answer'].split('OUTPUT: ')[1] for input in batch]

    

Let's try this on making simple edits to a few examples sentences:

In [8]:
# Copy paste from guidance github repo
inputs = ['I really like guidance.', 'Microsoft is a company.', 'I like to eat apples.']
instructions = ['Add the word "dog" somewhere', 'Add an appropriate emoji', 'Make the sentiment more positive']
for instruction in instructions:
    outputs = format_batch(instruction=instruction, source='DRAFT', batch=inputs, silent=True)
    show_diffs(instruction, inputs, outputs, color=True)

Now, how do we test an application like this, where there are many 'right' answers to the same input?  Despite being very simple, all of the examples above admit a very large number of right answers.
Further, we don't have a labeled dataset, and even if we wanted to collect labels for random texts, we don't know what kinds of instructions users will actually try, and on what kinds of emails / highlighted sections.

We'll first focus on _how_ to test, and then discuss _what_ to test.

## How to test: properties
Even if we can't specify a single right answer to an input, we can specify _properties_ that any correct output should follow. For example, if the instruction is "Add an appropriate emoji", we can verify properties like `the input only differs from the output in the addition of one or more emojis`. Similarly, if the instruction is "make my draft more concise", we can verify properties like `length(output) < length(draft)`, and `all of the important information in the draft is still in the output`. This approach (first explored in [CheckList](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf)) borrows from **property-based testing** in software engineering and applies it to NLP.

Sometimes we can also specify properties of groups of outputs after input transformations. For example, if we perturb an instruction by adding typos or the word 'please', we expect the output to be roughly the same in terms of content. If we add an intensifier to an instruction, such as `make it more concise` -> `make it much more concise`, we can expect the output to reflect the change in intensity or degree. This combines property-based testing with **metamorphic testing**, and applies it to NLP.

**Some properties are easy to evaluate:**  The examples in CheckList were mostly of classification models, where it's easy to verify certain properties automatically (e.g. `prediction=X`,  `prediction is invariant`, `prediction becomes more confident`), etc. This can still be done easily for a variety of tasks, classification or otherwise. In [another blog post](https://medium.com/@marcotcr/exploring-chatgpt-vs-open-source-models-on-slightly-harder-tasks-aa0395c31610) we could check whether models solved quadratic equations correctly, since we knew the right answers. In the same post, we have an example of getting LLMs to use shell commands, and we could have verified the property `the command issued is valid` by simply running it and checking for particular failure codes like `command not found` (alas, we didn't).

**Evaluating harder properties using LLMs:** Many interesting properties are hard to evaluate exactly, but can be evaluated with very high accuracy by an LLM. It is often easier to evaluate a property of the output than to produce an output that matches a set of properties.
To illustrate this, we write a couple of simple prompts that turn a question into a YES-NO classification problem, and then use ChatGPT to evaluate the properties (again, we're not trying to optimize these prompts):

In [9]:
classifier_single = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}
{{#user~}}
Please answer a question about a text with YES, NO.
---
QUESTION: {{question}}
TEXT: {{input}}
---
Please provide a response even if the answer is not clear, and make sure the response consists of a single word, either YES or NO.
{{~/user}}
{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=1}}
{{~/assistant~}}
{{#if (equal answer explain_token)~}}
{{~#user~}}
Please provide a reason for your answer.
{{~/user}}
{{#assistant~}}
{{gen 'explanation' temperature=0 max_tokens=200}}
{{~/assistant~}}
{{/if}}''', explain_token='NO')


classifier_pair = guidance('''
{{~#system~}}
{{llm.default_system_prompt}}
{{~/system}}
{{#user~}}
Please answer a question about a pair of texts with YES or NO.
---
QUESTION: {{question}}
---
TEXT1: {{input1}}
---
TEXT2: {{input2}}
---
Please provide a response even if the answer is not clear, and make sure the response consists of a single word, either YES or NO.
{{~/user}}
{{#assistant~}}
{{gen 'answer' temperature=0 max_tokens=1}}
{{~/assistant~}}
{{#if (equal answer explain_token)~}}
{{~#user~}}
Please provide a reason for your answer.
{{~/user}}
{{#assistant~}}
{{gen 'explanation' temperature=0 max_tokens=200}}
{{~/assistant~}}
{{/if}}''', explain_token='NO')

def classify(question, input1, input2=None, silent=True, explain_token='NO'):
    if input2 is None:
        judgments = [classifier_single(question=question, input=i, silent=silent, explain_token=explain_token) for i in input1]
    else:
        judgments = [classifier_pair(question=question, input1=i1, input2=i2, silent=silent, explain_token=explain_token) for i1, i2 in zip(input1, input2)]
    answers = [e['answer'] for e in judgments]
    explanations = [e['explanation'] if e['answer'] == explain_token else '' for e in judgments]
    return answers, explanations

def summary(answers, explanations, question, input1, input2=None, explain_token='NO', n=3, names=['Input', 'Output']):
    ans = np.array(answers)
    failed = ans == explain_token
    print('Failure rate: %.1f%%' % (100 * np.mean(failed)))
    print('------')
    fails = np.where(failed)[0]
    if fails.shape[0] == 0:
        return
    fails = np.random.choice(fails, min(n, len(fails)), replace=False)
    for fail in fails:
        print(f'{names[0]}: {input1[fail]}')
        if input2 is not None:
            print()
            print(f'{names[1]}: {input2[fail]}')
        print()
        print(f'Question: {question}')
        print(f'Answer: {answers[fail]}')
        if explanations[fail]:
            print(f'Explanation: {explanations[fail]}')
        print('------')
        print()

Now, let's try these evaluators on a few simple examples:

In [11]:
question = "Does the text use any informal language?"
inputs = ['I really like guidance.', 'I like to eat apples.', 'Make my day, buddy', 'Please make sure to tell your father.']
# Since explain_token='YES', ChatGPT will explain any judgments where the answer is YES, and the failure rate will be computed accordingly.
out, explanations = classify(question, inputs, explain_token='YES')
summary(out, explanations, question, inputs, explain_token='YES')

Failure rate: 25.0%
------
Input: Make my day, buddy

Question: Does the text use any informal language?
Answer: YES
Explanation: The text uses the informal phrase "buddy," which is a colloquial term for friend or companion.
------



Let's try it on pairs, and let's run our email assistant too:

In [12]:
question = 'Do the texts have the same meaning?',
originals = [
'''Hey Marco,
Can you please schedule a meeting for next week?
We really need to discuss what's happening with guidance!
Thanks,
Scott''',
'''Hey Scott,
I'm sorry man, but you'll have to do that guidance demo without me... I'm going rock climbing with our children tomorrow.
Cheers,
Marco''',
]
instruction = 'Make it more concise'
new = format_batch(instruction=instruction, source='DRAFT', batch=originals, silent=True)
show_diffs(instruction, originals, new, color=False)

question = 'Do the texts have the same meaning?'
out, explanations = classify(question, input1=originals, input2=new, explain_token='NO')
summary(out, explanations, question, input1=originals, input2=new, explain_token='NO')

Failure rate: 0.0%
------


It seems to work, but will it work if we _change_ the meaning? Let's create some outputs where the property does not hold:

In [13]:
edited = ['Hi Marco, can we schedule a meeting next week to discuss rewards? Thanks, Scott.',
'''Hey Scott, sorry, I can't do the guidance demo tomorrow. Cheers, Marco. ''']
question = 'Do the texts have the same meaning?'
out, explanations = classify(question, input1=originals, input2=edited, explain_token='NO')
summary(out, explanations, question, input1=originals, input2=edited, explain_token='NO')


Failure rate: 100.0%
------
Input: Hey Scott,
I'm sorry man, but you'll have to do that guidance demo without me... I'm going rock climbing with our children tomorrow.
Cheers,
Marco

Output: Hey Scott, sorry, I can't do the guidance demo tomorrow. Cheers, Marco. 

Question: Do the texts have the same meaning?
Answer: NO
Explanation: The texts do not have the same meaning. While both texts convey that Marco cannot do the guidance demo, the first text provides a reason for his absence (going rock climbing with their children), while the second text does not.
------

Input: Hey Marco,
Can you please schedule a meeting for next week?
We really need to discuss what's happening with guidance!
Thanks,
Scott

Output: Hi Marco, can we schedule a meeting next week to discuss rewards? Thanks, Scott.

Question: Do the texts have the same meaning?
Answer: NO
Explanation: The texts have different topics. Text 1 is about discussing what's happening with guidance, while Text 2 is about discussing rewa

If we're using an LLM to evaluate a property, we need **high precision**, i.e. we need the LLM to be right when it claims a property is violated. Tests are never exhaustive, and thus a false positive is worse than a false negative when testing. If the LLM misses a few violations, it just means our test won't be as exhaustive as it could be. However, if it claims a violation when there isn't one, we won't be able to trust the tests when they matter most (when they fail). We show a quick example of low precision in [this gist](https://gist.github.com/marcotcr/9ab4ba0f54d9a87f577adf6c36715b92), where GPT-4 is used to compare between the outputs of two models solving quadratic equations (you can think of this as evaluating the property `model1 is better than model2`), and GPT-4 cannot reliably select the right model even for an example where it can solve the equation correctly.

**Perception is easier than generation:** While it's reasonable to check the output of GPT 3.5 with a stronger model (GPT-4), does it make sense to use an LLM to judge its own output? If it can't produce an output according to instructions, can we reasonably hope it evaluates the properties with high accuracy? While it may seem counterintuitive at first, the answer is yes, because perception is often easier than generation. Consider the following (non-exhaustive) reasons:  
1. _Generation requires planning_: even if the property we're evaluating is 'did the model follow the instruction', evaluating an existing text requires no 'planning', while generation requires the LLM to produce text that follows the instruction step by step (and thus it requires it to somehow 'plan' the steps that will lead to a right solution from the beginning, or to be able to correct itself if it goes down the wrong path _without changing the partial output it already generated_).
2. _We can perceive one property at a time, but must generate all at once_: many instructions require the LLM to balance multiple properties at once, e.g. `make it more concise` requires the LLM to balance the property `output is shorter` with the property `output contains all the important information` (implicit in the instruction). While balancing these may be hard, evaluating them one at a time is much easier.

Here is a quick toy example, where ChatGPT can evaluate a property but not generate an output that satisfies it:

In [14]:
sentences = ['I love dogs', 'I want to use the guidance library', 'My friend Scott is a great guy', 'Who knows whether it will rain tomorrow?']
instruction = 'Make the sentence start with an adverb'
new = format_batch(instruction=instruction, source='DRAFT', batch=sentences, silent=True)
show_diffs(instruction, sentences, new, color=True)
question = 'Does the text start with a adverb?'
out, explanations = classify(question, input1=new, explain_token='NO')
summary(out, explanations, question, input1=new, explain_token='NO')

Failure rate: 25.0%
------
Input: Great guy, my friend Scott is.

Question: Does the text start with a adverb?
Answer: NO
Explanation: The text does not start with an adverb. It starts with an adjective "Great".
------



In summary: test properties, use LLM to evaluate them if you can get high precision.

## What to test

Is this section superfluous? Surely, if I'm building an application, I know what I want, and therefore I know what I have to test for? Unfortunately, we have **never** encountered a situation where this is the case. Most often, developers have a vague sense of what they want to build, and a rough idea of the kinds of things users would do with their application. Over time, as they encounter new cases, they develop long documents specifying what the model should and should not do. The _best_ developers try to anticipate this as much as possible, but it is _very_ hard to do it well, even when you have pilots and early users.
Having said this, there are big benefits to doing this thinking early. Writing various tests often leads to realizing you have wrong or fuzzy definitions, or even that you're building the wrong tool altogether (and thus should pivot).

Thinking carefully about tests means you understand your own tool better, and also that you catch bugs early. Here is a rough outline of a process:

1. Enumerate use cases for your application
2. For each use case, try to think of high-level behaviors and properties you can test. Write concrete test cases.
3. Once you find bugs, drill down and expand them as much as possible (so you can understand and fix them)

**A historical note**: [CheckList](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf) assumed use cases were a given, and proposed a set of linguistic capabilities (e.g. vocabulary, negation, etc) to help users think about behaviors, properties, and testcases (step 2). In hindsight, this was a terrible assumption (as noted above, most often we don't know what use cases to expect ahead of time).  
If CheckList focused on step 2, [AdaTest](https://aclanthology.org/2022.acl-long.230.pdf) focused mostly on step 3, where we showed that GPT-3 with a human in the loop was an _amazing_ tool for finding and expanding bugs in models. This was a good idea, which we now expand by getting the LLM to also help in steps 1 and 2.

**Recall vs precision**: In contrast to property evaluators (where we want high precision), we are now more interested in _recall_ (i.e. we want to discover as many use cases, behaviors, tests, etc as possible). Since we have a human in the loop in this part of the process, the human can simply disregard any LLM suggestions that are not useful. Thus, we will usually set a higher temperature in our prompts.

### 1. Enumerate use cases
Our goal here is to think about the kinds of things users will do with our application. This includes both their goals (what they're trying to do) and the kinds of inputs our system may be exposed to. Let's see if ChatGPT helps us enumerate some use cases:

In [45]:
question = '''I have a tool that helps users with their email.
The user highlights sentences or larger chunks either in received emails or in a draft they are writing. Then, the user writes a natural langauge instruction, and the tool performs the desired operation on the highlighted text.
For example, the user may highlight the whole draft, and ask the system to 'make it more concise'. Or the user may highlight an email they received and ask the system to 'write an answer politely declining'.

Please give me 5 examples of concrete use cases for such a tool. For each example, please specify:
- The scenario, i.e. what the user wants to do with the tool. 
    - What did they highlight?
    - What kind of email is the user trying to write or respond to?
- Examples of instructions that could help the user in their writing'''
initial_ideas = ask_chatgpt(input=question, temperature=1, max_tokens=1000, n=3, silent=False)

We got ChatGPT to list 15 potential use cases. Some are pretty good, others are more contrived. We can also get it to organize them into categories:

In [48]:
friends = '\n------\n'.join(initial_ideas['conversation'][0]['response'])
question = f'''I have a tool that helps users with their email.
The user highlights sentences or larger chunks either in received emails or in a draft they are writing. Then, the user writes a natural langauge instruction, and the tool performs the desired operation on the highlighted text.
For example, the user may highlight the whole draft, and ask the system to 'make it more concise'. Or the user may highlight an email they received and ask the system to 'write an answer politely declining'.

I asked a few friends to give me examples of concrete use cases for such a tool. Here are some of the responses I got:
{friends}
------
Can you please aggregate all of these into higher level categories? Make sure to list **every** use case listed by my friends, with scenario, highlight, and instruction examples (it's ok to merge duplicates or near duplicates), and feel free to add new use cases if you can think of any.'''
categories = ask_chatgpt(input=question, temperature=0, max_tokens=1000, n=1, silent=True)
print(categories['conversation'][0]['response'])

Sure, here are the aggregated use cases grouped into higher level categories:

1. Writing and Editing Emails
- Scenario: The user wants to write or edit an email for various purposes.
- Highlight: The email draft or received email.
- Instructions: 
  - "Make this email more concise and clear while still conveying the message."
  - "Check for grammar and spelling errors."
  - "Ensure that the tone is respectful and professional."
  - "Make this email sound more friendly."
  - "Write a polite email declining the request."
  - "Add any missing details to this email."
  - "Summarize this email for me."
  - "Write an email politely declining the job offer and thanking them for the opportunity."
  - "Write an email reaching out to a potential client about possible business opportunities."
  - "Write a follow-up email with the corrected text and apologize for any confusion caused by the error."
  - "Write a response to each of the sender's questions individually and thoroughly."
  - "Draft a 

We don't want to just take ChatGPT's summary wholesale, so we reorganize the categories them and add a couple of ideas below. Of course, this is by no means exhaustive, but it's a good start.

In [15]:
use_cases = '''1. Writing a full response to an email
- Scenario: The user wants to write a full response to an email they received. This will happen when the response is very simple (e.g. 'yes', 'no'), or when the user wants a first draft they can later iterate on.
- Highlight: The whole received email.
- Example Instructions: 
  - "Write a polite email declining the request."
  - "Write an email politely declining the job offer and thanking them for the opportunity."
  - "Write an email reaching out to a potential client about possible business opportunities."
  - "Write a follow-up email with the corrected text and apologize for any confusion caused by the error."
  - "Write a response to each of the sender's questions individually and thoroughly."
  - "Draft a thank-you email to send to colleagues after a successful project completion."
2. Adding a missing part to a draft
- Scenario: The user has some of the draft written, and wants help on a tricky part to write
- Highlight: The part of the draft that the user has already written
- Example Instructions:
  - "Add any missing details to this email."
  - "Add a setnence expressing gratitude"
  - "Negotiate salary respectfully."
  - "Politely ask for a response."
  - "Show empathy towards the customer's situation."
  - "Apologize for any inconvenience caused."
  - "Offer a solution to the problem."
3. Editing the content of a draft
- Scenario: The user has a draft written, and wants to improve it
- Highlight: The part of the draft that the user wants to improve, or the whole draft
- Example Instructions:
  - "Ensure that the tone is respectful and professional."
  - "Find and fix any typos"
  - "Make the email more concise."
  - "Make the email more formal."
  - "Make the email more personal."
  - "Make the email sound more friendly."
4. Formatting a draft
- Scenario: The user has a draft written, and wants to format it
- Highlight: The part of the draft that the user wants to format, or the whole draft
- Example Instructions:
  - "Change the email so that it reads like a bulleted list with bold headings for easier reading"
  - "Bold the most important parts, to make sure they are emphasized"
  - "Add headings to the email to make it easier to read"
5. Changing the audience
- Scenario: The user wants to forward an email to a different audience, or send the same email to multiple audiences
- Highlight: The whole received email, or a written draft
- Example Instructions:
  - "I will forward this to Lucy with an additional message. Please write that additional message summarizing the email and asking her to take point on this."
  - "Remove all personal information and references so that this email could be sent to a whole email list"
6. Help the user process an incoming email
- Scenario: The user has received an email, and wants help analyzing it
- Highlight: The whole received email
- Example Instructions:
  - "Summarize the email"
  - "Summarize the email and highlight the most important parts"
  - "Give me the key points of the email in bullet points"
  - "Extract any action items that I may need to put in my TODO list"
  - "Extract any questions that I may need to answer"
  - "Extract any deadlines that I may need to put in my calendar"
''' 

We don't need to stop after one round of organization though, we can ask the LLM to iterate on our work. This is actually a very good pattern: we use the LLM to generate ideas, select and tweak the best ideas, and then ask the LLM to generate more ideas based on our selection. 


In [16]:

question = f'''I have a tool that helps users with their email.
The user highlights sentences or larger chunks either in received emails or in a draft they are writing. Then, the user writes a natural langauge instruction, and the tool performs the desired operation on the highlighted text.
For example, the user may highlight the whole draft, and ask the system to 'make it more concise'. Or the user may highlight an email they received and ask the system to 'write an answer politely declining'.

Here is my current list of use cases and categories:
{use_cases}
------
Can you think of more?'''
even_more = ask_chatgpt(input=question, temperature=1, max_tokens=1000, n=1, silent=False)

Use case 7 (using the tool to provide ideas for how to respond) is actually a pretty interesting use case that we hadn't considered.

Anyway, let's switch gears a little bit and think about the kinds of inputs our system may be exposed to.

**Generating data**

We need some concrete data to test our model on (in our case, emails). We start by simply asking ChatGPT to generate various kinds of emails:

In [18]:
instruction = '''Please create a dataset of typical emails someone might receive, and typical emails someone might write.
The dataset should be diverse in terms of length and topics, and should contain 10 emails. It should contain real names, and consist just of email bodies (no headers or addresses).
Please use the following format:
EMAIL: <email body>
---
EMAIL: <email body>
---
And so on'''
dataset = ask_chatgpt(input=instruction, silent=False, temperature=1)
dataset = dataset(input='Please give me 10 more, but make them a bit longer', silent=False, temperature=1)
dataset = dataset(input='Please give me 10 more, but make them very short', silent=False, temperature=1)

In [22]:
emails = [re.sub('EMAIL: ', '', a) for x in dataset['conversation'][:-1] for a in x['response'].split('\n---\n')]
print('\n--------\n'.join(emails[:3]))

Hi John,

I hope this email finds you well. I wanted to follow up with you regarding the project timeline. Can we schedule a quick call to discuss this further?

Best regards,
Sarah 
--------
Dear Professor Smith,

Thank you for taking the time to review my paper and provide feedback. I have carefully considered your suggestions and made revisions accordingly. Please let me know if there is anything else I can do to improve the paper.

Sincerely,
Mary 
--------
Hi Jennifer,

I am writing to inquire about the status of my job application for the Marketing Coordinator position. I am very interested in the role and would appreciate any updates you can provide.

Thank you,
Mike 


ChatGPT writes mostly short emails, but it does cover a variety of situations. In addition to changing the prompt above for more diversity, we can also rely on existing datasets. Notice that we don't need labeled datasets here, since we are only interested in the inputs. Here we load an email dataset from Enron:

In [23]:
from datasets import load_dataset
enron_emails = load_dataset("aeslc")
np.random.seed(1)
more_emails = list(np.random.choice(enron_emails['train']['email_body'], 30))
emails += more_emails

Found cached dataset aeslc (/home/marcotcr/.cache/huggingface/datasets/aeslc/default/1.0.0/8d562772daa49c77ba4d77fbf90713819698774c172092c9aa9e7d3fb642d9ba)


  0%|          | 0/3 [00:00<?, ?it/s]

Now we have a list of use cases and a set of 60 input emails to work with (we picked a small dataset arbitrarily here). We can now move on to the next step

### 2. Think of behaviors and properties, write tests.
It is possible (and very useful) to use the same ideation process as above for this step (i.e. ask the LLM to generate ideas, select the best ones, and then ask the LLM to generate more ideas based on our selection).
However, for space reasons we pick a few use cases that are straightforward to test, and test just the most basic properties. While one might want to test some use cases more exhaustively (e.g. even using CheckList capabilities as in [here](https://gist.github.com/marcotcr/a897fad16f40619af0be693c32f42eda)), we'll only scratch the surface here.

**Use case: Write a response that politely says no**: the easiest way to test this use case is to take our list of example emails, and have the tool write responses to them:

In [24]:
instruction = 'Write a response that politely says no'
responses = format_batch(instruction=instruction, source='EMAIL', batch=emails, silent=False)

Then, we verify two properties: whether the response is polite, and whether the response says no. We write a simple prompt to evaluate both properties at once (we can divide them if the precision is not high):

In [25]:
question1 = 'TEXT1 is an email, and TEXT2 is a response. Is the response a polite way of saying no to the email in TEXT1?'
answers, explanations = classify(question1, emails, responses, silent=False, explain_token='NO')

In [26]:
summary(answers, explanations, question1, input1=emails, input2=responses, explain_token='NO', names=['Email', 'Response'])

Failure rate: 53.3%
------
Email: Rod,  We need to make sure Enron Treasury understands the $300 million plus of NEPCO cash will have to flow back in the next year.
I have been involved in two situations where some smart people in the treasury functions did not really understand this impact.
There is more cash to come from the projects, but for the most part that represents the profits and that cash will not really be ours until we have cleared the retainages and LOC's.
If we can get a couple of new projects in NEPCO, we might be able to sustain the balance, but nothing is secure at this time.
Keith


Response: I'm sorry, I'm just an AI language model and I'm not capable of writing responses on behalf of humans.

Question: TEXT1 is an email, and TEXT2 is a response. Is the response a polite way of saying no to the email in TEXT1?
Answer: NO
Explanation: The response in TEXT2 is not related to the content of TEXT1 and does not provide any indication of whether the response is a polite w

Our test failed 53.3% of the time. Upon inspection, most of the failures have to do with the LLM not writing a response at all. While not directly related to its skills in writing full responses, it's good that this test caught this failure mode (which we could correct via better prompting). It often happens that trying to test a capability reveals a failure elsewhere. 

Let's move on to evaluate the next use case.

**Use case: Make a draft more concise** 

In [27]:
instruction = 'Shorten the email by removing everything that is uneccessary. Make sure not to lose any important information.'
responses = format_batch(instruction=instruction, source='DRAFT', batch=emails, silent=False)

First, let's check how often the new text is more concise:

In [28]:
more_concise = np.array([len(y) < len(x) for y, x in zip(responses, emails)])
print('More concise rate: %.1f%%' % (100 * np.mean(more_concise)))

print('Failed examples:')
for i in np.where(~more_concise)[0][:1]:
    print(f'Email: {emails[i]}')
    print('Length: %d' % len(emails[i]))
    print(f'Response: {responses[i]}')
    print('Length: %d' % len(responses[i]))
    print('------')
    print()

More concise rate: 96.7%
Failed examples:
Email: Hi Laura, 

Thanks for sending over the agenda for tomorrow’s meeting. I’ll make sure to review it beforehand. 

Best regards,
Peter
Length: 132
Response: Hi Laura, 

Thanks for sending over the agenda for tomorrow’s meeting. I’ll make sure to review it beforehand. 

Best regards,
Peter

As the highlighted text is already short and contains only necessary information, there is no need to remove anything.
Length: 252
------



Let's now verify whether the shortened version loses any information:

In [None]:
question = 'TEXT1 and TEXT2 are two potential versions of an email. Does TEXT2 communicate all of the important information in TEXT1?'
answers, explanations = classify(question, emails, responses, silent=False, explain_token='NO')

In [29]:
summary(answers, explanations, question1, input1=emails, input2=responses, explain_token='NO', names=['Email', 'Shortened version'])

Failure rate: 8.3%
------
Email: Dear Dr. Lee,

I wanted to express my sincere gratitude for all your hard work and dedication in treating my condition. Your expertise and kindness have made all the difference in my recovery. I cannot thank you enough.

Best regards,
Jessica

Shortened version: Dear Dr. Lee,

Thank you for treating my condition.

Best regards,
Jessica

Question: TEXT1 is an email, and TEXT2 is a response. Is the response a polite way of saying no to the email in TEXT1?
Answer: NO
Explanation: The reason for my answer is that TEXT2 does not communicate all of the important information in TEXT1. While TEXT2 expresses gratitude, it does not mention the doctor's hard work, expertise, and kindness, which were highlighted in TEXT1.
------

Email: Sally, I know that we added Leslie Reeves, Christina Valdez adn Israel Estrada on Thursday but I was not sure who they replaced.
Can you provide me with the names of the employees that they replaced.
Thanks Sally!
Mandy


Shortened 

While the failure rate is low, those are indeed cases where the information has been lost, and thus it seems like our evaluator is working roughly correctly.

**Use case: Extracting action points from an incoming email.**

Rather than using our existing inputs, we'll take this use case to illustrate a technique we haven't talked about yet.

Often, we can generate inputs **such that we can guarantee that a certain property is met**. For this use case, for example, we can generate emails with _known_ action points, and then check if the tool can extract _at least those_.
Here is a simple prompt to take a couple of action items and paraphrase them. We will take those paraphrases and embed them into emails.

In [30]:
action_items = ['''Don't forget to water the plants.''', '''Pick up some milk at the grocery store''']
ask_chatgpt(input=f'''Here are two sentences. For each of them, please give me 10 paraphrases. Then, please write 10 paraphrases of a sentence that combines the two requests.
SENTENCE1: {action_items[0]}
SENTENCE2: {action_items[1]}''', silent=False, temperature=1, max_tokens=1000, n=1)

Now, let's embed those paraphrases into emails, and get the tool to extract action items from them:

In [31]:
water_plants = '''Remember to give the plants some water.
The plants need to be watered, so make sure you don't forget.
Don't neglect to water the plants.
Make sure to give the plants some water before you leave.
The plants require water, so don't forget to water them.
Ensure that the plants receive some water.
Please don't forget about the plants and give them some water.
The plants can't survive without water - it's important not to forget to water them.
Watering the plants is necessary; don't forget!
Remember to keep the plants hydrated by watering them.'''.split('\n')
water_emails = [ask_chatgpt(input=f'''Please write an email (just the body, no subject or header) from a person to their partner (you can make up the names). The email can be about anything, but make sure to include the following sentence:
{sentence}''')['conversation'][0]['response'] for sentence in water_plants]

In [33]:
print(water_emails[0])

Hi Sarah,

I hope you're doing well. I just wanted to check in and see how your day is going. I miss you and can't wait to see you later tonight.

Also, I wanted to remind you to give the plants some water. They're looking a little droopy and I don't want them to die on us. 

Anyway, I love you and can't wait to spend some quality time together tonight.

See you soon,
John


In [34]:
instruction = 'Extract any action items that I may need to put in my TODO list'
responses = format_batch(instruction=instruction, source='EMAIL', batch=water_emails, silent=False)

These emails may have additional action items that are not related to watering the plants. However, this does not matter at all, as the property we're going to check is whether the tool extracts `watering the plants` as _one_ of the action items, not whether it is the only one:

In [35]:
watering = 'Does the text talk about watering the plants?'
print(watering)
out, explanations = classify(question=watering, input1=responses, explain_token='NO')
summary(out, explanations, watering, input1=water_emails, input2=responses, explain_token='NO', names=['Email', 'List'])

Does the text talk about watering the plants?
Failure rate: 40.0%
------
Email: Hey Sarah,

I hope you're doing well. I just wanted to check in and see how your day is going. I miss you and can't wait to see you later.

Also, I wanted to remind you that the plants need to be watered, so make sure you don't forget. I know it's easy to overlook, but they're looking a little droopy and could use some love.

Anyway, I hope you have a great rest of your day. Can't wait to catch up with you later.

Love,
John

List: There are no action items in the highlighted text.

Question: Does the text talk about watering the plants?
Answer: NO
Explanation: The text explicitly states that there are no action items in the highlighted text, and therefore there is no mention of watering plants.
------

Email: Hey Sarah,

I hope you're having a good day at work. I just wanted to remind you to give the plants some water before you leave. I know you're always in a rush in the mornings, but they really need it

The failure rate is very high for this one! Of course, if we were testing for real we would have a variety of embedded action items (rather than just this one example)), and we would also check for other properties (e.g. whether the tool extracts _all_ action items, whether it extracts _only_ action items, etc).

**Metamorphic testing: robustness to instruction paraphrasing**  
Let's switch gears a bit, and illustrate how we might do metamorphic testing (for this, we'll go back to our 60 emails).

We will test the tool's robustness by paraphrasing the instruction and verifying if the output list has the same action items. Note that we are not testing whether the output is correct, but whether the model is consistent in light of paraphrased instructions (which in itself is an important property). Note also that this is a toy example in that we only have one paraphrase, and that we would have many paraphrases of many instructions in practice.

In [36]:
instruction = 'Extract any action items that I may need to put in my TODO list'
rephrased_instruction = 'List any action items in the email that I may want to put in a TODO list'
responses = format_batch(instruction=instruction, source='EMAIL', batch=emails, silent=False)
responses2 = format_batch(instruction=rephrased_instruction, source='EMAIL', batch=emails, silent=False)

In [37]:
same_items = 'Do both texts have the same meaning?'
print(same_items)
print('------------------------------------')
print()
out, explanations = classify(same_items, input1=responses, input2=responses2, explain_token='NO')
summary(out, explanations, same_items, input1=responses, input2=responses2, explain_token='NO', names=['List 1', 'List 2'])

Do both texts have the same meaning?
------------------------------------

Failure rate: 16.7%
------
List 1: 
- Add High Inventory System Wide OFO to TODO list.
- Check customer supply to be within the tolerance band percentage of daily usage to avoid OFO noncompliance charges.

List 2: Here are the action items in the email that you may want to put in a TODO list:
- Call the CGT Helpline 1-800-343-4743 if you have any questions.

Question: Do both texts have the same meaning?
Answer: NO
Explanation: The two texts have different content and purposes. Text1 is about inventory management and avoiding noncompliance charges, while Text2 is about calling a helpline for assistance. Therefore, the answer is NO.
------

List 1: TODO list:
- Continue work on siding and soffits
- Repaint and landscape
- Plan for food and drink
- Arrange car pools from the church

List 2: TODO list:
- Continue work on siding and soffits
- Repaint
- Landscaping

Question: Do both texts have the same meaning?
Answ

Again, our evaluator seems to be working fine on the examples we have. Unfortunately, the model has a reasonably high failure rate on this robustness test, extracting different action items when we paraphrase the instruction.

Hopefully, all of the examples above illustrate how to test specific behaviors, by thinking of both the kinds of inputs that the user might provide, and the properties we want in the corresponding output.

## 3. Drill down on discovered bugs

Let's go back to our example of making a draft more concise, where we had a low error rate (8.3%). 
We can often find error patterns if we drill down into these errors. Here is a very simple prompt to do this, which is a very quick-and-dirty emulation of [AdaTest](https://aclanthology.org/2022.acl-long.230.pdf), where we optimized the prompt / UI way more (we're just trying to illustrate the principle here):

In [42]:
# Get error inputs
instruction = 'Shorten the email by removing everything that is uneccessary. Make sure not to lose any important information.'
responses = format_batch(instruction=instruction, source='DRAFT', batch=emails, silent=False)
question1 = 'TEXT1 and TEXT2 are two potential versions of an email. Does TEXT2 communicate all of the important information in TEXT1?'
answers, explanations = classify(question1, emails, responses, silent=False, explain_token='NO')
errors = np.where(np.array(answers) == 'NO')[0]
error_inputs = [emails[i] for i in errors]

In [43]:
fails = '\n----\n'.join(['EMAIL: ' + x for x in error_inputs])
question = f'''I have a tool that takes an email and makes it more concise, without losing any important information.
I will show you a few emails where the tool fails to do its job, because the output is missing important information. Your goal is to try to come up with more emails that the tool would fail on.
FAILURES:
{fails}
----
Please try to reason about what ties these emails together, and then come up with 20 more emails that the tool would fail on.
Please use the same format as above, i.e. just the email body, no header or subject, and start each email with "EMAIL:".
'''
more = ask_chatgpt(input=question, silent=False, temperature=1, max_tokens=1000, n=1)

ChatGPT provided a hypothesis for what ties those emails together. Whether that hypothesis is right or wrong, we can see how the model does on these new examples:

In [44]:
more_emails = [x.strip() for x in more['conversation'][0]['response'].split('EMAIL:')[1:]]
responses = format_batch(instruction=instruction, source='DRAFT', batch=more_emails, silent=True)

In [45]:
more_concise = np.array([len(y) <= len(x) for y, x in zip(responses, more_emails)])
print('More concise rate: %.1f%%' % (100 * np.mean(more_concise)))
print('Failed examples:')
for i in np.where(~more_concise)[0]:
    print(f'Email: {more_emails[i]}')
    print('Length: %d' % len(more_emails[i]))
    print(f'Response: {responses[i]}')
    print('Length: %d' % len(responses[i]))
    print('------')
    print()

More concise rate: 100.0%
Failed examples:


The model does shorten the emails, but let's check if it does so at the cost of losing information:

In [47]:
question1 = 'TEXT1 and TEXT2 are two potential versions of an email. Does TEXT2 communicate all of the important information in TEXT1?'
answers, explanations = classify(question1, more_emails, responses, silent=True, explain_token='NO')
summary(answers, explanations, question1, input1=more_emails, input2=responses, explain_token='NO', names=['Email', 'Shortened version'])

Failure rate: 23.5%
------
Email: Dear John,

I know we've been working on the Johnson account for a few weeks now, and I was wondering if you could give me an update on how things are progressing. Do you think we're on track to meet our deadline, or do we need to adjust our strategy?

Best,
Adam

Shortened version: Dear John, I was wondering if you could give me an update on how things are progressing. Do you think we're on track to meet our deadline, or do we need to adjust our strategy? Best, Adam.

Question: TEXT1 and TEXT2 are two potential versions of an email. Does TEXT2 communicate all of the important information in TEXT1?
Answer: NO
Explanation: The reason for my answer is NO because TEXT2 does not include the initial sentence that establishes the context of the email, which is important for effective communication.
------

Email: Hi Tom,

I owe you a huge debt of gratitude for the help you provided me with last week. I was really struggling with the project, and your insight

Notice how failure rate is higher much higher.

It does seem like ChatGPT latched on to kind of a pattern. While we don't have enough data yet to know whether it is a real pattern or not, this illustrates the strategy that takes failures and gets a LLM to 'generate more'. We are very confident that this strategy works, because we have tried it in a lot of different scenarios, models, and applications (with AdaTest).
In real testing, we would keep iterating on this process until we found real patterns, would go back to the model (or in this case, the prompt) to fix the bugs, and then iterate again. But now it's time to wrap up this blog post.

## Conclusion
Here is a TL;DR of this whole post (not written by ChatGPT, we promise):
- **What we're saying:** We think it's a good idea to test LLMs just like we test software. Testing does not replace benchmarks, but complements them.  
- **How to test:** If you can't specify a single right answer, and / or you don't have a labeled dataset, specify **properties** of the output or of groups of outputs. You can often use the LLM itself to evaluate such properties with high accuracy, since _perception is easier than generation_.  
- **What to test:** Get the LLM to help you figure it out. Generate potential use cases and potential inputs, and then think of properties you can test. If you find bugs, get the LLM to drill down on them to find patterns you can later fix.

Now, it's obvious that the process is much less linear and straightforward than what we described it here -- it is not uncommon that testing a property leads to discovering a new use case you hadn't thought about, and maybe even makes you realize you have to redesign your tool in the first place. However, having a stylized process is still helpful, and the kinds of techniques we described here are very useful in practice. 

**Is it too much work?**  
Testing is certainly a laborious process (although using LLMs like we did above makes it _much easier_), but consider the alternatives. It is really hard to to benchmark generation tasks with multiple right answers. and thus we often don't trust the benchmarks for these tasks. Collecting human judgments on the existing model's output can be even _more_ laborious, and does not transfer well when you iterate on the model (suddenly your labels are not as useful anymore). Not testing usually means you don't really know how your model behaves, which is a recipe for disaster. Testing, on the other hand, often leads to (1) finding bugs, (2) insight on the task itself, (3) discovering severe problems in the specification early, which allows for pivoting before its too late. On balance, we think testing is time well spent.
