You can load [CheckList](https://github.com/marcotcr/checklist) suites as datasets and run additional models on them for comparison.  
This notebook has examples for running the checklists in the [CheckList paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf).

In [1]:
import nlp
from nlp import checklist as cl

# Sentiment Analysis

In [2]:
path = '../datasets/sentiment_checklist'

In [3]:
suite = cl.CheckListSuite(path)

`CheckListSuite` is a class with additional helper functions for running the CheckList.  
The data itself is in `suite.dataset`:

In [4]:
suite.dataset

Dataset(features: {'tweet': Value(dtype='string', id=None), 'test_name': Value(dtype='string', id=None), 'test_case': Value(dtype='int32', id=None), 'example_idx': Value(dtype='int32', id=None)}, num_rows: 87470)

### Running the CheckList

In order to run the CheckList, you have to add predictions to `suite.dataset`, similar to the examples in [here](https://huggingface.co/nlp/processing.html#processing-data-row-by-row).  
Here, we will load the predictions used in the CheckList paper, available [here](https://github.com/marcotcr/checklist/raw/master/release_data.tar.gz).  
Some CheckLists also require confidence scores. We can check what the right format is by looking at the suite description:

In [5]:
print(suite.dataset.info.description)

A CheckList for three-way sentiment analysis (negative, neutral, positive).
Predictions: should be integers, where:
  - 0: negative
  - 1: neutral
  - 2: positive
Confidences: should be list(float) of length 3, with prediction probabilities
for negative, neutral and positive (respectively)

Test names for Table 1 in the paper:
['neutral words in context', 'Sentiment-laden words in context', 'change neutral words with BERT', 'add positive phrases', 'add negative phrases', 'add random urls and handles', 'typos', 'change locations', 'change names', 'used to, but now', 'simple negations: not negative', 'simple negations: not neutral is still neutral', 'simple negations: I thought x was negative, but it was not (should be neutral or positive)', 'Hard: Negation of positive with neutral stuff in the middle (should be negative)', 'my opinion is what matters', 'Q & A: yes', 'Q & A: no']

Use with nlp.checklist.CheckListSuite



We load the predictions and confidences and add them to separate keys in `suite.dataset`:

In [6]:
import os
prediction_path = '/home/marcotcr/work/checklist/release_data/sentiment/predictions/'
models = ['microsoft', 'google', 'amazon', 'bert', 'roberta']
for model in models:
    print('Loading %s predictions' % model)
    preds = open(os.path.join(prediction_path, model)).read().splitlines()
    confs = [list(map(float, (x.split()[1:]))) for x in preds]
    preds = [int(x.split()[0]) for x in preds]
    conf_key = '%s_conf' % model
    suite.dataset = suite.dataset.map(lambda _, idx: {model: preds[idx], conf_key: confs[idx]}, with_indices=True)

Loading microsoft predictions
Loading google predictions
Loading amazon predictions
Loading bert predictions
Loading roberta predictions


After loading the predictions, we call `suite.compute` to compute test results for each model.  
The arguments are the prediction_key and the confidence_key.  

In [7]:
for model in models:
    print('Running tests for %s' % model)
    conf_key = '%s_conf' % model
    suite.compute(model, conf_key)

Running tests for microsoft


HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))


Running tests for google


HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))


Running tests for amazon


HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))


Running tests for bert


HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))


Running tests for roberta


HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




The result for every test example is saved under the key `fail`, which returns a dictionary with each model. For example, here is a test example where we expect the prediction to be neutral:

In [8]:
example = suite.dataset[800]
print(example['tweet'])
labels = ['Negative', 'Neutral', 'Positive']
for model in example['fail']:
    print('%-12s Prediction: %-10s Failed test: %s' % (model, labels[example[model]], 'Yes' if example['fail'][model] else 'No' ))

This was an Australian customer service.
amazon       Prediction: Neutral    Failed test: No
bert         Prediction: Positive   Failed test: Yes
google       Prediction: Neutral    Failed test: No
microsoft    Prediction: Neutral    Failed test: No
roberta      Prediction: Negative   Failed test: Yes


The failure rates for each model are saved in `suite.fail_rate`, indexed by test name.  
There are many tests in this suite, but let's say we wanted to replicate Table 1 in the [checklist paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf):

In [9]:
checklist_table1 =  [
'neutral words in context',
'Sentiment-laden words in context',
'change neutral words with BERT',
'add positive phrases',
'add negative phrases',
'add random urls and handles',
'typos',
'change locations',
'change names',
'used to, but now',
'simple negations: not negative',
'simple negations: not neutral is still neutral',
'simple negations: I thought x was negative, but it was not (should be neutral or positive)',
'Hard: Negation of positive with neutral stuff in the middle (should be negative)',
'my opinion is what matters',
'Q & A: yes',
'Q & A: no',
]

In [10]:
print (' '.join([x[:5] for x in models]))
for t in checklist_table1:
    r = ' '.join(['%5.1f' % (suite.fail_rate[m][t]) for m in models])
    print('%s %s' % (r, t))


micro googl amazo bert rober
  0.0   7.6   4.8  94.6  81.8 neutral words in context
  4.0  15.0   2.8   0.0   0.2 Sentiment-laden words in context
  9.4  16.2  12.4  10.2  10.2 change neutral words with BERT
 12.6  12.4   1.4   0.2  10.2 add positive phrases
  0.8  34.6   5.0   0.0  13.2 add negative phrases
  9.6  13.4  24.8  11.4   7.4 add random urls and handles
  5.6  10.2  10.4   5.2   3.8 typos
  7.0  20.8  14.8   7.6   6.4 change locations
  2.4  15.1   9.1   6.6   2.4 change names
 41.0  36.6  42.2  18.8  11.0 used to, but now
 18.8  54.2  29.4  13.2   2.6 simple negations: not negative
 40.4  39.6  74.2  98.4  95.4 simple negations: not neutral is still neutral
100.0  90.4 100.0  84.8   7.2 simple negations: I thought x was negative, but it was not (should be neutral or positive)
 98.4 100.0 100.0  74.0  30.2 Hard: Negation of positive with neutral stuff in the middle (should be negative)
 45.4  62.4  68.0  38.8  30.0 my opinion is what matters
  9.0  57.6  20.8   3.6   3.0 Q 

Now let's say we want to compare the pipeline in transformers to these models.  

In [11]:
from transformers import pipeline
model = pipeline("sentiment-analysis", device=0)

This test suite assumes the labels are [negative, neutral, positive], so we have to do some converting from binary sentiment to that:

In [12]:
import numpy as np
def pred_and_conf(data):
    # change format to softmax, make everything in [0.33, 0.66] range be predicted as neutral
    preds = model(data)
    pr = np.array([x['score'] if x['label'] == 'POSITIVE' else 1 - x['score'] for x in preds])
    pp = np.zeros((pr.shape[0], 3))
    margin_neutral = 1/3.
    mn = margin_neutral / 2.
    neg = pr < 0.5 - mn
    pp[neg, 0] = 1 - pr[neg]
    pp[neg, 2] = pr[neg]
    pos = pr > 0.5 + mn
    pp[pos, 0] = 1 - pr[pos]
    pp[pos, 2] = pr[pos]
    neutral_pos = (pr >= 0.5) * (pr < 0.5 + mn)
    pp[neutral_pos, 1] = 1 - (1 / margin_neutral) * np.abs(pr[neutral_pos] - 0.5)
    pp[neutral_pos, 2] = 1 - pp[neutral_pos, 1]
    neutral_neg = (pr < 0.5) * (pr > 0.5 - mn)
    pp[neutral_neg, 1] = 1 - (1 / margin_neutral) * np.abs(pr[neutral_neg] - 0.5)
    pp[neutral_neg, 0] = 1 - pp[neutral_neg, 1]
    preds = np.argmax(pp, axis=1)
    return preds, pp

Add predictions to `suite.dataset`

In [13]:
def add_pipeline(x):
    preds, confs = pred_and_conf(x['tweet'])
    return {'hf_pipeline': preds, 'hf_pipeline_conf': confs}
suite.dataset = suite.dataset.map(add_pipeline , batched=True)

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [14]:
suite.compute('hf_pipeline', 'hf_pipeline_conf')

HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




In [15]:
models.append('hf_pipeline')

In [16]:
print (' '.join([x[:5] for x in models]))
for t in checklist_table1:
    r = ' '.join(['%5.1f' % (suite.fail_rate[m][t]) for m in models])
    print('%s %s' % (r, t))

micro googl amazo bert rober hf_pi
  0.0   7.6   4.8  94.6  81.8  95.8 neutral words in context
  4.0  15.0   2.8   0.0   0.2   0.8 Sentiment-laden words in context
  9.4  16.2  12.4  10.2  10.2   9.8 change neutral words with BERT
 12.6  12.4   1.4   0.2  10.2   0.0 add positive phrases
  0.8  34.6   5.0   0.0  13.2   6.8 add negative phrases
  9.6  13.4  24.8  11.4   7.4  15.4 add random urls and handles
  5.6  10.2  10.4   5.2   3.8   6.6 typos
  7.0  20.8  14.8   7.6   6.4  10.0 change locations
  2.4  15.1   9.1   6.6   2.4   5.1 change names
 41.0  36.6  42.2  18.8  11.0  32.6 used to, but now
 18.8  54.2  29.4  13.2   2.6  12.8 simple negations: not negative
 40.4  39.6  74.2  98.4  95.4  97.4 simple negations: not neutral is still neutral
100.0  90.4 100.0  84.8   7.2 100.0 simple negations: I thought x was negative, but it was not (should be neutral or positive)
 98.4 100.0 100.0  74.0  30.2  86.8 Hard: Negation of positive with neutral stuff in the middle (should be negative)

### Using marcotcr/checklist

Users can access the suite object from [`marcotcr/checklist`](https://github.com/marcotcr/checklist) if they want to use it (e.g. for visualizations), with the caveat that it doesn't really allow for model comparison (it only keeps the state of the last model we called `compute` on).  
For example:

In [17]:
type(suite.suite)

checklist.test_suite.TestSuite

In [18]:
suite.suite.tests['Sentiment-laden words in context'].summary()

Test cases:      8658
Test cases run:  500
Fails (rate):    4 (0.8%)

Example fails:
0.1 0.0 0.9 This food was average.
----
0.0 0.0 1.0 The flight is average.
----
0.0 0.0 1.0 That was a weird food.
----


In [19]:
suite.suite.summary() # will only display results for the last model we called 'compute' on

Vocabulary

single positive words
Test cases:      34
Fails (rate):    0 (0.0%)


single negative words
Test cases:      35
Fails (rate):    1 (2.9%)

Example fails:
0.3 0.0 0.7 average
----


single neutral words
Test cases:      13
Fails (rate):    13 (100.0%)

Example fails:
0.0 0.0 1.0 see
----
1.0 0.0 0.0 commercial
----
0.0 0.0 1.0 private
----


Sentiment-laden words in context
Test cases:      8658
Test cases run:  500
Fails (rate):    4 (0.8%)

Example fails:
0.0 0.0 1.0 That was a weird food.
----
0.1 0.0 0.9 This was an average staff.
----
0.1 0.0 0.9 This food was average.
----


neutral words in context
Test cases:      1716
Test cases run:  500
Fails (rate):    479 (95.8%)

Example fails:
0.0 0.0 1.0 The crew was Israeli.
----
1.0 0.0 0.0 The staff was private.
----
0.0 0.0 1.0 That is an Australian service.
----


intensifiers
Test cases:      2000
Test cases run:  500
After filtering: 496 (99.2%)
Fails (rate):    8 (1.6%)

Example fails:
0.9 0.0 0.1 This is a creepy ser

In [20]:
suite.suite.visual_summary_table() # displays a visualization of the whole table

Please wait as we prepare the table data...


SuiteSummarizer(stats={'npassed': 0, 'nfailed': 0, 'nfiltered': 0}, test_infos=[{'name': 'single positive word…

### Slicing tests

The suite has a few additional functions to help slicing. For example, if you want to see example instances from certain test:

In [21]:
simple = suite.get_test('Sentiment-laden words in context')

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [22]:
simple['tweet'][:10]

['I valued the flight.',
 'That is a sad customer service.',
 'We like the flight.',
 'This was a nice crew.',
 'We abhor that flight.',
 'This staff is difficult.',
 'I valued that aircraft.',
 'This was a fantastic flight.',
 'I hate the food.',
 'I despised the food.']

We then can use this as we would any `nlp.dataset`. 
For example, let's filter by examples where google fails and the huggingface pipeline does not:

In [23]:
google_fails_hf_doesnt = simple.filter(lambda x:x['fail']['hf_pipeline'] == 0 and x['fail']['google'] == 1)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [24]:
for x in np.random.choice(google_fails_hf_doesnt.shape[0], 5):
    x = google_fails_hf_doesnt[int(x)]
    print('%-30s HF:%-8s Google:%-8s' % (x['tweet'], labels[x['hf_pipeline']], labels[x['google']]))

This seat was lame.            HF:Negative Google:Neutral 
I dread that service.          HF:Negative Google:Neutral 
That was a hard pilot.         HF:Negative Google:Neutral 
The seat is hard.              HF:Negative Google:Neutral 
That is a sad airline.         HF:Negative Google:Neutral 


Perturbation tests combine multiple examples, so looking at a single row in the dataset would not give us a good picture.  
Instead, we want to aggregate each testcase into a row of examples (data goes into the `data` key):

In [25]:
perturbation = suite.get_test('change locations', aggregate_testcases=True)

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [26]:
any_fails = lambda x, key: any([y['fail'][key] for y in x['data']])
google_fails_hf_doesnt = perturbation.filter(lambda x: not any_fails(x, 'hf_pipeline') and any_fails(x, 'google'))

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




This is an INV test, so a failure is a change in prediction. Some examples where HF maintains the invariance after a location change, while Google changes its prediction:

In [27]:
mapz = ['negative', 'neutral', 'positive']
for x in np.random.choice(google_fails_hf_doesnt.shape[0], 5):
    x = google_fails_hf_doesnt[int(x)]
    orig = x['data'][0]
    fail = [y for y in x['data'] if y['fail']['google']][0]
    print('HF:%-8s Google:%-8s | %s' % (labels[orig['hf_pipeline']], labels[orig['google']], orig['tweet']))
    print('HF:%-8s Google:%-8s | %s' % (labels[fail['hf_pipeline']], labels[fail['google']], fail['tweet']))
    print()
    print()

HF:Negative Google:Negative | @united you better hold my flight to Tucson #5237, just landed in Houston after an hour delay for some minor computer problem
HF:Negative Google:Neutral  | @united you better hold my flight to Santa Maria #5237, just landed in Houston after an hour delay for some minor computer problem


HF:Negative Google:Neutral  | @USAirways JUST LANDED flight 545. Any chance of making flight 5530 Phoenix to AUS
HF:Negative Google:Positive | @USAirways JUST LANDED flight 545. Any chance of making flight 5530 Newport Beach to AUS


HF:Positive Google:Neutral  | @SouthwestAir thanks for the ride to Chicago. #kmdw #b738 http://t.co/6cpYPGFnD6
HF:Positive Google:Positive | @SouthwestAir thanks for the ride to Palm Beach Gardens. #kmdw #b738 http://t.co/6cpYPGFnD6


HF:Negative Google:Neutral  | @SouthwestAir could u put one here in Baltimore? http://t.co/vLCI2KV1IP
HF:Negative Google:Positive | @SouthwestAir could u put one here in Bell Gardens? http://t.co/vLCI2KV1IP


HF:

# Quora Question Pair (QQP)

The process for QQP and SQuAD is the same as for sentiment, but we will run through them for completeness.

In [28]:
path = '../datasets/qqp_checklist'

In [29]:
suite = cl.CheckListSuite(path)

In [30]:
print(suite.dataset.description)

A CheckList for Quora Question Pair.
Predictions: should be integers, where:
  - 0: non-duplicate
  - 1: duplicate
Confidences: should be list(float) of length 2, with prediction probabilities
for non-duplicate and duplicate(respectively)

Test names for Table 2 in the paper:
['Modifier: adj',  'How can I become more {synonym}?', 'Replace synonyms in real pairs', 'How can I become more X = How can I become less antonym(X)', 'add one typo', '(q, paraphrase(q))',  'Change same name in both questions',  'Change first and last name in one of the questions', 'Keep entitites, fill in with gibberish', 'Is person X != Did person use to be X',   'Is it {ok, dangerous, ...} to {smoke, rest, ...} after != before', "What was person's life before becoming X != What was person's life after becoming X", 'How can I become a X person != How can I become a person who is not X', 'How can I become a X person == How can I become a person who is not antonym(X)', 'Simple coref: he and she', 'Simple coref: hi

Loading predictions and confidences:

In [31]:
models = ['bert', 'roberta']
prediction_path = '/home/marcotcr/work/checklist/release_data/qqp/predictions/'
for model in models:
    preds = list(map(float, open(os.path.join(prediction_path, model)).read().splitlines()))
    confs = [[1 - x, x] for x in preds]
    preds = [int(x >= 0.5) for x in preds]
    conf_key = '%s_conf' % model
    suite.dataset = suite.dataset.map(lambda _, idx: {model: preds[idx], conf_key: confs[idx]}, with_indices=True)

Running the tests:

In [32]:
for model in models:
    conf_key = '%s_conf' % model
    suite.compute(model, conf_key)

HBox(children=(FloatProgress(value=0.0, max=113985.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=113985.0), HTML(value='')))




In [33]:
checklist_table2 = ['Modifier: adj',  'How can I become more {synonym}?', 'Replace synonyms in real pairs', 'How can I become more X = How can I become less antonym(X)', 'add one typo', '(q, paraphrase(q))',  'Change same name in both questions',  'Change first and last name in one of the questions', 'Keep entitites, fill in with gibberish', 'Is person X != Did person use to be X',   'Is it {ok, dangerous, ...} to {smoke, rest, ...} after != before', "What was person's life before becoming X != What was person's life after becoming X", 'How can I become a X person != How can I become a person who is not X', 'How can I become a X person == How can I become a person who is not antonym(X)', 'Simple coref: he and she', 'Simple coref: his and her',  'Order does not matter for comparison', 'Order does not matter for symmetric relations', 'Order does matter for asymmetric relations',  'traditional SRL: active / passive swap with people', 'traditional SRL: wrong active / passive swap with people', 'Symmetry: f(a, b) = f(b, a)', 'Testing implications']

In [34]:
print (' '.join([x for x in models]))
for t in checklist_table2:
    r = ' '.join(['%5.1f' % (suite.fail_rate[m][t]) for m in models])
    print('%s %s' % (r, t))

bert roberta
 78.4  78.0 Modifier: adj
 22.8  39.2 How can I become more {synonym}?
 13.1  12.7 Replace synonyms in real pairs
 69.4 100.0 How can I become more X = How can I become less antonym(X)
 18.2  12.0 add one typo
 69.0  25.0 (q, paraphrase(q))
 11.8   9.4 Change same name in both questions
 35.1  30.1 Change first and last name in one of the questions
 30.0  32.8 Keep entitites, fill in with gibberish
 61.8  96.8 Is person X != Did person use to be X
 98.0  34.4 Is it {ok, dangerous, ...} to {smoke, rest, ...} after != before
100.0   0.0 What was person's life before becoming X != What was person's life after becoming X
 18.6   0.0 How can I become a X person != How can I become a person who is not X
 81.6  88.6 How can I become a X person == How can I become a person who is not antonym(X)
 79.0  96.6 Simple coref: he and she
 99.6 100.0 Simple coref: his and her
 99.6 100.0 Order does not matter for comparison
 81.8 100.0 Order does not matter for symmetric relations
 71.4 1

Summary with examples (for RoBERTa only):

In [35]:
suite.suite.summary()

Vocabulary

Modifier: adj
Test cases:      1000
Test cases run:  500
Fails (rate):    390 (78.0%)

Example fails:
0.8 ('Is Kayla Bennett an economist?', 'Is Kayla Bennett an average economist?')
----
0.8 ('Is Laura Morales an educator?', 'Is Laura Morales an acomplished educator?')
----
0.9 ('Is Kimberly Hill a historian?', 'Is Kimberly Hill an accredited historian?')
----


different adjectives
Test cases:      954
Test cases run:  500
Fails (rate):    0 (0.0%)


Different animals
Test cases:      928
Test cases run:  500
Fails (rate):    0 (0.0%)


Irrelevant modifiers - animals
Test cases:      1000
Test cases run:  500
Fails (rate):    0 (0.0%)


Irrelevant modifiers - people
Test cases:      987
Test cases run:  500
Fails (rate):    0 (0.0%)


Irrelevant preamble with different examples.
Test cases:      938
Test cases run:  500
Fails (rate):    498 (99.6%)

Example fails:
0.0 ('My pet fish eats rice. Is it normal for animals to eat rice?', 'My pet rat eats rice. Is it normal for 

1.0 ('If Benjamin and Natalie were married, would her family be happy?', "If Benjamin and Natalie were married, would Benjamin's family be happy?")
----
1.0 ('If Anthony and Sophia were married, would his family be happy?', "If Anthony and Sophia were married, would Sophia's family be happy?")
----
1.0 ('If Jackson and Leah were married, would her family be happy?', "If Jackson and Leah were married, would Jackson's family be happy?")
----




SRL

Who do X think - Who is the ... according to X
Test cases:      1000
Test cases run:  500
Fails (rate):    2 (0.4%)

Example fails:
0.5 ('Who do people think is the greatest person in the world?', 'Who is the greatest person in the world according to people?')
----
0.5 ('Who do readers think is the premier magician in the world?', 'Who is the premier magician in the world according to readers?')
----


Order does not matter for comparison
Test cases:      990
Test cases run:  500
Fails (rate):    500 (100.0%)

Example fails:
0.0 ('Are beans 

# SQuAD

In [46]:
path = '../datasets/squad_checklist'

In [47]:
suite = cl.CheckListSuite(path)

In [48]:
print(suite.dataset.info.description)

A CheckList for SQuAD.
Predictions: each prediction is a string, containing the answer
Confidences: not necessary for this checklist

Test names for Table 3 in the paper:
['A is COMP than B. Who is more / less COMP?', 'Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?', 'size, shape, age, color', 'Profession vs nationality', 'Animal vs Vehicle v2', 'A is COMP than B. Who is antonym(COMP)? B', 'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.', 'Question typo', 'Add random sentence to context', 'There was a change in profession', 'Understanding before / after -> first / last.', 'Negation in context, may or may not be in question', 'Negation in question only.', 'M/F failure rates should be similar for different professions', 'Basic coref, he / she', 'Basic coref, his / her', 'Former / Latter', 'Agent / object distinction', 'Agent / object distinction with 3 agents']

Use with nlp.checklist.CheckListSuit

Loading predictions:

In [49]:
prediction_path = '/home/marcotcr/work/checklist/release_data/squad/predictions/'
bert_preds = open(os.path.join(prediction_path, 'bert')).read().splitlines()

In [40]:
suite.dataset = suite.dataset.map(lambda _, idx: {'bert': bert_preds[idx]}, with_indices=True)

These tests don't really require confidence scores

In [41]:
suite.compute('bert')

HBox(children=(FloatProgress(value=0.0, max=71293.0), HTML(value='')))




In [42]:
checklist_table3 = [
    'A is COMP than B. Who is more / less COMP?',
    'Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?',
    'size, shape, age, color',
    'Profession vs nationality',
    'Animal vs Vehicle v2',
    'A is COMP than B. Who is antonym(COMP)? B',
    'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.',
    'Question typo',
    'Add random sentence to context',
    'There was a change in profession',
    'Understanding before / after -> first / last.',
    'Negation in context, may or may not be in question',
    'Negation in question only.', 'M/F failure rates should be similar for different professions',
    'Basic coref, he / she',
    'Basic coref, his / her',
    'Former / Latter',
    'Agent / object distinction',
    'Agent / object distinction with 3 agents'
]

In [43]:
for t in checklist_table3:
    print('%5.1f %s' % (suite.fail_rate['bert'][t], t))

 20.0 A is COMP than B. Who is more / less COMP?
 91.3 Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
 82.4 size, shape, age, color
 49.4 Profession vs nationality
 26.2 Animal vs Vehicle v2
 67.3 A is COMP than B. Who is antonym(COMP)? B
100.0 A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
 11.6 Question typo
  9.8 Add random sentence to context
 41.5 There was a change in profession
 82.9 Understanding before / after -> first / last.
 67.5 Negation in context, may or may not be in question
100.0 Negation in question only.
 46.2 M/F failure rates should be similar for different professions
100.0 Basic coref, he / she
 91.8 Basic coref, his / her
100.0 Former / Latter
 60.8 Agent / object distinction
 95.7 Agent / object distinction with 3 agents


In [44]:
suite.suite.summary()

Vocabulary

A is COMP than B. Who is more / less COMP?
Test cases:      494
Fails (rate):    99 (20.0%)

Example fails:
C: Dylan is tougher than Sarah.
Q: Who is less tough?
A: Sarah
P: Dylan


----
C: Adam is smaller than Amy.
Q: Who is less small?
A: Amy
P: Adam


----
C: Kelly is older than Isabella.
Q: Who is less old?
A: Isabella
P: Kelly


----


Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
Test cases:      497
Fails (rate):    454 (91.3%)

Example fails:
C: Jose is really particular about the project. Jessica is particular about the project.
Q: Who is most particular about the project?
A: Jose
P: Jessica

C: Jessica is particular about the project. Jose is really particular about the project.
Q: Who is most particular about the project?
A: Jose
P: Jessica

C: Jessica is particular about the project. Jose is really particular about the project.
Q: Who is least particular about the project?
A: Jessica
P: Jose


----
C: Ethan is pleased about the proje