In [60]:
%load_ext autoreload
%autoreload 2

In [None]:
from nlp import checklist as cl

# Sentiment Analysis

In [23]:
suite_path = '/home/marcotcr/work/checklist/release_data/sentiment/sentiment_suite.pkl'

We have to tell `nlp` how to turn the data in a checklist into a map, which we do in the second argument.  
I'll just add this function to the checklist pickle file for my test suites, making the argument optional.

In [240]:
suite = cl.CheckListSuite(suite_path, lambda x: {'tweet': x })

Loading predictions from other models (which I have saved).  
I'm assuming that people would want to add predictions to `suite.dataset`, similar to the examples in [here](https://huggingface.co/nlp/processing.html#processing-data-row-by-row)

In [50]:
models = ['microsoft', 'google', 'amazon', 'bert', 'roberta']
for model in models:
    preds = open('/home/marcotcr/work/checklist/release_data/sentiment/predictions/%s' % model).read().splitlines()
    confs = [list(map(float, (x.split()[1:]))) for x in preds]
    preds = [int(x.split()[0]) for x in preds]
    conf_key = '%s_conf' % model
    suite.dataset = suite.dataset.map(lambda _, idx: {model: preds[idx], conf_key: confs[idx]}, with_indices=True)

HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




After loading the predictions, we call `suite.compute` to compute test results for each model.  
The second argument is optional, but many tests depend on having a confidence score to check for monotonicity, etc.

In [111]:
for model in models:
    conf_key = '%s_conf' % model
    suite.compute(model, conf_key)

HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




There are more tests in this suite than I had in the paper, but let's pretend I want to replicate table 1 of the [checklist paper](https://homes.cs.washington.edu/~marcotcr/acl20_checklist.pdf).  
What I would do is look at `suite.fail_rate` (I can change this name to `suite.results`, or whatever)

In [242]:
checklist_table1 =  [
'neutral words in context',
'Sentiment-laden words in context',
'change neutral words with BERT',
'add positive phrases',
'add negative phrases',
'add random urls and handles',
'typos',
'change locations',
'change names',
'used to, but now',
'simple negations: not negative',
'simple negations: not neutral is still neutral',
'simple negations: I thought x was negative, but it was not (should be neutral or positive)',
'Hard: Negation of positive with neutral stuff in the middle (should be negative)',
'my opinion is what matters',
'Q & A: yes',
'Q & A: no',
]

In [243]:
print (' '.join([x[:5] for x in models]))
for t in checklist_table1:
    r = ' '.join(['%5.1f' % (suite.fail_rate[m][t]) for m in models])
    print('%s %s' % (r, t))

micro googl amazo bert rober hf_pi


KeyError: 'microsoft'

Let's suppose I want to compare the pipeline in transformers to these models.  
This test suite assumes the labels are [negative, neutral, positive], so we have to do some converting:

In [134]:
from transformers import pipeline
model = pipeline("sentiment-analysis", device=0)

In [139]:
import numpy as np
def pred_and_conf(data):
    # change format to softmax, make everything in [0.33, 0.66] range be predicted as neutral
    preds = model(data)
    pr = np.array([x['score'] if x['label'] == 'POSITIVE' else 1 - x['score'] for x in preds])
    pp = np.zeros((pr.shape[0], 3))
    margin_neutral = 1/3.
    mn = margin_neutral / 2.
    neg = pr < 0.5 - mn
    pp[neg, 0] = 1 - pr[neg]
    pp[neg, 2] = pr[neg]
    pos = pr > 0.5 + mn
    pp[pos, 0] = 1 - pr[pos]
    pp[pos, 2] = pr[pos]
    neutral_pos = (pr >= 0.5) * (pr < 0.5 + mn)
    pp[neutral_pos, 1] = 1 - (1 / margin_neutral) * np.abs(pr[neutral_pos] - 0.5)
    pp[neutral_pos, 2] = 1 - pp[neutral_pos, 1]
    neutral_neg = (pr < 0.5) * (pr > 0.5 - mn)
    pp[neutral_neg, 1] = 1 - (1 / margin_neutral) * np.abs(pr[neutral_neg] - 0.5)
    pp[neutral_neg, 0] = 1 - pp[neutral_neg, 1]
    preds = np.argmax(pp, axis=1)
    return preds, pp

In [141]:
def add_pipeline(x):
    preds, confs = pred_and_conf(x['tweet'])
    return {'hf_pipeline': preds, 'hf_pipeline_conf': confs}
suite.dataset = suite.dataset.map(add_pipeline , batched=True)

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [142]:
suite.compute('hf_pipeline', 'hf_pipeline_conf')

HBox(children=(FloatProgress(value=0.0, max=87470.0), HTML(value='')))




In [143]:
models.append('hf_pipeline')

In [201]:
print (' '.join([x[:5] for x in models]))
for t in checklist_table1:
    r = ' '.join(['%5.1f' % (suite.fail_rate[m][t]) for m in models])
    print('%s %s' % (r, t))

micro googl amazo bert rober hf_pi
  0.0   7.6   4.8  94.6  81.8  95.8 neutral words in context
  4.0  15.0   2.8   0.0   0.2   0.8 Sentiment-laden words in context
  9.4  16.2  12.4  10.2  10.2   9.8 change neutral words with BERT
 12.6  12.4   1.4   0.2  10.2   0.0 add positive phrases
  0.8  34.6   5.0   0.0  13.2   6.8 add negative phrases
  9.6  13.4  24.8  11.4   7.4  15.4 add random urls and handles
  5.6  10.2  10.4   5.2   3.8   6.6 typos
  7.0  20.8  14.8   7.6   6.4  10.0 change locations
  2.4  15.1   9.1   6.6   2.4   5.1 change names
 41.0  36.6  42.2  18.8  11.0  32.6 used to, but now
 18.8  54.2  29.4  13.2   2.6  12.8 simple negations: not negative
 40.4  39.6  74.2  98.4  95.4  97.4 simple negations: not neutral is still neutral
100.0  90.4 100.0  84.8   7.2 100.0 simple negations: I thought x was negative, but it was not (should be neutral or positive)
 98.4 100.0 100.0  74.0  30.2  86.8 Hard: Negation of positive with neutral stuff in the middle (should be negative)

### Using marcotcr/checklist

Users can access my package's object if they want to use it, with the caveat that it doesn't really allow for model comparison (it only keeps the state of the last model we called `compute` on).  
For example:

In [249]:
type(suite.suite)

checklist.test_suite.TestSuite

In [251]:
suite.suite.tests['Sentiment-laden words in context'].summary()

Exception: No results. Run run() first

In [None]:
suite.summary() # calls suite.suite.summary

### Some sugar on nlp.checklist

Examples of a certain test

In [212]:
simple = suite.get_test('Sentiment-laden words in context')

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [213]:
simple['tweet'][:10]

['I valued the flight.',
 'That is a sad customer service.',
 'We like the flight.',
 'This was a nice crew.',
 'We abhor that flight.',
 'This staff is difficult.',
 'I valued that aircraft.',
 'This was a fantastic flight.',
 'I hate the food.',
 'I despised the food.']

Filtering by examples where google fails and hf does not:

In [217]:
google_fails_hf_doesnt = simple.filter(lambda x:x['fail']['hf_pipeline'] == 0 and x['fail']['google'] == 1)

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [228]:
mapz = ['negative', 'neutral', 'positive']
for x in np.random.choice(google_fails_hf_doesnt.shape[0], 5):
    x = google_fails_hf_doesnt[int(x)]
    print('HF:%-8s GOOGLE:%-8s %s' % (mapz[x['hf_pipeline']], mapz[x['google']], x['tweet']))

HF:negative GOOGLE:neutral  That is a lame service.
HF:negative GOOGLE:positive That is a difficult pilot.
HF:negative GOOGLE:neutral  The pilot is creepy.
HF:positive GOOGLE:neutral  We like this company.
HF:negative GOOGLE:neutral  The seat is hard.


Perturbation tests combine multiple examples, so looking at a single row in the dataset would not give us a good picture.  
Instead, we want to aggregate each testcase into a row of examples (data goes into the `data` key):

In [229]:
perturbation = suite.get_test('change locations', aggregate_testcases=True)

HBox(children=(FloatProgress(value=0.0, max=88.0), HTML(value='')))




In [236]:
any_fails = lambda x, key: any([y['fail'][key] for y in x['data']])
google_fails_hf_doesnt = perturbation.filter(lambda x: not any_fails(x, 'hf_pipeline') and any_fails(x, 'google'))

HBox(children=(FloatProgress(value=0.0, max=1.0), HTML(value='')))




In [239]:
mapz = ['negative', 'neutral', 'positive']
for x in np.random.choice(google_fails_hf_doesnt.shape[0], 5):
    x = google_fails_hf_doesnt[int(x)]
    orig = x['data'][0]
    fail = [y for y in x['data'] if y['fail']['google']][0]
    print('HF:%-8s GOOGLE:%-8s %s' % (mapz[orig['hf_pipeline']], mapz[orig['google']], orig['tweet']))
    print('HF:%-8s GOOGLE:%-8s %s' % (mapz[fail['hf_pipeline']], mapz[fail['google']], fail['tweet']))
    print()

HF:negative GOOGLE:positive @SouthwestAir if only you could control the weather in Las Vegas 😉
HF:negative GOOGLE:neutral  @SouthwestAir if only you could control the weather in Farmington Hills 😉

HF:positive GOOGLE:positive @united stay warm - I will be passing through Chicago next week
HF:positive GOOGLE:neutral  @united stay warm - I will be passing through Manchester next week

HF:negative GOOGLE:neutral  @AmericanAir you really need some customer service training for your unhappy EEs in the morning in Chicago. Gate K20 at 430 chking her schd
HF:negative GOOGLE:negative @AmericanAir you really need some customer service training for your unhappy EEs in the morning in Plainfield. Gate K20 at 430 chking her schd

HF:negative GOOGLE:neutral  @JetBlue Are there really no flights from the Bay Area to Chicago anymore? Lame. So lame.
HF:negative GOOGLE:negative @JetBlue Are there really no flights from the Bay Area to Urbana anymore? Lame. So lame.

HF:positive GOOGLE:neutral  @USAirways

# SQuAD

In [2]:
suite_path = '/home/marcotcr/work/checklist/release_data/squad/squad_suite.pkl'

In [3]:
suite = cl.CheckListSuite(suite_path, lambda x: {'context': x[0], 'question': x[1]})

In [8]:
bert_preds = open('/home/marcotcr/work/checklist/release_data/squad/predictions/bert').read().splitlines()

In [13]:
suite.dataset = suite.dataset.map(lambda _, idx: {'bert': bert_preds[idx]}, with_indices=True)

HBox(children=(FloatProgress(value=0.0, max=71293.0), HTML(value='')))




In [16]:
suite.compute('bert')

HBox(children=(FloatProgress(value=0.0, max=71293.0), HTML(value='')))




In [19]:
checklist_table3 = [
    'A is COMP than B. Who is more / less COMP?',
    'Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?',
    'size, shape, age, color',
    'Profession vs nationality',
    'Animal vs Vehicle v2',
    'A is COMP than B. Who is antonym(COMP)? B',
    'A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.',
    'Question typo',
    'Add random sentence to context',
    'There was a change in profession',
    'Understanding before / after -> first / last.',
    'Negation in context, may or may not be in question',
    'Negation in question only.', 'M/F failure rates should be similar for different professions',
    'Basic coref, he / she',
    'Basic coref, his / her',
    'Former / Latter',
    'Agent / object distinction',
    'Agent / object distinction with 3 agents'
]

In [22]:
for t in checklist_table3:
    print('%.1f %s' % (suite.fail_rate['bert'][t], t))

20.0 A is COMP than B. Who is more / less COMP?
91.3 Intensifiers (very, super, extremely) and reducers (somewhat, kinda, etc)?
82.4 size, shape, age, color
49.4 Profession vs nationality
26.2 Animal vs Vehicle v2
67.3 A is COMP than B. Who is antonym(COMP)? B
100.0 A is more X than B. Who is more antonym(X)? B. Who is less X? B. Who is more X? A. Who is less antonym(X)? A.
11.6 Question typo
9.8 Add random sentence to context
41.5 There was a change in profession
82.9 Understanding before / after -> first / last.
67.5 Negation in context, may or may not be in question
100.0 Negation in question only.
46.2 M/F failure rates should be similar for different professions
100.0 Basic coref, he / she
91.8 Basic coref, his / her
100.0 Former / Latter
60.8 Agent / object distinction
95.7 Agent / object distinction with 3 agents


# TODO

- Put some data into `suite.dataset.info`, so people know what the predictions and confidences should look like (e.g. [0, 1, 2] for sentiment, or string for SQuAD)
- Warn people that CheckList suites contained pickled functions, which may not be safe
- Write some documentation for `nlp.checklist`
- 