# Checklist experiments

## <a href=#bookmark1>Minimum Functionality Test</a>

## <a href=#bookmark2>Invariance Test</a>

## <a href=#bookmark3>Directional Expectation  Test</a>

## <a href=#bookmark4>Conclusion</a>

In [None]:
import torch
from transformers import DistilBertTokenizer
from training import DistillBERTClass
import spacy
import numpy as np
import pandas as pd

from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'


import nltk
nltk.download('omw-1.4')

In [None]:
tokenizer = DistilBertTokenizer.from_pretrained(
    'distilbert-base-uncased',
    do_lower_case=True
)

model = DistillBERTClass()
model = torch.load("./model/trained_model.pt",map_location=torch.device('cpu'))
MAX_LEN = 160

In [3]:
# A function that uses the trained model to predict for a single sentence.

def sentence_prediction(sentence: str):
    max_len = MAX_LEN
    tweet = str(sentence)
    tweet = " ".join(tweet.split())
    inputs = tokenizer.encode_plus(
            tweet,
            None,
            add_special_tokens=True,
            # max_length=max_len,
            padding=True
            # pad_to_max_length=True,
        )

    ids = inputs["input_ids"]
    mask = inputs["attention_mask"]


    ids = torch.tensor(ids, dtype=torch.long).unsqueeze(0)
    mask = torch.tensor(mask, dtype=torch.long).unsqueeze(0)

    ids = ids.to(device, dtype=torch.long)
    mask = mask.to(device, dtype=torch.long)

    outputs = model(ids=ids, mask=mask)

    outputs = torch.sigmoid(outputs).cpu().detach().numpy()
    return outputs[0][0]


def predict_proba(inputs):
    p1 = np.array([sentence_prediction(x) for x in inputs]).reshape(-1, 1)
    p0 = 1 - p1
    return np.hstack((p0, p1))

## Example of prediction using the model.


In [4]:
# a disaster tweet

sentence_prediction("South Japan is experiencing some earthquakes right now!")

0.99887425

In [5]:
# not a disaster tweet

sentence_prediction("This movie is a complete disaster!")

0.43723223

In [6]:
# Required checklist imports

# Creating a wrapper for the prediction function as required by the Checklist package.

import checklist
from checklist.editor import Editor

editor = Editor()
from checklist.test_types import MFT, INV, DIR

from checklist.pred_wrapper import PredictorWrapper

wrapped_pp = PredictorWrapper.wrap_softmax(predict_proba)

# Minimum Functionality Test <a name='bookmark1' />

## 1)

Using  : "<*city_name*> is under attack" as a template. This template indicates a type of "disaster tweet". It tests basic functionality of the model to predict positive class.

The checklist uses various city names in its lexicon to generate data and test the model against the generated data.

In [7]:
samples_1 = editor.template(
    '{city} is under attack.',
    remove_duplicates=True, 
    nsamples=150,
    labels=1)


test_1 = MFT(samples_1.data, labels=samples_1.labels, name='Different_cities_countries', capability = 'NER',
          description='Different city names')

Printing a few example data generated :-

In [8]:
print(test_1.data[0])
print(test_1.data[1])
print(test_1.data[2])

Saint Paul is under attack.
Houston is under attack.
Atlanta is under attack.


In [9]:
test_1.run(wrapped_pp)
test_1.summary()

Predicting 150 examples
Test cases:      150
Fails (rate):    0 (0.0%)


In [10]:
test_1.visual_summary()

TestSummarizer(stats={'npassed': 150, 'nfailed': 0, 'nfiltered': 0}, summarizer={'name': 'Different_cities_cou…

100% success rate here.

## 2)

Using :" <*country_name*> starts talks to open a nuclear plant." as template.

This is a negative class as it does not indicate a real disaster.

In [48]:
samples = editor.template(
    '{country} starts talks to open a nuclear plant.',
    remove_duplicates=True, 
    nsamples=100, labels=0)

test = MFT(samples.data, labels=samples.labels, name='Different_countries', capability = 'NER',
          description='Different country names')

Printing a few example data generated :-

In [49]:
print(test.data[0])
print(test.data[1])
print(test.data[2])

Tanzania starts talks to open a nuclear plant.
Czech Republic starts talks to open a nuclear plant.
Senegal starts talks to open a nuclear plant.


In [50]:
test.run(wrapped_pp)
test.summary()

Predicting 100 examples
Test cases:      100
Fails (rate):    93 (93.0%)

Example fails:
0.8 Tanzania starts talks to open a nuclear plant.
----
1.0 Oman starts talks to open a nuclear plant.
----
0.7 Italy starts talks to open a nuclear plant.
----


In [None]:
test.visual_summary()

<img src="mft-summary.png" width="800" height="400">

93% fail rate.

### The reason for high failure rate is of course that, the model is wrongly associating the word *nuclear* with some kind of disaster.

### An interesting thing to note here is the model's bias. The highest error is obtained for under-developed or developing countries. In this case highest error is seen with - Tanzania, Senegal, Cyprus etc and no developed Western countries.

# Invariance test <a name='bookmark2' />

Checking robustness to typos.

The prediction should be the same if we introduce some typos to the sentence.

We use the same data as from the first example, but with **typos** introduced.

In [19]:
nlp = spacy.load("en_core_web_sm")
pdata = list(nlp.pipe(samples_1.data))

  

In [20]:
from checklist.perturb import Perturb

def test_invariant(data: list, method: callable, wrapped_predict: callable):
    t = Perturb.perturb(data, method)
    print("First sample before and after pertubation:")
    print("\n".join(t.data[0]))
    print("\nSummary:")
    test = INV(**t)
    test.run(wrapped_predict)
    test.summary()

In [21]:
test_invariant(samples_1.data, Perturb.add_typos, wrapped_pp)

First sample before and after pertubation:
Saint Paul is under attack.
Saint Palu is under attack.

Summary:
Predicting 300 examples
Test cases:      150
Fails (rate):    38 (25.3%)

Example fails:
1.0 Fort Wayne is under attack.
0.1 Fort Wayne is under atatck.

----
1.0 North Las Vegas is under attack.
0.1 North Las Vegas is under attakc.

----
1.0 Miami is under attack.
0.1 Miami is under atatck.

----


### There is a 25% fail rate when introducing typos to the sentence.

### This is not surprising as the model is not trained on these particular typos and it's not easy to achieve robustness with respect to typos.

# Directional Expectation tests <a name='bookmark3' />

A Directional Expectation test (DIR) is just like an INV, in the sense that we apply a perturbation to existing inputs. However, instead of expecting invariance, we expect the model to behave in a some specified way.

Here we just add a phrase "a fire then broke out." to the end of the sentences.

After this perturbation we expect that probabilty of positive class should NOT go down since this particular phrase indicates some kind of 'disaster'.

Here we don't have to generate new data, we can just use existing data and just add the phrase at the end.







In [31]:
def add_fire_phrase(x: str):
    phrases = ['A fire then broke out.']
    return ['%s %s' % (x, p) for p in phrases]

In [32]:
test_data = list(pd.read_csv("data/test.csv")['text'][:100])

In [33]:
test_data[:3]

['Just happened a terrible car crash',
 'Heard about #earthquake is different cities, stay safe everyone.',
 'there is a forest fire at spot pond, geese are fleeing across the street, I cannot save them all']

In [34]:
from checklist.expect import Expect

In [35]:
monotonic_decreasing = Expect.monotonic(label=1, increasing=True, tolerance=0.1)

In [36]:
t = Perturb.perturb(test_data, add_fire_phrase)
test = DIR(**t, expect=monotonic_decreasing)

In [37]:
test.run(wrapped_pp)
test.summary()

Predicting 200 examples
Test cases:      100
After filtering: 46 (46.0%)
Fails (rate):    0 (0.0%)


### 100 % success rate for DIR

## Conclusion <a name='bookmark4' />

- 93% Failure rate for the second MFT test. More training data needed for the model to understand nuance regarding "nuclear".

- 25% Failure rate when typos are introduced. This is a difficult problem to solve as there can be lot of ways to misspell.

- Model performs well in DIR.