## Introduction
Okay! here we discuss about how to text augment data using checklist pacakge and how to test performance of the model using augment data for linguistic properties of textual data.

In [11]:
import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb

In [12]:
editor = Editor()

In [13]:
data = ["John is a very smart person, he lives in Ireland.",
       "Mark stewart was born and raised in Chicago",
       "luke smith has 3 sisters",
       'mary is not a nurse',
       'Julianne is an engineer',
       'My brother andrew used to be a lawyer']

In [14]:
# so this function takes in the list of text and replaces some words it's complimentary words ex : profession
import re
def change_professions(x,meta=False, *args, **kwards):
    professions = ['doctor',  'engineer', 'lawyer', 'nurse']
    ret = []
    ret_meta = []
    for p in professions:
        if re.search(r'\b%s\b'%p, x):
            ret.extend([re.sub(r'\b%s\b'%p, p2, x) for p2 in professions
                       if p != p2])
#             if meta:
            ret_meta.extend([(p, p2) for p2 in professions if p!=p2])
    if meta:
        return ret, ret_meta
    return ret

In [15]:
change_professions(data[3], meta=True)

(['mary is not a doctor', 'mary is not a engineer', 'mary is not a lawyer'],
 [('nurse', 'doctor'), ('nurse', 'engineer'), ('nurse', 'lawyer')])

In [16]:
# the main function for augementing data here is perturb (modifying) and passing the custom function according 
#to use case : parameters
#keep_origina --> defines to keep main sentence fro which the augment is happening.
#n_samples --> on number of sample to perform the augmentation.
ret = Perturb.perturb(data, change_professions, keep_original=True)

In [17]:
ret.data

[['mary is not a nurse',
  'mary is not a doctor',
  'mary is not a engineer',
  'mary is not a lawyer'],
 ['Julianne is an engineer',
  'Julianne is an doctor',
  'Julianne is an lawyer',
  'Julianne is an nurse'],
 ['My brother andrew used to be a lawyer',
  'My brother andrew used to be a doctor',
  'My brother andrew used to be a engineer',
  'My brother andrew used to be a nurse']]

In [18]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [19]:
pdata =  list(nlp.pipe(data))

### Builtin functions of perturb
so for all basic functinalities like adding punctuations, spelling mistake library has builtin functions. and all take spacy pipeline object

In [20]:
#here example of strip puncation
pdata[0], Perturb.strip_punctuation(pdata[0])

(John is a very smart person, he lives in Ireland.,
 'John is a very smart person, he lives in Ireland')

In [21]:
#we just pass data and builting function to perturb and data will be augmented.
ret = Perturb.perturb(pdata, Perturb.punctuation)
ret.data[:4]

[['John is a very smart person, he lives in Ireland.',
  'John is a very smart person, he lives in Ireland'],
 ['Mark stewart was born and raised in Chicago',
  'Mark stewart was born and raised in Chicago.'],
 ['luke smith has 3 sisters', 'luke smith has 3 sisters.'],
 ['mary is not a nurse', 'mary is not a nurse.']]

In [22]:
#similarly typos function
data[0], Perturb.add_typos(data[0])

('John is a very smart person, he lives in Ireland.',
 'John is a very smart person, he lives in rIeland.')

In [23]:
#similarly contract to deal with expansion of contractions functions
print(data[3], Perturb.contract(data[3]))
print(data[3], Perturb.expand_contractions('mary is not a nurse mary isn\'t a nurse'))

mary is not a nurse mary isn't a nurse
mary is not a nurse mary is not a nurse mary is not a nurse


In [24]:
#ner supporting funcinalities for changing the names
ret = Perturb.perturb(pdata[2:3], Perturb.change_names, nsamples=1, n=3)
ret.data

[['luke smith has 3 sisters',
  'Michael Martin has 3 sisters',
  'Christopher Wilson has 3 sisters',
  'Matthew Wilson has 3 sisters']]

In [25]:
ret = Perturb.perturb(pdata, Perturb.change_names, nsamples=1, n=3, last_only=True, meta=True)
ret.data[0][1:], ret.meta[0][1:]

(['luke Ward has 3 sisters',
  'luke Clark has 3 sisters',
  'luke Stewart has 3 sisters'],
 [('smith', 'Ward'), ('smith', 'Clark'), ('smith', 'Stewart')])

In [26]:
#ner changing capabilities for cities
ret = Perturb.perturb(pdata, Perturb.change_location, nsamples=1, n=3, meta=True)
ret.data[0], ret.meta[0]

(['John is a very smart person, he lives in Ireland.',
  'John is a very smart person, he lives in Russian Federation.',
  'John is a very smart person, he lives in Poland.',
  'John is a very smart person, he lives in Uzbekistan.'],
 [None,
  ('Ireland', 'Russian Federation'),
  ('Ireland', 'Poland'),
  ('Ireland', 'Uzbekistan')])

In [27]:
#function to change numbers in the input data
ret = Perturb.perturb(pdata, Perturb.change_number, nsamples=1, n=3, meta=True)
ret.data[0], ret.meta[0]

(['luke smith has 3 sisters',
  'luke smith has 2 sisters',
  'luke smith has 4 sisters',
  'luke smith has 2 sisters'],
 [None, ('3', '2'), ('3', '4'), ('3', '2')])

In [28]:
# function to remove the negation
for t in ['This is not good', 'He didn\'t play the guitar', 'He doesn\'t play anything', 'She wasn\'t sad']:
    print(t)
    print(Perturb.remove_negation(nlp(t)))
    print()

This is not good
This is good

He didn't play the guitar
He played the guitar

He doesn't play anything
He plays anything

She wasn't sad
She was sad



## Behavioural  Testing

now checking performance of the model using linguistic properties of the language : 
one of the way is INV -- > 
SO INV means even after changing labels (ners) to others model should predict the same.

In [29]:
from checklist.test_types import MFT, INV, DIR

In [30]:
from pattern.en import sentiment

so we use the sentiment model from the pattern lib and writing the proba function to return probabilities from the sentiment output.

In [31]:
import numpy as np
def predict_proba(inputs):
    p1 = np.array([(sentiment(x)[0]+1)/2. for x in inputs]).reshape(-1, 1)
#     print(p1)
    p0 = 1-p1
#     print(p0)
    return np.hstack((p0, p1))

In [32]:
predict_proba(['good', 'bad'])

array([[0.15, 0.85],
       [0.85, 0.15]])

In [33]:
from checklist.pred_wrapper import PredictorWrapper
wrapped_pp = PredictorWrapper.wrap_softmax(predict_proba)

In [34]:
wrapped_pp(['bad'])

(array([0]), array([[0.85, 0.15]]))

In [35]:
wrapped_pp

<function checklist.pred_wrapper.PredictorWrapper.wrap_softmax.<locals>.pred_and_conf(inputs)>

In [36]:
dataset = ['This was a very nice movie directed by John Smith.',
           'Mary Keen was brilliant.', 
          'I hated everything about this.',
          'This movie was very bad.',
          'I really liked this movie.',
          'just bad.',
          'amazing.',
          ]
pdataset = list(nlp.pipe(dataset))

In [55]:
t = Perturb.perturb(dataset, Perturb.add_typos, nsamples=4)
print('\n'.join(t.data[0][:3]))
print('...')


This was a very nice movie directed by John Smith.
This was a very niec movie directed by John Smith.
...


In [56]:
#instantiate the invariance object by passing the dataset
test = INV(**t)
#pass the predictive function wrapper
test.run(wrapped_pp)
test.summary()

Predicting 8 examples
Test cases:      4
Fails (rate):    2 (50.0%)

Example fails:
0.9 Mary Keen was brilliant.
0.5 Mary Keen was birlliant.

----
0.0 This movie was very bad.
0.6 This movie was very bda.

----


### Results summary : 
first we augmented the data by adding typoes.
then we passed it to INV object, then in the summary we see two tests failing.
in the examples we see the probability of sentiment changing from 0.9 to 0.5 for 2nd example.

we can see in following examples and their probabilities

In [64]:
t.data[2]

['Mary Keen was brilliant.', 'Mary Keen was birlliant.']

In [62]:
wrapped_pp(t.data[2])

(array([1, 0]),
 array([[0.05, 0.95],
        [0.5 , 0.5 ]]))

In [58]:
t.data[0], len(t.data)

(['This was a very nice movie directed by John Smith.',
  'This was a very niec movie directed by John Smith.'],
 4)