# 3C-jkk-counterfactuals

In this notebook, we will explore counterfactual explanations. A counterfactual explanation of a prediction describes the smallest shange to the feature values that changes the prediction to a predefined output. Counterfactual explanations can be model-specific of model-agnostic. We will focus on model-agnostic methods.

Note that the documentation linked above may refer to a newer version of scikit-learn.

In [1]:
%config IPCompleter.use_jedi = False

from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import HTML
from tqdm.auto import tqdm
from joblib import load
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the entire dataset.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('''select * from toxic''', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,10.0,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,10.0,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,10.0,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,10.0,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,10.0,-1.0,1.0,0.2,0


Split into two seperate dataframes: df_train and df_test.

In [3]:
df_train = df[df['split'] == 'train'].copy().reset_index(drop=True)
df_train.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,10.0,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,10.0,0.0,2.0,0.5,0
2,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,10.0,0.0,2.0,0.6,0
3,37330.0,"""\n\n\nI fixed the link; I also removed """"home...",2002,1,article,random,train,10.0,-1.0,1.0,0.1,0
4,37346.0,"""If they are """"indisputable"""" then why does th...",2002,1,article,random,train,10.0,-1.0,1.0,0.2,0


In [4]:
df_test = df[df['split'] == 'test'].copy().reset_index(drop=True)
df_test.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y
0,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,10.0,0.0,1.0,0.1,0
1,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,10.0,-1.0,1.0,0.2,0
2,138074.0,"""\n\n\n\nI'm not sure if it's properly called ...",2002,1,article,random,test,10.0,0.0,1.0,0.5,0
3,200664.0,\n\n\n \nThanks on the info on how to move a p...,2002,1,user,random,test,10.0,0.0,1.0,0.4,0
4,213105.0,"""\n\n: I should do that too, I agree, but I've...",2002,1,user,random,test,10.0,0.0,1.0,0.3,0


Let's grab our best Random Forest model from notebook 2C-jkk-random-forest. This is a good example, since predictions from a Random Forest can be difficult to intepret. Also, this model pipeline has a dense vector representation, which will be useful for demonstration purposes.

In [5]:
rs = load('../results/rs_cv_nmf_rf.joblib')
pipe = rs.best_estimator_
pipe.steps

[('vect',
  TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
          dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
          lowercase=True, max_df=1.0, max_features=None, min_df=20,
          ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
          stop_words=None, strip_accents=None, sublinear_tf=False,
          token_pattern='[a-z]+', tokenizer=None, use_idf=True,
          vocabulary=None)),
 ('nmf',
  NMF(alpha=0.0, beta_loss='frobenius', init='nndsvda', l1_ratio=0.0,
    max_iter=200, n_components=100, random_state=None, shuffle=False,
    solver='cd', tol=0.0001, verbose=0)),
 ('clf',
  RandomForestClassifier(bootstrap=True, class_weight='balanced_subsample',
              criterion='gini', max_depth=20, max_features='auto',
              max_leaf_nodes=None, min_impurity_decrease=0.0,
              min_impurity_split=None, min_samples_leaf=1,
              min_samples_split=2, min_weight_fraction_leaf=0.0,
        

Before we go any further, let's go ahead and generate predictions on the test set so we have some examples for later.

In [6]:
df_test["y_prob"] = pipe.predict_proba(df_test["comment"])[:,1]
df_test["y_pred"] = (df_test["y_prob"] > 0.5).astype(int)
df_test.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y,y_prob,y_pred
0,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,10.0,0.0,1.0,0.1,0,0.322735,0
1,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,10.0,-1.0,1.0,0.2,0,0.184302,0
2,138074.0,"""\n\n\n\nI'm not sure if it's properly called ...",2002,1,article,random,test,10.0,0.0,1.0,0.5,0,0.184865,0
3,200664.0,\n\n\n \nThanks on the info on how to move a p...,2002,1,user,random,test,10.0,0.0,1.0,0.4,0,0.084335,0
4,213105.0,"""\n\n: I should do that too, I agree, but I've...",2002,1,user,random,test,10.0,0.0,1.0,0.3,0,0.196685,0


Let's pick a positive (toxic) example from the test set and find the *most similar* comment with a negative (non-toxic) label from the training set.

In [7]:
idx_test_pos = (df_test['y'] == 1) & (df_test['y_prob'] > 0.5)
df_test[idx_test_pos].sample(10, random_state=seed)

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,num,min,max,avg,y,y_prob,y_pred
22019,410026490.0,I am a huge tool bag.,2011,1,user,blocked,test,10.0,-1.0,1.0,-0.1,1,0.959228,1
232,8357736.0,\n\nYou are a child.,2004,1,user,blocked,test,10.0,-1.0,1.0,-0.1,1,0.982558,1
14834,248946653.0,"""\n\n== FUCK YOU ==\nFUCK YOU!!! YOU'RE A STU...",2008,0,user,blocked,test,10.0,-2.0,-1.0,-1.8,1,0.965476,1
21476,395850174.0,\n\n== no worries ==\n\nWe will use another ac...,2010,1,user,blocked,test,10.0,-1.0,1.0,-0.8,1,0.821475,1
25681,508305880.0,\nYou LOVE Anime!? There are some ANIMES out t...,2012,0,user,blocked,test,10.0,-2.0,-1.0,-1.2,1,0.835931,1
12701,209435370.0,\n\nDON'T YOU FUCKING TOUCH MY EDITS AGAIN!!!!!,2008,0,user,blocked,test,10.0,-2.0,0.0,-1.4,1,0.84982,1
12674,209021397.0,"\n\n== I hate you , unfriends ! ==\n\n \n I ...",2008,1,user,blocked,test,10.0,-2.0,0.0,-0.9,1,0.957991,1
29191,618375751.0,\n:There was nothing but good intent. You plac...,2014,0,user,blocked,test,10.0,-1.0,0.0,-0.9,1,0.708681,1
24009,462241410.0,\n\n\nfarooq abdullah is better known as faroo...,2011,0,article,blocked,test,10.0,-2.0,0.0,-0.9,1,0.606574,1
9941,163442321.0,\n\n\nWhere do you get off deleting that? it t...,2007,1,user,blocked,test,10.0,-2.0,0.0,-1.3,1,0.919241,1


In [8]:
comment = df_test.loc[12701]['comment']
print(comment)



DON'T YOU FUCKING TOUCH MY EDITS AGAIN!!!!!


Yup, this comment sucks. Now we need to make a decision about computing similarity, and what it means to have similar comments. The obvious solution is to transform the comments in the training set into TF-IDF vectors and compute a pairwise distance metric, such as cosine similarity. Let's start there.

In [9]:
idx_train_neg = df_train["y"] == 0 
X_train_neg = pipe.named_steps["vect"].transform(df_train.loc[idx_train_neg, "comment"])
X_train_neg

<77903x81017 sparse matrix of type '<class 'numpy.float64'>'
	with 7826131 stored elements in Compressed Sparse Row format>

In [10]:
X_target = pipe.named_steps["vect"].transform([comment])
X_target

<1x81017 sparse matrix of type '<class 'numpy.float64'>'
	with 15 stored elements in Compressed Sparse Row format>

In [11]:
cosine_sim = cosine_similarity(X_train_neg, X_target).flatten()
print('AVG: %0.2f' % df_train.loc[idx_train_neg, ['avg']].values[cosine_sim.argmax()][0])
print('Cosine Similarity: %0.3f' % cosine_sim.max())
print('<COMMENT>\n' + df_train.loc[idx_train_neg, 'comment'].values[cosine_sim.argmax()] + '\n<COMMENT>')

AVG: 0.50
Cosine Similarity: 0.326
<COMMENT>


::I didn't break 3RR check my edits again. Regards.  
<COMMENT>


Interesting, both are talking about edits, but the second is clearly much less confrontational. Let's look at the cloest three comments.

In [12]:
cosine_sim = cosine_similarity(X_train_neg, X_target).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_neg, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_neg, 'comment'].values[i] + '\n<COMMENT>')


AVG: 0.50
Cosine Distance: 0.326
<COMMENT>


::I didn't break 3RR check my edits again. Regards.  
<COMMENT>

AVG: 0.50
Cosine Distance: 0.284
<COMMENT>

:I agree. Why don't You? 
<COMMENT>

AVG: 0.50
Cosine Distance: 0.246
<COMMENT>

: Why don't you do it?   
<COMMENT>


Not too close. What about similar *positive* examples? What do those look like?

In [13]:
idx_train_pos = df_train['y'] == 1
X_train_pos = pipe.named_steps['vect'].transform(df_train.loc[idx_train_pos, 'comment'])

In [14]:
cosine_sim = cosine_similarity(X_train_pos, X_target).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_pos, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_pos, 'comment'].values[i] + '\n<COMMENT>')


AVG: -0.50
Cosine Distance: 0.570
<COMMENT>
 never touch my edits again you douche bag
<COMMENT>

AVG: -0.50
Cosine Distance: 0.451
<COMMENT>


Dont touch my edits you anti dentite.
<COMMENT>

AVG: -0.50
Cosine Distance: 0.444
<COMMENT>


Please touch my johnson
<COMMENT>


One thing that might be obvious at this point is that we are looking for hard matches across individual tokens. However, we really care about *semantic similarity*. A topic model like NMF may produce higher quality similarity comparisons. Let's transform our data into NMF space and examine the same case.

In [15]:
idx_train_neg = df_train["y"] == 0 
X_train_neg = pipe.named_steps["nmf"].transform(pipe.named_steps["vect"].transform(df_train.loc[idx_train_neg, "comment"]))
X_train_neg

array([[0.00894929, 0.        , 0.        , ..., 0.        , 0.        ,
        0.01186398],
       [0.00507612, 0.00100739, 0.        , ..., 0.01329248, 0.        ,
        0.        ],
       [0.00875505, 0.        , 0.        , ..., 0.        , 0.00279743,
        0.040765  ],
       ...,
       [0.00346846, 0.        , 0.        , ..., 0.        , 0.        ,
        0.00084756],
       [0.0149903 , 0.        , 0.        , ..., 0.00349148, 0.00444417,
        0.00116728],
       [0.00388735, 0.        , 0.        , ..., 0.00017346, 0.        ,
        0.00705176]])

In [16]:
X_target = pipe.named_steps["nmf"].transform(pipe.named_steps["vect"].transform([comment]))
X_target

array([[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.87493940e-04,
        1.03695513e-04, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        1.09769443e-02, 0.00000000e+00, 0.00000000e+00, 1.74427504e-04,
        1.80820986e-02, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 7.82862593e-05, 0.00000000e+00,
        5.10910738e-05, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 1.67402779e-02,
        7.25922862e-04, 0.00000000e+00, 0.00000000e+00, 3.61699484e-04,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
        0.00000000e+00, 2.35360097e-04, 0.00000000e+00, 0.000000

In [17]:
cosine_sim = cosine_similarity(X_train_neg, X_target).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_neg, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_neg, 'comment'].values[i] + '\n<COMMENT>')


AVG: 0.00
Cosine Distance: 0.670
<COMMENT>




how does a piece of dust start sensing light tho? 
<COMMENT>

AVG: 0.00
Cosine Distance: 0.659
<COMMENT>
piece of information 
<COMMENT>

AVG: 0.00
Cosine Distance: 0.657
<COMMENT>
"Holy longest hatnote ever batman! –  
:::::::"
<COMMENT>


In [18]:
idx_train_pos = df_train["y"] == 1 
X_train_pos = pipe.named_steps["nmf"].transform(pipe.named_steps["vect"].transform(df_train.loc[idx_train_pos, "comment"]))
X_train_pos

array([[3.73369885e-03, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [1.88252173e-03, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 5.85668173e-05, 6.13222676e-04],
       [7.88262217e-03, 6.78617861e-04, 0.00000000e+00, ...,
        0.00000000e+00, 1.85360957e-03, 0.00000000e+00],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        2.18057329e-04, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [5.98596227e-03, 0.00000000e+00, 0.00000000e+00, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

In [19]:
cosine_sim = cosine_similarity(X_train_pos, X_target).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_pos, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_pos, 'comment'].values[i] + '\n<COMMENT>')


AVG: -1.60
Cosine Distance: 0.978
<COMMENT>


Don't post on my profile page, you fucking cunt.    
<COMMENT>

AVG: -1.60
Cosine Distance: 0.945
<COMMENT>


==HEY ALAN!==
YOU FUCKING SUCK MY DICK, you gayboy. suck it hard and choke on it. don't ban me, please...
<COMMENT>

AVG: -1.60
Cosine Distance: 0.944
<COMMENT>


Suck my cock you stupid bitch!:)
<COMMENT>


Now let's talk about an alternative technique, since it may not always be feasible, legal, or advisable to provide counterfactual examples from your training set. Let's instead *create* a counterfactual example by dropping word occurances from the original text until the score flips across the threshold. Let's grab a new example.

In [20]:
comment = df_test.loc[13718]['comment']
print(comment)

"
::Block me. I really don't give a shit! If a source doesn't work for someone, it gets removed. K?    "


The steps here are relatively straightforward:

  1. Using the defined vectorizer, convert the comment to a raw count vector.
  2. Create a variation for each unique token in the raw count vector, such that each variant has a single token masked.
  3. Generate a confidence score for each variant.
  4. Identify the feature that moved the base score the furthest and mask it across all other variants.
  5. Repeat until confidence score crosses threshold.

We'll start by demonstrating a single iteration of the above process.

In [21]:
X_target = pipe.named_steps['vect'].transform([comment])
X_target

<1x81017 sparse matrix of type '<class 'numpy.float64'>'
	with 44 stored elements in Compressed Sparse Row format>

In [22]:
idx_nonzero = np.nonzero(X_target.toarray().flatten())[0] # identify all nonzero elements of the target vector
variants = np.repeat(X_target.toarray(), len(idx_nonzero), axis=0)
# for each variant, mask a single feature (token)
for i, j in enumerate(idx_nonzero):
    variants[i,j] = 0
variants.shape

(44, 81017)

Now we'll generate a prediction for each variant and identify which feature was most impactful. We also need a lookup dictionary for tokens.

In [23]:
tokens = np.array(pipe.named_steps['vect'].get_feature_names())

In [24]:
y_prob_var = pipe.named_steps['clf'].predict_proba(pipe.named_steps['nmf'].transform(variants))[:,1]
k = y_prob_var.argmin()
print('''Removing token "%s" changes toxic score from %0.1f%% to %0.1f%%''' % (tokens[idx_nonzero[k]], 100*y_prob_var[0], 100*y_prob_var[k]))

Removing token "shit" changes toxic score from 77.1% to 62.9%


Well, I think that makes sense. Let's write a helper function to automate this task.

In [25]:
vect = pipe.named_steps['vect']
nmf = pipe.named_steps['nmf']
clf = pipe.named_steps['clf']

In [26]:
def explain_prediction_cf(comment, tokens=tokens, vect=vect, nmf=nmf, clf=clf, max_tokens=100):
    X = vect.transform([comment])
    y_prob_base = clf.predict_proba(nmf.transform(X))[:,1][0]
    idx_nonzero = np.nonzero(X.toarray().flatten())[0]
    variants = np.repeat(X.toarray(), len(idx_nonzero), axis=0)
    for i,j in enumerate(idx_nonzero):
        variants[i,j] = 0
    log = [[None, None, y_prob_base]]
    max_steps = np.min([len(variants), max_tokens])
    for step in tqdm(range(max_steps)):
        y_prob_var = clf.predict_proba(nmf.transform(variants))[:,1]
        k = y_prob_var.argsort()[step]
#         print(k, tokens[idx_nonzero[k]], y_prob_var[k])
        variants[:,idx_nonzero[k]] = 0
        log.append([k, tokens[idx_nonzero[k]], y_prob_var[k]])
        if y_prob_var[k] < 0.5:
            break
    return log, y_prob_base

In [27]:
explain_prediction_cf(comment)

  0%|          | 0/44 [00:00<?, ?it/s]

([[None, None, 0.7694893640870398],
  [32, 'shit', 0.6290851245473555],
  [16, 'give a', 0.5482100796290866],
  [17, 'give a shit', 0.4901658935061039]],
 0.7694893640870398)

In [28]:
def get_cf(comment, log, y_prob_base):
    html = '<pre><h2>Explanation</h2>\n'
    html += 'Removing {'
    for row in log[1:]:
        html += '"%s", ' % row[1]
    html = html[:-2]
    html += '} from the text changes the toxicity score from %0.1f%% to %0.1f%%.' % (100*log[0][2], 100*log[-1][2])
    # Now let's add the original comment with highlighted text
    for row in log[1:]:
        token = row[1]
        comment = re.sub(r'\b%s\b' % token, '<span style="background-color: rgba(255, 0, 0, 0.2)">%s</span>' % token, comment, flags=re.IGNORECASE)  
    html += '<h2>Original</h2>\n%s</pre>' % comment
    return html

In [29]:
log, y_prob_base = explain_prediction_cf(comment)
HTML(get_cf(comment, log, y_prob_base))

  0%|          | 0/44 [00:00<?, ?it/s]