In [1]:
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import Ridge
from IPython.display import HTML
from joblib import load
from tqdm import tqdm
import sqlite3 as sql
import pandas as pd
import numpy as np
import re

seed = 101

Load the entire dataset.

In [2]:
with sql.connect('../data/toxic.db') as conn:
    df = pd.read_sql_query('''select * from toxic''', conn)
df.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,0.0,2.0,0.5,0
2,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,0.0,1.0,0.1,0
3,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,0.0,2.0,0.6,0
4,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,-1.0,1.0,0.2,0


Split into two seperate dataframes: df_train and df_test.

In [3]:
df_train = df[df['split'] == 'train'].copy().reset_index(drop=True)
df_train.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y
0,2232.0,This:\n:One can make an analogy in mathematica...,2002,1,article,random,train,-1.0,1.0,0.4,0
1,4216.0,"""\n\n:Clarification for you (and Zundark's ri...",2002,1,user,random,train,0.0,2.0,0.5,0
2,26547.0,"""This is such a fun entry. Devotchka\n\nI on...",2002,1,article,random,train,0.0,2.0,0.6,0
3,37330.0,"""\n\n\nI fixed the link; I also removed """"home...",2002,1,article,random,train,-1.0,1.0,0.1,0
4,37346.0,"""If they are """"indisputable"""" then why does th...",2002,1,article,random,train,-1.0,1.0,0.2,0


In [4]:
df_test = df[df['split'] == 'test'].copy().reset_index(drop=True)
df_test.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y
0,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,0.0,1.0,0.1,0
1,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,-1.0,1.0,0.2,0
2,138074.0,"""\n\n\n\nI'm not sure if it's properly called ...",2002,1,article,random,test,0.0,1.0,0.5,0
3,200664.0,\n\n\n \nThanks on the info on how to move a p...,2002,1,user,random,test,0.0,1.0,0.4,0
4,213105.0,"""\n\n: I should do that too, I agree, but I've...",2002,1,user,random,test,0.0,1.0,0.3,0


Let's grab our "best" model from earlier.

In [5]:
def tokenizer(text):
    return re.findall(r'[a-z0-9]+', text.lower())

gs = load('../results/gs_cv_sgd.joblib')
pipe = gs.best_estimator_

Just to make things easier, let's compute probabilities for the entire test set.

In [6]:
df_test['y_prob'] = pipe.predict_proba(df_test['comment'])[:,1]
df_test.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y,y_prob
0,8953.0,Elected or Electoral? JHK,2002,0,article,random,test,0.0,1.0,0.1,0,0.244198
1,28959.0,Please relate the ozone hole to increases in c...,2002,1,article,random,test,-1.0,1.0,0.2,0,0.17284
2,138074.0,"""\n\n\n\nI'm not sure if it's properly called ...",2002,1,article,random,test,0.0,1.0,0.5,0,0.0037
3,200664.0,\n\n\n \nThanks on the info on how to move a p...,2002,1,user,random,test,0.0,1.0,0.4,0,0.008976
4,213105.0,"""\n\n: I should do that too, I agree, but I've...",2002,1,user,random,test,0.0,1.0,0.3,0,0.007818


# Global surrogates

Many NLP models use complicated neural network architecture that don't exactly lend themselves well to interpretation. A global surrogate is an interpretable model (e.g., decision tree, logistic regression, k-nearest neighbors, etc.) that is trained on the output of the _true_ model. In effect, it tries to distill the complex model into a simpler one, which can have benefits for deployment as well. Our "best" model is a linear model, so this is a bit more direct that the process would normally be. Let's start by looking at the tokens that are most important for predicting each class.

In [7]:
tokens = pipe.named_steps['vect'].get_feature_names() # Note, NOT the same as vocabulary_
weights = pipe.named_steps['clf'].coef_[0]
df_model = pd.DataFrame({'token':tokens, 'weight':weights})
df_model.head()

Unnamed: 0,token,weight
0,0,-0.48531
1,0 0,0.039417
2,0 00,-8.8e-05
3,0 005,-0.007528
4,0 01,-0.002821


In [8]:
df_model.sort_values('weight', ascending=True).head(10)

Unnamed: 0,token,weight
463216,thanks,-1.575386
127997,cool you,-1.079754
463188,thank you,-1.015618
220356,hey hey,-0.978156
58988,are cool,-0.958368
463160,thank,-0.863174
189578,for your,-0.836592
130100,could you,-0.775153
393418,regards,-0.771566
228178,http en,-0.759324


In [9]:
df_model.sort_values('weight', ascending=False).head(10)

Unnamed: 0,token,weight
88865,block block,12.789651
315629,nipple nipple,11.052719
315628,nipple,10.774228
459260,teabag,9.681863
97787,buttsecks,7.755658
540505,wikipedia hi,5.545768
220792,hi wikipedia,5.39969
195137,fuck,4.016078
109999,chester,3.47463
500501,tommy2010,3.472154


# Counterfactual examples
A counterfactual explanation of a prediction describes the smallest change to the prediction instance that results in a change to a predefined output. In the context of this problem, the smallest change that induces a change from toxic to non-toxic or vice-versa. Of course, defining what constitutes a _small_ change is particularly difficult. Here are a couple basic strategies for generating those examples.

## Historical comparison 
We'll start by picking a positive (toxic) example from the test set and finding the _closest_ example from the training set that had a negative (non-toxic) outcome.

In [10]:
idx_test_pos = (df_test['y'] == 1) & (df_test['y_prob'] > 0.5)
df_test[idx_test_pos].sample(5)

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split,min,max,avg,y,y_prob
8319,132533129.0,\n\nI really do not see how i am acting immatu...,2007,1,user,blocked,test,-2.0,1.0,-0.4,1,0.987215
12401,204752555.0,"Oh, and there are no talk page giudelines, ya...",2008,1,user,blocked,test,-2.0,0.0,-0.9,1,0.988083
11483,189695289.0,\n\nNonsense. Calzaghe is Undisputed & Linear ...,2008,1,article,blocked,test,-2.0,-1.0,-1.4,1,0.999699
26730,546960141.0,\n\n== Fuck you ==\n\nFuck you you cheap whore...,2013,1,user,blocked,test,-2.0,-1.0,-1.8,1,0.999984
11074,183338685.0,"If I am sockpuppet so is she, same computer \n\n",2008,1,user,blocked,test,-1.0,1.0,-0.5,1,0.604105


In [11]:
comment = df_test.loc[13718]['comment']
print(comment)

"
::Block me. I really don't give a shit! If a source doesn't work for someone, it gets removed. K?    "


Now let's vectorize this comment, vectorize all the negative training instances, and determine which one is the closest to the target comment. We'll start by computing [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) (actually, cosine distance, or 1 - cosine similarity) between the target comment and _all_ training comments.

In [12]:
idx_train_neg = df_train['y'] == 0
X_train_neg = pipe.named_steps['vect'].transform(df_train.loc[idx_train_neg, 'comment'])

In [13]:
X = pipe.named_steps['vect'].transform([comment])

In [14]:
# cosine_dist = pairwise_distances(X_train_neg, X, metric='cosine').flatten()
cosine_sim = cosine_similarity(X_train_neg, X).flatten()
print('AVG: %0.2f' % df_train.loc[idx_train_neg, ['avg']].values[cosine_sim.argmax()][0])
print('Cosine Similarity: %0.3f' % cosine_sim.max())
print('<COMMENT>\n' + df_train.loc[idx_train_neg, 'comment'].values[cosine_sim.argmax()] + '\n<COMMENT>')

AVG: 0.60
Cosine Similarity: 0.344
<COMMENT>


::::::::::: Okay, found a source, don't know if it's good enough, don't care. It was worth a shot. I apologize for the personal attacks to you, but I'd appreciate it if you wouldn't make sarcastic comments or making fun of what I say. 
<COMMENT>


Interesting. Both are clearly talking about sources, but the second is apologetic and less confrontational. Let's look at the closest three comments.

In [15]:
cosine_sim = cosine_similarity(X_train_neg, X).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_neg, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_neg, 'comment'].values[i] + '\n<COMMENT>')


AVG: 0.60
Cosine Distance: 0.344
<COMMENT>


::::::::::: Okay, found a source, don't know if it's good enough, don't care. It was worth a shot. I apologize for the personal attacks to you, but I'd appreciate it if you wouldn't make sarcastic comments or making fun of what I say. 
<COMMENT>

AVG: 0.60
Cosine Distance: 0.344
<COMMENT>
"

== looking for a song ==

I am looking for a song that I thought was by a Flock of Seagulls from obviously WAY BACK that goes ""Don't Change for you, Don't Change a thing for me. It was kind of punkish but made a point a it popped into my mind if anyone can help me. Thanks so much"
<COMMENT>

AVG: 0.60
Cosine Distance: 0.342
<COMMENT>
 Why do people make weird exception for the title Rabbi? Don't give a person a title that needs a degree that they don't have.
<COMMENT>


These examples don't look particularly relevant. Let's look at the three closest positive (toxic) examples.

In [16]:
idx_train_pos = df_train['y'] == 1
X_train_pos = pipe.named_steps['vect'].transform(df_train.loc[idx_train_pos, 'comment'])

In [17]:
cosine_sim = cosine_similarity(X_train_pos, X).flatten()
for i in cosine_sim.argsort()[-3:][::-1]:
    print('\nAVG: %0.2f' % df_train.loc[idx_train_pos, ['avg']].values[cosine_sim.argmax()][0])
    print('Cosine Distance: %0.3f' % cosine_sim[i])
#     print(df_train.loc[idx_train_pos, ['avg']].values[i])
    print('<COMMENT>\n' + df_train.loc[idx_train_pos, 'comment'].values[i] + '\n<COMMENT>')


AVG: -1.40
Cosine Distance: 0.432
<COMMENT>


I don't give a flying fuck. block me I Don't care. kiss my goddamn ass. DUSTIN Motherfucking D
<COMMENT>

AVG: -1.40
Cosine Distance: 0.418
<COMMENT>
And I don't give a damn.  


<COMMENT>

AVG: -1.40
Cosine Distance: 0.409
<COMMENT>


<COMMENT>


## Greedy counterfactual example
Now, instead of mining our training data for counterfactual examples, let's attempt to _create_ one by dropping word occurances from the original text until the score changes. Let's review the previous example.

In [18]:
print(comment)

"
::Block me. I really don't give a shit! If a source doesn't work for someone, it gets removed. K?    "


The steps here are relatively straightforward:
1. Using the defined vectorizer, convert the comment to a raw count vector.
2. Create a variation for each unique token in the raw count vector, such that each variant has a single token masked.
3. Generate a confidence score for each variant.
4. Identify the feature that moved the base score the furthest and mask it across all other variants.
5. Repeat until confidence score crosses threshold.

We'll start by demonstrating a single iteration of the above process.

In [19]:
idx_nonzero = np.nonzero(X.toarray().flatten())[0] # identify all nonzero elements of the target vector
variants = np.repeat(X.toarray(), len(idx_nonzero), axis=0)
# for each variant, mask a single feature (token)
for i, j in enumerate(idx_nonzero):
    variants[i,j] = 0
variants.shape

(38, 560571)

Now we'll generate a prediction for each variant and identify which feature was most impactful.

In [20]:
y_prob_var = pipe.named_steps['clf'].predict_proba(variants)[:,1]
k = y_prob_var.argmin()
print('''Removing token "%s" changes toxic score from %0.1f%% to %0.1f%%''' % (tokens[idx_nonzero[k]], 100*y_prob_var[0], 100*y_prob_var[k]))

Removing token "shit" changes toxic score from 96.8% to 67.4%


Fun stuff. Now let's create a function and repeat the process.

In [21]:
def explain_prediction_cf(comment, tokens, pipe, max_tokens=100):
    X = pipe.named_steps['vect'].transform([comment])
    y_prob_base = pipe.named_steps['clf'].predict_proba(X)[:,1][0]
    idx_nonzero = np.nonzero(X.toarray().flatten())[0]
    variants = np.repeat(X.toarray(), len(idx_nonzero), axis=0)
    for i,j in enumerate(idx_nonzero):
        variants[i,j] = 0
    log = [[None, None, y_prob_base]]
    for step in tqdm(range(max_tokens), total=float('inf')):
        y_prob_var = pipe.named_steps['clf'].predict_proba(variants)[:,1]
        k = y_prob_var.argsort()[step]
#         print(k, tokens[idx_nonzero[k]], y_prob_var[k])
        variants[:,idx_nonzero[k]] = 0
        log.append([k, tokens[idx_nonzero[k]], y_prob_var[k]])
        if y_prob_var[k] < 0.5:
            break
    return log, y_prob_base

In [22]:
log, y_prob_base = explain_prediction_cf(comment, tokens, pipe)
log

2it [00:00, 10.59it/s]


[[None, None, 0.9780473453499616],
 [27, 'shit', 0.6740743903054798],
 [22, 'me', 0.5742807959783442],
 [14, 'give a', 0.4841897354143122]]

Let's format this for ease of consumption.

In [23]:
def get_cf(comment, tokens, pipe):
    log, y_prob_base = explain_prediction_cf(comment, tokens, pipe)
    html = '<pre><h2>Explanation</h2>\n'
    html += 'Removing {'
    for row in log[1:]:
        html += '"%s", ' % row[1]
    html = html[:-2]
    html += '} from the text changes the toxicity score from %0.1f%% to %0.1f%%.' % (100*log[0][2], 100*log[-1][2])
    # Now let's add the original comment with highlighted text
    for row in log[1:]:
        token = row[1]
        comment = re.sub(r'\b%s\b' % token, '<span style="background-color: rgba(255, 0, 0, 0.2)">%s</span>' % token, comment, flags=re.IGNORECASE)  
    html += '<h2>Original</h2>\n%s</pre>' % comment
    return html

In [24]:
HTML(get_cf(comment, tokens, pipe))

2it [00:00, 10.55it/s]


In [25]:
comment_2 = df_test.loc[17621, 'comment']
print(comment_2)

, 8 August 2009 (UTC)


I just saw Xeno's edit comment - A Phone Call????? That's a reliable source that verified it for you???? You're a lousy editor, biased, obstructionist and fixed on defending your article. Fact is the claim was unsupported and unverified and should not be in here until such time as a Reliable Source was produced. You've been completely unable and incapable of providing a source yet more than happy to keep your edit by any means possible. Absolute garbage - and I detest your unsupported allegations that I'm a fucking teabagger, Republican or one-subject editor -ESPECIALLY since I provided supporting links to my NPOV editing. This is how you support an edit: You give it a proper name: Suncoast Regional Emmy Award /You give it a year: 2000. /You give it a title: A Grave Injustice /You give it a channel: WDSU, New Orleans - AND YOU PROVIDE A RELIABLE SOURCE: And you do it without bias according to supporting references. Your a biased hack, your attacks, ignorance, in

In [26]:
HTML(get_cf(comment_2, tokens, pipe))

7it [00:04,  1.50it/s]


# Local Surrogate Model

Local surrogate models are interpretable models (e.g., Logistic Regression, Decision Tree, etc.) that are used to explain individual predictions of black box machine learning models. The steps for computing a local surrogate model are as follows:

1. Generate variants by randomly masking (blanking) features found in the base instance.
2. Compute distance between base instance and each variant.
3. Compute scores for each variant.
4. Train an interpretable model using the inverse distance as the sample weight.
5. Interpret the resulting model.

In [27]:
print(comment)

"
::Block me. I really don't give a shit! If a source doesn't work for someone, it gets removed. K?    "


Let's start by vectorizing the comment (base instance) and randomly masking features from it to create variants.

In [28]:
X = pipe.named_steps['vect'].transform([comment]).toarray()
idx_nonzero = np.nonzero(X.flatten())[0]
idx_nonzero

array([ 13683,  18982,  19217,  88832,  88962, 152681, 152682, 153245,
       153256, 186689, 189069, 201036, 201126, 201907, 201909, 229569,
       230727, 232375, 232385, 254303, 255014, 262939, 290807, 291167,
       388853, 389014, 396308, 424082, 424153, 435431, 435600, 437394,
       437507, 454940, 455496, 456398, 547563, 547652])

These are the nonzero elements in the feature vector. Now we can generate a binary mask to create the variants. Let's create 100 variants with a 20% dropout rate.

In [29]:
mask = np.random.choice([0,1], size=(100, idx_nonzero.shape[0]), p=[0.2,0.8])
mask[:2]

array([[1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
        0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1],
       [1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0]])

In [30]:
X_var = np.repeat(X, repeats=100, axis=0)
X_var[:, idx_nonzero] = mask*X_var[:, idx_nonzero]
X_var[:2][:,idx_nonzero]

array([[2, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1,
        0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 2, 1, 1, 1, 1],
       [2, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1, 1, 0, 0]])

Note the differences between variants. Now we compute the cosine similarity between the base instance and each variant.

In [31]:
sim = cosine_similarity(X_var, X).flatten()
sim[:5]

array([0.87904907, 0.90453403, 0.95346259, 0.8660254 , 0.90453403])

Next we'll compute the confidence score for each variant using the original model (black box).

In [32]:
y_prob_var = pipe.named_steps['clf'].predict_proba(X_var)[:,1]
y_prob_var[:5]

array([0.58270181, 0.97692936, 0.97699281, 0.98281699, 0.98003753])

Now to train the local surrogate. We'll use a ridge regression model (Linear regression w/ L2 penalty) and weight each sample by it's similarity to the base instance. Remember, we want the local surrogate to replicate the behavior of the black box model in the region of the base instance.

In [33]:
local_model = Ridge(random_state=seed)
local_model.fit(X_var, y_prob_var, sample_weight=sim)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=101, solver='auto', tol=0.001)

We now have a local surrogate model that we can examine to better understand factors *important to this prediction*. Let's start by looking at the features associated with a negative result (non-toxic).

In [34]:
df_local = pd.DataFrame({'token':tokens, 'coef':local_model.coef_.flatten()})
df_local.sort_values('coef', ascending=True).head()

Unnamed: 0,token,coef
547563,work,-0.034859
437507,source doesn,-0.020377
254303,it,-0.015024
232385,if a,-0.014162
291167,me i,-0.013288


We can just flip the order to view the features associated with a positive result (toxic).

In [35]:
df_local.sort_values('coef', ascending=False).head()

Unnamed: 0,token,coef
424082,shit,0.329376
88832,block,0.053366
152682,doesn t,0.031191
290807,me,0.030523
255014,it gets,0.024395


Let's just clean these steps up and create a function.

In [36]:
def get_local_surrogate(comment, tokens, pipe, num_variants=100, dropout=0.2):
    X = pipe.named_steps['vect'].transform([comment]).toarray()
    idx_nonzero = np.nonzero(X.flatten())[0]
    mask = np.random.choice([0,1], size=(num_variants, idx_nonzero.shape[0]), p=[dropout,1-dropout])
    X_var = np.repeat(X, repeats=num_variants, axis=0)
    X_var[:, idx_nonzero] = mask*X_var[:, idx_nonzero]
    sim = cosine_similarity(X_var, X).flatten()
    y_prob_var = pipe.named_steps['clf'].predict_proba(X_var)[:,1]
    local_model = Ridge(random_state=seed)
    local_model.fit(X_var, y_prob_var, sample_weight=sim)
    df_local = pd.DataFrame({'token':tokens, 'coef':local_model.coef_.flatten()})
    return local_model, df_local

In [37]:
print(comment_2)

, 8 August 2009 (UTC)


I just saw Xeno's edit comment - A Phone Call????? That's a reliable source that verified it for you???? You're a lousy editor, biased, obstructionist and fixed on defending your article. Fact is the claim was unsupported and unverified and should not be in here until such time as a Reliable Source was produced. You've been completely unable and incapable of providing a source yet more than happy to keep your edit by any means possible. Absolute garbage - and I detest your unsupported allegations that I'm a fucking teabagger, Republican or one-subject editor -ESPECIALLY since I provided supporting links to my NPOV editing. This is how you support an edit: You give it a proper name: Suncoast Regional Emmy Award /You give it a year: 2000. /You give it a title: A Grave Injustice /You give it a channel: WDSU, New Orleans - AND YOU PROVIDE A RELIABLE SOURCE: And you do it without bias according to supporting references. Your a biased hack, your attacks, ignorance, in

In [38]:
local_model, df_local = get_local_surrogate(comment_2, tokens, pipe, 100)
df_local.sort_values('coef', ascending=False).head(5)

Unnamed: 0,token,coef
195335,fucking,0.003043
263612,keep,0.002996
452417,support an,0.00292
163150,emmy,0.002842
118364,comment a,0.002798


Note that this method is very much subject to RNG. The longer your text, the more variants are required to effectively sample the feature space.

In [39]:
local_model, df_local = get_local_surrogate(comment_2, tokens, pipe, 200)
df_local.sort_values('coef', ascending=False).head(5)

Unnamed: 0,token,coef
195335,fucking,0.034493
559236,your unsupported,0.022984
362502,phone call,0.02196
552883,year,0.021199
488794,this is,0.019682


In [40]:
local_model, df_local = get_local_surrogate(comment_2, tokens, pipe, 500)
df_local.sort_values('coef', ascending=False).head(5)

Unnamed: 0,token,coef
276688,links,0.037561
195335,fucking,0.028824
80768,been completely,0.027754
72689,away from,0.025457
463070,than to,0.021284
