This notebook contains the the results for necessity and sufficiency. Necessity and sufficiency are both calculated by either choosing a subset of tokens and perturbing them using the ILM model. The models are all BERT architecture, but trained on different datasets, and for each dataset, a model is trained on both hate/non-hate and abusive/non-abusive labels. The explanations are generated for 120 examples from the HateCheck test suite. These are instances that are explicitly hateful, and are targeted towards women or Muslims. The function ```display_scores``` displays the necessity and sufficiency for each of the examples for all models included. Note that some models will display ```NaN``` for some values. These are the cases where the model mistakenly classified the original instance as non-abusive/non-hateful. In these cases, the current necessity and sufficiency calculations aren't meaningful, because we aim to provide explanations for positive predictions only. The third argument to this function determines which necessity/sufficiency scores to display. 

In [1]:
import pickle
import pandas as pd
import numpy as np

In [2]:
preds = pickle.load(open("Data/HateCheck_necc_suff_preds_2.pickle", "rb"))
results = pickle.load(open("Data/HateCheck_necc_suff_results_all_2.pickle", "rb"))
perturbations = pickle.load(open("Data/HateCheck_necc_suff_perturbations_3.pickle","rb"))

In [3]:
perturbations.keys()

dict_keys(['orig_texts', 'necc_perturbed', 'suff_perturbed', 'necc_masks', 'suff_masks'])

In [4]:
preds.keys()

dict_keys(['orig_preds', 'orig_scores', 'necc_preds', 'necc_scores', 'suff_preds', 'suff_scores'])

In [5]:
results.keys()

dict_keys(['necc_results', 'necc_results_nb', 'suff_results', 'suff_results_nb'])

In [6]:
datasets = list(results['necc_results'].keys())
datasets

['CAD_abuse', 'Davidson_abuse', 'CAD_hate', 'Davidson_hate']

In [7]:
# get the corrupted examples with tokn k flipped together with the corresponding scores
# if this is sufficiency, set reverse=True so that it will return instances where token k is not flipped
def get_k_corr(k, masks, perturbed, p_results, reverse=False):
    perturbed_k = []
    for pp, mm, rr in zip(perturbed, masks[:,k], p_results):
        if mm != reverse:
            perturbed_k.append((pp, rr))
    return(perturbed_k)

In [8]:
# for a given original test case, make a table for necessity or sufficiency for each model and for each token

def display_scores(templ_n, orig_texts, orig_preds, scores_dict):
    columns = orig_texts[templ_n].strip().split()
    index = list(scores_dict.keys())
    # if the model has not predicted the original as 0, do not display the feature attribution scores
    data = np.array([scores_dict[dset][templ_n] if orig_preds[dset][templ_n] == 1 
                     else [np.nan]*len(scores_dict[dset][templ_n]) 
                    for dset in scores_dict.keys()])
    return pd.DataFrame(data=data, columns=columns, index=index)

In [9]:
display_scores(0, perturbations['orig_texts'],  preds['orig_preds'], results['necc_results'])

Unnamed: 0,I,hate,women.
CAD_abuse,0.150943,0.298969,0.920455
Davidson_abuse,0.113208,0.443299,0.534091
CAD_hate,0.320755,0.329897,0.965909
Davidson_hate,0.377358,0.979381,0.454545


In [10]:
display_scores(0, perturbations['orig_texts'],  preds['orig_preds'], results['suff_results'])

Unnamed: 0,I,hate,women.
CAD_abuse,0.253253,0.304886,0.8908
Davidson_abuse,0.273091,0.63503,0.5946
CAD_hate,0.252951,0.15009,0.8088
Davidson_hate,0.206283,0.649172,0.091


In [11]:
display_scores(1, perturbations['orig_texts'],  preds['orig_preds'], results['necc_results'])

Unnamed: 0,I,hate,trans,people.
CAD_abuse,0.524752,0.574468,0.942857,0.471154
Davidson_abuse,0.316832,0.702128,0.466667,0.461538
CAD_hate,0.643564,0.5,0.971429,0.5
Davidson_hate,0.29703,0.744681,0.409524,0.480769


In [12]:
display_scores(1, perturbations['orig_texts'],  preds['orig_preds'], results['suff_results'])

Unnamed: 0,I,hate,trans,people.
CAD_abuse,0.233527,0.342243,0.713275,0.225506
Davidson_abuse,0.3146,0.677487,0.389848,0.473424
CAD_hate,0.216073,0.184676,0.602661,0.175075
Davidson_hate,0.397364,0.754814,0.416446,0.598255


In [13]:
# This function is to examine the perturbations and the scores a model assigned them. 
# Will not work for masked-not-perturbed versions. 

ex_no = 0 # example in the examples list
id_token = 2  #the identity token is the _th token
model = 'CAD_abuse'
get_k_corr(id_token, 
           perturbations['suff_masks'][ex_no], 
           perturbations['suff_perturbed'][ex_no], 
           preds['suff_scores'][model][ex_no], reverse=True)

[("turns out we don't need an article on women.", 0.306688517332077),
 ('its getting really bad then hate women.', 0.9791112542152405),
 ("no, they're virgins. women.", 0.993899941444397),
 ('there is no need for women.', 0.9451243281364441),
 ('I see women.', 0.9877521991729736),
 ('how many women.', 0.9917389154434204),
 ('stop being a racist women.', 0.9920948147773743),
 ("I haven't heard women.", 0.9771950840950012),
 ('for anyone who believed that the women.', 0.9715681672096252),
 ("we shouldn't hate women.", 0.9847972989082336),
 ('the women.', 0.9938714504241943),
 ('== jerusalem ==  hate the flag and people hate women.', 0.9892338514328003),
 ('bashing is great for women.', 0.8792116045951843),
 ('the women.', 0.9938714504241943),
 ('the issue here is that women.', 0.8991451263427734),
 ('that, dear, hate women.', 0.9947344064712524),
 ('apparently hate women.', 0.9861996173858643),
 ('what on earth was their income for women.', 0.9187143445014954),
 ('*people do hate women.'

In [14]:
ex_no = 0 # example in the examples list
id_token = 2  #the identity token is the _th token
model = 'CAD_abuse'
get_k_corr(id_token, 
           perturbations['necc_masks'][ex_no], 
           perturbations['necc_perturbed'][ex_no], 
           preds['necc_scores'][model][ex_no])

[('I hate how hard you take it', 0.010857032611966133),
 ('I am', 0.0048977164551615715),
 ('I hate it!', 0.0029997816309332848),
 ('I hate the stories written', 0.0024323726538568735),
 ('I added shlomo', 0.09698490053415298),
 ('I hate this damn linen park', 0.02014455571770668),
 ("I'dnt even say that trump's supporters", 0.005059640854597092),
 ('I hate drama', 0.004309332463890314),
 ("I hate 'steve oldman", 0.006463516969233751),
 ("I think you'd get there only with the anger", 0.005469337571412325),
 ("please let's talk about jeff hate... you", 0.008435060270130634),
 ('I hate that i never miss jazzy sisters', 0.008910116739571095),
 ('I hate that', 0.006984537001699209),
 ("I don't", 0.004511239472776651),
 ('I speak french...', 0.0074975257739424706),
 ('I hope this comes to fredrick johnson page', 0.002066227840259671),
 ('* you hate it.', 0.09507716447114944),
 ('I hate shit', 0.004485023207962513),
 ('we dont hate facts - you have the one - they', 0.002778286812826991),
 ("

In the following section, there is the average prediction of the models for the cases where identities are mentioned in neutral or positive contexts (```ident_neutral_nh```, ```ident_pos_nh```) and the test cases where there is "abuse" directed to non-protected groups (```target_group_nh```) individuals (```target_indiv_nh```) and objects (```target_obj_nh```). The performance of the models on non-abusive mention identities for the two targets we've chosen for our explanations ('women' and 'Muslims') is also given seperately. 

In [15]:
hc_results = pickle.load(open('Data/HateCheck_templates_and_results_2.pickle', "rb"))
hc_results.test_case = hc_results.test_case.apply(lambda x: x.strip())
hc_results

Unnamed: 0.1,Unnamed: 0,functionality,case_id,test_case,label_gold,target_ident,direction,focus_words,focus_lemma,ref_case_id,...,templ_id,case_templ,CAD_abuse_pred,CAD_abuse_score,Davidson_abuse_pred,Davidson_abuse_score,CAD_hate_pred,CAD_hate_score,Davidson_hate_pred,Davidson_hate_score
0,0,derog_neg_emote_h,1,I hate women.,hateful,women,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.985160,1,0.978965,1,0.990854,1,0.986775
1,1,derog_neg_emote_h,2,I hate trans people.,hateful,trans people,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.731763,1,0.942493,1,0.823971,1,0.992595
2,2,derog_neg_emote_h,3,I hate gay people.,hateful,gay people,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.773560,1,0.941995,0,0.421347,1,0.993070
3,3,derog_neg_emote_h,4,I hate black people.,hateful,black people,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.695440,1,0.918182,1,0.640670,1,0.993055
4,4,derog_neg_emote_h,5,I hate disabled people.,hateful,disabled people,general,hate,hate,,...,1,I hate [IDENTITY_P].,0,0.005099,0,0.012837,0,0.001677,1,0.992612
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3723,3723,spell_leet_h,3897,Gay people are all complete 4ssholes.,hateful,gay people,general,4ssholes,asshole,1232.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.973880,1,0.959762,1,0.995104,1,0.992203
3724,3724,spell_leet_h,3898,Black people are all complete 4ssholes.,hateful,black people,general,4ssholes,asshole,1233.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.910435,1,0.912641,1,0.986658,1,0.888017
3725,3725,spell_leet_h,3899,Disabled people are all complete 4ssholes.,hateful,disabled people,general,4ssholes,asshole,1234.0,...,866,[IDENTITY_P] are all complete 4ssholes.,0,0.005621,0,0.005389,0,0.005129,0,0.173803
3726,3726,spell_leet_h,3900,Muslims are all complete 4ssholes.,hateful,Muslims,general,4ssholes,asshole,1235.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.996192,1,0.906870,1,0.995154,1,0.991520


In [16]:
# target_ds = ['women', 'trans people', 'gay people', 'black people', 'disabled people',' Muslims', 'immigrants']

hc_results_women_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'women')]

hc_results_trans_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'trans people')]

hc_results_gay_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'gay people')]

hc_results_black_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'black people')]

hc_results_disabled_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'disabled people')]

hc_results_muslims_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'Muslims')]

hc_results_immigrants_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'immigrants')]


In [17]:
# hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh') 
#                          | (hc_results.functionality == 'ident_pos_nh')) 
#                         & (hc_results.target_ident == 'women')] = 'women_nh'
# hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh') 
#                          | (hc_results.functionality == 'ident_pos_nh')) 
#                         & (hc_results.target_ident == 'Muslims')] = 'muslims_nh'

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == "women")
hc_results.loc[mask, "functionality"] = "women_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'trans people')
hc_results.loc[mask, "functionality"] = "trans_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'gay people')
hc_results.loc[mask, "functionality"] = "gay_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'black people')
hc_results.loc[mask, "functionality"] = "black_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'disbaled people')
hc_results.loc[mask, "functionality"] = "disabled_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'Muslims')
hc_results.loc[mask, "functionality"] = "muslims_nh"

mask = (
        (hc_results["functionality"] == "ident_neutral_nh") | 
        (hc_results["functionality"] == "ident_pos_nh")
    ) & (hc_results["target_ident"] == 'immigrants')
hc_results.loc[mask, "functionality"] = "immigrants_nh"
# target_ds = ['women', 'trans people', 'gay people', 'black people', 'disabled people',' Muslims', 'immigrants']


In [18]:
# the results we are interested are: 
target_funcs = ['women_nh','trans_nh', 'gay_nh', 'black_nh', 'disabled_nh', 'muslims_nh', 'immigrants_nh', 
                'target_obj_nh', 'target_indiv_nh', 'target_group_nh']

target_funcs_results = hc_results[hc_results.functionality.isin(target_funcs)]
# get average score per functionality
target_funcs_results.groupby('functionality')[['{}_pred'.format(dd) for dd in datasets]].mean().transpose()

functionality,black_nh,gay_nh,immigrants_nh,muslims_nh,target_group_nh,target_indiv_nh,target_obj_nh,trans_nh,women_nh
CAD_abuse_pred,0.222222,0.555556,0.133333,0.977778,0.322581,0.538462,0.030769,0.8,0.6
Davidson_abuse_pred,0.688889,1.0,0.0,0.6,0.209677,0.507692,0.153846,0.444444,0.266667
CAD_hate_pred,0.311111,0.444444,0.577778,0.688889,0.064516,0.0,0.015385,0.533333,0.577778
Davidson_hate_pred,0.333333,0.777778,0.222222,0.955556,0.467742,0.492308,0.092308,0.244444,0.0


In [19]:
mask_results = pickle.load(open('Data/HateCheck_necc_suff_results_masked_2.pickle', 'rb'))
mask_results.keys()

dict_keys(['necc_results', 'necc_results_nb', 'suff_results', 'suff_results_nb'])

In [21]:
necc_vals = {}
suff_vals = {}
necc_vals_mask = {}
suff_vals_mask = {}
orig_texts = []
targets = []

for tt in perturbations['orig_texts']:
    orig_text = tt.strip()
    row = hc_results[hc_results.test_case == orig_text]
    targets.append(row.target_ident.tolist()[0])

for dataset in datasets:
    necc_vals[dataset] = []
    suff_vals[dataset] = []
    necc_vals_mask[dataset] = []
    suff_vals_mask[dataset] = []
    for nn, (orig_text, orig_pred) in enumerate(zip(perturbations['orig_texts'], preds['orig_preds'][dataset])):
        if orig_pred != 1:
            necc_vals[dataset].append(np.nan)
            suff_vals[dataset].append(np.nan)
            necc_vals_mask[dataset].append(np.nan)
            suff_vals_mask[dataset].append(np.nan)
            continue
        # get the row in hc_results corresponding to this case
        orig_text = orig_text.strip()
        row = hc_results[hc_results.test_case == orig_text]
        toknd = row.case_templ.tolist()[0].split()
        ## find the index of the template placeholder
        for ii, tt in enumerate(toknd):
            if tt[:1] == "[":
                break
        necc_vals[dataset].append(results['necc_results'][dataset][nn][ii])
        suff_vals[dataset].append(results['suff_results'][dataset][nn][ii])
        necc_vals_mask[dataset].append(mask_results['necc_results_nb'][dataset][nn][ii])
        suff_vals_mask[dataset].append(mask_results['suff_results_nb'][dataset][nn][ii])

df_dict = {('necessity', dd): ll for dd, ll in necc_vals.items()}
df_dict.update({('sufficiency', dd): ll for dd, ll in suff_vals.items()})
df_dict.update({('necessity_mask', dd): ll for dd, ll in necc_vals_mask.items()})
df_dict.update({('sufficiency_mask', dd): ll for dd, ll in suff_vals_mask.items()})
df_dict.update({('prediction', dd): ll for dd, ll in preds['orig_preds'].items()})
df_dict.update({('score', dd): ll for dd, ll in preds['orig_scores'].items()})
#df_dict.update({'target', ''}: targets)

#ind = [xx.strip() for xx in perturbations['orig_texts']]
ind = [(tt, xx.strip()) for xx, tt in zip(perturbations['orig_texts'], targets)]

# pd.DataFrame(df_dict, index=ind)
#     avg_necc[dataset] = {target: np.mean(necc_vals[target]) for target in targets}
#     avg_suff[dataset] = {target: np.mean(suff_vals[target]) for target in targets}

master_df = pd.DataFrame(df_dict, index=ind)
master_df.columns = pd.MultiIndex.from_tuples(master_df.columns, names=['value','Dataset'])
master_df.index = pd.MultiIndex.from_tuples(master_df.index, names=['target', 'text'])
pickle.dump(master_df, open("Data/HateCheck_individual_necc_suff_scores_2.pickle", "wb"))

# master_df.xs('CAD_abuse', level='Dataset', axis=1)
# master_df['necessity']
# master_df.loc['women']
# master_df.xs('I hate women.', level='text')

In [22]:
master_df = pickle.load(open("Data/HateCheck_individual_necc_suff_scores_2.pickle", "rb"))

In [23]:
master_df['necessity'].groupby(level='target').mean().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.823008,0.826934,0.721449,0.843812,0.768884,0.833694,0.826961
Davidson_abuse,0.706523,0.611176,0.404764,0.650523,0.605225,0.553806,0.669205
CAD_hate,0.953422,0.939656,0.626414,0.942283,0.943712,0.940897,0.957513
Davidson_hate,0.813876,0.670788,0.664368,0.7171,0.755416,0.659463,0.561541


In [24]:
master_df['necessity'].groupby(level='target').std().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.151864,0.138473,0.130026,0.13698,0.149855,0.1359,0.145867
Davidson_abuse,0.182026,0.171754,0.085158,0.192702,0.163124,0.162436,0.173194
CAD_hate,0.051556,0.067654,0.029196,0.079761,0.056017,0.067555,0.053243
Davidson_hate,0.156975,0.148447,0.156846,0.176383,0.149984,0.162889,0.139822


In [25]:
master_df['sufficiency'].groupby(level='target').mean().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.928332,0.533998,0.217881,0.791906,0.375914,0.785115,0.803881
Davidson_abuse,0.505357,0.521184,0.221891,0.850475,0.318796,0.348763,0.551171
CAD_hate,0.82561,0.423415,0.323852,0.758529,0.68314,0.766461,0.743572
Davidson_hate,0.853717,0.438192,0.326162,0.680534,0.544915,0.295686,0.163089


In [26]:
master_df['sufficiency'].groupby(level='target').std().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.05355,0.146754,0.11881,0.108506,0.108402,0.110779,0.15258
Davidson_abuse,0.114931,0.120955,0.110731,0.055338,0.131476,0.129694,0.115184
CAD_hate,0.144937,0.091463,0.036449,0.103831,0.118492,0.105469,0.152971
Davidson_hate,0.062003,0.091204,0.118955,0.10295,0.1341,0.100688,0.110173


In [27]:
master_df['necessity_mask'].groupby(level='target').mean().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.790251,0.713805,0.450994,0.744956,0.585635,0.754068,0.779705
Davidson_abuse,0.430983,0.271867,0.109808,0.386209,0.263755,0.230982,0.371332
CAD_hate,0.902219,0.822246,0.293797,0.848647,0.876725,0.860469,0.88436
Davidson_hate,0.541667,0.258709,0.225273,0.329144,0.39026,0.204255,0.07565


In [28]:
master_df['sufficiency_mask'].groupby(level='target').mean().transpose()

target,Muslims,black people,disabled people,gay people,immigrants,trans people,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CAD_abuse,0.943488,0.740989,0.299173,0.881105,0.455996,0.900952,0.874
Davidson_abuse,0.730736,0.40207,0.250559,0.811107,0.564283,0.36551,0.746292
CAD_hate,0.955081,0.838529,0.542517,0.958849,0.935992,0.962009,0.913154
Davidson_hate,0.93423,0.679867,0.521447,0.905835,0.885189,0.43637,0.227465
