This notebook contains the the results for necessity and sufficiency. Necessity and sufficiency are both calculated by either choosing a subset of tokens and perturbing them using the ILM model. The models are all BERT architecture, but trained on different datasets, and for each dataset, a model is trained on both hate/non-hate and abusive/non-abusive labels. The explanations are generated for 120 examples from the HateCheck test suite. These are instances that are explicitly hateful, and are targeted towards women or Muslims. The function ```display_scores``` displays the necessity and sufficiency for each of the examples for all models included. Note that some models will display ```NaN``` for some values. These are the cases where the model mistakenly classified the original instance as non-abusive/non-hateful. In these cases, the current necessity and sufficiency calculations aren't meaningful, because we aim to provide explanations for positive predictions only. The third argument to this function determines which necessity/sufficiency scores to display. 

In [4]:
import pickle
import pandas as pd
import numpy as np

In [5]:
preds = pickle.load(open("Data/HateCheck_necc_suff_preds.pickle", "rb"))
results = pickle.load(open("Data/HateCheck_necc_suff_results_all.pickle", "rb"))
perturbations = pickle.load(open("Data/HateCheck_necc_suff_perturbations.pickle","rb"))

In [6]:
perturbations.keys()

dict_keys(['orig_texts', 'necc_perturbed', 'suff_perturbed', 'necc_masks', 'suff_masks'])

In [7]:
preds.keys()

dict_keys(['orig_preds', 'orig_scores', 'necc_preds', 'necc_scores', 'suff_preds', 'suff_scores'])

In [8]:
results.keys()

dict_keys(['necc_results', 'necc_results_nb', 'suff_results', 'suff_results_nb'])

In [9]:
datasets = list(results['necc_results'].keys())
datasets

['CAD_abuse', 'Davidson_abuse', 'CAD_hate', 'Davidson_hate']

In [10]:
# get the corrupted examples with tokn k flipped together with the corresponding scores
# if this is sufficiency, set reverse=True so that it will return instances where token k is not flipped
def get_k_corr(k, masks, perturbed, p_results, reverse=False):
    perturbed_k = []
    for pp, mm, rr in zip(perturbed, masks[:,k], p_results):
        if mm != reverse:
            perturbed_k.append((pp, rr))
    return(perturbed_k)

In [11]:
# for a given original test case, make a table for necessity or sufficiency for each model and for each token

def display_scores(templ_n, orig_texts, orig_preds, scores_dict):
    columns = orig_texts[templ_n].strip().split()
    index = list(scores_dict.keys())
    # if the model has not predicted the original as 0, do not display the feature attribution scores
    data = np.array([scores_dict[dset][templ_n] if orig_preds[dset][templ_n] == 1 
                     else [np.nan]*len(scores_dict[dset][templ_n]) 
                    for dset in scores_dict.keys()])
    return pd.DataFrame(data=data, columns=columns, index=index)

In [12]:
display_scores(0, perturbations['orig_texts'],  preds['orig_preds'], results['necc_results'])

Unnamed: 0,I,hate,women.
CAD_abuse,0.142857,0.326923,0.82
Davidson_abuse,0.152381,0.442308,0.52
CAD_hate,0.295238,0.403846,0.97
Davidson_hate,,,


In [13]:
display_scores(0, perturbations['orig_texts'],  preds['orig_preds'], results['suff_results'])

Unnamed: 0,I,hate,women.
CAD_abuse,0.2812,0.233087,0.921099
Davidson_abuse,0.4152,0.670483,0.733281
CAD_hate,0.2492,0.158445,0.857988
Davidson_hate,,,


In [14]:
display_scores(1, perturbations['orig_texts'],  preds['orig_preds'], results['necc_results'])

Unnamed: 0,I,hate,Muslims.
CAD_abuse,0.258929,0.188119,0.855556
Davidson_abuse,0.178571,0.544554,0.488889
CAD_hate,0.294643,0.217822,0.988889
Davidson_hate,0.357143,0.782178,0.788889


In [15]:
display_scores(1, perturbations['orig_texts'],  preds['orig_preds'], results['suff_results'])

Unnamed: 0,I,hate,Muslims.
CAD_abuse,0.286039,0.395118,0.9312
Davidson_abuse,0.344555,0.688087,0.560793
CAD_hate,0.291028,0.288478,0.936827
Davidson_hate,0.204606,0.535501,0.512698


In [16]:
# This function is to examine the perturbations and the scores a model assigned them. 
# Will not work for masked-not-perturbed versions. 

ex_no = 0 # example in the examples list
id_token = 2  #the identity token is the _th token
model = 'CAD_abuse'
get_k_corr(id_token, 
           perturbations['suff_masks'][ex_no], 
           perturbations['suff_perturbed'][ex_no], 
           preds['suff_scores'][model][ex_no], reverse=True)

[('sorry about the problems with women.', 0.9258073568344116),
 ('I understand women.', 0.8937275409698486),
 ('due to their tremendously virulently anti-american views, women.',
  0.9253207445144653),
 ('many women.', 0.9009522795677185),
 ('I like to compare herself to women.', 0.7563531398773193),
 ('just like you hate women.', 0.9707891941070557),
 ("i really don't hate women.", 0.883637547492981),
 ('i hate women.', 0.9485741257667542),
 ('>the united states still does not recognize women.', 0.7905982732772827),
 ('I agree with you on most things, women.', 0.9439984560012817),
 ('i hate women.', 0.9485741257667542),
 ('*the reason women were developed women.', 0.8438448905944824),
 ('I think everything is good because women.', 0.7619067430496216),
 ('i hate women.', 0.9485741257667542),
 ('but i hate women.', 0.9404919743537903),
 ('coombe had america as an important continent women.', 0.6697374582290649),
 ("you'll get a cookie in the women.", 0.9203630685806274),
 ('. women.', 0

In [17]:
ex_no = 0 # example in the examples list
id_token = 2  #the identity token is the _th token
model = 'CAD_abuse'
get_k_corr(id_token, 
           perturbations['necc_masks'][ex_no], 
           perturbations['necc_perturbed'][ex_no], 
           preds['necc_scores'][model][ex_no])

[('I hate you', 0.9528999328613281),
 ('I hate being persecuted for expressing a differing viewpoint',
  0.032123543322086334),
 ('I hate vandalism', 0.04124375432729721),
 ('I hate cancer', 0.06964512914419174),
 ('i hate anyone', 0.15954862534999847),
 ('i hate but', 0.06715261936187744),
 ('I hate drama where women get fucking beaten up', 0.9644010066986084),
 ('I feel free to disagree with everybody', 0.035190798342227936),
 ("I hate that people ranting about people's views as theories",
  0.04902292788028717),
 ('you hate that, i', 0.8494347929954529),
 ('I make', 0.05512707680463791),
 ("don't hate it", 0.030198724940419197),
 ('i hate sad little chap', 0.04761945828795433),
 ('I hate what they say', 0.7114335298538208),
 ("please don't hate you, let alone the", 0.09828422218561172),
 ('quater blu is, not hate them.', 0.6837099194526672),
 ('I removed /r/askreddit because /r/askreddit is pretty bad.',
  0.021946441382169724),
 ('I mean to speak', 0.026546703651547432),
 ('I hate 

In the following section, there is the average prediction of the models for the cases where identities are mentioned in neutral or positive contexts (```ident_neutral_nh```, ```ident_pos_nh```) and the test cases where there is "abuse" directed to non-protected groups (```target_group_nh```) individuals (```target_indiv_nh```) and objects (```target_obj_nh```). The performance of the models on non-abusive mention identities for the two targets we've chosen for our explanations ('women' and 'Muslims') is also given seperately. 

In [18]:
hc_results = pickle.load(open('Data/HateCheck_templates_and_results.pickle', "rb"))
hc_results.test_case = hc_results.test_case.apply(lambda x: x.strip())
hc_results

Unnamed: 0.1,Unnamed: 0,functionality,case_id,test_case,label_gold,target_ident,direction,focus_words,focus_lemma,ref_case_id,...,templ_id,case_templ,CAD_abuse_pred,CAD_abuse_score,Davidson_abuse_pred,Davidson_abuse_score,CAD_hate_pred,CAD_hate_score,Davidson_hate_pred,Davidson_hate_score
0,0,derog_neg_emote_h,1,I hate women.,hateful,women,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.948574,1,0.926830,1,0.931045,0,0.174689
1,1,derog_neg_emote_h,2,I hate trans people.,hateful,trans people,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.702156,1,0.778591,0,0.416270,1,0.986996
2,2,derog_neg_emote_h,3,I hate gay people.,hateful,gay people,general,hate,hate,,...,1,I hate [IDENTITY_P].,1,0.846855,1,0.770283,1,0.891302,1,0.984606
3,3,derog_neg_emote_h,4,I hate black people.,hateful,black people,general,hate,hate,,...,1,I hate [IDENTITY_P].,0,0.475613,1,0.880622,1,0.660958,1,0.987589
4,4,derog_neg_emote_h,5,I hate disabled people.,hateful,disabled people,general,hate,hate,,...,1,I hate [IDENTITY_P].,0,0.062079,0,0.060263,0,0.003808,1,0.987493
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3723,3723,spell_leet_h,3897,Gay people are all complete 4ssholes.,hateful,gay people,general,4ssholes,asshole,1232.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.929226,1,0.857159,1,0.988521,1,0.977773
3724,3724,spell_leet_h,3898,Black people are all complete 4ssholes.,hateful,black people,general,4ssholes,asshole,1233.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.831112,1,0.823153,1,0.986541,1,0.659783
3725,3725,spell_leet_h,3899,Disabled people are all complete 4ssholes.,hateful,disabled people,general,4ssholes,asshole,1234.0,...,866,[IDENTITY_P] are all complete 4ssholes.,0,0.089759,0,0.006946,0,0.011324,0,0.432639
3726,3726,spell_leet_h,3900,Muslims are all complete 4ssholes.,hateful,Muslims,general,4ssholes,asshole,1235.0,...,866,[IDENTITY_P] are all complete 4ssholes.,1,0.966604,1,0.848158,1,0.985151,0,0.474491


In [19]:
hc_results_women_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'women')]

hc_results_muslims_nh = hc_results[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'Muslim')]

In [20]:
hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'women')] = 'women_nh'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

In [21]:
hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh') 
                         | (hc_results.functionality == 'ident_pos_nh')) 
                        & (hc_results.target_ident == 'Muslims')] = 'muslims_nh'

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  hc_results.functionality.loc[((hc_results.functionality == 'ident_neutral_nh')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-

In [22]:
# the results we are interested are: 
target_funcs = ['women_nh', 'muslims_nh', 'target_obj_nh', 'target_indiv_nh', 'target_group_nh']

target_funcs_results = hc_results[hc_results.functionality.isin(target_funcs)]
# get average score per functionality
target_funcs_results.groupby('functionality')[['{}_pred'.format(dd) for dd in datasets]].mean().transpose()

functionality,muslims_nh,target_group_nh,target_indiv_nh,target_obj_nh,women_nh
CAD_abuse_pred,0.977778,0.516129,0.646154,0.0,0.688889
Davidson_abuse_pred,0.6,0.241935,0.676923,0.169231,0.488889
CAD_hate_pred,0.955556,0.032258,0.0,0.0,0.622222
Davidson_hate_pred,0.444444,0.306452,0.353846,0.015385,0.0


In [23]:
mask_results = pickle.load(open('Data/HateCheck_necc_suff_results_masked.pickle', 'rb'))
mask_results.keys()

FileNotFoundError: [Errno 2] No such file or directory: 'Data/HateCheck_necc_suff_results_masked.pickle'

In [None]:
necc_vals = {}
suff_vals = {}
necc_vals_mask = {}
suff_vals_mask = {}
orig_texts = []
targets = []

for tt in perturbations['orig_texts']:
    orig_text = tt.strip()
    row = hc_results[hc_results.test_case == orig_text]
    targets.append(row.target_ident.tolist()[0])

for dataset in datasets:
    necc_vals[dataset] = []
    suff_vals[dataset] = []
    necc_vals_mask[dataset] = []
    suff_vals_mask[dataset] = []
    for nn, (orig_text, orig_pred) in enumerate(zip(perturbations['orig_texts'], preds['orig_preds'][dataset])):
        if orig_pred != 1:
            necc_vals[dataset].append(np.nan)
            suff_vals[dataset].append(np.nan)
            necc_vals_mask[dataset].append(np.nan)
            suff_vals_mask[dataset].append(np.nan)
            continue
        # get the row in hc_results corresponding to this case
        orig_text = orig_text.strip()
        row = hc_results[hc_results.test_case == orig_text]
        toknd = row.case_templ.tolist()[0].split()
        ## find the index of the template placeholder
        for ii, tt in enumerate(toknd):
            if tt[:1] == "[":
                break
        necc_vals[dataset].append(results['necc_results'][dataset][nn][ii])
        suff_vals[dataset].append(results['suff_results'][dataset][nn][ii])
        necc_vals_mask[dataset].append(mask_results['necc_results_nb'][dataset][nn][ii])
        suff_vals_mask[dataset].append(mask_results['suff_results_nb'][dataset][nn][ii])

df_dict = {('necessity', dd): ll for dd, ll in necc_vals.items()}
df_dict.update({('sufficiency', dd): ll for dd, ll in suff_vals.items()})
df_dict.update({('necessity_mask', dd): ll for dd, ll in necc_vals_mask.items()})
df_dict.update({('sufficiency_mask', dd): ll for dd, ll in suff_vals_mask.items()})
df_dict.update({('prediction', dd): ll for dd, ll in preds['orig_preds'].items()})
df_dict.update({('score', dd): ll for dd, ll in preds['orig_scores'].items()})
#df_dict.update({'target', ''}: targets)

#ind = [xx.strip() for xx in perturbations['orig_texts']]
ind = [(tt, xx.strip()) for xx, tt in zip(perturbations['orig_texts'], targets)]

# pd.DataFrame(df_dict, index=ind)
#     avg_necc[dataset] = {target: np.mean(necc_vals[target]) for target in targets}
#     avg_suff[dataset] = {target: np.mean(suff_vals[target]) for target in targets}

master_df = pd.DataFrame(df_dict, index=ind)
master_df.columns = pd.MultiIndex.from_tuples(master_df.columns, names=['value','Dataset'])
master_df.index = pd.MultiIndex.from_tuples(master_df.index, names=['target', 'text'])
pickle.dump(master_df, open("Data/HateCheck_individual_necc_suff_scores.pickle", "wb"))

# master_df.xs('CAD_abuse', level='Dataset', axis=1)
# master_df['necessity']
# master_df.loc['women']
# master_df.xs('I hate women.', level='text')

NameError: name 'mask_results' is not defined

In [None]:
master_df = pickle.load(open("Data/HateCheck_individual_necc_suff_scores.pickle", "rb"))

FileNotFoundError: [Errno 2] No such file or directory: 'Data/HateCheck_individual_necc_suff_scores.pickle'

In [None]:
master_df['necessity'].groupby(level='target').mean().transpose()

NameError: name 'master_df' is not defined

In [None]:
master_df['necessity'].groupby(level='target').std().transpose()

target,Muslims,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
CAD_abuse,0.147089,0.135664
Davidson_abuse,0.132284,0.136133
Founta_abuse,0.212946,0.169301
CAD_hate,0.031214,0.023952
Davidson_hate,0.123177,0.089595
Founta_hate,0.159284,0.182587


In [None]:
master_df['sufficiency'].groupby(level='target').mean().transpose()

target,Muslims,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
CAD_abuse,0.883638,0.64441
Davidson_abuse,0.40808,0.439905
Founta_abuse,0.823165,0.343123
CAD_hate,0.878019,0.706071
Davidson_hate,0.738724,0.213942
Founta_hate,0.813537,0.295489


In [None]:
master_df['sufficiency'].groupby(level='target').std().transpose()

target,Muslims,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
CAD_abuse,0.073014,0.138826
Davidson_abuse,0.138387,0.125232
Founta_abuse,0.059369,0.102189
CAD_hate,0.13258,0.173075
Davidson_hate,0.091162,0.061001
Founta_hate,0.077387,0.104272


In [None]:
master_df['necessity_mask'].groupby(level='target').mean().transpose()

target,Muslims,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
CAD_abuse,0.643937,0.615021
Davidson_abuse,0.522215,0.552777
Founta_abuse,0.362422,0.192982
CAD_hate,0.928238,0.874405
Davidson_hate,0.88264,0.436204
Founta_hate,0.724388,0.52969


In [None]:
master_df['sufficiency_mask'].groupby(level='target').mean().transpose()

target,Muslims,women
Dataset,Unnamed: 1_level_1,Unnamed: 2_level_1
CAD_abuse,0.945313,0.859666
Davidson_abuse,0.749977,0.796173
Founta_abuse,0.950145,0.580692
CAD_hate,0.945048,0.882524
Davidson_hate,0.917194,0.257918
Founta_hate,0.909181,0.574927
