Note: Much of this code was lifted from [the Conversation AI project](https://conversationai.github.io/).

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import model_bias_analysis

Read TSVs from file. These are the original data from *Conversation AI*.

In [2]:
comments = pd.read_csv("toxicity_annotated_comments.tsv"\
                                          , sep = "\t")

In [3]:
annotations = pd.read_csv("toxicity_annotations.tsv"\
                                          , sep = "\t")

In [4]:
# toxicity_worker_demographics = pd.read_csv("toxicity_worker_demographics.tsv"\
#                                           , sep = "\t")

* *grouped_annotations* takes the mean of all toxicity ratings of a comment.
* *joined_tox* joins *grouped_annotations* and *comments*.
* We also add a column *binary_tox* to the dataframe *joined_tox*. Here we assign a toxicity rating of 0 or 1 based on whether the mean toxicity rating is above or below 0.5

In [5]:
grouped_annotations = annotations.groupby('rev_id',as_index=False)['toxicity'].mean()
joined_tox = grouped_annotations.join(comments, lsuffix='rev_id', rsuffix='rev_id', how='left', sort=True) 
joined_tox['binary_tox'] = np.where(joined_tox['toxicity']>=.5, 1, 0)

# Stuff I might want later
# # remove newline and tab tokens
# comments['comment'] = comments['comment'].apply(lambda x: x.replace("NEWLINE_TOKEN", " "))
# comments['comment'] = comments['comment'].apply(lambda x: x.replace("TAB_TOKEN", " "))
# comments['length'] = comments['comment'].str.len()

Train logistic regression classifier.

In [6]:
test_comments = joined_tox.query("split == 'test' ")
train_comments = joined_tox.query("split == 'train' ")

clf = Pipeline([
    ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
    ('tfidf', TfidfTransformer(norm = 'l2')),
    ('clf', LogisticRegression()),
])

clf = clf.fit(train_comments['comment'], train_comments['binary_tox'])
auc = roc_auc_score(test_comments['binary_tox'], clf.predict_proba(test_comments['comment'])[:, 1])
print('Test ROC AUC: %.3f' %auc)

Test ROC AUC: 0.951


In [7]:
test_comments["predicted"] = clf.predict(test_comments['comment'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [8]:
test_comments.columns

Index(['rev_idrev_id', 'toxicity', 'rev_idrev_id', 'comment', 'year',
       'logged_in', 'ns', 'sample', 'split', 'binary_tox', 'predicted'],
      dtype='object')

In [9]:
# An example of a comment predicted to be toxic.
# test_comments[test_comments['predicted'] == 1]['comment'][1649]#['comment']#.value_counts()

In [10]:
# This is what the predicted toxic=True test comments looks like after the prediction is done
test_comments[test_comments['predicted'] == 1].head()

Unnamed: 0,rev_idrev_id,toxicity,rev_idrev_id.1,comment,year,logged_in,ns,sample,split,binary_tox,predicted
829,5815399.0,0.7,5815399.0,NEWLINE_TOKENNEWLINE_TOKENIt was anonymous on ...,2004,True,article,random,test,1,1
1227,8357736.0,0.3,8357736.0,NEWLINE_TOKENNEWLINE_TOKENYou are a child.,2004,True,user,blocked,test,0,1
1228,8359431.0,0.5,8359431.0,`NEWLINE_TOKENNEWLINE_TOKEN::You are not worth...,2004,True,user,blocked,test,1,1
1348,8845700.0,1.0,8845700.0,NEWLINE_TOKENNEWLINE_TOKENi am using the sandb...,2004,False,user,blocked,test,1,1
1521,9664203.0,0.8,9664203.0,NEWLINE_TOKENNEWLINE_TOKEN== whoa ==NEWLINE_TO...,2005,False,user,blocked,test,1,1


In [11]:
print(clf.predict(['Thanks for you contribution, you did a great job!']))
# Model labels this comment as non-toxic.
print(clf.predict(['People as stupid as you should not edit Wikipedia!']))
# Model labels this comment as toxic.
print(clf.predict(['I am a proud gay']))
# Model also labels this comment as toxic. Uh-oh! We don't think this comment should be toxic!

[0]
[1]
[1]


In [12]:
# What are the identity labels?

# Currently not in use, but possiblyl we'll want to use it later

# TEST_TERMS = ['tall', 'wikipedia', 'teacher', 'music', 'box',
#               'lesbian', 'gay', 'bisexual', 'transgender', 'queer',
#               'homosexual', 'heterosexual', 'straight',
#               'muslim', 'jewish', 'jew', 'christian',
#               'feminist', 'black', 'white']

madlibs_terms = model_bias_analysis.read_identity_terms('adjectives_people.txt')

# Add identity labels to train and test comments
model_bias_analysis.add_subgroup_columns_from_text(train_comments, 'comment', madlibs_terms)
model_bias_analysis.add_subgroup_columns_from_text(test_comments, 'comment', madlibs_terms)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[term] = df[text_column].apply(lambda x: bool(re.search(r'\b{}\b'.format(term), x, flags=re.IGNORECASE)))


In [13]:
# Print the train and test if you wanna check that the identity label of terms were added
# train_comments.head()
# test_comments.head()

Equation DI:
DI(t1,t2) = (probability that comment containing term t1 is labeled toxic) / (probability that comment containing term t2 is labelled toxic)
= a/b

a = # comments containing t1 AND toxic / # comments containing t1
= alpha/beta

In [14]:
# Confirm that you can find alpha
train_comments[train_comments['lesbian'] == True].shape[0]

38

In [15]:
# Confirm that you can find beta
train_comments[(train_comments['lesbian'] == True) & (train_comments['binary_tox'] == 1)].shape[0]

15

In [16]:
train_comments.head()

Unnamed: 0,rev_idrev_id,toxicity,rev_idrev_id.1,comment,year,logged_in,ns,sample,split,binary_tox,...,older,young,younger,teenage,millenial,middle aged,elderly,blind,deaf,paralyzed
0,2232.0,0.1,2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train,0,...,False,False,False,False,False,False,False,False,False,False
1,4216.0,0.0,4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train,0,...,False,False,False,False,False,False,False,False,False,False
3,26547.0,0.0,26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train,0,...,False,False,False,False,False,False,False,False,False,False
6,37330.0,0.3,37330.0,`NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENI fixe...,2002,True,article,random,train,0,...,False,False,False,False,False,False,False,False,False,False
7,37346.0,0.1,37346.0,`If they are ``indisputable`` then why does th...,2002,True,article,random,train,0,...,False,False,False,False,False,False,False,False,False,False


In [48]:
def calculate_pairwise_di(df, madlibs_terms, colname):
    
    # We now calculate the DI for each pair for the training data.

    term_toxicity = np.zeros(len(madlibs_terms))

    # we temporarily put -1 in as a placeholder for stuff that has no exacmples of terms with that term.
    # non-binary is one
    for i in range(len(madlibs_terms)):
        try:
            term_toxicity[i] = float((df[(df[madlibs_terms[i]] == True) & (df[colname] == 1)].shape[0]))/  \
            float((df[df[madlibs_terms[i]] == True].shape[0]))
        except ZeroDivisionError:
            term_toxicity[i] = -1 

    return term_toxicity

term_toxicity = calculate_pairwise_di(train_comments, madlibs_terms, 'binary_tox')
print(term_toxicity)


[ 0.39473684  0.62544803  0.07692308  0.26315789  0.0952381   0.74509804
  0.07017544  1.          0.46774194  0.13702624  0.15789474  0.17073171
  0.15577889 -1.          0.08474576  0.16326531  0.18042813  0.15853659
  0.09090909  0.10416667  0.11111111  0.2        -1.          0.2739726
  0.06944444  0.09857978  0.12        0.11444142  0.09090909  0.12058824
  0.07216495  0.09190372  0.13057325  0.11860465  0.02173913  0.08906883
  0.03846154  0.11111111 -1.          0.12916667  0.0472103   0.09433962
  0.03529412  0.125       0.         -1.          0.          0.19496855
  0.08       -1.        ]


In [37]:
# This finds the max DI between all pairs of identity terms, which for one of our experiments,
# we're claiming is kind of the DI of all terms
# Possibly something we'll take out or come back to once we try more versions of our code. (Neha's working on this!)

def find_max_di(term_toxicity, madlibs_terms):
    # General questions for potential problems with this method:
    # Do we want the 0s? That would mean they are never used in a toxic way but can't divide by 0...
    # Also, how often do terms appear? Is sample size crazy small right now?? 
    max_prop = 0
    top_indexes = ()

    for i in range(len(madlibs_terms)):
        for j in range(len(madlibs_terms)):
            if (i==j or term_toxicity[i] == -1 or term_toxicity[j] == -1 or term_toxicity[j] == 0):
                continue
            if term_toxicity[i]/ term_toxicity[j] > max_prop:
                max_prop = term_toxicity[i]/ term_toxicity[j]
                top_indexes = (i,j)
                
    return max_prop, top_indexes

max_prop, top_indexes = find_max_di(term_toxicity, madlibs_terms)

print(max_prop, top_indexes)
print(madlibs_terms[7], madlibs_terms[34])



46.0 (7, 34)
lgbtq buddhist


Now! We do the same thing but for the test data

In [19]:
# alpha of test
test_comments[test_comments['lesbian'] == True].shape[0]

18

In [20]:
# beta of test
test_comments[(test_comments['lesbian'] == True) & (test_comments['predicted'] == 1)].shape[0]

4

In [21]:
# This writes the 3 dataframes to csv so that we can read from that later for shorter code.

joined_tox.to_csv('joined_tox.csv')
train_comments.to_csv('train_comments.csv')
test_comments.to_csv('test_comments.csv')

In [22]:
train_comments = pd.read_csv('train_comments.csv')

In [23]:
test_comments = pd.read_csv('test_comments.csv')

In [24]:
# Generates one perturbation.

list_perturbation = []

length = len(train_comments.binary_tox.values)
for j in range(10):
    rand = np.random.random(length) # generate a random number (between 0 and 1) for each comment
    tox_tmp = np.copy(train_comments.binary_tox.values) # np.copy(tox_np)
    for i in range(length):
        if rand[i] >= 0.5: # if random number is greater than 0.5, replace value in array with a random integer from [0, 1]
            tox_tmp[i] = np.random.randint(2)
    list_perturbation.append(tox_tmp)

# each item in list_perturbation is a list of 0s and 1s that correspond to the new binary_tox of each variable

In [54]:
print(list_perturbation)

[array([0, 1, 0, ..., 0, 0, 1], dtype=int64), array([1, 1, 0, ..., 0, 0, 1], dtype=int64), array([0, 0, 1, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 1], dtype=int64), array([1, 1, 0, ..., 0, 0, 0], dtype=int64), array([1, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 0], dtype=int64), array([1, 0, 1, ..., 0, 0, 0], dtype=int64)]


In [55]:
# Calculate max DIs on all of the perturbed training datasets

training_data_dis = []

for i in range(10):
    train_comments['newcol'] = list_perturbation[i]
    term_toxicity = calculate_pairwise_di(train_comments, madlibs_terms, 'newcol')
    max_prop, top_indexes = find_max_di(term_toxicity, madlibs_terms)
    training_data_dis.append((max_prop, top_indexes))

print("**", training_data_dis)
# This is the array of all of the max_dis, and the indexes of the madlibs_terms array that composed that max_di

** [(3.3, (21, 46)), (7.11764705882353, (5, 46)), (6.25, (44, 48)), (11.0, (7, 46)), (11.0, (7, 46)), (9.5, (7, 3)), (11.0, (7, 46)), (6.901960784313725, (5, 46)), (8.666666666666666, (7, 2)), (6.352941176470589, (5, 20))]


In [25]:
# This trains a classifier on n different perturbed datsets 

d={}
n = 10
for x in range(n):
    d["clf{0}".format(x)] = Pipeline([
        ('vect', CountVectorizer(max_features = 10000, ngram_range = (1,2))),
        ('tfidf', TfidfTransformer(norm = 'l2')),
        ('clf', LogisticRegression()),
    ])
    d["clf{0}".format(x)] = d["clf{0}".format(x)].\
                                fit(train_comments['comment'], list_perturbation[x])
    d["auc{0}".format(x)] = roc_auc_score(test_comments['binary_tox'], \
                                d["clf{0}".format(x)].predict_proba(test_comments['comment'])[:, 1])
    print('Test ROC AUC: %.5f' %d["auc{0}".format(x)])

Test ROC AUC: 0.88699
Test ROC AUC: 0.89348
Test ROC AUC: 0.88652
Test ROC AUC: 0.88795
Test ROC AUC: 0.88485
Test ROC AUC: 0.88355
Test ROC AUC: 0.88990
Test ROC AUC: 0.88771
Test ROC AUC: 0.88754
Test ROC AUC: 0.88733


In [26]:
# Once a classifier is trained, this goes to the test data and creates predictions on test data
perturbed_predictions = [] # list, each item is array of predictions. element 0 is 0th perturbation and 
# predictions based on that.
# each item in the array is a column that indicates 0/1 for predicted not-toxic/toxic

for i in range(10):
    perturbed_predictions.append(d["clf{0}".(i)].predict(test_comments['comment']))

In [39]:
print(perturbed_predictions)

[array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64), array([0, 1, 0, ..., 0, 0, 0], dtype=int64), array([0, 0, 0, ..., 0, 0, 0], dtype=int64)]


In [27]:
# Sanity check
print(np.bincount(test_comments.binary_tox.values))
# More sanity
for arr in perturbed_predictions:
    print(np.bincount(arr))

[28249  3617]
[29459  2407]
[29343  2523]
[29411  2455]
[29523  2343]
[29397  2469]
[29474  2392]
[29343  2523]
[29377  2489]
[29395  2471]
[29418  2448]


In [41]:
# Goal: Compute fairness on test dataset.

test_comments['newcol'] = perturbed_predictions[0]


In [42]:
test_comments.head()

Unnamed: 0.1,Unnamed: 0,rev_idrev_id,toxicity,rev_idrev_id.1,comment,year,logged_in,ns,sample,split,...,young,younger,teenage,millenial,middle aged,elderly,blind,deaf,paralyzed,newcol
0,2,8953.0,0.0,8953.0,Elected or Electoral? JHK,2002,False,article,random,test,...,False,False,False,False,False,False,False,False,False,0
1,4,28959.0,0.2,28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test,...,False,False,False,False,False,False,False,False,False,0
2,19,138074.0,0.0,138074.0,`NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENNEWLIN...,2002,True,article,random,test,...,False,False,False,False,False,False,False,False,False,0
3,33,200664.0,0.0,200664.0,NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKEN NEWLIN...,2002,True,user,random,test,...,False,False,False,False,False,False,False,False,False,0
4,37,213105.0,0.0,213105.0,`NEWLINE_TOKENNEWLINE_TOKEN: I should do that ...,2002,True,user,random,test,...,False,False,False,False,False,False,False,False,False,0


In [44]:
# Sanity check: Run these two lines below to confirm that they shape of newcol looks like perturbed_predictions array that you added
test_comments['newcol'].value_counts()

0    29459
1     2407
Name: newcol, dtype: int64

In [46]:
sum(perturbed_predictions[0] == 1)

2407

In [53]:
# For each of the perturbed_predictions, append it onto the dataset, find the pairwise DI, and then find the max DI.
test_data_dis = []

for i in range(len(perturbed_predictions)):
    test_comments['newcol'] = perturbed_predictions[i]
    term_toxicity = calculate_pairwise_di(test_comments, madlibs_terms, 'newcol')
    max_prop, top_indexes = find_max_di(term_toxicity, madlibs_terms)
    test_data_dis.append((max_prop, top_indexes))

print("**", test_data_dis)
# This is the array of all of the max_dis, and the indexes of the madlibs_terms array that composed that max_di

** [(40.25, (45, 31)), (66.19354838709678, (1, 35)), (17.5, (20, 11)), (20.564516129032256, (1, 40)), (31.006451612903227, (1, 35)), (18.956451612903226, (1, 31)), (19.277419354838713, (1, 35)), (108.0, (45, 35)), (108.0, (45, 35)), (21.38494623655914, (1, 9))]


40.25 (45, 31)
lgbtq buddhist


### Neha's notes

"This data set (https://figshare.com/articles/Wikipedia_Talk_Labels_Toxicity/4563973) includes over 100k labeled discussion comments from English Wikipedia. Each comment was labeled by multiple annotators via Crowdflower on whether it is a toxic or healthy contribution. We also include some demographic data for each crowd-worker. See our wiki for documentation of the schema of each file and our research paper for documentation on the data collection and modeling methodology. For a quick demo of how to use the data for model building and analysis, check out this ipython notebook." - quote from linked page

In [29]:
comments = pd.read_csv("toxicity_annotated_comments.tsv"\
                                          , sep = "\t")

Copied from documentation: <br>
"Schema for {attack/aggression/toxicity}_annotated_comments.tsv
The comment text and metadata for comments with attack/aggression/toxicity labels generated by crowd-workers. The actual labels are in the corresponding {attack/aggression/toxicity}_annotations.tsv since each comment was labeled multiple times.

rev_id: MediaWiki revision id of the edit that added the comment to a talk page (i.e. discussion). <br>
comment: Comment text. Consists of the concatenation of content added during a revision/edit of a talk page. MediaWiki markup and HTML have been stripped out. To simplify tsv parsing, \n has been mapped to NEWLINE_TOKEN, \t has been mapped to TAB_TOKEN and " has been mapped to `. <br>
year: The year the comment was posted in. <br>
logged_in: Indicator for whether the user who made the comment was logged in. Takes on values in {0, 1}. <br>
ns: Namespace of the discussion page the comment was made in. Takes on values in {user, article}. <br>
sample: Indicates whether the comment came via random sampling of all comments, or whether it came from random sampling of the 5 comments around a block event for violating WP:npa or WP:HA. Takes on values in {random, blocked}. <br>
split: For model building in our paper we split comments into train, dev and test sets. Takes on values in {train, dev, test}."
<br>

My notes: <br> 
I don't know enough about how natural language processing works, but from the snippets that I do know, I imagine that the really really long comments probably aren't very good at being classified even. I also wonder about how bigram toxicity works and whether this is something the training data accounts for (eg "nasty woman" vs "nasty" has different sources of problems). What are they classifying on/why does this work? We can see Figure 1 in the Dixon paper for comment length and should be able to filter there. I wonder if the phrase templating of classification works for the problems that I raise of large comment length and bigrams. Is this an issue we should be dealing with?
Also, an easier couple of questions are: what does the ns and sample really mean?

In [30]:
comments.head()

Unnamed: 0,rev_id,comment,year,logged_in,ns,sample,split
0,2232.0,This:NEWLINE_TOKEN:One can make an analogy in ...,2002,True,article,random,train
1,4216.0,`NEWLINE_TOKENNEWLINE_TOKEN:Clarification for ...,2002,True,user,random,train
2,8953.0,Elected or Electoral? JHK,2002,False,article,random,test
3,26547.0,`This is such a fun entry. DevotchkaNEWLINE_...,2002,True,article,random,train
4,28959.0,Please relate the ozone hole to increases in c...,2002,True,article,random,test


Copied from documentation:
        Schema for toxicity_annotations.tsv
    Toxicity labels from several crowd-workers for each comment in toxicity_annotated_comments.tsv. It can be joined with toxicity_annotated_comments.tsv on rev_id.

rev_id: MediaWiki revision id of the edit that added the comment to a talk page (i.e. discussion). <br>
worker_id: Anonymized crowd-worker id.<br>
toxicity_score: Categorical variable ranging from very toxic (-2), to neutral (0), to very healthy (2). <br>
toxicity: Indicator variable for whether the worker thought the comment is toxic. The annotation takes on the value 1 if the worker considered the comment toxic (i.e worker gave a toxicity_score less than 0) and value 0 if the worker considered the comment neutral or healthy (i.e worker gave a toxicity_score greater or equal to 0). Takes on values in {0, 1}.

My notes:
Things to explore is how many people rated each thing? The paper said 10, but I would like to confirm this. 

In [31]:
annotations = pd.read_csv("toxicity_annotations.tsv"\
                                          , sep = "\t")

My comment:
    This isn't really one we'll be using until much later, if/when we decide we're doing a perturbation using demographic data. Would first want to check to see what kind of correlations might/do exist between gender/rating and see how they rate comments about women, for example.
    

Copied from documentation:

Schema for {attack/aggression/toxicity}_worker_demographics.tsv
Demographic information about the crowdworkers. This information was obtained by an optional demographic survey administered after the labelling task. It is meant to be joined with {attack/aggression/toxicity}_annotations.tsv on worker_id. Some fields may be blank if left unanswered.

worker_id: Anonymized crowd-worker id. <br>
gender: The gender of the crowd-worker. Takes a value in {'male', 'female', and 'other'}. <br>
english_first_language: Does the crowd-worker describe English as their first language. Takes a value in {0, 1}.<br>
age_group: The age group of the crowd-worker. Takes on values in {'Under 18', '18-30', '30-45', '45-60', 'Over 60'}.<br>
education: The highest education level obtained by the crowd-worker. Takes on values in {'none', 'some', 'hs', 'bachelors', 'masters', 'doctorate', 'professional'}. Here 'none' means no schooling, some means 'some schooling', 'hs' means high school completion, and the remaining terms indicate completion of the corresponding degree type.
