# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same
topic.  This data has not been cleaned yet, and has not had multiple categories introduced yet.  However it is presented free from bias, for people to play with.  

In [1]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer

In [45]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

#reindex 
#df_com.index=df_com['rev_id']
#df_rate.index=df_rate['rev_id']
print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [46]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes.  Those set the boundaries for changes.
#Need to also append final index.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split instead.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr
#change_indices=np.append(change_indices,len(df_rate))
# for i in range(len(change_indices)-1):
#     ind0 = change_indices[i]
#     ind1 = change_indices[i+1]
#     d0=df_rate.iloc[ind0:ind1]
#     scores=d0[['worker_id','toxicity_score']]
#     #pass it a list so it can be set as an entry.
#     #accessing later will require score[0] idiocy to get at list.
#     df_com.loc[i,'scores']=[scores.values]

In [169]:
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)


In [182]:
df_com['toxic']=df_com['median_toxic']==-2
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]


In [168]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [183]:
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159686. Toxic comments: 1610. Toxic Fraction: 0.01008228648723119


In [None]:
#can combine the dataframes together by extracting all reviewer ids, and scores for each 

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Other Analysis possibilities:
* Support Vector Machine
    - the other big architecture, less popular now?
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
* Naive Bayes
    - can find most important words in spam
    - simple, easy to understand baseline.
* Latent Factor Analysis 
    - maybe useful prelude for building up embeddings.
    
Cleaning:
* Tokenize (convert words to indices)
* Clean data : How to remove newlines (search/replace: NEWLINE with '')
* Stemming words
* Balancing data set
* Match up comments, and review scores

* Search for gibberish words (make a new "feature" for badly spelled comments)

In [73]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    
    com_clean=com_clean.str.replace("\[\[",' ')
    com_clean=com_clean.str.replace("\]\]",' ')
    com_clean=com_clean.str.replace("==",' ')
    com_clean=com_clean.str.replace("|",' ')                                
    com_clean=com_clean.str.replace("::",' ')                                    
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')

    #remove symbols
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

In [74]:
(df_com['split']=='train').sum()

95692

In [72]:
df_com[df_com['split']=='train']['comment_clean']

0         This: :One can make an analogy in mathematical terms by envisioning the distribution of opinions...
1         ` :Clarification for you (and Zundark's right, i should have checked the Wikipedia bugs page fir...
3         `This is such a fun entry. Devotchka I once had a coworker from Korea and not only couldn't she ...
6         ` I fixed the link; I also removed ``homeopathy`` as an exampleit's not anything like a legitima...
7         `If they are ``indisputable`` then why does the NOAA dispute it? Note that the NOAA is the same ...
9         ` The concept of ``viral meme`` is not a mainstream academic concept, and only merits the briefe...
11        `just quick notes, since i don't have the time or background to write well on it... purpose: fun...
12        `The actual idea behind time-out is to get the parent to cool-off. They are the real problem in ...
15        ` Gjalexei, you asked about whether there is an ``anti-editorializing`` policy here. There is, a...
18        

In [83]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_counts.shape

(95692, 125568)

In [325]:
def cond_prob(X_counts,toxic,csmooth=1):
    """bayes_prob
    X_counts - sparse matrix of counts of each word in a given message
    toxic - whether word was toxic or not, with 0,1
    """
    nrows,nwords=X_counts.shape
    ptoxic = np.sum(toxic)/nrows
    
    toxic_mat=X_counts[toxic==1,:]
    clean_mat=X_counts[toxic==0,:]
    #sum across messages
    nword_toxic=np.sum(toxic_mat,axis=0)
    nword_clean=np.sum(clean_mat,axis=0)    

    #estimate probability of word given toxicity by number of times
    #that word occurs in toxic documents, divided by the total number of words
    #in toxic documents
    #Laplace smooth version
    pword_toxic= (nword_toxic+csmooth) \
                / (np.sum(toxic_mat)+nwords*csmooth)

    pword_clean= (nword_clean+csmooth) \
                /(np.sum(clean_mat)+nwords*csmooth)
    x1=np.sum(toxic_mat,0)
    x2=nword_toxic
    return ptoxic,pword_toxic,pword_clean    

ptox,pw_tox,pw_cln = cond_prob(X_train_counts,df_train['toxic'].values,csmooth=0.01)

In [326]:
#Version from spam code.
# Calculate probabilities that a given word occurs in spam emails.
# Also calculate probability that emails are spam.  
def calc_prob(word_mat,cat_vec):
    #Find dimensions of matrix
    nrows,nwords=word_mat.shape
    #Training uses the frequency of words.
    #Based on previous parts of the question, I was just using whether
    # a word occured.  (Following the notes?)
    # spamword_occured=(word_full[cat_vec==1,:]>0)
    # word_occured=(word_full[cat_vec==0,:]>0)
    spam_mat=word_mat[cat_vec==1,:]
    ham_mat=word_mat[cat_vec==0,:]

    #another fuckup in my probabilities?  I computed the number of spam emails.
    #should compute prob of word occuring in spam based on total words in spam.
    spamword_num=np.sum(spam_mat)
    hamword_num=np.sum(ham_mat)
    #Then find proportion of spam emails where each word occurs.
    #Fixed version.  should be smooth version of prob(word|spam)
    prob_spamword=(np.sum(spam_mat,0)+1)/(spamword_num+nwords)
    prob_hamword=(np.sum(ham_mat,0)+1)/(hamword_num+nwords)

    #probability for messages to be spam.
    pspam = np.sum(cat_vec)/nrows
    #Return probability for word in not spam email, prob for word in spam,
    #and probability that email is spam.
    print([1/(hamword_num+nwords),1/(spamword_num+nwords)])
    return pspam,prob_spamword,prob_hamword

ptox2,pw_tox2,pw_cln2=calc_prob(X_train_counts,df_train['toxic'].values)


[3.2633656852529483e-07, 5.9284202538549556e-06]


# Checking the vectorizer and finding common words

In [327]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value, and then use that as the index for the counts.
voc_df1=voc_df.sort_values(by=0)

In [328]:
word_mat=np.array([X_train_counts.sum(axis=0),pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','p_clean','p_toxic'],index=voc_df1.index)
word_df.sort_values('p_toxic',ascending=False,inplace=True)
print(word_df)

                 count       p_clean       p_toxic
fuck            6179.0  5.836751e-04  1.005937e-01
suck            2669.0  1.853770e-04  4.787399e-02
faggot          2271.0  3.493221e-04  2.803929e-02
die             2062.0  3.483017e-04  2.339616e-02
bitch           1284.0  1.489826e-04  1.906859e-02
sucks           1282.0  1.486424e-04  1.904605e-02
fucking         1938.0  3.904784e-04  1.780638e-02
ass             1275.0  2.142885e-04  1.453816e-02
fucksex          624.0  3.401351e-09  1.406483e-02
asshole          795.0  1.112276e-04  1.054868e-02
fucker           537.0  2.381285e-05  1.052614e-02
wikipedia      29767.0  9.966641e-03  1.048106e-02
shit            1689.0  4.207505e-04  1.018805e-02
dick            1022.0  2.000028e-04  9.782341e-03
huge             893.0  1.608873e-04  9.466789e-03
cocksucker       425.0  6.806102e-06  9.128697e-03
mothjer          391.0  3.401351e-09  8.813145e-03
cock             561.0  5.782636e-05  8.813145e-03
rape             747.0  1.29254

In [332]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [335]:
vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,'gay')
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
#does not account for leetspeak 13375|o34|k (but who uses that these days?)

In [336]:
#currently searching for "jew", a term that has clean connotations, but can be used as anti-semitic.
print('Vect: {}, Naive: {}'.format(vc,cc))
print([com.head(),ncom.head()])

Vect: [[1866]], Naive: 638
[588     ` Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdi...
790     ` There is one other thing that just occured to me: an encyclopedia deals in facts, not in specu...
871     ` Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technical...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High School so I know that they are, i...
1467     WHY ARE YOU SUCH A GAY NIGGER?!?! GOD DAMNDD... YOU ARE SUCH A GAY NIGGER. FUCK FUCK SHTAY AWAY...
Name: comment_clean, dtype: object, 588     ` Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdi...
790     ` There is one other thing that just occured to me: an encyclopedia deals in facts, not in specu...
871     ` Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technical...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High S

I noticed that there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
Weird sociological question on perception of racism, perhaps by american reviewers?

(searching for the n-word found these)
Also, some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!

In [153]:
ind

array([3279])

                     count       p_clean   p_toxic
kindom              6179.0  5.814706e-04  0.080174
fiss                2669.0  1.846977e-04  0.038157
rickson             2271.0  3.480150e-04  0.022349
uppityness          2062.0  3.469985e-04  0.018648
sincerest           1284.0  1.484426e-04  0.015199
michelin            1282.0  1.481037e-04  0.015181
donk                1938.0  3.890137e-04  0.014193
assertion           1275.0  2.134984e-04  0.011588
instinctively        624.0  3.388326e-08  0.011211
bs                   795.0  1.108321e-04  0.008409
carpentry            537.0  2.375217e-05  0.008391
wtekni             29767.0  9.928507e-03  0.008355
reqmove             1689.0  4.191698e-04  0.008121
bhaduria            1022.0  1.992675e-04  0.007798
macmanus             893.0  1.603017e-04  0.007547
foliage              425.0  6.810535e-06  0.007277
identifications      391.0  3.388326e-08  0.007026
gmaxwell             561.0  5.763543e-05  0.007026
scienceapologiest    747.0  1.2

In [None]:
Looks like my word mapping is completely screwed?

In [30]:
def naive_bayes(mat,pword_tox,pword_cln,ptox):
    """Compute probability that a message 
    is toxic via naive_bayes estimate.
    """
    msk = mat>0
    #ptox_word = (pword_tox * p_tox)/(pword_cln*pcln + pword_tox * p_tox)
    #Need to multiply together.  Can just add the logs, and exponentiate at
    #the end

    #do once for all words
    log_pword = np.log(pword_tox*ptox/(pword_tox*ptox + pword_cln*(1-ptox)))
    log_score = mat.dot(log_pword.T)
    score=np.exp(log_score)
    return score
scores=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

(1, 9351)


  # This is added back by InteractiveShellApp.init_path()


In [31]:
plt.hist(scores)
plt.show()

<matplotlib.figure.Figure at 0x7efba3f0e358>

In [32]:
#Look at the messages classified as spam.
msk = np.array(scores>0.5)
df_com[msk][['comment_clean','toxic']]

                                                                                            comment_clean  \
18                                                                                       :Are you there?    
22                                                                                               * Keep.    
25                                                                                               * Keep.    
73                                                                                                          
82                                                                                                     —    
86                                                                                                ` : : `   
139                                                                                      ::He always is.    
159                                                                                 What else is on her?    
165                

In [27]:
np.min(pw_tox)

4.4887332794685343e-05

In [79]:
word_mat.squeeze()

array([[  1.00000000e+00,   2.00000000e+00,   1.00000000e+00, ...,   1.00000000e+00,   1.00000000e+00,   1.00000000e+00],
       [  5.95663569e-05,   8.93495354e-05,   5.95663569e-05, ...,   5.95663569e-05,   5.95663569e-05,   5.95663569e-05],
       [  4.48873328e-05,   4.48873328e-05,   4.48873328e-05, ...,   4.48873328e-05,   4.48873328e-05,   4.48873328e-05]])

In [166]:
test=['Zep the blah blah','Cow Alt the blah Alt','foo the if']
test2=count_vect.fit_transform(test)

In [167]:
count_vect.vocabulary_

{'alt': 0, 'blah': 1, 'cow': 2, 'foo': 3, 'zep': 4}

array([[0, 2, 0, 0, 1],
       [2, 1, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=int64)

In [86]:
X_train_counts.sum(axis=0).min()

1

In [None]:
#get term frequencies via tfidf transformer?

So, I briefly embarassed myself by looking for racist slurs,
and found none of the obvious American candidates.

14

In [11]:
np.sum(df_com['toxic'])

395

In [279]:
naughty_word=['fuck','shit','cunt','piss','cocksucker','dick','ass','asshole','']
for word in naughty_word:
    try:
        print(word,count_vect.vocabulary_[word])
    except:
        print(word,'not found')

fuck 57401
shit 136209
cunt 35357
piss 114922
cocksucker 29180
dick 40351
ass 10569
asshole 10688
bitch 17516
