# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same
topic.  This data has not been cleaned yet, and has not had multiple categories introduced yet.  However it is presented free from bias, for people to play with.  

In [1]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
#df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

#reindex 
#df_com.index=df_com['rev_id']
#df_rate.index=df_rate['rev_id']
print(df_com.shape, df_rate.shape)

(3582, 7) (35152, 4)


In [281]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes.  Those set the boundaries for changes.
#Need to also append final index.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values
change_indices=np.append(change_indices,len(df_rate))

for i in range(len(change_indices)-1):
    ind0 = change_indices[i]
    ind1 = change_indices[i+1]
    d0=df_rate.iloc[ind0:ind1]
    scores=d0[['worker_id','toxicity_score']]
    #pass it a list so it can be set as an entry.
    #accessing later will require score[0] idiocy to get at list.
    df_com.loc[i,'scores']=[scores.values]

In [5]:
?np.array

In [5]:
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['toxic']=df_com['mean_toxic']!=0

In [282]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s_arr=score_list[0]
    s = np.mean(s_arr[:,1])
    return s



In [6]:
df_com['toxic'].sum()

395

In [None]:
#can combine the dataframes together by extracting all reviewer ids, and scores for each 

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Other Analysis possibilities:
* Support Vector Machine
    - the other big architecture, less popular now?
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
* Naive Bayes
    - can find most important words in spam
    - simple, easy to understand baseline.
* Latent Factor Analysis 
    - maybe useful prelude for building up embeddings.
    
Cleaning:
* Tokenize (convert words to indices)
* Clean data : How to remove newlines (search/replace: NEWLINE with '')
* Stemming words
* Balancing data set
* Match up comments, and review scores

* Search for gibberish words (make a new "feature" for badly spelled comments)

In [259]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove symbols.    
    com_clean=com_clean.str.replace("\[\[",' ')
    com_clean=com_clean.str.replace("\]\]",' ')
    com_clean=com_clean.str.replace("==",' ')
    com_clean=com_clean.str.replace("|",' ')                                
    
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')

    #remove symbols
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

In [260]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_com['comment_clean'])
X_train_counts.shape

(159686, 172992)

In [195]:
def cond_prob(X_counts,toxic):
    """bayes_prob
    X_counts - sparse matrix of counts of each word in a given message
    toxic - whether word was toxic or not, with 0,1
    """
    ptoxic = np.sum(toxic)/len(toxic)

    toxic_mat=X_counts[toxic==1,:]
    clean_mat=X_counts[toxic!=1,:]
    #sum across messages
    nword_toxic=np.sum(toxic_mat,axis=0)
    nword_clean=np.sum(clean_mat,axis=0)    
    nwords=len(nword_clean)
    #estimate probability of word given toxicity by number of times
    #that word occurs in toxic documents, divided by the total number of words
    #in toxic documents
    #Laplace smooth version
    #pword_toxic=(nword_toxic+1)/np.sum(nword_toxic+nwords)
    #pword_clean=(nword_clean+1)/np.sum(nword_clean+nwords)
    #raw version    
    pword_toxic=(nword_toxic)/np.sum(nword_toxic)
    pword_clean=(nword_clean)/np.sum(nword_clean)    

    return ptoxic,pword_toxic,pword_clean    

ptox,pw_tox,pw_cln = cond_prob(X_train_counts,df_com['toxic'].values)

In [197]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value
voc_df1=voc_df.sort_values(by=0)

In [241]:
%pdb off

Automatic pdb calling has been turned OFF


In [245]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):

    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    com_count=np.sum(comments.str.contains('{}'.format(word)))
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    comments=comments[msk]
    return vect_count,com_count,comments

vc,cc,com=check_vect(X_train_counts,df_com['comment_clean'],voc_df,'fuck')

In [247]:
print('Vect: {}, Com: {}'.format(vc,cc))
print(com)

Vect: [[14]], Com: 17
97       Excuse me, whoever the Fuck you are SPUI says on his talk page that he likes vandalism, and doe...
339                                                                             I got your message Fuck you
729      FUCK YOU YOU FUCKING FAGGOT!!!! this is Sean MC Sean. You motherfuckers blocked me for a long time
939                                 What... What are you doing? Don't fuck me. I fuck to Ryulong, not you. 
972      FUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OFFFUCK OF...
1310                Hey go fuck yourself. Why don't you go edit my talk page to show how offended you are. 
1409                                     What the fuck are you talking about? They did lose a couple times.
1520     Fuck You Go suck on a fat dick faggot mother fucker. I'm the the wrong person to fuck with you ...
2000                                                           wtf? this aint me! who the fuck used ma ip?!
2040  

In [153]:
ind

array([3279])

In [128]:
word_mat=np.array([X_train_counts.sum(axis=0),pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','p_clean','p_toxic'],index=voc_df.index)

In [135]:
word_df.loc['dick']

count      2.000000
p_clean    0.000041
p_toxic    0.000077
Name: dick, dtype: float64

In [44]:
word_df.sort_values('p_toxic')

      count   p_clean   p_toxic           word
0       1.0  0.000041  0.000000           like
5458    2.0  0.000083  0.000000         carpio
5457    2.0  0.000083  0.000000           belz
5456    1.0  0.000041  0.000000       applying
5455    6.0  0.000248  0.000000          durra
5454    1.0  0.000041  0.000000      cornelius
5452    1.0  0.000041  0.000000       partners
5451    1.0  0.000041  0.000000        stories
5448    3.0  0.000124  0.000000        peoples
5447    1.0  0.000041  0.000000      critisize
5446    1.0  0.000041  0.000000        oldlady
5445    2.0  0.000083  0.000000         jejune
5444    1.0  0.000041  0.000000        rethink
5443    1.0  0.000041  0.000000            mau
5437    3.0  0.000124  0.000000        demonic
5435    1.0  0.000041  0.000000             ft
5434    1.0  0.000041  0.000000        biggest
5433    1.0  0.000041  0.000000      establish
5432    1.0  0.000041  0.000000        wanting
5430    1.0  0.000041  0.000000      discribes
5429    2.0  

In [None]:
Looks like my word mapping is completely screwed?

In [30]:
def naive_bayes(mat,pword_tox,pword_cln,ptox):
    """Compute probability that a message 
    is toxic via naive_bayes estimate.
    """
    msk = mat>0
    #ptox_word = (pword_tox * p_tox)/(pword_cln*pcln + pword_tox * p_tox)
    #Need to multiply together.  Can just add the logs, and exponentiate at
    #the end

    #do once for all words
    log_pword = np.log(pword_tox*ptox/(pword_tox*ptox + pword_cln*(1-ptox)))
    log_score = mat.dot(log_pword.T)
    score=np.exp(log_score)
    return score
scores=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

(1, 9351)


  # This is added back by InteractiveShellApp.init_path()


In [31]:
plt.hist(scores)
plt.show()

<matplotlib.figure.Figure at 0x7efba3f0e358>

In [32]:
#Look at the messages classified as spam.
msk = np.array(scores>0.5)
df_com[msk][['comment_clean','toxic']]

                                                                                            comment_clean  \
18                                                                                       :Are you there?    
22                                                                                               * Keep.    
25                                                                                               * Keep.    
73                                                                                                          
82                                                                                                     —    
86                                                                                                ` : : `   
139                                                                                      ::He always is.    
159                                                                                 What else is on her?    
165                

In [27]:
np.min(pw_tox)

4.4887332794685343e-05

In [79]:
word_mat.squeeze()

array([[  1.00000000e+00,   2.00000000e+00,   1.00000000e+00, ...,   1.00000000e+00,   1.00000000e+00,   1.00000000e+00],
       [  5.95663569e-05,   8.93495354e-05,   5.95663569e-05, ...,   5.95663569e-05,   5.95663569e-05,   5.95663569e-05],
       [  4.48873328e-05,   4.48873328e-05,   4.48873328e-05, ...,   4.48873328e-05,   4.48873328e-05,   4.48873328e-05]])

In [166]:
test=['Zep the blah blah','Cow Alt the blah Alt','foo the if']
test2=count_vect.fit_transform(test)

In [167]:
count_vect.vocabulary_

{'alt': 0, 'blah': 1, 'cow': 2, 'foo': 3, 'zep': 4}

array([[0, 2, 0, 0, 1],
       [2, 1, 1, 0, 0],
       [0, 0, 0, 1, 0]], dtype=int64)

In [86]:
X_train_counts.sum(axis=0).min()

1

In [None]:
#get term frequencies via tfidf transformer?

So, I briefly embarassed myself by looking for racist slurs,
and found none of the obvious American candidates.

14

In [11]:
np.sum(df_com['toxic'])

395

In [279]:
naughty_word=['fuck','shit','cunt','piss','cocksucker','dick','ass','asshole','']
for word in naughty_word:
    try:
        print(word,count_vect.vocabulary_[word])
    except:
        print(word,'not found')

fuck 57401
shit 136209
cunt 35357
piss 114922
cocksucker 29180
dick 40351
ass 10569
asshole 10688
bitch 17516
