# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same
topic.  This data has not been cleaned yet, and has not had multiple categories introduced yet.  However it is presented free from bias, for people to play with.  

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Other Analysis possibilities:
* Support Vector Machine
    - the other big architecture, less popular now?
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
* Naive Bayes
    - can find most important words in spam
    - simple, easy to understand baseline.
* Latent Factor Analysis 
    - maybe useful prelude or alternative for building up embeddings.
    
Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')
* Tokenize (convert words to indices)
* Stemming words
* Balancing data set
* Match up comments, and review scores
* Search for gibberish words (make a new "feature" for badly spelled comments)

In [1]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

#reindex 
#df_com.index=df_com['rev_id']
#df_rate.index=df_rate['rev_id']
print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [3]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes.  Those set the boundaries for changes.
#Need to also append final index.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split instead.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr
#change_indices=np.append(change_indices,len(df_rate))
# for i in range(len(change_indices)-1):
#     ind0 = change_indices[i]
#     ind1 = change_indices[i+1]
#     d0=df_rate.iloc[ind0:ind1]
#     scores=d0[['worker_id','toxicity_score']]
#     #pass it a list so it can be set as an entry.
#     #accessing later will require score[0] idiocy to get at list.
#     df_com.loc[i,'scores']=[scores.values]

In [4]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [214]:
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [227]:
df_com['toxic']= (df_com['median_toxic']<=-2)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159686. Toxic comments: 1610. Toxic Fraction: 0.01008228648723119


In [None]:
#can combine the dataframes together by extracting all reviewer ids, and scores for each 

In [8]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    
    com_clean=com_clean.str.replace("[\[\[\{\}=_:|\(\)\\\/]+\`",' ')
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')

    #remove symbols
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

In [228]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]
(df_com['split']=='train').sum()

95692

In [229]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_counts.shape

(95692, 125568)

# Checking the vectorizer and finding common words

In [283]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value, and then use that as the index for the counts.
voc_df1=voc_df.sort_values(by=0)

In [320]:
voc_df1.iloc[29143]

0    29143
Name: dick, dtype: int64

In [296]:
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond)','p_clean','p_toxic'],index=voc_df1.index)
print(word_df.sort_values('pcond)',ascending=False,inplace=True))

None


In [440]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [335]:
vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,'gay')
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
#does not account for leetspeak or rather: 13375|o3@|< (but who uses that these days?)

In [336]:
#currently searching for "jew", a term that has clean connotations, but can be used as anti-semitic.
print('Vect: {}, Naive: {}'.format(vc,cc))
print([com.head(),ncom.head()])

Vect: [[1866]], Naive: 638
[588     ` Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdi...
790     ` There is one other thing that just occured to me: an encyclopedia deals in facts, not in specu...
871     ` Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technical...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High School so I know that they are, i...
1467     WHY ARE YOU SUCH A GAY NIGGER?!?! GOD DAMNDD... YOU ARE SUCH A GAY NIGGER. FUCK FUCK SHTAY AWAY...
Name: comment_clean, dtype: object, 588     ` Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdi...
790     ` There is one other thing that just occured to me: an encyclopedia deals in facts, not in specu...
871     ` Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technical...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High S

In [45]:
naughty_word=['fuck','shit','cunt','piss','cocksucker','dick','ass','nigger']
word_counts=X_train_counts.sum(axis=0)
for word in naughty_word:
    try:
        ind=count_vect.vocabulary_[word]
        print(word,'count: {}'.format(word_counts[0,ind]))
    except:
        print(word,'not found')

fuck count: 6179
shit count: 1689
cunt count: 619
piss count: 151
cocksucker count: 425
dick count: 1022
ass count: 1275
nigger count: 1238


I noticed that there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers?

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score.  

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learns version.  First up, my version

In [390]:
def cond_prob(X_counts,toxic,csmooth=1):
    """bayes_prob
    X_counts - sparse matrix of counts of each word in a given message
    toxic - whether word was toxic or not, with 0,1
    """
    nrows,nwords=X_counts.shape
    ptoxic = np.sum(toxic)/nrows
    
    toxic_mat=X_counts[toxic==1,:]
    clean_mat=X_counts[toxic==0,:]
    #sum across messages
    nword_toxic=np.sum(toxic_mat,axis=0)
    nword_clean=np.sum(clean_mat,axis=0)    

    #estimate probability of word given toxicity by number of times
    #that word occurs in toxic documents, divided by the total number of words
    #in toxic documents
    #Laplace/Lidstone smooth version
    pword_toxic= (nword_toxic+csmooth) \
                / (np.sum(toxic_mat)+nwords*csmooth)

    pword_clean= (nword_clean+csmooth) \
                /(np.sum(clean_mat)+nwords*csmooth)
    x1=np.sum(toxic_mat,0)
    x2=nword_toxic
    return ptoxic,pword_toxic,pword_clean    

ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [421]:
def naive_bayes(mat,pword_tox,pword_cln,ptox):
    """Compute probability that a message 
    is toxic via naive_bayes estimate.
    """
    #ptox_word = (pword_tox * p_tox)/(pword_cln*pcln + pword_tox * p_tox)
    #Can use logs:  #Can just add the logs, and exponentiate at the end
    #Compute log of conditional probability of p(toxic|word)
    ##init version:
    # pword=pword_tox*ptox/(pword_tox*ptox + pword_cln*(1-ptox))
    # #log probability for toxic/clean comments
    # log_Tword = np.log(pword)
    # log_Cword = np.log(1-pword)    
    # # #now accumulate probabilities by multiplying number of counts
    # #per comment, with the weights per word
    # log_Tscore = mat.dot(log_Tword.T)
    # log_Cscore = mat.dot(log_Cword.T)
    # pred=log_Tscore>log_Cscore
    
    # ##futzing version
    pwordT=pword_tox#*ptox/(pword_tox*ptox + pword_cln*(1-ptox))    
    pwordC=pword_cln#*(1-ptox)/(pword_tox*ptox + pword_cln*(1-ptox))

    #log probability for toxic/clean comments
    log_Tword = np.log(pwordT)
    log_Cword = np.log(pwordC)
    # #now accumulate probabilities by multiplying number of counts
    #per comment, with the weights per word
    msk=mat>0
    # print(msk.shape,log_Tword.shape,log_Cword.shape)
    # for i in msk.indices:
    #     print('p(w,T),p(w,C)',pword_tox[0,i],pword_cln[0,i])        
    #     print('p(w|T),p(w|C)',pwordT[0,i],pwordC[0,i])    
    #     print('logs',log_Tword[0,i],log_Cword[0,i])
    #     print('count',mat[0,i])        
    log_Tscore = mat.dot(log_Tword.T)+np.log(ptox)
    log_Cscore = mat.dot(log_Cword.T)+np.log(1-ptox)
    pred=log_Tscore>log_Cscore    

    #score=np.exp(log_score)
    return pred,log_Tscore,log_Cscore,log_Tword,log_Cword

In [422]:
actual=df_train['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

In [423]:
def check_predictions(pred,actual):
    actual=np.reshape(actual,(len(actual),1))
    pred=np.reshape(pred,(len(actual),1))    
    print(pred.shape,actual.shape)
    tp = np.mean((pred==True)&(actual==True))
    fp = np.mean((pred==True)&(actual==False))    
    tn = np.mean((pred==False)&(actual==False))
    fn = np.mean((pred==False)&(actual==True))            
    scores=[tp,tn,fp,fn]
    print("True Positive {}. False Positive {}".format(tp,fp))
    print("False Negative {}. True Negative {}".format(fn,tn))    
    return scores
score_rates=check_predictions(pred,actual)

(95692, 1) (95692, 1)
True Positive 0.00896626677256197. False Positive 0.009196171048781508
False Negative 0.0009300672992517661. True Negative 0.9809074948794048


In [16]:
#Look at the false negatives 
# df_fn=df_train[(pred==False)]
# df_fn=df_fn[df_fn['toxic']==True]
# df_fn[['comment_clean','mean_toxic','median_toxic']]

So, this simple classifier has bad performance.  The rate of false negatives means we miss roughly half of toxic comments!  Roughly 1\% of the data
are "toxic" under the metric where the median toxicity score is -2.   It misses 3x as many toxic comments when feed toxic comments based on the median toxicity being -1.
The false negatives in that larger seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

Also length?  Can try a SVM, and then some dimensionality reduction word2vec, then neural network.

In [364]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [365]:
pred_nb=nb.predict(X_train_counts)
nb_stats=check_predictions(pred_nb,actual)

(95692, 1) (95692, 1)
True Positive 0.00896626677256197. False Positive 0.009196171048781508
False Negative 0.0009300672992517661. True Negative 0.9809074948794048


Well, I must have an error in my Naive Bayes code.  I'm getting much worse results.  Have correct class probabilities, and conditional probabilities.

In [None]:
#class priors agree
print(np.exp(nb.class_log_prior_))
print(ptox,1-ptox)

[ 0.99010367  0.00989633]
0.00989633407181 0.990103665928


In [417]:
#conditional probabilities agree, i.e.  log(P(w|T))
[pw_cln2,pw_tox2]=nb.feature_log_prob_
print(np.log(pw_cln))
print(pw_cln2)

[[-19.49909328 -13.79198301 -14.19578837 ..., -14.88397276 -14.88397276
  -14.88397276]]
[-19.49909328 -13.79198301 -14.19578837 ..., -14.88397276 -14.88397276
 -14.88397276]


In [412]:
#check computes probabilities for each class
nb_logprob=np.matrix(nb.predict_log_proba(X_train_counts))
nb_jll=np.matrix(nb._joint_log_likelihood(X_train_counts))
# log_probabilities for each message
nb_logC=nb_logprob[:,0]
nb_logT=nb_logprob[:,1]



In [420]:
nb_logprob

matrix([[   0.        , -152.61213392],
        [   0.        , -121.75430671],
        [   0.        , -691.48789889],
        ..., 
        [   0.        , -997.78984736],
        [   0.        , -136.79274693],
        [   0.        , -157.51107803]])

In [376]:
pred_nb=np.matrix(pred_nb).T

In [380]:
msk=np.matrix((pred!=pred_nb))
print(np.mean(msk))
smsk = sparse.coo_matrix(msk)
#print out to find some mismatched indices.
#print(smsk)

0.0135330017138


In [401]:
ind=575
print(pred_nb[ind],pred[ind],actual[ind])
print(nb_logC[ind],nb_logT[ind],"\n")
print(logC[ind],logT[ind],"\n")

[[False]] [[False]] False
[[-0.01064289]] [[-4.5481796]] 

[[-0.02159582]] [[-9.16477778]] 



In [411]:
print(log_Cword[0,[104335,116314]])
print(X_train_counts[ind])

[[-0.00751187 -0.01408395]]
  (0, 104335)	1
  (0, 116314)	1


# Support Vector Machine

In CS229, Andrew Ng's assignemnt sugest the SVM as a natural improvement over the Naive Bayes method.  Let's implement one of those.
I'm going to update it to do batch gradient descent with sparse matrices.  The version I wrote initially was trash.

In [61]:
from sklearn.svm import SVC

In [64]:
svm=SVC(cache_size=500,verbose=True,class_weight={True:2,False:1})
svm.fit(X_train_counts,actual)

KeyboardInterrupt: 

[LibSVM]

In [36]:
pred_nb.sum()
actual=df_train['toxic'].values
actual.sum()

9245

In [92]:
#define a cost function, check that we're minimizing it.
#define the alternative cost function to be sure we'e also minimize that original choice.
#check constraints are obeyed?
def svm_cost(alpha,Kmat,cat,l):
    m = Kmat.shape[0]
    Ka = np.dot(Kmat,alpha);            
    cost=0.5*l*np.dot( alpha, Ka)
    yvec=(1-cat*Ka)/m
    ymsk=yvec>0
    cost+=np.sum(yvec*ymsk)
    return cost                    

#Compute Kernel Matrix, an m x m matrix
#that effectively measures similarity between inputs.
#each K_{ij} is the "distance" between the weighted inputs,
# $x^{(i)}_k, $x^{(j)}_k$.
# def kernel_matrix(x,tau):
#     m=x.shape[0]
#     x=(x>0).astype(int)
#     K=np.zeros([m,m])
#     for i in range(0,m):
#         K[:,i]=np.exp(-np.sum((x-x[i])**2,1)/(2*tau**2));
#     return K

#Compute a column from the Kernel matrix.
#Matrix is assumed to be [m,m], with vec
#of length m.  Returns vector of length m.
# def Kvec(mat,vec,tau):
#     xarg=np.sum((mat-vec)**2,1)
#     Kv=np.exp(-xarg/(2*tau**2));
#     return Kv

def Kbatch(mat,ind,norm2,tau):
    """Kbatch(mat,cvec,ind,norm2,tau)
    Compute a batch of kernel matrix elements 
    Input: mat  - sparse matrix (nobs x nfeature)
           ind - indices for that subset of rows (nbatch)
           norm2 - column matrix with squared norm for each (nobs,1)
    Return: Kvecs - nbatch x nobs subset of the full kernel matrix.
    """
    nbatch=len(ind)
    #extract chosen rows
    cvec = mat[ind,:].T
    #relying on numpy broadcasting to join (nobs,nbatch) + (nobs,1)
    xarg=-2.0*mat.dot(cvec)+norm2
    #further broadcasting: use a row-vector ind to make a row-vector
    #of relevant norms.
    #then broadcast again from (nobs,nbatch)+ (1,nbatch)
    xarg+=norm2[ind].T
    Kv=np.exp(-xarg/(2*tau**2));
    #NOT TESTED YET!
    return Kv

#carry out update on parameters for given loss function for SVM,
#given parameters, a row-vector of inputs K_i
def svm_batchsgd_update(alpha,Kbatch,y,ind,rate,l):
    """svm_batchsgd_update
    alpha  - nobs x 1 vector
    Kbatch - nobs x Nbatch subset of Kernel matrix
    y      - (1xNbatch) labels for inputs
    ind    - (1xNbatch) indices for batch
    """
    nobs = K_i.shape[0]
    yK = np.multiply(y,Kbatch)   #nobs x Nbatch 
    yKa = np.dot(alpha.T,yK);
    Kalpha = np.multiply(Kbatch,alpha[ind])
    #da= (-y_i*K_i)*((1-y_i*Ka) >0)+m*l*K_i*alpha[ind];
    da= (-yK)*((y*Ka)<-1)+nobs*l*Kalpha;
    #sum all changes over columns
    alpha=alpha-rate*np.sum(da,axis=1);
    return alpha
    
#Fit SVM coefficients for spam with stochastic gradient descent.
#use known categories in cat_vec, and word_matrix with nobs x nwords
def svm_fit(word_mat,cat_vec,tau=8,Nbatch=100):
    #just count whether word occurs.
    new_mat=(word_mat>0).astype(int)
    new_m
    nobs,nword=new_mat.shape;
    alpha=np.zeros((nobs,1))    #initialize parameters
    alpha_tot=np.zeros((nobs,1))
    niter=40*nobs;
    l=1/(tau**2*nobs)

    norm2=new_mat.multiply(new_mat).sum(axis=1)
    #multiple iterations of stochastic gradient descent.
    for t in range(0,niter):
        indx=np.random.uniform(Nbatch,low=0,high=nobs).astype(int)
        Kv = Kbatch(new_mat,indx,norm2,tau)
        Kt=Kmat[indx,:]
        yt=cat_vec[indx]
        rate=np.sqrt(1.0/(t+1))
        alpha=svm_sgd_update(alpha,Kt,yt,indx,rate,l)
        alpha_tot=alpha_tot+alpha
    alpha_tot=alpha_tot/niter
    return alpha_tot

In [111]:
norm2=X_train_counts.multiply(X_train_counts).sum(axis=1)
ind=np.random.randint(size=100,low=0,high=1000)

In [114]:
Kb=Kbatch(X_train_counts,ind,norm2,8)

In [115]:
Kb.shape

(95692, 100)

In [87]:
a = np.matrix([[2],[4]])
B = np.matrix([[3,0,5],[0,0,9]])
Bs = sparse.csr_matrix(B)
a+np.matrix([[3,2]])

matrix([[5, 4],
        [7, 6]])

In [91]:
ind=[0,0,1,0]
a[ind]

matrix([[2],
        [2],
        [4],
        [2]])