# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same
topic.  This data has not been cleaned yet, and has not had multiple categories introduced yet.  However it is presented free from bias, for people to play with.

Beware: Lots of swearing, racism, homophobia, misogyny is contained within due to nature of the comments.
And the fact I have to search for nasty terms as a sanity check.

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Other Analysis possibilities:
* Naive Bayes
    - can find most important words in spam
    - simple, easy to understand baseline.
* Support Vector Machine
    - another big architecture, less popular now?
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
*Deep Neural Network
    - Try to grow beyond naive Bayes via term-frequency matrix.
* Latent Factor Analysis 
    - maybe useful prelude or alternative for building up embeddings.
    
Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')
* Tokenize (convert words to indices)
* Stemming words
* Balancing data set
* Match up comments, and review scores
* Search for gibberish words (make a new "feature" for badly spelled comments)

In [279]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import log_loss,f1_score

from IPython.display import clear_output


In [2]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

#reindex 
print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [12]:
#When are the comments made?
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
df_com['year'].hist(bins=bin_arr)
plt.show()

In [21]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes in 'rev_id'.  Those set the boundaries for changes.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split to split the array into many sub-arrays.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr

In [22]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [23]:
#Make a new column computing mean, median scores
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [26]:
#Define toxic comments as those where the median is below -1, or -2.
#-1 captures more comments, but with more variance in what is considered toxic/unhelpful.
df_com['toxic']=(df_com['median_toxic']<=-1)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159686. Toxic comments: 15362. Toxic Fraction: 0.09620129504151897


In [38]:
#When are the comments made?  Has the toxicity changed over time?
#Note this is on the full dataset, with test/training/dev splits. 
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
#non-toxic comments
plt.subplot(2,2,1)
msk1=df_com['median_toxic']<=-1
plt.ylabel('Toxicity=-1')
df_com['year'][msk1].hist(bins=bin_arr)
plt.title('Toxic')
plt.subplot(2,2,2)
plt.title('Non-Toxic')
df_com['year'][~msk1].hist(bins=bin_arr)
#second row
plt.subplot(2,2,3)
msk2=df_com['median_toxic']<=-2
df_com['year'][msk2].hist(bins=bin_arr)
plt.ylabel('Toxicity=-2')
plt.subplot(2,2,4)
df_com['year'][~msk2].hist(bins=bin_arr)
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7efbb2592080>

So the data looks to be evenly balanced as toxic/non-toxic across time.
Another question about the data is what topics were under discussion? Does this bias the output/findings?

In [39]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate
#Can maybe also eliminate all punctuation?  Makes any 

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    There must be a more comprehensive way of doing this?
    com_clean=com_clean.str.replace("[\[\[\{\}=_:\|\(\)\\\/\`]+",' ')
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

In [240]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]
df_dev=df_com[df_com['split']=='dev']
df_test=df_com[df_com['split']=='test']

In [41]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
tfidf_vect=TfidfVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_tfidf=tfidf_vect.fit_transform(df_train['comment_clean'])
X_train_counts.shape

(95692, 125568)

In [266]:
#do the same transformations using existing vocab built up in training.
X_dev_tfidf=tfidf_vect.transform(df_dev['comment_clean'])
X_test_tfidf=tfidf_vect.transform(df_test['comment_clean'])

X_dev_counts=count_vect.transform(df_dev['comment_clean'])
X_test_counts=count_vect.transform(df_test['comment_clean'])

# Checking the vectorizer and finding common words

I wanted to check that the vectorizer was working by outputting common words, and identifying the "most toxic" words, based on their counts.
This was useful as a sanity check.

In [48]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value, and then use that as the index for the counts.
voc_df1=voc_df.sort_values(by=0)

In [49]:
voc_df1.iloc[29143]

0    29143
Name: dick, dtype: int64

In [50]:
def cond_prob(X_counts,toxic,csmooth=1):
    """cond_prob
    Compute conditional probabilities of toxic/non-toxic words after tokenization, and 
    count vectorization. 

    Intput: X_counts - sparse matrix of counts of each word in a given message
    toxic - whether word was toxic or not, with 0,1
    csmooth - parameter for Laplace/Lidstone smoothing to account for unseen words

    Return:
    ptoxic      - total probability for toxic message
    pword_toxic - conditional probability for word being toxic
    pword_clean - conditional probability for word being clean
    """
    nrows,nwords=X_counts.shape
    ptoxic = np.sum(toxic)/nrows
    
    toxic_mat=X_counts[toxic==1,:]
    clean_mat=X_counts[toxic==0,:]
    #sum across messages
    nword_toxic=np.sum(toxic_mat,axis=0)
    nword_clean=np.sum(clean_mat,axis=0)    

    #estimate probability of word given toxicity by number of times
    #that word occurs in toxic documents, divided by the total number of words
    #in toxic documents
    #Laplace/Lidstone smooth version
    pword_toxic= (nword_toxic+csmooth) \
                / (np.sum(toxic_mat)+nwords*csmooth)

    pword_clean= (nword_clean+csmooth) \
                /(np.sum(clean_mat)+nwords*csmooth)
    x1=np.sum(toxic_mat,0)
    x2=nword_toxic
    return ptoxic,pword_toxic,pword_clean    

ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [57]:
#make new dataframe with conditional probabilities for words being toxic, and raw probabilities of occuring in toxic/clean messages
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond','p_clean','p_toxic'],index=voc_df1.index)
word_df.sort_values('pcond',ascending=False,inplace=True)
print(word_df.head(n=1000))

                   count     pcond       p_clean   p_toxic
fucksex            624.0  0.999987  3.640840e-09  0.002625
buttsecks          498.0  0.999984  3.640840e-09  0.002095
bastered           449.0  0.999982  3.640840e-09  0.001889
cocksucker         425.0  0.999981  3.640840e-09  0.001788
fggt               398.0  0.999980  3.640840e-09  0.001674
mothjer            391.0  0.999979  3.640840e-09  0.001645
offfuck            360.0  0.999978  3.640840e-09  0.001514
niggas             340.0  0.999976  3.640840e-09  0.001430
sexsex             332.0  0.999976  3.640840e-09  0.001396
yourselfgo         309.0  0.999974  3.640840e-09  0.001300
marcolfuck         260.0  0.999969  3.640840e-09  0.001094
fack               232.0  0.999965  3.640840e-09  0.000976
veggietales        212.0  0.999962  3.640840e-09  0.000892
ancestryfuck       208.0  0.999961  3.640840e-09  0.000875
notrhbysouthbanof  208.0  0.999961  3.640840e-09  0.000875
shitfuck           182.0  0.999956  3.640840e-09  0.0007

So, the most toxic words (i.e. words that only appeared in toxic messages) are misspelled attempts at rudeness, with weird spaces, and combination words.  I think this reflects more on the pre-processing.  These words show up in a single toxic message, and are thus great at inferring that one message is toxic.  This doesn't say much about more general trends in the messages.

I am considering also implementing a spell-check, and adding a variable for the number of incorrect words or fraction of the message that is misspelled.  Another feature would be the fraction that is capitalized?
The accent stripping catches simple attempts to avoid the spam filter with accents, but does miss things where the words are spaced out, or have other characters inserted.

In [58]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [273]:
#vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,'gay')
vc,cc,com,ncom,dcom=check_vect(X_dev_counts,df_dev['comment'],voc_df,'gay')
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
#does not account for leetspeak or rather: 13375|o3@|< (but who uses that these days?)

In [274]:
#currently searching for "gay", a term that has clean connotations, but can be used in homophobic attacks.
#Another word with the same dichotomy of identity/hate is Jew.
print('Vect: {}, Naive: {}'.format(vc,cc))
print(com.head(),'\n\n')
print(ncom.head())

Vect: [[426]], Naive: 181
682     NEWLINE_TOKENNEWLINE_TOKEN::Just because one is gay does not mean one is unwilling to breed. I w...
1640    `NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENI don't know who the other guy is, but of course I'm the...
2532                       NEWLINE_TOKENNEWLINE_TOKEN== your mums ugly ==NEWLINE_TOKENNEWLINE_TOKENyour gay
2798    `NEWLINE_TOKENNEWLINE_TOKEN== categories ==NEWLINE_TOKENNEWLINE_TOKENGood eye on those awful ``P...
2912    `NEWLINE_TOKENNEWLINE_TOKEN== One MO' time for the kids in the back... ==NEWLINE_TOKENNEWLINE_TO...
Name: comment, dtype: object 


682     NEWLINE_TOKENNEWLINE_TOKEN::Just because one is gay does not mean one is unwilling to breed. I w...
1640    `NEWLINE_TOKENNEWLINE_TOKENNEWLINE_TOKENI don't know who the other guy is, but of course I'm the...
2532                       NEWLINE_TOKENNEWLINE_TOKEN== your mums ugly ==NEWLINE_TOKENNEWLINE_TOKENyour gay
2798    `NEWLINE_TOKENNEWLINE_TOKEN== categories ==NEWLINE_TOKENNEWLINE_TOKENG

In [253]:
#naughty_word=['sex','fuck','shit','cunt','bitch','piss','cocksucker','dick','ass','nazi','tosser','wanker','bellend','jerk']
#naughty_word=['fuck','fag','kill','bleach','bellend','wanker','towelhead']
#identity_hate=['nigger','trans','faggot','kike','jew','wetback','spic']
word_counts=X_train_counts.sum(axis=0)
for word in naughty_word:
    try:
        ind=count_vect.vocabulary_[word]
        print(word,'count: {}'.format(word_counts[0,ind]))
    except:
        print(word,'not found')

fuck count: 6179
fag count: 798
kill count: 522
bleach count: 24
bellend count: 5
wanker count: 1011
towelhead not found


It might be interesting to look at how the words changed over time?  Perhaps look at the prevalence of generic/homophobic/misoynistic/racist comments.  

I noticed that there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers? (this is something that the actual original project is explicitly considering at https://conversationai.github.io/bias.html)

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score to avoid the mean being contaminated by people with a really different sense 

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learn's version.  (They both return the same result now).

This basically treats the comments in a bag-of-words sense, and drops any correlations between the words.  Perhaps including some more
common n-grams, e.g. "frigging crank".

In [64]:
def naive_bayes(mat,pword_tox,pword_cln,ptox):
    """Compute probability that a message 
    is toxic via naive_bayes estimate.
    """
    #I screwed up using prod_i[p(w_i|T)p(T)]
    #instead of P(T)prod_i[p(w_i|T)].  Ugh.
    #log probability for toxic/clean comments
    log_Tword = np.log(pword_tox)
    log_Cword = np.log(pword_cln)
    ## now accumulate probabilities by multiplying number of counts
    #per comment, with the weights per word
    #also add on log-normalization.
    msk=mat>0
    log_Tscore = mat.dot(log_Tword.T)+np.log(ptox)
    log_Cscore = mat.dot(log_Cword.T)+np.log(1-ptox)
    #predict based on which has larger probability (or log-likelihood)
    pred=log_Tscore>log_Cscore
    #also output probabilities
    prob=1/(1+np.exp(log_Cscore-log_Tscore))

    return pred,prob,log_Tscore,log_Cscore,log_Tword,log_Cword

In [65]:
actual=df_train['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,prob,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    



In [139]:
#Plot a histogram of the log probabilities.  
plt.figure()
plt.hist(np.maximum(-50,np.log(prob)),bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7efb7c19cbe0>

  This is separate from the ipykernel package so we can avoid doing imports until


In [148]:
#Plot a histogram of the log-odds (right term?).  
plt.figure()
plt.subplot(121)
bins=np.linspace(-1000,1000,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.subplot(122)
bins=np.linspace(-20,20,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.show()

<matplotlib.figure.Figure at 0x7efb7541bba8>

In [None]:
Maybe should also plot length of comments? TO what extent are these mirroring a similar underlying shape?

In [258]:
def check_predictions(pred,actual,epsilon=1E-15):
    """check_predictions
    Compares predicted class (y_i) against actual class (z_i).
    Returns the confusion matrix and mean log-loss.
    
    Log-loss = sum_i{ z_i log[ y_i] }/M

    Input: pred - predicted values (0,1)
    actual - true labels 
    eps    - shift to avoid log(0)
    Returns: Confusion matrix with [[true positive, false positive],[false negative, true negative]]
    log-loss - average log-loss
    """
    actual=np.reshape(actual,(len(actual),1))
    pred=np.reshape(pred,(len(actual),1))    
    print(pred.shape,actual.shape)
    tp = np.mean((pred==True)&(actual==True))
    tn = np.mean((pred==False)&(actual==False))
    fp = np.mean((pred==True)&(actual==False))    
    fn = np.mean((pred==False)&(actual==True))            
    scores=np.matrix([[tp,fp],[fn,tn]])
    print("True Positive {}. False Positive {}".format(tp,fp))
    print("False Negative {}. True Negative {}".format(fn,tn))
    pred_num=pred.astype(float)
    logloss=log_loss(actual,pred_num,eps=epsilon,normalize=True)    
    #give zero a small correction.
    #pred_num[pred==False]=epsilon
    #pred_num[pred==True]=1-epsilon
    #my (initial) wrong attempt
    #logloss2=-np.mean(np.multiply(actual,np.log(pred_num)))
    # logloss2=-np.mean(np.multiply(actual,np.log(pred_num))\
    #     +np.multiply(1-actual,np.log(1-pred_num)))
    # print(logloss2)

    #logloss=0
    print("Log-loss is {}".format(logloss))
    return scores,logloss
logloss,score_rates=check_predictions(pred,actual)


(95692, 1) (95692, 1)
True Positive 0.08485557831375663. False Positive 0.01652175730468587
False Negative 0.011756468670317268. True Negative 0.8868661957112403
Log-loss is 0.9767085345500724


In [224]:
?log_loss

Interesting. The mean log-loss is surprisingly sensitive to the chosen zero-offset.  I think this reflects the fact that the naive-bayes method is returning a lot of incredibly small probabilities (10^{-100}).

In [16]:
#Look at the false negatives 
# df_fn=df_train[(pred==False)]
# df_fn=df_fn[df_fn['toxic']==True]
# df_fn[['comment_clean','mean_toxic','median_toxic']]

After fixing this, I get a 50% false positive rate.  And have a 10% false negative rate.  Note that this is searching for the most toxic comments.

The false negatives in that larger seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

Also length?  Can try a SVM, and then some dimensionality reduction word2vec, then neural network.

In [68]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [192]:
pred_nb=nb.predict(X_train_counts)
nb_stats=check_predictions(pred_nb,actual)

(95692, 1) (95692, 1)
True Positive 0.08485557831375663. False Positive 0.01652175730468587
False Negative 0.011756468670317268. True Negative 0.8868661957112403
0.651130214594
Log-loss is 0.9767085345500724


Well, I must have an error in my Naive Bayes code.  I'm getting much worse results.  Have correct class probabilities, and conditional probabilities.  (Found the bug.  Used $\prod_i[p(w_i|T)p(T)]$ instead of
$p(T)\prod_i p(w_i|T)$.  Found my error after examining hidden attributes, but really checking and copying from the sklearn source code)

## Naive Bayes False Positives and Negatives

Let's now look a bit at the misclassified results.

In [431]:
#fixing shapes to avoid broadcasting
actual=np.reshape(actual,(len(actual),1))
pred=np.reshape(pred,(len(actual),1))    

fp_msk = ((pred==True)&(actual==False))    
fn_msk = ((pred==False)&(actual==True))            

In [441]:
df_fn=df_train[fn_msk][['comment_clean','mean_toxic','median_toxic','toxic']]
df_fp=df_train[fp_msk][['comment_clean','mean_toxic','median_toxic','toxic']]

In [834]:
df_fp.head()

                                                                                            comment_clean  \
1659   == SHUT UP, FAG == LOL YOU WILL BAN ME HUH? OOH IM SHAKIN IM SHAKIN FUCKING FAGGOT YOU NEED TO ...   
2570                                                                                DAMN ur an ugly fukka   
2670      == Chanting Cox == Shut up you dumb cunt. Your a nigger loving, cum cuzzling crack whore bitch!   
2773                                                              ==Suck COck!!!== you do!!!! you do!!!!    
2880        == HOLA FUCKER RETARD WHO HAS NOTHING BETTER TO DO THAN DO WIKIPEDIA SHIT == FUCK YOU ASSHOLE   

      mean_toxic  median_toxic  toxic  
1659        -1.3          -1.0  False  
2570        -1.1          -1.0  False  
2670        -1.1          -1.0  False  
2773        -1.4          -1.0  False  
2880        -1.4          -1.5  False  

In [835]:
df_fn.head()

                                                                                             comment_clean  \
2464   BUSH US EVIL, HE'S SELFISH AND VERY SHORT TEMPERED. HE SHOULDN'T EVEN BE IN CONTROL OF A PET SHO...   
3560    == GAY SEX, ANAL, ALL NIGHT == Why are you reverting me? SPUI saya on his talk page that he doe...   
5066    No thanks faggot, I'm fine right here. But you might try going to hell. You don't like to suck ...   
10064  ` == WHAT IN THE FUCK ARE YOU REFERRING TOO == I HAVE NO IDEA WHAT THE PISS YOU'RE REFERRING TOO...   
11732                             == ASSHOLE! == I just read this article, and man, was Hitler an asshole!   

       mean_toxic  median_toxic  toxic  
2464         -1.4          -2.0   True  
3560         -1.6          -2.0   True  
5066         -1.3          -2.0   True  
10064        -1.7          -2.0   True  
11732        -1.6          -2.0   True  

In [451]:
ind=df_fn.index.values

So false negatives.  Much more spacing/characters being used to avoid the filter.  

So at least the "false positives" are because the people using the rating scale are wildly inconsistent.  These are "-1" on the toxicity scale, and so "non-toxic" under the rule where toxic comments have median toxicity less than -1.
I think I recognize some Full Metal Jacket quotes in there being used as insults.
Some are "neutral" but have lots of repitition.  I can't for the life of me imagine any of these comments adding anything to the discussion.

# Dimensionality Reduction

Let's use the truncated SVD for dimensionality reduction (or latent semantic analysis?)
Apparently TF-IDF matrix is superior to straight term frequency matrix for this purpose  (more closely matches assumptions in the SVD about the noise.)

In [70]:
from sklearn.decomposition import TruncatedSVD

In [71]:
#took a minute or two
TSVD=TruncatedSVD(n_components=100,n_iter=10)
TSVD.fit(X_train_tfidf)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=10,
       random_state=None, tol=0.0)

In [72]:
#actually transform the results 
X_train_trans=TSVD.transform(X_train_tfidf)

In [73]:
plt.plot(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()

<matplotlib.figure.Figure at 0x7efba92d0f28>

We will next use the transformed results in a "deep" neural network.  

In [254]:
#actually transform the dev/test data.
X_dev_trans=TSVD.transform(X_dev_tfidf)

# Support Vector Machine

In CS229, Andrew Ng's assignmmnt 2 suggest the SVM as a natural improvement over the Naive Bayes method.
Let's implement one of those.  I'm going to update it to do batch gradient descent with sparse matrices.
The version I wrote initially was trash, I am attempting to vectorize the code using appropriate scipy.sparse matrix operations.

Or I could use an ensemble of SVM's based on subsets of the data. That leverages the existing (presumably smarter) scikit-learn code, in a way that could scale up. 

In [519]:
pred_nb.sum()
actual=df_train['toxic'].values
actual.sum()

947

In [708]:
#define a cost function, check that we're minimizing it.
#define the alternative cost function to be sure we'e also minimize that original choice.
#check constraints are obeyed?
def svm_cost(alpha,Kmat,cat,l):
    m = Kmat.shape[0]
    Ka = np.dot(Kmat,alpha);            
    cost=0.5*l*np.dot( alpha, Ka)
    yvec=(1-cat*Ka)/m
    ymsk=yvec>0
    cost+=np.sum(yvec*ymsk)
    return cost                    

#Compute Kernel Matrix, an m x m matrix
#that effectively measures similarity between inputs.
#each K_{ij} is the "distance" between the weighted inputs,
# $x^{(i)}_k, $x^{(j)}_k$.
# def kernel_matrix(x,tau):
#     m=x.shape[0]
#     x=(x>0).astype(int)
#     K=np.zeros([m,m])
#     for i in range(0,m):
#         K[:,i]=np.exp(-np.sum((x-x[i])**2,1)/(2*tau**2));
#     return K

#Compute a column from the Kernel matrix.
#Matrix is assumed to be [m,m], with vec
#of length m.  Returns vector of length m.
# def Kvec(mat,vec,tau):
#     xarg=np.sum((mat-vec)**2,1)
#     Kv=np.exp(-xarg/(2*tau**2));
#     return Kv

def Kbatch(mat,ind,norm2,tau):
    """Kbatch(mat,cvec,ind,norm2,tau)
    Compute a batch of kernel matrix elements. 
    Input: mat  - sparse matrix (nobs x nfeature)
           ind - indices for that subset of rows (nbatch)
           norm2 - column matrix with squared norm for each (nobs,1)
    Return: Kvecs - nbatch x nobs subset of the full kernel matrix.
    """
    nbatch=len(ind)
    #extract chosen rows
    cvec = mat[ind,:].T
    #relying on numpy broadcasting to join (nobs,nbatch) + (nobs,1)
    xarg=-2.0*mat.dot(cvec)+norm2
    #further broadcasting: use a row-vector ind to make a row-vector
    #of relevant norms.
    #then broadcast again from (nobs,nbatch)+ (1,nbatch)
    xarg+=norm2[ind].T
    Kv=np.exp(-xarg/(2*tau**2));
    return Kv

#carry out update on parameters for given loss function for SVM,
#given parameters, a row-vector of inputs K_i
def svm_batchsgd_update(alpha,Kbatch,y,ind,rate,l):
    """svm_batchsgd_update
    alpha  - nobs x 1 vector
    Kbatch - nobs x Nbatch subset of Kernel matrix
    y      - (1xNbatch) labels for inputs
    ind    - (1xNbatch) indices for batch
    """
    nobs = Kbatch.shape[0]
    yK = np.multiply(Kbatch,y.T)   #nobs x Nbatch 
    yKa = np.dot(alpha.T,yK);
    Kalpha = np.multiply(Kbatch,alpha[ind].T)
    #da= (-y_i*K_i)*((1-y_i*Ka) >0)+m*l*K_i*alpha[ind];
    da= -np.multiply(yK,yKa<-1)+nobs*l*Kalpha;
    #sum all changes over columns
    alpha=alpha-rate*np.sum(da,axis=1);
    return alpha
    
#Fit SVM coefficients for spam with stochastic gradient descent.
#use known categories in cat_vec, and word_matrix with nobs x nwords
def svm_fit(word_mat,cat_vec,tau=8,Nbatch=100):
    #just count whether word occurs.
    new_mat=(word_mat>0).astype(int)
    nobs,nword=new_mat.shape;
    alpha=0.1*np.random.randn(nobs,1)    #initialize parameters
    alpha0=alpha
    alpha_tot=np.zeros((nobs,1))
    niter=int(40*nobs/Nbatch);
    l=1/(tau**2*nobs)
    norm2=new_mat.multiply(new_mat).sum(axis=1)
    #multiple iterations of stochastic gradient descent.
    for t in range(0,niter):
        indx=np.random.randint(low=0,high=nobs,size=Nbatch)        
        Kv = Kbatch(new_mat,indx,norm2,tau)
        yt=cat_vec[indx]
        rate=np.sqrt(np.sqrt(1.0/(t+1)))
        alpha=svm_batchsgd_update(alpha,Kv,yt,indx,rate,l)
        alpha_tot=alpha_tot+alpha
        if (10*t % niter ==0):
            print("Iter {} of {}".format(t,niter))
    alpha_tot=alpha_tot/niter
    return alpha0,alpha_tot

#given parameters, predict the output
def svm_predict(train_mat,test_mat,alpha,tau):
    ntrain=train_mat.shape[0]
    ntest=test_mat.shape[0]
    pred_cat=np.zeros(ntest)
    train_new=(train_mat>0).astype(int)
    test_new=(test_mat>0).astype(int)

    train_norm=train_new.multiply(train_new).sum(axis=1)
    test_norm=test_new.multiply(test_new).sum(axis=1)
    train_test_dot = np.dot(train_new,test_new.T)
    for i in range(0,ntest):
        #compute dot-product of param-vector and column of kernel matrix
        dist2 = train_norm-2*train_test_dot[:,i]+test_norm[i]
        Kvec=np.exp(-dist2/(2*tau**2))
        #Kvec=np.exp(-np.sum((train_new-test_new[i])**2,1)/(2*tau**2))
        pred_size= np.dot(alpha.T,Kvec)
        pred_cat[i] = np.sign(pred_size)
    return pred_cat


In [None]:
norm2=X_train_counts.multiply(X_train_counts).sum(axis=1)

In [497]:
%pdb off

Automatic pdb calling has been turned OFF


In [77]:
Nsub=1000
np.random.seed(454)
def get_subset(frac_perc,dat_mat,labels):
    """get_subset
    Returns random subset of the data and labels.
    Maintains same fraction of toxic/non-toxic data as the full dataset.
    """ 
    #make vector and sample indices for true/false.
    nvec=np.arange(len(labels))
    #get the indices for true/false
    Tvec=nvec[labels]
    Cvec=nvec[~labels]
    #grab a random shuffling of those indices.
    np.random.shuffle(Tvec)
    np.random.shuffle(Cvec)
    #grab some fraction of them.
    it = int(len(Tvec)*frac_perc)
    ic = int(len(Cvec)*frac_perc)
    ind_sub=np.append(Tvec[:it],Cvec[:ic])
    Xsub = dat_mat[ind_sub]
    label_sub = labels[ind_sub].reshape((len(ind_sub),1))
    return ind_sub,Xsub,label_sub

In [88]:
%pdb off
ind_sub,Xsub,label_sub=get_subset(0.01,X_train_counts,actual)


Automatic pdb calling has been turned OFF


In [793]:
#my code: super slow.
#TODO: Look into Cython.  Does it play nice with sparse?
alpha0,alpha=svm_fit(Xsub,label_sub,Nbatch=100)

Iter 191 of 382


Iter 0 of 382


In [794]:
%pdb off
svm_pred=svm_predict(Xsub,Xsub,alpha,8)
check_predictions(svm_pred,label_sub)

Automatic pdb calling has been turned OFF


In [640]:
#Lets try to use sklearns version on the same subset of the data

In [75]:
from sklearn.svm import SVC
nfeature,nobs=X_train_counts.shape

In [None]:
#Try to determine parameters gamma/C via cross-validation.
#Note that there is no need for explicit regularization?  Apparently in large dimensions, the parameters C/gamma (for penalty radius and width of basis function do a decent job in regularizing), since l1, l2 regularization don't work.  

Since apparently the training time for a SVM goes as $O(n_{sample}^3)$, maybe it is better to train an ensemble of SVMs.
In which case the training time is $O(n_{sample}^3/n_{ensemble^2})$ for the ensemble.  Then evaluating the results typically takes $O(n_sample)$ for all of the ensemble together.  (This is something like making the assumption that the kernels are block-diagonal, once appropriately sorted).  If we repeat this for multiple such random splits we can extract different correlations.
Then take a majority vote.

A similar idea is available here:(https://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python), which suggests
using a BaggingClassifier to automate the process.  
Of course, Random Forests are another option, with a similar goal.    

In [808]:
?SVC

In [130]:
frac_perc=0.02
svm=SVC(cache_size=1000,verbose=True,gamma=0.01,C=0.5,class_weight='balanced')
indsub,Xsub,label_sub=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
svm.fit(Xsub,label_sub.ravel())
svm_pred=svm.predict(Xsub)
svm_stats=check_predictions(svm_pred,label_sub)
#test on a different subset of the training data
indsub2,Xsub2,label_sub2=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
svm_pred2=svm.predict(Xsub2)
svm_stats2=check_predictions(svm_pred2,label_sub)

(1912, 1) (1912, 1)
True Positive 0.07426778242677824. False Positive 0.29707112970711297
False Negative 0.021966527196652718. True Negative 0.606694560669456
11.7227196161
Log-loss is 11.019407830667294


(1912, 1) (1912, 1)
True Positive 0.09623430962343096. False Positive 0.2688284518828452
False Negative 0.0. True Negative 0.6349372384937239
9.87589722428
Log-loss is 9.285220742710885


[LibSVM]

# Recurrent Neural Network

So let's try the current flavour of the month approach: a recurrent neural network.
Based on talking to Joseph and Fahim at the group, they used a two-layer neural network based on the just the 2000 most common words, using ReLU activation.  
(I think they said their approach was inspired by someone at Kaggle.)
Let's try something similar, with initially a single layer leaky ReLU layer, but after using a Truncated SVD.

# Deep Network

Another idea is to build a deep neural network on the term-frequency matrix, effectively running with extensions to the Naive Bayes model.
This will use the reduced term-frequency matrix after the Truncated SVD.  

In [193]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected, l2_regularizer
from tensorflow.contrib.rnn import BasicRNNCell,LSTMCell

10000

In [280]:
#just use the default graph
Nlayers=4
Nhidden=400
Nout=1
lr = 0.01
keep_prob=0.9
frac_perc=0.01
n_iter=10000

Nobs,Nfeature=X_train_trans.shape
#only grabbing a fraction of the data
Nsub=np.int(Nobs*frac_perc)
tf.reset_default_graph()

#load in the training examples, and their labels
X = tf.placeholder(tf.float32,[Nsub,Nfeature],name='X')
y = tf.placeholder(tf.float32,[Nsub,Nout],name='y')

X2 = tf.nn.l2_normalize(X,dim=1)

# #make a hidden layer.  Must be smarter way to scale up.
H1 = fully_connected(inputs=X2,num_outputs=Nhidden,
       activation_fn=tf.nn.relu,
       biases_initializer=tf.zeros_initializer,  
    weights_regularizer=l2_regularizer,
    biases_regularizer=l2_regularizer)
H1_d=tf.nn.dropout(H1,keep_prob)

H2 = fully_connected(inputs=H1_d,num_outputs=Nhidden,
    activation_fn=tf.nn.relu,
    biases_initializer=tf.zeros_initializer ,
    weights_regularizer=l2_regularizer,
    biases_regularizer=l2_regularizer)
H2_d=tf.nn.dropout(H2,keep_prob)

H3 = fully_connected(inputs=H2_d,num_outputs=Nhidden,
    activation_fn=tf.nn.relu,
    biases_initializer=tf.zeros_initializer ,
    weights_regularizer=l2_regularizer,
    biases_regularizer=l2_regularizer)
H3_d=tf.nn.dropout(H3,keep_prob)

H4 = fully_connected(inputs=H3_d,num_outputs=Nhidden,
    activation_fn=tf.nn.relu,
    biases_initializer=tf.zeros_initializer ,
    weights_regularizer=l2_regularizer,
    biases_regularizer=l2_regularizer)
H4_d =tf.nn.dropout(H4,keep_prob)

#Need to add dropout layers too.

# #just condense the number of inputs down, acting as a linear matrix combining results
outputs=fully_connected(inputs=H3,num_outputs=Nout,
     activation_fn=tf.sigmoid)

#should compute mean log-loss
eps=1E-15
loss = tf.losses.log_loss(y,outputs,epsilon=eps)
#loss = tf.reduce_mean(tf.square(y-outputs2))
#define optimization function.
optimizer=tf.train.AdamOptimizer(learning_rate=lr)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()

#save model and graph
saver=tf.train.Saver()

print('Running this thang')
with tf.Session() as sess:
     init.run()
     for iteration in range(n_iter+1):
         #select random starting point.
         ind_batch,X_batch,y_batch=get_subset(
         frac_perc,X_train_trans,actual)
         if iteration%100 ==0:
            clear_output(wait=True)
            mse =loss.eval(feed_dict={X:X_batch,y:y_batch})
            print('iter #{}. Current log-loss:{}'.format(iteration,mse))
            nn_pred=sess.run(outputs,feed_dict={X:X_batch})
            nn_pred_reduced=np.round(nn_pred).astype(bool)
            check_predictions(nn_pred_reduced,y_batch)
            print('\n')
            #save the weights
            saver.save(sess,'tf_models/deep_relu_drop',global_step=iteration)
         sess.run(training_op, feed_dict={X: X_batch, y:y_batch})
         

Type is unsupported, or the types of the items don't match field type in CollectionDef.
'function' object has no attribute 'name'


iter #10000. Current log-loss:0.028003334999084473
(956, 1) (956, 1)
True Positive 0.08577405857740586. False Positive 0.005230125523012552
False Negative 0.010460251046025104. True Negative 0.8985355648535565
Log-loss is 0.5419305898648662




In [None]:
## Predictions from Deep Neural Network

Let's now run some predictions on the full training and development sets.  

In [260]:
#run some predictions by loading the meta-graph.
def network_predict(model_name,input_data):
    """network_predict
    Load a saved Neural network, and predict the output labels
    based on input_data
    
    Input: model_name - string name to where model/variables are saved.
    input_data - transformed data of shape (Nobs,Nfeature).

    Output nn_pred_reduced - vector of predicted labels.
    """
    with tf.Session() as sess:
        loader=tf.train.import_meta_graph(model_name+'.meta')
        loader.restore(sess,model_name)
        Nobs,Nfeature=input_data.shape
        nn_pred_total=np.zeros((Nobs,1))
        i0=0
        i1=Nsub
        while (i1 < Nobs):
            X_batch=input_data[i0:i1]
            nn_pred=sess.run(outputs,feed_dict={X:X_batch})
            nn_pred_total[i0:i1]=nn_pred
            i0=i1
            i1+=Nsub
        #last iter: do remaining operations.  
        X_batch=input_data[-Nsub:]
        nn_pred=sess.run(outputs,feed_dict={X:X_batch})
        nn_pred_total[-Nsub:]=nn_pred
        nn_pred_reduced=np.round(nn_pred_total).astype(bool)
    return nn_pred_reduced

In [283]:
model_name='tf_models/deep_relu_drop-{}'.format(n_iter)

nn_pred_train = network_predict(model_name,X_train_trans)

print('3 layer ReLU network')
check_predictions(nn_pred_train,actual)
print('Naive Bayes')
check_predictions(pred_nb,actual)

(matrix([[ 0.08485558,  0.01652176],
         [ 0.01175647,  0.8868662 ]]), 0.97670853455007245)

3 layer ReLU network
(95692, 1) (95692, 1)
True Positive 0.09009112569493792. False Positive 0.0037516197801279105
False Negative 0.006520921289135978. True Negative 0.8996363332357982
Log-loss is 0.3548039987843783
Naive Bayes
(95692, 1) (95692, 1)
True Positive 0.08485557831375663. False Positive 0.01652175730468587
False Negative 0.011756468670317268. True Negative 0.8868661957112403
Log-loss is 0.9767085345500724


INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-10000


In [286]:
#Try testing on the dev-set
model_name='tf_models/deep_relu_drop-{}'.format(n_iter)
nn_pred_dev = network_predict(model_name,X_dev_trans)
print('3 layer ReLU network: on Dev set')
actual_dev=df_dev['toxic'].values
nn_stats=check_predictions(nn_pred_dev,actual_dev)

model_name='tf_models/deep_relu-{}'.format(n_iter)
nn_pred_dev = network_predict(model_name,X_dev_trans)
print('3 layer ReLU network: on Dev set')
actual_dev=df_dev['toxic'].values
nn_stats=check_predictions(nn_pred_dev,actual_dev)



3 layer ReLU network: on Dev set
(32128, 1) (32128, 1)
True Positive 0.05876494023904383. False Positive 0.0297871015936255
False Negative 0.03675921314741036. True Negative 0.8746887450199203
Log-loss is 2.2984521024358737


INFO:tensorflow:Restoring parameters from tf_models/deep_relu-10000


3 layer ReLU network: on Dev set
(32128, 1) (32128, 1)
True Positive 0.056710657370517926. False Positive 0.023157370517928287
False Negative 0.038813496015936255. True Negative 0.8813184760956175
Log-loss is 2.1404164187859576


INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-10000


In [285]:
print('Naive Bayes on Dev set')
pred_dev_nb=nb.predict(X_dev_counts)
nb_stats=check_predictions(pred_dev_nb,actual_dev)

Naive Bayes on Dev set
(32128, 1) (32128, 1)
True Positive 0.06296688247011953. False Positive 0.023063994023904383
False Negative 0.03255727091633466. True Negative 0.8814118525896414
Log-loss is 1.9211088744833538


In [None]:
Evidently the (3-layer ReLU-tanh) network is over-fitting to the training set.  Not surprising, since there is no regularization here.
It could outperform the Naive Bayes method on the training set, but had worse performance on the development dataset.

Let's put in some dropout. Putting in dropout after each layer, with a 0.1 dropout probability improved performance.

In [294]:
#dev scores
[f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)]

[0.63848495096381463, 0.69363963655066008]

In [298]:
print([f1_score(actual,nn_pred_train),f1_score(actual,pred_nb)])
print([f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)])

[0.94606310013717421, 0.85717301805130364]
[0.63848495096381463, 0.69363963655066008]


In [211]:
np.array(np.round([1.0, 0.0, 0.1, 0.9])).astype(bool)

array([ True, False, False,  True], dtype=bool)

In [179]:
nn_pred.shape

(956, 1)

I am finding that beyond one or two layers, the network just seems to output zeros.  Maybe the learning rate was too high?

In [102]:
?sklearn

Object `sklearn` not found.


In [124]:
#checks output for a single training batch.
plt.figure()
plt.hist(nn_pred[y_batch[:,0],0],bins=20)
plt.hist(nn_pred[~y_batch[:,0],0],bins=20)
plt.show()

<matplotlib.figure.Figure at 0x7efb754de748>

In [205]:
plt.figure()
plt.hist(nn_pred_total,bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7efb75311e80>