# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same topic.
This is using the dataset from
"Wulczyn, Ellery; Thain, Nithum; Dixon, Lucas (2016): Wikipedia Detox. figshare. doi.org/10.6084/m9.figshare.4054689"
This data has not been cleaned yet, and has not had multiple categories for the variety of toxity introduced yet.

The Kaggle competition uses this data in its training.
This initial analysis looks at the data, and implements tokenization, Naive Bayes, and a Deep Neural Network.
There's also some material considering methods for building a SVM.
This is succeeded by kaggle_anlt.ipynb, which runs further with the model building, but has less focus on
exploration.

Beware: Lots of swearing, racism, homophobia, misogyny is contained within due to nature of the comments.
And the fact I have searched for nasty terms as a sanity check on how the methods are working.


Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).
There is also a lot of wikipedia markup, and mis-spelled words.

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')  (done)
* Tokenize (convert words to indices) (done)
* Stemming words
* Balancing data set
* Match up comments, and review scores (done)
* Search for gibberish words (make a new "feature" for badly spelled comments)
* Spelling: one of the easiest ways to avoid a word based system is to misspell words, "fck you".
            This does suggest perhaps working with character n-grams, rather than just words. 

Embeddings:
These are necessary to reduce the dimensionality of the problem to a scale that will fit in memory.  
   * SVD - use SVD on the term-frequency matrix. Will use truncated SVD.  
   * word2vec - train vectors for words based on surrounding contexts (can use pre-trained ones)
   * Latent Factor Analysis - maybe useful prelude or alternative for building up embeddings.
                            - ALS is similar to SVD, but not guaranteed to be orthogonal.
   * Keep only most common words (in both toxic/non-toxic), or highest probability of toxic/non-toxic

Other Analysis possibilities:
* Naive Bayes
    - can find most important words
    - simple, easy to understand baseline.
* Support Vector Machine
    - try ensemble method (split the data into batches, and train an SVM on each batch.  Then do a committee vote.)
      This turns O(n_sample^3) scaling into O(n_sample^3/n_batch^2) scaling on the training.
      This is effectively treating the kernel matrix as if it were block-diagonal, as it omits correlations between datasets.
      Perhaps running multiple copies with different random splits would work?
* Deep Neural Network
    - Build a network using the term-frequency matrix as inputs.
    - Extends the naive Bayes method.  (Might be automatic way of doing some of that SVM stuff?)
    - Employ dropout for regularization, alongside L2 penalties.  
     
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
    - This one runs at the sentence/paragraph level and keeps the temporal structure.
    - Use LSTM/GRU cells, with a couple layers. 
    - Also dropout, l2 penalties

Metrics:
    - F1 :harmonic mean of precision and recall
    - log-loss $N^{-1}\sum_{j=1}^N\sum_c y_{jc}\log \hat{y}_{jc}$, where $j$ runs over observations, and $c$ runs over classes.
    - AUROC: Something like Gini coefficient?  (Plot the true-positive/false-positive curve as the decision threshold $t$ is varied.)
The last two were used as Kaggle metrics.  They just changed over to the column average AUC-ROC metric.  Apparently this is less sensitive to leader-board climbing than the log-loss. 

In [2]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import log_loss,f1_score,roc_auc_score

from IPython.display import clear_output
import time

#my code
from bayes import cond_prob, naive_bayes
%load_ext autoreload
%autoreload 2

In [2]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [42]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [3]:
#When are the comments made?
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
df_com['year'].hist(bins=bin_arr)
plt.title('Total Comments')
plt.show()

<matplotlib.figure.Figure at 0x7efbb7b64be0>

In [4]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes in 'rev_id'.  Those set the boundaries for changes.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split to split the array into many sub-arrays.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr

In [3]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [5]:
#Make a new column computing mean, median scores
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [12]:
#So 373 duplicated comments.  Awesome.  
dup_msk=df_com['comment'].duplicated(keep=False)
#I'm just going to drop these duplicates
df_com.drop_duplicates(subset='comment',inplace=True)

In [13]:
#Define toxic comments as those where the median is below -1, or -2.
#-1 captures more comments, but with more variance in what is considered toxic/unhelpful.
df_com['toxic']=(df_com['median_toxic']<=-1)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159463. Toxic comments: 15353. Toxic Fraction: 0.09627938769495117


In [17]:
#When are the comments made?  Has the toxicity changed over time?
#Note this is on the full dataset, with test/training/dev splits. 
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
#non-toxic comments
plt.subplot(2,2,1)
msk1=df_com['median_toxic']<=-1
plt.ylabel('Toxicity=-1')
df_com['year'][msk1].hist(bins=bin_arr)
plt.title('Toxic')
plt.subplot(2,2,2)
plt.title('Non-Toxic')
df_com['year'][~msk1].hist(bins=bin_arr)
#second row
plt.subplot(2,2,3)
msk2=df_com['median_toxic']<=-2
df_com['year'][msk2].hist(bins=bin_arr)
plt.ylabel('Toxicity=-2')
plt.subplot(2,2,4)
df_com['year'][~msk2].hist(bins=bin_arr)
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7f10d074eeb8>

So the data looks to be evenly balanced as toxic/non-toxic across time, with a rough 10% fraction reduction from regular to toxic, to severely toxic.
Another question about the data is what topics were under discussion? Does this bias the output/findings? 

In [83]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate
#Can maybe also eliminate all punctuation?  Makes any 

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks.
    #Should probably combine into a single regex.
    #re_str="(style|class|width|align|cellpadding|cellspacing|rowspan|colspan)=\`\`.*?\`\`"
    #com_clean=com_clean.str.replace(re_str,' ')
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    There must be a more comprehensive way of doing this?
    com_clean=com_clean.str.replace("[\[\[\{\}=_:\|\(\)\\\/\`]+",' ')
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

KeyError: 'comment'

This does lose some information.  Such as possible rude symbols replicating genitalia. (There's like 6 of these crude ascii art drawings.  This is probably not worth tracking down).

In [43]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [15]:
#put the file as tsv 
df_com.to_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',
columns=['rev_id','comment_clean','scores','mean_toxic','median_toxic','split','toxic'],compression='gzip')

In [3]:
#read in saved cleaned up dataframe.
df_com=pd.read_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',compression='gzip')

In [56]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]
df_dev=df_com[df_com['split']=='dev']
df_test=df_com[df_com['split']=='test']

So the following vectorizer eliminates the stop words.  However, while stop words have little impact
on the semantic content of regular documents, in internet toxicity they are absent from the most toxic messages.
However, the hardest to classify comments will probably have them?

In [269]:
?CountVectorizer

In [270]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode',ngram_range=(1,2))
tfidf_vect=TfidfVectorizer(stop_words='english',lowercase=True,strip_accents='unicode',ngram_range=(1,2))
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_tfidf=tfidf_vect.fit_transform(df_train['comment_clean'])

#do the same transformations using existing vocab built up in training.
X_dev_tfidf=tfidf_vect.transform(df_dev['comment_clean'])
X_test_tfidf=tfidf_vect.transform(df_test['comment_clean'])

X_dev_counts=count_vect.transform(df_dev['comment_clean'])
X_test_counts=count_vect.transform(df_test['comment_clean'])

# Spell checking

Let's try using pyenchant to spellcheck these messages.
My goal is to build a way of catching inventive attempts to circumvent the
swearing filter.  (This catches any idiocy like "fck you")
Anything not recognize will count as a spelling error, which can be used as a feature.  

However, this will also end up penalizing proper names, foreign languages, and rare technical terms.
I would hope these would be outweighed by having other genuine content.  


In [30]:
from enchant.checker import SpellChecker
chkr = SpellChecker("en_US")

In [105]:
#Make a custom dictionary based on data:
#Tokenize words.  Accept new words that show up in more than 5 messages.
X_log=np.sum(X_train_counts>0,axis=0)

msk=X_log>20
print(np.sum(msk))

10105


In [111]:
word_arr=voc_df1[msk.T].index.values
#now check which words are "new"
new_word=np.zeros(word_arr.shape)
for i in range(len(word_arr)):
    chkr.set_text(word_arr[i])
    for err in chkr:
        #print('Error:',err.word)
        new_word[i]=1
np.sum(new_word)


In [120]:
#write these to a custom CSV dict for use in checking.
new_word_series=pd.Series(word_arr[new_word>0])
new_word_series.to_csv('new_word_dict.txt',index=False)

In [122]:
#use US english, and augmented by new "common" words.
chkr = SpellChecker("en_US","new_word_dict.txt")

In [123]:
Ncheck=1000

err_tot=np.zeros(Ncheck)
t0=time.time()
for i in range(Ncheck):
    chkr.set_text(df_com.iloc[i]['comment_clean'])
    for err in chkr:
        #print('Error:',err.word)
        err_tot[i]+=1
t1=time.time()
print('Time taken:',t1-t0)
#So around 8 sec for 1000 entries.  Will be around an hour for the whole set of 10^5.
#Could use multiprocess to split up, as this is a trivially parallel task.
#But note:chkr is stateful - need a list of independent chkrs for each process/thread.

Time taken: 7.926937818527222


# Feature Engineering

Let's try to add features such as checking the fraction of all-caps words, the number of mis-spelled words, and repetition.
The spellchecker can find the number of misspelled words, can construct a freaction of the message that is mispelled.
Repetition can be estimated from the term-frequency matrix, by taking the ratio of any word to the total number of words.
Capitalization 

# Checking the vectorizer and finding common words

I wanted to check that the vectorizer was working by outputting common words, and identifying the "most toxic" words, based on their counts.
This was useful as a sanity check.

In [271]:
#get vocabulary dictionary, then make a dataframe, with entries as rows
#Then sort dataframe by row entry value, and then use that as the index for the counts.
voc_dict=count_vect.vocabulary_
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
voc_df1=voc_df.sort_values(by=0)

In [272]:
voc_df1.iloc[29143]

0    29143
Name: 27 districts, dtype: int64

In [273]:
#Compute conditional probabilities of toxicity for each word. 
ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [274]:
#make new dataframe with conditional probabilities for words being toxic, and raw probabilities of occuring in toxic/clean messages
#Then sort by toxicity.
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond','p_clean','p_toxic'],index=voc_df1.index)
word_df.sort_values('pcond',ascending=False,inplace=True)
pcond_wds=word_df.head(n=20).index.values

#Mis-spelled rare swearing follows.  And someone's vendetta against VeggieTales.
print(pcond_wds)

['fuck fuck' 'faggot faggot' 'bark bark' 'suck suck' 'wanker wanker'
 'nipple nipple' 'die die' 'die fag' 'faggots faggots' 'fucksex' 'fag die'
 'fucksex fucksex' 'nigger nigger' 'jew fat' 'fat jew' 'super gay' 'gay super'
 'buttsecks' 'buttsecks buttsecks' 'bastered']


So, the most toxic words (i.e. words that only appeared in toxic messages) are misspelled attempts at rudeness, with weird spaces, and combination words.  I think this reflects more on the pre-processing.  These words show up in a single toxic message, and are thus great at inferring that one message is toxic.  This doesn't say much about more general trends in the messages.

I am considering also implementing a spell-check, and adding a variable for the number of incorrect words or fraction of the message that is misspelled.  Another feature would be the fraction that is capitalized?
The accent stripping catches simple attempts to avoid the spam filter with accents, but does miss things where the words are spaced out, or have other characters inserted.

In [142]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [275]:
#The following searches for words via a naive regex, and compares the results with the tokenizer.
vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,pcond_wds[0])
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
# print('Vect: {}, Naive: {}'.format(vc,cc))
# print(com.head(),'\n\n')
# print(ncom.head())

In [277]:
## Naughty words catch obvious candidates, and check for more British insults,
## and slurs for arabs.
word_counts=X_train_counts.sum(axis=0)
for word in pcond_wds:
    try:
        ind=count_vect.vocabulary_[word]
        n_occur=word_counts[0,ind]
        n_tot=np.sum(X_train_counts[:,ind]>0)
        print(word,':\t {} occurences: \t {} messages'.format(n_occur,n_tot))
    except:
        print(word,'not found')

buttsecks buttsecks :	 496 occurences: 	 1 messages
bastered :	 449 occurences: 	 2 messages


nigger nigger :	 617 occurences: 	 3 messages
jew fat :	 613 occurences: 	 1 messages
fat jew :	 609 occurences: 	 1 messages
super gay :	 500 occurences: 	 1 messages
gay super :	 499 occurences: 	 1 messages
buttsecks :	 498 occurences: 	 2 messages


die fag :	 625 occurences: 	 1 messages
faggots faggots :	 624 occurences: 	 1 messages
fucksex :	 624 occurences: 	 1 messages
fag die :	 624 occurences: 	 1 messages
fucksex fucksex :	 623 occurences: 	 1 messages


fuck fuck :	 1747 occurences: 	 79 messages
faggot faggot :	 1427 occurences: 	 4 messages
bark bark :	 999 occurences: 	 1 messages
suck suck :	 993 occurences: 	 14 messages
wanker wanker :	 963 occurences: 	 3 messages
nipple nipple :	 763 occurences: 	 2 messages
die die :	 635 occurences: 	 6 messages


So, the conditional probability for a word being toxic is chiefly determined by whether it only occurs in toxic messages.  In this case, these "most toxic words" are misspelled or portmanteus.  The high counts are offset by only appearing in few messages.  This suggest these are just lengthy repetitions or a single rude message.

A spellcheck might catch these, and correct the spelling?  That would potentially catch the attempts to circumvent obvious mis-spelling.

As an attempt to make a simpler, smaller data set to work with, Matt Borthwick selected out the comments with unanimous ratings.
This revealed an interesting phenomenon: there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
People disagreed quite a bit on the extent to which racism was toxic, which suggests analyzing how the (predominantly American?) reviewers responded to the data. 
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers? (this is something that the actual original project is explicitly considering at https://conversationai.github.io/bias.html)

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score to avoid the mean being contaminated by people with a really different sense of what is toxic.

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learn's version.  (They both return the same result now).

This basically treats the comments in a bag-of-words sense, and drops any correlations between the words.  Perhaps including some more
common n-grams, e.g. "frigging crank".

* Estimate $p(w|T)$ from counts in term-frequency matrix.
* Use Bayes Rule
  $ P(T|w) = \frac{p(T)p(w|T)}{\text{normalization const}}$

  \begin{equation}
    p(T|\text{words}) = P(T) \prod_{words}\frac{p(w_i|T)}{p(w_i|T)
  \end{equation}

* Use Logarithms, and compare log-odds for toxicity/non-toxic.  

In [278]:
actual=df_train['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,prob,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

  prob=1/(1+np.exp(log_Cscore-log_Tscore))


In [279]:
#Plot a histogram of the log probabilities.  
plt.figure()
plt.hist(np.maximum(-50,np.log(prob)),bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7f22e9ccd470>

  This is separate from the ipykernel package so we can avoid doing imports until


In [280]:
#Plot a histogram of the log-odds 
plt.figure()
plt.subplot(121)
bins=np.linspace(-1000,1000,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.ylabel('Counts')
plt.xlabel('Log Odds of Toxicity')
plt.title('Full Range')
plt.subplot(122)
bins=np.linspace(-20,20,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.xlabel('Log Odds of Toxicity')
plt.title('Zoomed in')
plt.show()

<matplotlib.figure.Figure at 0x7f2330dcee80>

In [None]:
Maybe should also plot length of comments? To what extent are these mirroring a similar underlying shape, with long tails?

In [133]:
com_len=df_train['comment_clean'].apply(len)
plt.hist(com_len,log=True)
plt.xlabel('Character length of message')
plt.show()

<matplotlib.figure.Figure at 0x7f239904f0b8>

In [169]:
def check_predictions(pred,actual,epsilon=1E-15):
    """check_predictions
    Compares predicted class (y_i) against actual class (z_i).
    Returns the confusion matrix and mean log-loss.
    
    Log-loss = sum_i{ z_i log[ y_i] }/M

    Input: pred - predicted values (0,1)
    actual - true labels 
    eps    - shift to avoid log(0)
    Returns: Confusion matrix with [[true positive, false positive],[false negative, true negative]]
    log-loss - average log-loss
    """
    actual=np.reshape(actual,(len(actual),1))
    pred=np.reshape(pred,(len(actual),1))    
    print(pred.shape,actual.shape)
    tp = np.mean((pred==True)&(actual==True))
    tn = np.mean((pred==False)&(actual==False))
    fp = np.mean((pred==True)&(actual==False))    
    fn = np.mean((pred==False)&(actual==True))            
    scores=np.matrix([[tp,fp],[fn,tn]])
    print("True Positive {}. False Positive {}".format(tp,fp))
    print("False Negative {}. True Negative {}".format(fn,tn))
    pred_num=pred.astype(float)
    logloss=log_loss(actual,pred_num,eps=epsilon,normalize=True)    
    #give zero a small correction.
    #pred_num[pred==False]=epsilon
    #pred_num[pred==True]=1-epsilon
    #my (initial) wrong attempt
    #logloss2=-np.mean(np.multiply(actual,np.log(pred_num)))
    # logloss2=-np.mean(np.multiply(actual,np.log(pred_num))\
    #     +np.multiply(1-actual,np.log(1-pred_num)))
    # print(logloss2)
    auroc = roc_auc_score(actual,pred)
    #logloss=0
    print("Log-loss is {}".format(logloss))
    print("AUROC is {}".format(auroc))    
    return scores,logloss


In [281]:
logloss,score_rates=check_predictions(pred,actual)

(95554, 1) (95554, 1)
True Positive 0.0962597065533625. False Positive 0.004510538543650711
False Negative 0.0004290767524122486. True Negative 0.8988006781505745
Log-loss is 0.17061187480262835
AUROC is 0.9952844759687263


Interesting. The mean log-loss is surprisingly sensitive to the chosen zero-offset.  I think this reflects the fact that the naive-bayes method is returning a lot of incredibly small probabilities (10^{-100}).

In [282]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [283]:
pred_nb=nb.predict(X_train_counts)
pred_dev_nb=nb.predict(X_dev_counts)
nb_stats=check_predictions(pred_nb,actual)

(95554, 1) (95554, 1)
True Positive 0.0962597065533625. False Positive 0.004510538543650711
False Negative 0.0004290767524122486. True Negative 0.8988006781505745
Log-loss is 0.17061187480262835
AUROC is 0.9952844759687263


In [286]:
#Dev_set including 2-grams.  Evidently this is enough to overfit 
# the data, since performance is much worse than training set.
nb_stats=check_predictions(pred_dev_nb,actual_dev)

(32083, 1) (32083, 1)
True Positive 0.05784995168780974. False Positive 0.021163856247857122
False Negative 0.03774584671009569. True Negative 0.8832403453542375
Log-loss is 2.03468598052041
AUROC is 0.7908753658415774


In the Kaggle case, this can be extended to multiple classes by training multiple such classifiers (since each one is independent of the others).

## Naive Bayes False Positives and Negatives

Let's now look a bit at the misclassified results.

In [287]:
#fixing shapes to avoid broadcasting
actual=np.reshape(actual,(len(actual),1))
pred=np.reshape(pred,(len(actual),1))    

fp_msk = ((pred==True)&(actual==False))    
fn_msk = ((pred==False)&(actual==True))            

In [288]:
df_fn=df_train[fn_msk][['comment_clean','mean_toxic','median_toxic','toxic']]
df_fp=df_train[fp_msk][['comment_clean','mean_toxic','median_toxic','toxic']]

In [289]:
df_fp.head()

                                                                                            comment_clean  \
952                                                                                  Stop the vandalism.    
1438                                                                         or to call people to action    
1537                                                 Quit vandalizing the pages or you will be blocked.-    
2493   BIG BALLS BIG BALLS ive got big balls youve got big balls shes got big balls theyve got big bal...   
2761                                                                  Please stop vandalizing Wikipedia.    

      mean_toxic  median_toxic  toxic  
952          0.2           0.0  False  
1438         0.0           0.0  False  
1537        -0.1           0.0  False  
2493        -0.5          -0.5  False  
2761        -0.1           0.0  False  

In [290]:
ind=df_fn.index.values

In [291]:
df_fn.head()

                                                                                             comment_clean  \
4603                                                                      You're not very bright, are you    
7449    Shut up you liar. Why aren't you abiding by wikipedia policies that content must be based on ve...   
8737                                        Thanks Thanks for participating in the conspiracy against me.    
14382  Welcome Faggot! Welcome! Hello, , and welcome to Wikipedia! Thank you for your contributions. I ...   
33504                                                                                  You. I despise you.   

       mean_toxic  median_toxic  toxic  
4603         -0.6          -1.0   True  
7449         -0.8          -1.0   True  
8737         -0.4          -1.0   True  
14382        -0.5          -1.0   True  
33504        -0.9          -1.0   True  

So false negatives.  Much more spacing/characters being used to avoid the filter.
The false negatives in the larger set seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

So at least the "false positives" are because the people using the rating scale are wildly inconsistent.  These are "-1" on the toxicity scale, and so "non-toxic" under the rule where toxic comments have median toxicity less than -1.
Some are "neutral" but have lots of repitition.  I can't for the life of me imagine any of these comments adding anything to the discussion.

# Dimensionality Reduction

Let's use the truncated SVD for dimensionality reduction (or latent semantic analysis?)
Apparently TF-IDF matrix is superior to straight term frequency matrix for this purpose  (more closely matches assumptions in the SVD about the noise.)
Should maybe also symmetrize transformation (as suggested in paper comparing hyperparameters between word2vec and older SVD methods).
They suggest using $T=U \Lambda V = (U \Lambda^{1/2}) (\Lambda^{1/2} V)$ for the projection.  

In [156]:
from sklearn.decomposition import TruncatedSVD
#took a minute or two
TSVD=TruncatedSVD(n_components=100,n_iter=20)
TSVD.fit(X_train_tfidf)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=20,
       random_state=None, tol=0.0)

In [157]:
#actually transform the results 
X_train_trans=TSVD.transform(X_train_tfidf)

In [158]:
plt.plot(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()

<matplotlib.figure.Figure at 0x7f23955a12e8>

We will next use the transformed results in a "deep" neural network.  

In [188]:
#actually transform the dev/test data.
X_dev_trans=TSVD.transform(X_dev_tfidf)

# Support Vector Machine

The one problem with the kernelized SVM is that it's training time scales as $O(n_{\text{sample}}^\alpha)$, with scaling exponent between 2 and 3.

I've come across a couple approximate methods.  The first is a bagged model using an ensemble of SVM's based on subsets of the data. That leverages the existing (presumably smarter) scikit-learn code, in a way that could scale up.  Each classifier is then part of an ensemble, and we determine the class by majority vote. 
This in essence assumes that the Kernel matrix is block-diagonal, since this ensemble misses any correlations between the different subsets.
One way would be to take multiple random batches of data.

Or use an approximate kernel via Random Fourier Components.  The resulting SVM uses the linear library (but requires $N_{\text{sample}}$ samples per Kernel estimate.)  This method can be applied to any method where the kernel function $K(x,y)$ has a known Fourier transform (which can be
efficiently sampled from).

In [24]:
Nsub=1000
np.random.seed(454)
#Should really update to just use sklearn's stratified Kfold.
def get_subset(frac_perc,dat_mat,labels):
    """get_subset(frac_perc,dat_mat,labels
    Returns random subset of the data and labels.
    Maintains same fraction of toxic/non-toxic data as the full dataset.
    Input: 
    frac_perc: fraction of data to extract
    dat_mat: input data 
    labels:  corresponding labels.
    Returns:
    ind_sub: indices used for extraction
    Xsub : random subarray
    label_sub: corresponding labels for Xsub
    """ 
    #make vector and sample indices for true/false.
    nvec=np.arange(len(labels))
    #get the indices for true/false
    Tvec=nvec[labels]
    Cvec=nvec[~labels]
    #grab a random shuffling of those indices.
    np.random.shuffle(Tvec)
    np.random.shuffle(Cvec)
    #grab some fraction of them.
    it = int(len(Tvec)*frac_perc)
    ic = int(len(Cvec)*frac_perc)
    ind_sub=np.append(Tvec[:it],Cvec[:ic])
    Xsub = dat_mat[ind_sub]
    label_sub = labels[ind_sub].reshape((len(ind_sub),1))
    return ind_sub,Xsub,label_sub

In [88]:
%pdb off
ind_sub,Xsub,label_sub=get_subset(0.01,X_train_counts,actual

Automatic pdb calling has been turned OFF


# SVM Ensemble

Since apparently the training time for a SVM goes as $O(n_{sample}^{3})$, maybe it is better to train an ensemble of SVMs.
In which case the training time is $O(n_{sample}^{3}/n_{ensemble^{2}})$ for the ensemble.  Then evaluating the results typically takes $O(n_sample)$ for all of the ensemble together.  (This is something like making the crude assumption that the kernels are block-diagonal, once appropriately sorted).  If we repeat this for multiple such random splits we can extract different correlations.
The final choice is based on take a majority vote.

Like any good idea, this ideas has been had before.  A similar idea is available here:(https://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python), which suggests
using a BaggingClassifier to automate the whole process.  
Of course, Random Forests are another option, with a similar goal.    

In [19]:
from sklearn.svm import SVC
#just use bagging classifier on the whole list of SVMs
from sklearn.ensemble import BaggingClassifier

nfeature,nobs=X_train_counts.shape

In [None]:
#Try to determine parameters gamma/C via cross-validation.
#Note that there is no need for explicit regularization?  Apparently in large dimensions, the parameters C/gamma (for penalty radius and width of basis function do a decent job in regularizing), since l1, l2 regularization don't work.  

In [21]:
#make the SVM model
svm=SVC(cache_size=750,gamma=0.01,C=10,class_weight='balanced')
#The bagging classifier of those
ensemble_svm=BaggingClassifier(svm,n_estimators=10,
bootstrap=False,n_jobs=3,max_samples=0.1,oob_score=False,verbose=True)

In [77]:
frac_perc=0.2
#svm=SVC(cache_size=1000,verbose=True,gamma=0.1,C=0.5,class_weight='balanced')
indsub,Xsub,label_sub=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)


In [111]:
t0=time.time()
#use the ravel for reshaping?
ensemble_svm.fit(Xsub,label_sub.ravel())
svm_pred=ensemble_svm.predict(Xsub)
#test on a different subset of the training data
t1=time.time()
print('Time Elapsed:',t1-t0)
svm_stats=check_predictions(svm_pred,label_sub)

[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    3.8s finished


Time Elapsed: 62.2320671081543
(15288, 1) (15288, 1)
True Positive 0.03270538984824699. False Positive 0.0005232862375719519
False Negative 0.06397174254317112. True Negative 0.9027995813710099
Log-loss is 2.227579796059745
AUROC is 0.6688578514324013


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:   20.2s finished


In [110]:
frac_perc=0.04
#svm=SVC(cache_size=1000,verbose=True,gamma=0.1,C=0.5,class_weight='balanced')
indsub2,Xsub2,label_sub2=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
svm_pred2=ensemble_svm.predict(Xsub2)
svm_stats2=check_predictions(svm_pred2,label_sub2)

(3821, 1) (3821, 1)
True Positive 0.03245223763412719. False Positive 0.0010468463752944255
False Negative 0.06411934048678357. True Negative 0.9023815755037948
Log-loss is 2.25076119359395
AUROC is 0.66744230594102


[Parallel(n_jobs=3)]: Done   3 out of   3 | elapsed:    4.5s finished


In [80]:
#Use Cross-validation to split, estimate score.
from sklearn.model_selection import GridSearchCV

gam_arr=np.logspace(-2,2,6)
C_arr=np.logspace(-2,2,6)
param_grid=dict(base_estimator__gamma=gam_arr,base_estimator__C=C_arr)

svm=SVC(cache_size=500,gamma=0.01,C=10,class_weight='balanced')
#The bagging classifier of those

ensemble_svm=BaggingClassifier(svm,n_estimators=10,
bootstrap=False,n_jobs=2,max_samples=0.1,oob_score=False,verbose=True)

#Uses stratified k-fold cross-validation.
gridsearch_svm=GridSearchCV(ensemble_svm,param_grid,error_score=0,scoring='neg_log_loss',cv=5)

#Then do grid search over that.  

In [135]:
%pdb off
##So this search took around 5 hours on 2 cores, with 20% of data, 10 estimators.  
#gridsearch_svm.fit(Xsub,label_sub.ravel())

Automatic pdb calling has been turned OFF


In [83]:
##Useful for getting parameters of bagged classifier.
#ensemble_svm.get_params()
gridsearch_svm.cv_results_

{'mean_fit_time': array([  8.515379  ,   9.01231294,   9.75185003,   9.73938398,   9.98681965,
         10.32953625,   9.1302866 ,   9.20638032,   9.61401176,   9.70504975,
         10.125845  ,  10.38082843,   8.5658195 ,   9.51361198,   9.58338413,
          9.42688198,   9.7426013 ,   9.23867021,   5.23752804,   8.05332527,
          8.26427064,   8.61298461,   8.92178097,   9.07002459,   4.95735879,
          7.89778571,   8.14328508,   8.49073753,   8.82482314,   8.98313236,
          5.27298045,   7.23655725,   8.26355944,   8.421489  ,   8.68758707,
          8.9933918 ]),
 'mean_score_time': array([ 16.90008068,  18.05262084,  19.87818789,  20.06860175,  20.97318258,
         21.40729494,  17.89313512,  18.53377829,  19.04816036,  19.9470829 ,
         21.13045225,  21.46793203,  16.92121625,  19.20889707,  18.34303741,
         19.25774665,  19.23158917,  17.89144721,   7.61561131,  13.21875992,
         16.03910851,  16.0835259 ,  17.37367401,  17.5949316 ,   6.19260755,
    

In [84]:
scores=gridsearch_svm.cv_results_['mean_test_score'].reshape(len(gam_arr),len(C_arr))

In [95]:
#gridsearch_svm.best_params_
#Results:{'base_estimator__C': 0.063095734448019331, 'base_estimator__gamma': 100.0}

{'base_estimator__C': 0.063095734448019331, 'base_estimator__gamma': 100.0}

In [94]:
plt.figure()
plt.imshow(np.exp(scores))
plt.xticks(np.arange(len(gam_arr)),gam_arr,rotation=90)
plt.yticks(np.arange(len(C_arr)),C_arr)
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.show()

<matplotlib.figure.Figure at 0x7efbb1e17128>

So this search took 5 hours or so (on 2 cores).  And suggests the two regions woth exploring are $C>1, \gamma\ll 1$, and $C<1,\gamma\gg 1$. 

In [113]:
#Let's try to compare that with a full SVM on the same data.
t0=time.time()
full_svm=SVC(cache_size=1000,verbose=True,gamma=0.01,C=10,class_weight='balanced')
full_svm.fit(Xsub,label_sub.ravel())
full_svm_pred=full_svm.predict(Xsub)
t1=time.time()
print('Training time',t1-t0)
svm_stats3=check_predictions(full_svm_pred,label_sub)

[LibSVM]

Training time 78.60097455978394
(15288, 1) (15288, 1)
True Positive 0.09641548927263213. False Positive 0.004120879120879121
False Negative 0.00026164311878597594. True Negative 0.8992019884877027
Log-loss is 0.15137025072587212
AUROC is 0.9963658641979543


## Randomized Fourier Features

The Tensorflow documentation includes a great idea for extending Kernel machines: use an sinusoidal mapping from the original space to another linear space.  The mapping depends on a Gaussian random variable, so when we take expectation values over the Gaussian variable, the result
of that expectation approximates the desired kernel.  Genius!
Ideas here:(https://www.tensorflow.org/tutorials/kernel_methods,
https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).
See also scikit-learn's Kernel Approximations methods, which implement the RBF kernel described below. 

LinearSVMs work quickly, but their full kernel counterparts are slow to train, scaling as $O(n_{sample}^3)$.
Instead, consider features like 
\begin{equation}
    z_{k}(\mathbf{x})=\cos(\mathbf{\omega}_{k}\cdot\mathbf{x}+b_{k}),
\end{equation}
where $\mathbf{x}\in \mathbb{R}^{d}, \omega\in \mathbb{R}^{d},\mathbf{b}_{k}\in\mathbb{R}$, and $\omega_{k}$, is a random Gaussian vector drawn from
\begin{equation}
    P(\omega) = (2\pi\sigma^2)^{-d/2} \exp\left(-\frac{\mathbf{\omega}^2}{2\sigma^2}\right),
\end{equation}
and $b_{k}$ is a uniform random variable drawn from $[0,2\pi)$.  Note that $z_{k}$ is a scalar.  But if we consider making $D$ draws of the random variables, then we can construct a vector $\mathbf{z}(\mathbf{x})=\sqrt{\frac{2}{D}}[z_{1},z_{2},\ldots, z_{D}]$,

The inner products on these new feature vectors for different input data are given y 
\begin{equation}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})=\frac{2}{D}\sum_{k=1}^{D} \cos(\mathbf{\omega}_{k}\cdot\mathbf{x}+b_{k})\cos(\mathbf{\omega}_{k}\cdot\mathbf{y}+b_{k}).
\end{equation}
This is essentially a Monte-Carlo estimate (with $D$ samples) of the probability distributions.  As $D\rightarrow \infty$, this converges to 
\begin{align}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})&\approx \int d\mathbf{\omega}\int db\,P(\omega)p(b)
    2\cos(\mathbf{\omega}\cdot\mathbf{x}+b)\cos(\mathbf{\omega}\cdot\mathbf{y}+b)\\
&=\frac{1}{2\pi}\frac{1}{(2\pi \sigma^2)^{D/2}}\int d\mathbf{\omega}\int_0^{2\pi} db\,e^{-(\mathbf{\omega})^2/(2\sigma^2)}
    2\cos(\mathbf{\omega}\cdot\mathbf{x}+b)\cos(\mathbf{\omega}\cdot\mathbf{y}+b) \\
&=\frac{1}{2\pi}\frac{1}{(2\pi \sigma^2)^{D/2}}\int d\mathbf{\omega}\int_0^{2\pi} db\,e^{-(\mathbf{\omega})^2/(2\sigma^2)}
    \bigg(\cos[\mathbf{\omega}\cdot(\mathbf{x}+\mathbf{y})+b]+\cos[\mathbf{\omega}\cdot(\mathbf{x}-\mathbf{y})]\bigg),
\end{align}
where we used a double-angle formula on the cosines.  The Gaussian and uniform integrals can be carried out, with the result
\begin{align}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})&\approx 
&=\,e^{-(\mathbf{x-y})^2/(2\sigma^2)}.
\end{align}
The same idea can be extended for any $P(\mathbf{\omega})$ to get the desired kernel, provided it has a nice Fourier transform.

One thing noted in the docs is that this works well for smooth data, but can require a lot of components if there is a significant random component, such as trying to detect fractal structures like forests in images.  


# Deep Network

Another idea is to build a deep neural network on the term-frequency matrix, effectively running with extensions to the Naive Bayes model.
This will use the reduced term-frequency matrix after the Truncated SVD.  

In [136]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected, l2_regularizer
from tensorflow.contrib.rnn import BasicRNNCell,LSTMCell

In [211]:
from deep_network import deep_dropout_NN

In [261]:
#Ignore the "error about serializing - this is a known problem with saving models created using 
#modules like fully connected, since their components are not named.
#The models are saved, and the computations work.

actual=df_train['toxic'].astype(int).values
save_name='./tf_models/deep_relu_drop'
dNN=deep_dropout_NN(X_train_trans.shape)
#dNN.run_graph(X_train_trans,actual,save_name)

In [267]:
#model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
model_name='./tf_models/deep_relu_drop-{}'.format(5000)
dnn_pred2=dNN.predict_all(model_name,X_train_trans)

newer predict
INFO:tensorflow:Restoring parameters from ./tf_models/deep_relu_drop-5000


In [266]:
dnn_conf=check_predictions(dnn_pred2,actual)

(95554, 1) (95554, 1)
True Positive 0.08374322372689788. False Positive 0.008204784729053729
False Negative 0.012945559578876865. True Negative 0.8951064319651715
Log-loss is 0.7305135732517722
AUROC is 0.9285140205369825


In [268]:
dnn_conf=check_predictions(dnn_pred2,actual)

(95554, 1) (95554, 1)
True Positive 0.08339786926763924. False Positive 0.008110597149255917
False Negative 0.013290914038135504. True Negative 0.8952006195449693
Log-loss is 0.7391884946271304
AUROC is 0.926780247594411


In [256]:
dnn_conf=check_predictions(dnn_pred,actual)

(95554, 1) (95554, 1)
True Positive 0.08066642945350273. False Positive 0.006090796826925089
False Negative 0.016022353852272013. True Negative 0.8972204198673002
Log-loss is 0.7637660368812476
AUROC is 0.9137733403321003


In [76]:
nb_conf=check_predictions(pred_nb,actual)

(95554, 1) (95554, 1)
True Positive 0.08529208615023966. False Positive 0.016461895891328463
False Negative 0.01139669715553509. True Negative 0.8868493208028968
Log-loss is 0.9622148788120853
AUROC is 0.931953076744998


In [None]:
## Predictions from Deep Neural Network

Let's now run some predictions on the full training and development sets.  

In [259]:
   #Try testing on the dev-set
model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
nn_pred_train = dNN.predict_all(model_name,X_train_trans)
print('4 layer ReLU network: on Dev set')
actual_train=df_train['toxic'].values
nn_stats=check_predictions(nn_pred_train,actual_train)

nn_pred_dev = dNN.predict_all(model_name,X_dev_trans)
print('4 layer ReLU network: on Dev set')
actual_dev=df_dev['toxic'].values
nn_stats=check_predictions(nn_pred_dev,actual_dev)

4 layer ReLU network: on Dev set
(32083, 1) (32083, 1)
True Positive 0.057818782532805535. False Positive 0.014711841161986098
False Negative 0.0377770158650999. True Negative 0.8896923604401085
Log-loss is 1.8129126596333467
AUROC is 0.794279337602118


newer predict
INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-1000


4 layer ReLU network: on Dev set
(95554, 1) (95554, 1)
True Positive 0.06437197814848149. False Positive 0.00998388345856793
False Negative 0.03231680515729326. True Negative 0.8933273332356573
Log-loss is 1.461022008541531
AUROC is 0.8273560765169565


newer predict
INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-1000


In [260]:
print('Naive Bayes on Dev set')
pred_dev_nb=nb.predict(X_dev_counts)
nb_stats=check_predictions(pred_dev_nb,actual_dev)

Naive Bayes on Dev set
(32083, 1) (32083, 1)
True Positive 0.06299286226350403. False Positive 0.023501542873172708
False Negative 0.0326029361344014. True Negative 0.8809026587289218
Log-loss is 1.9377988469688499
AUROC is 0.8164822255177966


In [None]:
Evidently the (3-layer ReLU-tanh) network is over-fitting to the training set.  Not surprising, since there is no regularization here.
It could outperform the Naive Bayes method on the training set, but had worse performance on the development dataset.

Let's put in some dropout. Putting in dropout after each layer, with a 0.1 dropout probability improved performance.

In [294]:
#dev scores
[f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)]

[0.63848495096381463, 0.69363963655066008]

In [292]:
print([f1_score(actual,nn_pred_train),f1_score(actual,pred_nb)])
print([f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)])

[0.75269211943220748, 0.97498410006359981]
[0.68780126065999259, 0.66262049268118539]


I am finding that beyond one or two layers, the network just seems to output zeros.  Maybe the learning rate was too high? Yes - this is a common problem.

# Recurrent Neural Network

So let's try the current flavour of the month approach: a recurrent neural network.
Based on talking to Joseph and Fahim at the group, they used a two-layer neural network based on the just the 2000 most common words, using ReLU activation.  (I think they said their approach was inspired by someone at Kaggle.)
Let's try something similar, with initially a single layer leaky ReLU layer.

The idea is that the network parses each word of the sentence (to better capture logical structure).
Each word needs an index.  Initially this is an index in the vocabulary V, where $V\sim10^6$ or more.  That's an infeasibly large matrix.
We need some form of dimensionality reduction.  Either by picking the most distinctive words (which actually appear in multiple messages),
or by projecting down via SVD. 