# Toxicity in Wikipedia Comments

Hostile, toxic comments and discussions are a fact of life on the present internet.
Algorithms offer one approach to moderating these discussions, to filter out harmful, ugly
comments which detract from the conversation. This project is an attempt to explore
tha question, and build some simple models to classify online comments as "toxic" or "non-toxic."

This is using the dataset from
"Wulczyn, Ellery; Thain, Nithum; Dixon, Lucas (2016): Wikipedia Detox. figshare. doi.org/10.6084/m9.figshare.4054689",
which was cleaned up to provide the Kaggle dataset.
That data used Amazon Mechanical Turk workers to rate comments taken from Wikipedia talk pages as being
toxic, or non-toxic.  This version of the data also includes demographic and scoring information.
While this notebook does not explore how these topics are perceived, I would like to explore the variations
in how the topics are perceived.  
This data has not been cleaned yet, and has not had multiple categories for the variety of toxity introduced yet.

This initial analysis looks at the data, and implements tokenization, Naive Bayes, and a Deep Neural Network.
There's also some material considering methods for building a SVM.
This is succeeded by kaggle_anlt.ipynb, which runs further with the model building, but has less focus on
exploration.

Beware: Lots of swearing, racism, homophobia, misogyny is contained within due to nature of the comments.
And the fact I have searched for nasty terms as a sanity check on how the methods are working.


# Outline/Planning

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).
There is also a lot of wikipedia markup, and mis-spelled words.

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')  (done)
* Tokenize (convert words to indices) (done)
* Stemming words
* Balancing data set
* Match up comments, and review scores (done)
* Search for gibberish words (make a new "feature" for badly spelled comments)
* Spelling: one of the easiest ways to avoid a word based system is to misspell words, "fck you".
            This does suggest perhaps working with character n-grams, rather than just words. 

Embeddings:
These are necessary to reduce the dimensionality of the problem to a scale that will fit in memory.  
   * SVD - use SVD on the term-frequency matrix. Will use truncated SVD.  
   * word2vec - train vectors for words based on surrounding contexts (can use pre-trained ones, like GLoVE)
   * Latent Factor Analysis - maybe useful prelude or alternative for building up embeddings.
                            - ALS is similar to SVD, but not guaranteed to be orthogonal.
   * Keep only most common words (in both toxic/non-toxic), or highest probability of toxic/non-toxic

Other Analysis possibilities:
* Naive Bayes
    - can find most important words
    - simple, easy to understand baseline.
* Support Vector Machine
    - try ensemble method (split the data into batches, and train an SVM on each batch.  Then do a committee vote.)
      This turns O(n_sample^3) scaling into O(n_sample^3/n_batch^2) scaling on the training.
      This is effectively treating the kernel matrix as if it were block-diagonal, as it omits correlations between datasets.
      Perhaps running multiple copies with different random splits would work?
* Deep Neural Network
    - Build a network using the term-frequency matrix as inputs.
    - Extends the naive Bayes method.  (Might be automatic way of doing some of that SVM stuff?)
    - Employ dropout for regularization, alongside L2 penalties.  
     
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
    - This one runs at the sentence/paragraph level and keeps the temporal structure.
    - Use LSTM/GRU cells, with a couple layers. 
    - Also dropout, l2 penalties

Metrics:
    - F1 :harmonic mean of precision and recall
    - log-loss $N^{-1}\sum_{j=1}^N\sum_c y_{jc}\log \hat{y}_{jc}$, where $j$ runs over observations, and $c$ runs over classes.
    - AUROC: Something like Gini coefficient?  (Plot the true-positive/false-positive curve as the decision threshold $t$ is varied.)
The last two were used as Kaggle metrics.  They just changed over to the column average AUC-ROC metric.  Apparently this is less sensitive to leader-board climbing than the log-loss. 

# Loading in the data

In [2]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import log_loss,f1_score,roc_auc_score

from IPython.display import clear_output
import time

#my code
from bayes import cond_prob, naive_bayes
from util import clean_up, get_subset, check_predictions
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [3]:
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [42]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [4]:
#When are the comments made?
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
df_com['year'].hist(bins=bin_arr)
plt.title('Total Comments')
plt.show()

<matplotlib.figure.Figure at 0x7efbb7b73a90>

In [5]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes in 'rev_id'.  Those set the boundaries for changes.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split to split the array into many sub-arrays.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr

In [6]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [7]:
#Make a new column computing mean, median scores for each comment
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [8]:
#So 373 duplicated comments.  Awesome.  
dup_msk=df_com['comment'].duplicated(keep=False)
#I'm just going to drop these duplicates
df_com.drop_duplicates(subset='comment',inplace=True)

In [9]:
#Define toxic comments as those where the median is below -1, or -2.
#-1 captures more comments, but with more variance in what is considered toxic/unhelpful.
df_com['toxic']=(df_com['median_toxic']<=-1)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159463. Toxic comments: 15353. Toxic Fraction: 0.09627938769495117


In [10]:
#When are the comments made?  Has the toxicity changed over time?
#Note this is on the full dataset, with test/training/dev splits. 
plt.figure(figsize=(10,10))
bin_arr=np.sort(df_com['year'].unique())
#non-toxic comments
plt.subplot(2,2,1)
msk1=df_com['median_toxic']<=-1
plt.ylabel('Toxicity=-1')
df_com['year'][msk1].hist(bins=bin_arr)
plt.title('Toxic')
plt.subplot(2,2,2)
plt.title('Non-Toxic')
df_com['year'][~msk1].hist(bins=bin_arr)
#second row
plt.subplot(2,2,3)
msk2=df_com['median_toxic']<=-2
df_com['year'][msk2].hist(bins=bin_arr)
plt.ylabel('Toxicity=-2')
plt.subplot(2,2,4)
df_com['year'][~msk2].hist(bins=bin_arr)
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7efbb7b58e48>

So the data looks to be evenly balanced as toxic/non-toxic across time, with a rough 10% fraction reduction from regular to toxic, to severely toxic.
Either the internet has not really changed much, or the data gatherers carefully kept the data calibrated. 
Another question about the data is what topics were under discussion? Does this bias the output/findings? 

In [11]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate html tags, newlines, non-text characters. 
#Can maybe also eliminate all punctuation?  Makes any 

#maybe also dates?
df_com['comment_clean']=clean_up(df_com['comment'])

This does lose some information.  Such as possible rude symbols (There's like 6 of these crude ascii art drawings.  This is probably not worth tracking down).

In [43]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [15]:
#put the file as tsv 
df_com.to_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',
columns=['rev_id','comment_clean','scores','mean_toxic','median_toxic','split','toxic'],compression='gzip')

In [3]:
#read in saved cleaned up dataframe.
df_com=pd.read_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',compression='gzip')

In [12]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]
df_dev=df_com[df_com['split']=='dev']
df_test=df_com[df_com['split']=='test']

So the following vectorizer eliminates the stop words.  However, while stop words have little impact
on the semantic content of regular documents, in internet toxicity they are absent from the most toxic messages.
However, the hardest to classify comments will probably have them?

In [None]:
?CountVectorizer

It's possible to select the n-gram range.  I've found that using 2-grams tends to lead to over-fitting (and a huge corpus).
This will just use 1-grams.  I'll also not use 

In [13]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
tfidf_vect=TfidfVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_tfidf=tfidf_vect.fit_transform(df_train['comment_clean'])

#do the same transformations using existing vocab built up in training.
X_dev_tfidf=tfidf_vect.transform(df_dev['comment_clean'])
X_test_tfidf=tfidf_vect.transform(df_test['comment_clean'])

X_dev_counts=count_vect.transform(df_dev['comment_clean'])
X_test_counts=count_vect.transform(df_test['comment_clean'])

# Spell checking 

Let's try using pyenchant to spellcheck these messages.
My goal is to build a way of catching inventive attempts to circumvent the
swearing filter.  (This catches any idiocy like "fck you")
Anything not recognize will count as a spelling error, which could be used as a feature.

However, this will also end up penalizing proper names, foreign languages, and rare technical terms.
I would hope these would be outweighed by having other genuine content.  

In [30]:
from enchant.checker import SpellChecker
chkr = SpellChecker("en_US")

In [105]:
#Make a custom dictionary based on data:
#Tokenize words.  Accept new words that show up in more than 5 messages.
X_log=np.sum(X_train_counts>0,axis=0)

msk=X_log>20
print(np.sum(msk))

10105


In [111]:
word_arr=voc_df1[msk.T].index.values
#now check which words are "new"
new_word=np.zeros(word_arr.shape)
for i in range(len(word_arr)):
    chkr.set_text(word_arr[i])
    for err in chkr:
        #print('Error:',err.word)
        new_word[i]=1
np.sum(new_word)


In [120]:
#write these to a custom CSV dict for use in checking.
new_word_series=pd.Series(word_arr[new_word>0])
new_word_series.to_csv('new_word_dict.txt',index=False)

In [122]:
#use US english, and augmented by new "common" words.
chkr = SpellChecker("en_US","new_word_dict.txt")

In [123]:
Ncheck=1000

err_tot=np.zeros(Ncheck)
t0=time.time()
for i in range(Ncheck):
    chkr.set_text(df_com.iloc[i]['comment_clean'])
    for err in chkr:
        #print('Error:',err.word)
        err_tot[i]+=1
t1=time.time()
print('Time taken:',t1-t0)
#So around 8 sec for 1000 entries.  Will be around an hour for the whole set of 10^5.
#Could use multiprocess to split up, as this is a trivially parallel task.
#But note:chkr is stateful - need a list of independent chkrs for each process/thread.

Time taken: 7.926937818527222


# Feature Engineering

Let's try to add features such as checking the fraction of all-caps words, the number of mis-spelled words, and repetition.
The spellchecker can find the number of misspelled words, can construct a freaction of the message that is mispelled.
Repetition can be estimated from the term-frequency matrix, by taking the ratio of any word to the total number of words.
Capitalization 

# Checking the vectorizer and finding common words

I wanted to check that the vectorizer was working by outputting common words, and identifying the "most toxic" words, based on their counts.
This was useful as a sanity check.

In [14]:
#get vocabulary dictionary, then make a dataframe, with entries as rows
#Then sort dataframe by row entry value, and then use that as the index for the counts.
voc_dict=count_vect.vocabulary_
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
voc_df1=voc_df.sort_values(by=0)

In [15]:
#Compute conditional probabilities of toxicity for each word. 
ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [16]:
#make new dataframe with conditional probabilities for words being toxic, and raw probabilities of occuring in toxic/clean messages
#Then sort by toxicity.
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond','p_clean','p_toxic'],index=voc_df1.index)
word_df.sort_values('pcond',ascending=False,inplace=True)
pcond_wds=word_df.head(n=20).index.values

#Mis-spelled rare swearing follows.  And someone's vendetta against VeggieTales.
print(pcond_wds)

['fucksex' 'buttsecks' 'bastered' 'cocksucker' 'fggt' 'mothjer' 'offfuck'
 'niggas' 'sexsex' 'yourselfgo' 'marcolfuck' 'fack' 'veggietales'
 'notrhbysouthbanof' 'ancestryfuck' 'shitfuck' 'yaaaa' 'cuntbag' 'haahhahahah'
 'cuntliz']


So, the most toxic words (i.e. words that only appeared in toxic messages) are misspelled attempts at rudeness, with weird spaces, and combination words.  I think this reflects more on the pre-processing.  These words show up in a single toxic message, and are thus great at inferring that one message is toxic.  This doesn't say much about more general trends in the messages.

I am considering also implementing a spell-check, and adding a variable for the number of incorrect words or fraction of the message that is misspelled.  Another feature would be the fraction that is capitalized?
The accent stripping catches simple attempts to avoid the spam filter with accents, but does miss things where the words are spaced out, or have other characters inserted.

In [17]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [18]:
#The following searches for words via a naive regex, and compares the results with the tokenizer.
vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,pcond_wds[0])
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
# print('Vect: {}, Naive: {}'.format(vc,cc))
# print(com.head(),'\n\n')
# print(ncom.head())

In [19]:
## Naughty words catch obvious candidates, and check for more British insults,
## and slurs for arabs.
word_counts=X_train_counts.sum(axis=0)
for word in pcond_wds:
    try:
        ind=count_vect.vocabulary_[word]
        n_occur=word_counts[0,ind]
        n_tot=np.sum(X_train_counts[:,ind]>0)
        print(word,':\t {} occurences: \t {} messages'.format(n_occur,n_tot))
    except:
        print(word,'not found')

marcolfuck :	 260 occurences: 	 1 messages
fack :	 232 occurences: 	 2 messages
veggietales :	 212 occurences: 	 1 messages
notrhbysouthbanof :	 208 occurences: 	 2 messages
ancestryfuck :	 208 occurences: 	 1 messages
shitfuck :	 182 occurences: 	 1 messages
yaaaa :	 128 occurences: 	 1 messages
cuntbag :	 128 occurences: 	 3 messages
haahhahahah :	 128 occurences: 	 1 messages
cuntliz :	 111 occurences: 	 1 messages


fucksex :	 624 occurences: 	 1 messages
buttsecks :	 498 occurences: 	 2 messages
bastered :	 449 occurences: 	 2 messages
cocksucker :	 425 occurences: 	 37 messages
fggt :	 398 occurences: 	 5 messages
mothjer :	 391 occurences: 	 4 messages
offfuck :	 360 occurences: 	 1 messages
niggas :	 340 occurences: 	 7 messages
sexsex :	 332 occurences: 	 1 messages
yourselfgo :	 309 occurences: 	 1 messages


So, the conditional probability for a word being toxic is chiefly determined by whether it only occurs in toxic messages.  In this case, these "most toxic words" are misspelled or portmanteus.  The high counts are offset by only appearing in few messages.  This suggest these are just lengthy repetitions or a single rude message.

A spellcheck might catch these, and correct the spelling?  That would potentially catch the attempts to circumvent obvious mis-spelling.

As an attempt to make a simpler, smaller data set to work with, Matt Borthwick selected out the comments with unanimous ratings.
This revealed an interesting phenomenon: there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
People disagreed quite a bit on the extent to which racism was toxic, which suggests analyzing how the (predominantly American?) reviewers responded to the data. 
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers? (this is something that the actual original project is explicitly considering at https://conversationai.github.io/bias.html)

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score to avoid the mean being contaminated by people with a really different sense of what is toxic.

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learn's version.  (They both return the same result now).

This basically treats the comments in a bag-of-words sense, and drops any correlations between the words.  Perhaps including some more
common n-grams, e.g. "frigging crank".

* Estimate $p(w|T)$ from counts in term-frequency matrix.
* Use Bayes Rule
  $ P(T|w) = \frac{p(T)p(w|T)}{\text{normalization const}}$

  \begin{equation}
    p(T|\text{words}) = P(T) \prod_{words}\frac{p(w_i|T)}{p(w_i|T)
  \end{equation}

* Use Logarithms, and compare log-odds for toxicity/non-toxic.  

In [20]:
actual=df_train['toxic'].values
actual_dev=df_dev['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,prob,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

  prob=1/(1+np.exp(log_Cscore-log_Tscore))


In [21]:
#Plot a histogram of the log probabilities.  
plt.figure()
plt.hist(np.maximum(-50,np.log(prob)),bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7efbb72676d8>

  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
#Plot a histogram of the log-odds 
plt.figure()
plt.subplot(121)
bins=np.linspace(-1000,1000,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.ylabel('Counts')
plt.xlabel('Log Odds of Toxicity')
plt.title('Full Range')
plt.subplot(122)
bins=np.linspace(-20,20,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.xlabel('Log Odds of Toxicity')
plt.title('Zoomed in')
plt.show()

<matplotlib.figure.Figure at 0x7efbaab935f8>

In [None]:
Maybe should also plot length of comments? To what extent are these mirroring a similar underlying shape, with long tails?

In [23]:
com_len=df_train['comment_clean'].apply(len)
plt.hist(com_len,log=True)
plt.xlabel('Character length of message')
plt.show()

<matplotlib.figure.Figure at 0x7efbb7384940>

In [24]:
logloss,score_rates=check_predictions(pred,actual)

MemoryError: 

Interesting. The mean log-loss is surprisingly sensitive to the chosen zero-offset.  I think this reflects the fact that the naive-bayes method is returning a lot of incredibly small probabilities (10^{-100}).

In [25]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [26]:
#Make predictions on training/dev sets
pred_nb=nb.predict(X_train_counts)
pred_dev_nb=nb.predict(X_dev_counts)

In [27]:
print('Checking Training data')
nb_stats=check_predictions(pred_nb,actual)

Checking Training data
True Positive 0.08531301672352806. False Positive 0.01651422232454947
False Negative 0.011375766582246687. True Negative 0.8867969943696757
Log-loss is 0.9632992952381051
AUROC is 0.9320323498876192


In [28]:
#Dev_set including 1-grams.  Evidently this is enough to overfit 
# the data, since performance is much worse than training set.
print('Checking dev set')
nb_stats=check_predictions(pred_dev_nb,actual_dev)

Checking dev set
True Positive 0.06311753888352087. False Positive 0.02322102047813484
False Negative 0.032478259514384565. True Negative 0.8811831811239598
Log-loss is 1.9238035444874515
AUROC is 0.8172894153987111


In [None]:
Evidently including the 2-grams is massively overfitting the data.
(It's essentially learning the repeated combinations that show up most in spammed toxic comments,
 and are thus only good for identifying those comments as toxic.)

## Naive Bayes False Positives and Negatives

Let's now look a bit at the misclassified results.

In [29]:
#fixing shapes to avoid broadcasting
actual=np.reshape(actual,(len(actual),1))
pred=np.reshape(pred,(len(actual),1))    

fp_msk = ((pred==True)&(actual==False))    
fn_msk = ((pred==False)&(actual==True))            

In [30]:
df_fn=df_train[fn_msk][['comment_clean','mean_toxic','median_toxic','toxic']]
df_fp=df_train[fp_msk][['comment_clean','mean_toxic','median_toxic','toxic']]

In [31]:
df_fp.head()

                                                                          comment_clean  \
128                    if i pick enough holes in you will you turn into swiss cheese ed   
534          hey rich wuzzup my mom loves you so have fun huh thank you and good night    
642               wow youre so clever so smooth stop being an ass so we can compromise    
1136   i kant believe how sad everybody who writes things on this sight are get a life    
1679                youre blocked from spamming do it again and youll be blocked again    

      mean_toxic  median_toxic  toxic  
128         -0.4           0.0  False  
534          0.3           0.5  False  
642         -0.4          -0.5  False  
1136        -0.2           0.0  False  
1679         0.1           0.0  False  

In [32]:
ind=df_fn.index.values

In [40]:
df_fn.head()

                                                                                             comment_clean  \
7472    shut up you liar why arent you abiding by wikipedia policies that content must be based on veri...   
8761                                         thanks thanks for participating in the conspiracy against me    
14410  welcome faggot welcome hello and welcome to wikipedia thank you for your contributions i hope yo...   
33567                                                                                   you i despise you    
34109                                                                               put it up your bottom    

       mean_toxic  median_toxic  toxic  
7472    -0.800000          -1.0   True  
8761    -0.400000          -1.0   True  
14410   -0.500000          -1.0   True  
33567   -0.900000          -1.0   True  
34109   -0.777778          -1.0   True  

So false negatives.  Much more spacing/characters being used to avoid the filter.
The false negatives in the larger set seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

So at least the "false positives" are because the people using the rating scale are wildly inconsistent.  These are "-1" on the toxicity scale, and so "non-toxic" under the rule where toxic comments have median toxicity less than -1.
Some are "neutral" but have lots of repitition.  I can't for the life of me imagine any of these comments adding anything to the discussion.

# Dimensionality Reduction

Let's use the truncated SVD for dimensionality reduction (or latent semantic analysis?)
Apparently TF-IDF matrix is superior to straight term frequency matrix for this purpose  (more closely matches assumptions in the SVD about the noise.)
Should maybe also symmetrize transformation (as suggested in paper comparing hyperparameters between word2vec and older SVD methods).
They suggest using $T=U \Lambda V = (U \Lambda^{1/2}) (\Lambda^{1/2} V)$ for the projection.  

In [40]:
from sklearn.decomposition import TruncatedSVD
#took a minute or two
TSVD=TruncatedSVD(n_components=100,n_iter=20)
TSVD.fit(X_train_tfidf)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=20,
       random_state=None, tol=0.0)

In [41]:
#actually transform the results 
X_train_trans=TSVD.transform(X_train_tfidf)         

In [38]:
plt.loglog(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()
plt.plot(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()

<matplotlib.figure.Figure at 0x7efbaa8948d0>

<matplotlib.figure.Figure at 0x7efbaee658d0>

So looks like power-law decay in the spectrum. We're primarily interested in using this
for dimensionality reduction.  Ideally, I'd pick some reasonable threshold for keeping a
certain fraction of the explained variance.
(Could estimate a power law tail, compute threshold to capture that percentage).

But for now, we'll just set the threshold to be 100, as a suitably small arbitrary choice.  
We will next use the transformed results in a "deep" neural network.  

In [42]:
#actually transform the dev/test data.
X_dev_trans=TSVD.transform(X_dev_tfidf)

In [24]:
Nsub=1000
np.random.seed(454)
#Should really update to just use sklearn's stratified Kfold.


# Deep Network

Another idea is to build a deep neural network on the term-frequency matrix, effectively running with extensions to the Naive Bayes model.
This will use the reduced term-frequency matrix after the Truncated SVD.  

In [43]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected, l2_regularizer
from tensorflow.contrib.rnn import BasicRNNCell,LSTMCell

In [44]:
from deep_network import deep_dropout_NN

In [48]:
#Ignore the "error about serializing - this is a known problem with saving models created using 
#modules like fully connected, since their components are not named.
#The models are saved, and the computations work.

actual=df_train['toxic'].astype(int).values
save_name='./tf_models/deep_relu_drop'
dNN=deep_dropout_NN(X_train_trans.shape)
dNN.run_graph(X_train_trans,actual,save_name)

<matplotlib.figure.Figure at 0x7efb80f7dd30>

iter #1000. Current log-loss:0.13786394894123077


Type is unsupported, or the types of the items don't match field type in CollectionDef.
'function' object has no attribute 'name'


In [62]:
#model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
model_name='./tf_models/deep_relu_drop-{}'.format(1000)
dnn_pred2=dNN.predict_all(model_name,X_train_trans)
dnn_pred2=dnn_pred2.reshape(-1)

newer predict
INFO:tensorflow:Restoring parameters from ./tf_models/deep_relu_drop-1000


In [63]:
#Check scores on training data
dnn_conf=check_predictions(dnn_pred2,actual)

True Positive 0.06058354438328066. False Positive 0.006645457019067752
False Negative 0.036105238922494086. True Negative 0.8966657596751575
Log-loss is 1.4765620415427758
AUROC is 0.8096130944597727


In [64]:
#Compare with naive-bayes training
nb_conf=check_predictions(pred_nb,actual)

True Positive 0.08531301672352806. False Positive 0.01651422232454947
False Negative 0.011375766582246687. True Negative 0.8867969943696757
Log-loss is 0.9632992952381051
AUROC is 0.9320323498876192


In [None]:
Right now the network seems undertrained.

In [None]:
## Predictions from Deep Neural Network

Let's now run some predictions on the full training and development sets.  

In [66]:
#Try testing on the dev-set
model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
nn_pred_train = dNN.predict_all(model_name,X_train_trans)
print('4 layer ReLU network: on Dev set')
actual_train=df_train['toxic'].values
nn_pred_train=nn_pred_train.reshape(-1)
nn_stats=check_predictions(nn_pred_train,actual_train)

nn_pred_dev = dNN.predict_all(model_name,X_dev_trans)
print('4 layer ReLU network: on Dev set')
actual_dev=df_dev['toxic'].values
nn_pred_dev=nn_pred_dev.reshape(-1)
nn_stats=check_predictions(nn_pred_dev,actual_dev)

4 layer ReLU network: on Dev set
True Positive 0.05386029984727114. False Positive 0.011594925661565315
False Negative 0.04173549855063429. True Negative 0.8928092759405293
Log-loss is 1.841976868183656
AUROC is 0.7752982535343149


newer predict
INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-1000


4 layer ReLU network: on Dev set
True Positive 0.06067773196307847. False Positive 0.006614061159135149
False Negative 0.036011051342696276. True Negative 0.8966971555350901
Log-loss is 1.4722245180949745
AUROC is 0.8101175383672511


newer predict
INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-1000


In [67]:
print('Naive Bayes on Dev set')
pred_dev_nb=nb.predict(X_dev_counts)
nb_stats=check_predictions(pred_dev_nb,actual_dev)

Naive Bayes on Dev set
True Positive 0.06311753888352087. False Positive 0.02322102047813484
False Negative 0.032478259514384565. True Negative 0.8811831811239598
Log-loss is 1.9238035444874515
AUROC is 0.8172894153987111


In [None]:
Prior work with a (3-layer ReLU-tanh) network led to over-fitting to the training set.  Not surprising, since there was no regularization here.
It could outperform the Naive Bayes method on the training set, but had worse performance on the development dataset.

Let's put in some dropout. Putting in dropout after each layer, with a 0.1 dropout probability improved performance.

In [294]:
#dev scores
[f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)]

[0.63848495096381463, 0.69363963655066008]

In [292]:
print([f1_score(actual,nn_pred_train),f1_score(actual,pred_nb)])
print([f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)])

[0.75269211943220748, 0.97498410006359981]
[0.68780126065999259, 0.66262049268118539]


I am finding that beyond one or two layers, the network just seems to output zeros.  Maybe the learning rate was too high? Yes - this is a common problem.

# Recurrent Neural Network

So let's try the current flavour of the month approach: a recurrent neural network.
Based on talking to Joseph and Fahim at the group, they used a two-layer neural network based on the just the 2000 most common words, using ReLU activation.  (I think they said their approach was inspired by someone at Kaggle.)
Let's try something similar, with initially a single layer leaky ReLU layer.

The idea is that the network parses each word of the sentence (to better capture logical structure).
Each word needs an index.  Initially this is an index in the vocabulary V, where $V\sim10^6$ or more.  That's an infeasibly large matrix.
We need some form of dimensionality reduction.  Either by picking the most distinctive words (which actually appear in multiple messages),
or by projecting down via SVD. 