# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same topic.
This data has not been cleaned yet, and has not had multiple categories for the variety of toxity introduced yet.

Beware: Lots of swearing, racism, homophobia, misogyny is contained within due to nature of the comments.
And the fact I have searched for nasty terms as a sanity check on how the methods are working.


Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).
There is also a lot of wikipedia markup, and mis-spelled words.

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')  (done)
* Tokenize (convert words to indices) (done)
* Stemming words
* Balancing data set
* Match up comments, and review scores (done)
* Search for gibberish words (make a new "feature" for badly spelled comments)

Embeddings:
These are necessary to reduce the dimensionality of the problem to a scale that will fit in memory.  
   * SVD - use SVD on the term-frequency matrix. Will use truncated SVD.  
   * word2vec - train vectors for words based on surrounding contexts (can use pre-trained ones)
   * Latent Factor Analysis - maybe useful prelude or alternative for building up embeddings.
                            - ALS is similar to SVD, but not guaranteed to be orthogonal.
   * Keep only most common words (in both toxic/non-toxic), or highest probability of toxic/non-toxic

Other Analysis possibilities:
* Naive Bayes
    - can find most important words
    - simple, easy to understand baseline.
* Support Vector Machine
    - try ensemble method (split the data into batches, and train an SVM on each batch.  Then do a committee vote.)
      This turns O(n_sample^3) scaling into O(n_sample^3/n_batch^2) scaling on the training.
      This is effectively treating the kernel matrix as if it were block-diagonal, as it omits correlations between datasets.
      Perhaps running multiple copies with different random splits would work?
* Deep Neural Network
    - Build a network using the term-frequency matrix as inputs.
    - Extends the naive Bayes method.  (Might be automatic way of doing some of that SVM stuff?)
    - Employ dropout for regularization, alongside L2 penalties.  
     
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
    - This one runs at the sentence/paragraph level and keeps the temporal structure.
    - Use LSTM/GRU cells, with a couple layers. 
    - Also dropout, l2 penalties

Metrics:
    - F1 :harmonic mean of precision and recall
    - log-loss $N^{-1}\sum_{j=1}^N\sum_c y_{jc}\log \hat{y}_{jc}$, where $j$ runs over observations, and $c$ runs over classes.
    - AUROC: Something like Gini coefficient?  (Plot the true-positive/false-positive curve as the decision threshold $t$ is varied.)
The last two were used as Kaggle metrics.  They just changed over to the column average AUC-ROC metric.  Apparently this is less sensitive to leader-board climbing than the log-loss. 

In [1]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import log_loss,f1_score,roc_auc_score

from IPython.display import clear_output
import time

#my code
from bayes import cond_prob, naive_bayes
%load_ext autoreload
%autoreload 2

In [2]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [42]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [3]:
#When are the comments made?
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
df_com['year'].hist(bins=bin_arr)
plt.title('Total Comments')
plt.show()

<matplotlib.figure.Figure at 0x7efbb7b64be0>

In [4]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes in 'rev_id'.  Those set the boundaries for changes.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split to split the array into many sub-arrays.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr

In [5]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [7]:
#Make a new column computing mean, median scores
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [8]:
#So 373 duplicated comments.  Awesome.  
#dup_msk=df_com['comment'].duplicated(keep=False)
#I'm just going to drop these duplicates
df_com.drop_duplicates(subset='comment',inplace=True)

In [9]:
#Define toxic comments as those where the median is below -1, or -2.
#-1 captures more comments, but with more variance in what is considered toxic/unhelpful.
df_com['toxic']=(df_com['median_toxic']<=-1)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159463. Toxic comments: 15353. Toxic Fraction: 0.09627938769495117


In [17]:
#When are the comments made?  Has the toxicity changed over time?
#Note this is on the full dataset, with test/training/dev splits. 
plt.figure()
bin_arr=np.sort(df_com['year'].unique())
#non-toxic comments
plt.subplot(2,2,1)
msk1=df_com['median_toxic']<=-1
plt.ylabel('Toxicity=-1')
df_com['year'][msk1].hist(bins=bin_arr)
plt.title('Toxic')
plt.subplot(2,2,2)
plt.title('Non-Toxic')
df_com['year'][~msk1].hist(bins=bin_arr)
#second row
plt.subplot(2,2,3)
msk2=df_com['median_toxic']<=-2
df_com['year'][msk2].hist(bins=bin_arr)
plt.ylabel('Toxicity=-2')
plt.subplot(2,2,4)
df_com['year'][~msk2].hist(bins=bin_arr)
plt.xlabel('Year')
plt.show()

<matplotlib.figure.Figure at 0x7f10d074eeb8>

So the data looks to be evenly balanced as toxic/non-toxic across time, with a rough 10% fraction reduction from regular to toxic, to severely toxic.
Another question about the data is what topics were under discussion? Does this bias the output/findings? 

In [10]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate
#Can maybe also eliminate all punctuation?  Makes any 

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks.
    #Should probably combine into a single regex.
    #re_str="(style|class|width|align|cellpadding|cellspacing|rowspan|colspan)=\`\`.*?\`\`"
    #com_clean=com_clean.str.replace(re_str,' ')
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    #com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    There must be a more comprehensive way of doing this?
    com_clean=com_clean.str.replace("[\[\[\{\}=_:\|\(\)\\\/\`]+",' ')
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

This does lose some information.  Such as possible rude symbols replicating breasts, or genitalia. (There's like 6 of these crude ascii art drawings.  This is probably not worth tracking down)

In [43]:
df_com.columns

Index(['rev_id', 'comment', 'year', 'logged_in', 'ns', 'sample', 'split',
       'scores', 'mean_toxic', 'median_toxic', 'toxic', 'comment_clean'],
      dtype='object')

In [44]:
#put the file as tsv 
df_com.to_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',
columns=['rev_id','comment_clean','scores','mean_toxic','median_toxic','split','toxic'],compression='gzip')

In [49]:
#read in saved cleaned up dataframe.
df_com=pd.read_csv('saved_dataframes/cleaned_comments.tsv.gzip',sep='\t',compression='gzip')

In [11]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]
df_dev=df_com[df_com['split']=='dev']
df_test=df_com[df_com['split']=='test']

In [12]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
tfidf_vect=TfidfVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_tfidf=tfidf_vect.fit_transform(df_train['comment_clean'])
X_train_counts.shape

(95554, 133822)

In [13]:
#do the same transformations using existing vocab built up in training.
X_dev_tfidf=tfidf_vect.transform(df_dev['comment_clean'])
X_test_tfidf=tfidf_vect.transform(df_test['comment_clean'])

X_dev_counts=count_vect.transform(df_dev['comment_clean'])
X_test_counts=count_vect.transform(df_test['comment_clean'])

In [62]:
X_dev_tfidf

<32083x133822 sparse matrix of type '<class 'numpy.float64'>'
	with 778855 stored elements in Compressed Sparse Row format>

# Checking the vectorizer and finding common words

I wanted to check that the vectorizer was working by outputting common words, and identifying the "most toxic" words, based on their counts.
This was useful as a sanity check.

In [23]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value, and then use that as the index for the counts.
voc_df1=voc_df.sort_values(by=0)

In [49]:
voc_df1.iloc[29143]

0    29143
Name: dick, dtype: int64

In [14]:
ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [28]:
#make new dataframe with conditional probabilities for words being toxic, and raw probabilities of occuring in toxic/clean messages
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond','p_clean','p_toxic'],index=voc_df1.index)
word_df.sort_values('pcond',ascending=False,inplace=True)
pcond_wds=word_df.head(n=20).index.values
print(pcond_wds)

NameError: name 'voc_df1' is not defined

So, the most toxic words (i.e. words that only appeared in toxic messages) are misspelled attempts at rudeness, with weird spaces, and combination words.  I think this reflects more on the pre-processing.  These words show up in a single toxic message, and are thus great at inferring that one message is toxic.  This doesn't say much about more general trends in the messages.

I am considering also implementing a spell-check, and adding a variable for the number of incorrect words or fraction of the message that is misspelled.  Another feature would be the fraction that is capitalized?
The accent stripping catches simple attempts to avoid the spam filter with accents, but does miss things where the words are spaced out, or have other characters inserted.

In [28]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [30]:
vc,cc,com,ncom,dcom=check_vect(X_dev_counts,df_dev['comment'],voc_df,pcond_wds[0])
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.

In [33]:
pcond_wds[0]

'fucksex'

In [31]:
#currently searching for "gay", a term that has clean connotations, but can be used in homophobic attacks.
#Another word with the same dichotomy of identity/hate is Jew. Or Muslim.  
print('Vect: {}, Naive: {}'.format(vc,cc))
print(com.head(),'\n\n')
print(ncom.head())

Vect: [[0]], Naive: 0
Series([], Name: comment, dtype: object) 


Series([], Name: comment, dtype: object)


In [36]:
#naughty_word=['fuck','fag','kill','bleach','bellend','wanker','towelhead']
#identity_hate=['nigger','trans','faggot','kike','jew','wetback','spic']
word_counts=X_train_counts.sum(axis=0)
for word in pcond_wds:
    try:
        ind=count_vect.vocabulary_[word]
        n_occur=word_counts[0,ind]
        n_tot=np.sum(X_train_counts[:,ind]>0)
        print(word,':\t {} occurences: \t {} messages'.format(n_occur,n_tot))
    except:
        print(word,'not found')

cuntbag :	 128 occurences: 	 3 messages
cuntliz :	 111 occurences: 	 1 messages
yourselfgo :	 309 occurences: 	 1 messages
marcolfuck :	 260 occurences: 	 1 messages
fack :	 232 occurences: 	 2 messages
veggietales :	 212 occurences: 	 1 messages
ancestryfuck :	 208 occurences: 	 1 messages
notrhbysouthbanof :	 208 occurences: 	 2 messages
shitfuck :	 182 occurences: 	 1 messages
yaaaa :	 128 occurences: 	 1 messages
haahhahahah :	 128 occurences: 	 1 messages
fucksex :	 624 occurences: 	 1 messages
buttsecks :	 498 occurences: 	 2 messages
bastered :	 449 occurences: 	 2 messages
cocksucker :	 425 occurences: 	 37 messages
fggt :	 398 occurences: 	 5 messages
mothjer :	 391 occurences: 	 4 messages
offfuck :	 360 occurences: 	 1 messages
niggas :	 340 occurences: 	 7 messages
sexsex :	 332 occurences: 	 1 messages


So, the conditional probability for a word being toxic is chiefly determined by whether it only occurs in toxic messages.  In this case, these "most toxic words" are misspelled or portmanteus.  The high counts are offset by only appearing in few messages.  This suggest these are just lengthy repetitions or a single rude message.

A spellcheck might catch these, and correct the spelling?  That would potentially catch the attempts to circumvent obvious mis-spelling.

I noticed that there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers? (this is something that the actual original project is explicitly considering at https://conversationai.github.io/bias.html)

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score to avoid the mean being contaminated by people with a really different sense 

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learn's version.  (They both return the same result now).

This basically treats the comments in a bag-of-words sense, and drops any correlations between the words.  Perhaps including some more
common n-grams, e.g. "frigging crank".

* Estimate $p(w|T)$ from counts in term-frequency matrix.
* Use Bayes Rule
  $ P(T|w) = \frac{p(T)p(w|T)}{\text{normalization const}}

  \begin{equation}
    p(T|\text{words}) = P(T) \prod_{words}\frac{p(w_i|T)}{p(w_i|T)
  \end{equation}

* Use Logarithms, and compare log-odds for toxicity/non-toxic.  

In [15]:
actual=df_train['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,prob,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    

  prob=1/(1+np.exp(log_Cscore-log_Tscore))


In [30]:
#Plot a histogram of the log probabilities.  
plt.figure()
plt.hist(np.maximum(-50,np.log(prob)),bins=100)
plt.show()

  This is separate from the ipykernel package so we can avoid doing imports until


<matplotlib.figure.Figure at 0x7efbaf4abdd8>

In [77]:
#Plot a histogram of the log-odds (right term?).  
plt.figure()
plt.subplot(121)
bins=np.linspace(-1000,1000,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.subplot(122)
bins=np.linspace(-20,20,100)
plt.hist(logT-logC,bins=bins,log=True)
plt.show()

<matplotlib.figure.Figure at 0x7efb553aa898>

In [None]:
Maybe should also plot length of comments? To what extent are these mirroring a similar underlying shape?

In [16]:
def check_predictions(pred,actual,epsilon=1E-15):
    """check_predictions
    Compares predicted class (y_i) against actual class (z_i).
    Returns the confusion matrix and mean log-loss.
    
    Log-loss = sum_i{ z_i log[ y_i] }/M

    Input: pred - predicted values (0,1)
    actual - true labels 
    eps    - shift to avoid log(0)
    Returns: Confusion matrix with [[true positive, false positive],[false negative, true negative]]
    log-loss - average log-loss
    """
    actual=np.reshape(actual,(len(actual),1))
    pred=np.reshape(pred,(len(actual),1))    
    print(pred.shape,actual.shape)
    tp = np.mean((pred==True)&(actual==True))
    tn = np.mean((pred==False)&(actual==False))
    fp = np.mean((pred==True)&(actual==False))    
    fn = np.mean((pred==False)&(actual==True))            
    scores=np.matrix([[tp,fp],[fn,tn]])
    print("True Positive {}. False Positive {}".format(tp,fp))
    print("False Negative {}. True Negative {}".format(fn,tn))
    pred_num=pred.astype(float)
    logloss=log_loss(actual,pred_num,eps=epsilon,normalize=True)    
    #give zero a small correction.
    #pred_num[pred==False]=epsilon
    #pred_num[pred==True]=1-epsilon
    #my (initial) wrong attempt
    #logloss2=-np.mean(np.multiply(actual,np.log(pred_num)))
    # logloss2=-np.mean(np.multiply(actual,np.log(pred_num))\
    #     +np.multiply(1-actual,np.log(1-pred_num)))
    # print(logloss2)
    auroc = roc_auc_score(actual,pred)
    #logloss=0
    print("Log-loss is {}".format(logloss))
    print("AUROC is {}".format(auroc))    
    return scores,logloss
logloss,score_rates=check_predictions(pred,actual)


(95554, 1) (95554, 1)
True Positive 0.08529208615023966. False Positive 0.016461895891328463
False Negative 0.01139669715553509. True Negative 0.8868493208028968
Log-loss is 0.9622148788120853
AUROC is 0.931953076744998


Interesting. The mean log-loss is surprisingly sensitive to the chosen zero-offset.  I think this reflects the fact that the naive-bayes method is returning a lot of incredibly small probabilities (10^{-100}).

In [16]:
#Look at the false negatives 
# df_fn=df_train[(pred==False)]
# df_fn=df_fn[df_fn['toxic']==True]
# df_fn[['comment_clean','mean_toxic','median_toxic']]

The false negatives in that larger seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

In [39]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [40]:
pred_nb=nb.predict(X_train_counts)
nb_stats=check_predictions(pred_nb,actual)

(95554, 1) (95554, 1)
True Positive 0.08529208615023966. False Positive 0.016461895891328463
False Negative 0.01139669715553509. True Negative 0.8868493208028968
Log-loss is 0.9622148788120853
AUROC is 0.931953076744998


Well, I must have an error in my Naive Bayes code.  I'm getting much worse results.  Have correct class probabilities, and conditional probabilities.  (Found the bug.  Used $\prod_i[p(w_i|T)p(T)]$ instead of
$p(T)\prod_i p(w_i|T)$.  Found my error after examining hidden attributes, but really checking and copying from the sklearn source code)

## Naive Bayes False Positives and Negatives

Let's now look a bit at the misclassified results.

In [17]:
#fixing shapes to avoid broadcasting
actual=np.reshape(actual,(len(actual),1))
pred=np.reshape(pred,(len(actual),1))    

fp_msk = ((pred==True)&(actual==False))    
fn_msk = ((pred==False)&(actual==True))            

In [18]:
df_fn=df_train[fn_msk][['comment_clean','mean_toxic','median_toxic','toxic']]
df_fp=df_train[fp_msk][['comment_clean','mean_toxic','median_toxic','toxic']]

In [19]:
df_fp.head()

                                                                                            comment_clean  \
128                                If I pick enough holes in you, will you turn into swiss cheese? ^ ^ Ed   
534                     hey rich WUZZUP? My mom loves you... SO HAVE FUN HUH!!! THANK YOU AND GOOD NIGHT!   
642                         ****Wow, you're so clever. So smooth. Stop being an ass so we can compromise!   
1136                   I.Kant believe how sad everybody who writes things on this sight are- get a life!    
1767   Just a question... Talk about Lir, not to them How do we know that these ban timer reset edits ...   

      mean_toxic  median_toxic  toxic  
128         -0.4           0.0  False  
534          0.3           0.5  False  
642         -0.4          -0.5  False  
1136        -0.2           0.0  False  
1767         0.3           0.0  False  

In [451]:
ind=df_fn.index.values

In [20]:
df_fn.head()

                                                                                            comment_clean  \
803        Are you trying to wind me up or something? You are a vandal undoing useful proofreading work.    
924    Actually, it is you who resorted to pathetic insults first. Let me quote what you wrote remove ...   
1354   I am not user 168.209.97.34. On what basis are you acusing me of being that user? Please answer...   
1374                                              Make that four, fifth is underway. This is rediculous!    
1736   No! This is a GROUP EFFORT! Wikipedia is a collaborative COMMUNITY and there are no school essa...   

      mean_toxic  median_toxic  toxic  
803         -0.8          -1.0   True  
924         -0.6          -1.0   True  
1354        -0.6          -1.0   True  
1374        -0.6          -1.0   True  
1736        -0.3          -1.0   True  

So false negatives.  Much more spacing/characters being used to avoid the filter.  

So at least the "false positives" are because the people using the rating scale are wildly inconsistent.  These are "-1" on the toxicity scale, and so "non-toxic" under the rule where toxic comments have median toxicity less than -1.
Some are "neutral" but have lots of repitition.  I can't for the life of me imagine any of these comments adding anything to the discussion.

# Dimensionality Reduction

Let's use the truncated SVD for dimensionality reduction (or latent semantic analysis?)
Apparently TF-IDF matrix is superior to straight term frequency matrix for this purpose  (more closely matches assumptions in the SVD about the noise.)
Should maybe also symmetrize transformation (as suggested in paper comparing hyperparameters between word2vec and older SVD methods).
They suggest using $T=U \Lambda V = (U \Lambda^{1/2}) (\Lambda^{1/2} V)$ for the projection.  

In [21]:
from sklearn.decomposition import TruncatedSVD

In [68]:
?TruncatedSVD

In [22]:
#took a minute or two
TSVD=TruncatedSVD(n_components=100,n_iter=10)
TSVD.fit(X_train_tfidf)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=10,
       random_state=None, tol=0.0)

In [23]:
#actually transform the results 
X_train_trans=TSVD.transform(X_train_tfidf)

In [83]:
?TSVD.transform

Object `TSVD.transform` not found.


In [24]:
plt.plot(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()

<matplotlib.figure.Figure at 0x7efbaadb2208>

We will next use the transformed results in a "deep" neural network.  

In [25]:
#actually transform the dev/test data.
X_dev_trans=TSVD.transform(X_dev_tfidf)

# Support Vector Machine

In CS229, Andrew Ng's assignment 2 suggest the SVM as a natural improvement over the Naive Bayes method.
# Let's implement one of those.  I'm going to update it to do batch gradient descent with sparse matrices.
# The version I wrote initially was trash, I am attempting to vectorize the code using appropriate scipy.sparse matrix operations.

Or I could use an ensemble of SVM's based on subsets of the data. That leverages the existing (presumably smarter) scikit-learn code, in a way that could scale up.

Or use an approximate kernel via Random Fourier Components 

In [519]:
actual=df_train['toxic'].values

947

In [85]:
Nsub=1000
np.random.seed(454)
def get_subset(frac_perc,dat_mat,labels):
    """get_subset
    Returns random subset of the data and labels.
    Maintains same fraction of toxic/non-toxic data as the full dataset.
    """ 
    #make vector and sample indices for true/false.
    nvec=np.arange(len(labels))
    #get the indices for true/false
    Tvec=nvec[labels]
    Cvec=nvec[~labels]
    #grab a random shuffling of those indices.
    np.random.shuffle(Tvec)
    np.random.shuffle(Cvec)
    #grab some fraction of them.
    it = int(len(Tvec)*frac_perc)
    ic = int(len(Cvec)*frac_perc)
    ind_sub=np.append(Tvec[:it],Cvec[:ic])
    Xsub = dat_mat[ind_sub]
    label_sub = labels[ind_sub].reshape((len(ind_sub),1))
    return ind_sub,Xsub,label_sub

In [88]:
%pdb off
ind_sub,Xsub,label_sub=get_subset(0.01,X_train_counts,actual

Automatic pdb calling has been turned OFF


In [793]:
#my code: super slow.
#TODO: Look into Cython.  Does it play nice with sparse?
#Just use scikit-learns SVM, and approximate kernels.
alpha0,alpha=svm_fit(Xsub,label_sub,Nbatch=100)

Iter 0 of 382
Iter 191 of 382


In [794]:
%pdb off
svm_pred=svm_predict(Xsub,Xsub,alpha,8)
check_predictions(svm_pred,label_sub)

Automatic pdb calling has been turned OFF


# SVM Ensemble

Since apparently the training time for a SVM goes as $O(n_{sample}^3)$, maybe it is better to train an ensemble of SVMs.
In which case the training time is $O(n_{sample}^3/n_{ensemble^2})$ for the ensemble.  Then evaluating the results typically takes $O(n_sample)$ for all of the ensemble together.  (This is something like making the crude assumption that the kernels are block-diagonal, once appropriately sorted).  If we repeat this for multiple such random splits we can extract different correlations.
We Then take a majority vote.

A similar idea is available here:(https://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python), which suggests
using a BaggingClassifier to automate the process.  
Of course, Random Forests are another option, with a similar goal.    

In [80]:
from sklearn.svm import SVC
nfeature,nobs=X_train_counts.shape

In [87]:
#just use bagging classifier on the whole list of SVMs
from sklearn.ensemble import BaggingClassifier

#make the SVM model
svm=SVC(cache_size=200,gamma=0.1,C=0.5,class_weight='balanced')
#The bagging classifier of those
ensemble_svm=BaggingClassifier(svm,n_estimators=20,
bootstrap=False,n_jobs=3)

In [None]:
#Try to determine parameters gamma/C via cross-validation.
#Note that there is no need for explicit regularization?  Apparently in large dimensions, the parameters C/gamma (for penalty radius and width of basis function do a decent job in regularizing), since l1, l2 regularization don't work.  

In [83]:
%pdb off

Automatic pdb calling has been turned OFF


In [90]:
frac_perc=0.1
t0=time.time()
#svm=SVC(cache_size=1000,verbose=True,gamma=0.1,C=0.5,class_weight='balanced')
indsub,Xsub,label_sub=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
#use the ravel for reshaping?
ensemble_svm.fit(Xsub,label_sub.ravel())
svm_pred=ensemble_svm.predict(Xsub)
#test on a different subset of the training data
frac_perc2=0.01
indsub2,Xsub2,label_sub2=get_subset(frac_perc2,X_train_counts,df_train['toxic'].values)
svm_pred2=ensemble_svm.predict(Xsub2)
t1=time.time()
print('Time Elapsed:',t1-t0)
svm_stats=check_predictions(svm_pred,label_sub)
svm_stats2=check_predictions(svm_pred2,label_sub)

KeyboardInterrupt: 

## Randomized Fourier Features

The Tensorflow documentation includes a great idea for extending Kernel machines: use an sinusoidal mapping from the original space to another linear space.  The mapping depends on a Gaussian random variable, so when we take expectation values over the Gaussian variable, the result
of that expectation approximates the desired kernel.  Genius!
Ideas here:(https://www.tensorflow.org/tutorials/kernel_methods,
https://people.eecs.berkeley.edu/~brecht/papers/07.rah.rec.nips.pdf).
See also scikit-learn's Kernel Approximations methods, which implement the RBF kernel described below. 

LinearSVMs work quickly, but their full kernel counterparts are slow to train, scaling as $O(n_{sample}^3)$.
Instead, consider features like 
\begin{equation}
    z_{k}(\mathbf{x})=\cos(\mathbf{\omega}_{k}\cdot\mathbf{x}+b_{k}),
\end{equation}
where $\mathbf{x}\in \mathbb{R}^{d}, \omega\in \mathbb{R}^{d},\mathbf{b}_{k}\in\mathbb{R}$, and $\omega_{k}$, is a random Gaussian vector drawn from
\begin{equation}
    P(\omega) = (2\pi\sigma^2)^{-d/2} \exp\left(-\frac{\mathbf{\omega}^2}{2\sigma^2}\right),
\end{equation}
and $b_{k}$ is a uniform random variable drawn from $[0,2\pi)$.  Note that $z_{k}$ is a scalar.  But if we consider making $D$ draws of the random variables, then we can construct a vector $\mathbf{z}(\mathbf{x})=\sqrt{\frac{2}{D}}[z_{1},z_{2},\ldots, z_{D}]$,

The inner products on these new feature vectors for different input data are given y 
\begin{equation}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})=\frac{2}{D}\sum_{k=1}^{D} \cos(\mathbf{\omega}_{k}\cdot\mathbf{x}+b_{k})\cos(\mathbf{\omega}_{k}\cdot\mathbf{y}+b_{k}).
\end{equation}
This is essentially a Monte-Carlo estimate (with $D$ samples) of the probability distributions.  As $D\rightarrow \infty$, this converges to 
\begin{align}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})&\approx \int d\mathbf{\omega}\int db\,P(\omega)p(b)
    2\cos(\mathbf{\omega}\cdot\mathbf{x}+b)\cos(\mathbf{\omega}\cdot\mathbf{y}+b)\\
&=\frac{1}{2\pi}\frac{1}{(2\pi \sigma^2)^{D/2}}\int d\mathbf{\omega}\int_0^{2\pi} db\,e^{-(\mathbf{\omega})^2/(2\sigma^2)}
    2\cos(\mathbf{\omega}\cdot\mathbf{x}+b)\cos(\mathbf{\omega}\cdot\mathbf{y}+b) \\
&=\frac{1}{2\pi}\frac{1}{(2\pi \sigma^2)^{D/2}}\int d\mathbf{\omega}\int_0^{2\pi} db\,e^{-(\mathbf{\omega})^2/(2\sigma^2)}
    \bigg(\cos[\mathbf{\omega}\cdot(\mathbf{x}+\mathbf{y})+b]+\cos[\mathbf{\omega}\cdot(\mathbf{x}-\mathbf{y})]\bigg),
\end{align}
where we used a double-angle formula on the cosines.  The Gaussian and uniform integrals can be carried out, with the result
\begin{align}
    \mathbf{z}(\mathbf{x})\cdot\mathbf{z}(\mathbf{y})&\approx 
&=\,e^{-(\mathbf{x-y})^2/(2\sigma^2)}.
\end{align}
The same idea can be extended for any $P(\mathbf{\omega})$ to get the desired kernel, provided it has a nice Fourier transform.

One thing noted in the docs is that this works well for smooth data, but can require a lot of components if there is a significant random component, such as trying to detect fractal structures like forests.  


# Deep Network

Another idea is to build a deep neural network on the term-frequency matrix, effectively running with extensions to the Naive Bayes model.
This will use the reduced term-frequency matrix after the Truncated SVD.  

In [26]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected, l2_regularizer
from tensorflow.contrib.rnn import BasicRNNCell,LSTMCell

In [27]:
from deep_network import deep_dropout_NN

In [56]:
actual=df_train['toxic'].astype(int).values
save_name='tf_models/deep_relu_drop'
dNN=deep_dropout_NN(X_train_trans.shape)
dNN.run_graph(X_train_trans,actual,save_name)

<matplotlib.figure.Figure at 0x7efb37cb24a8>

iter #5000. Current log-loss:0.06816151738166809


Type is unsupported, or the types of the items don't match field type in CollectionDef.
'function' object has no attribute 'name'


In [51]:
#model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
model_name='tf_models/deep_relu_drop-{}'.format(5000)
dnn_pred=dNN.predict_all(model_name,X_train_trans)

INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-5000


In [75]:
dnn_conf=check_predictions(dnn_pred,actual)

(95554, 1) (95554, 1)
True Positive 0.07975594951545723. False Positive 0.01246415639324361
False Negative 0.016932833790317518. True Negative 0.8908470603009816
Log-loss is 1.0153460369408245
AUROC is 0.9055372623991556


In [76]:
nb_conf=check_predictions(pred_nb,actual)

(95554, 1) (95554, 1)
True Positive 0.08529208615023966. False Positive 0.016461895891328463
False Negative 0.01139669715553509. True Negative 0.8868493208028968
Log-loss is 0.9622148788120853
AUROC is 0.931953076744998


In [None]:
## Predictions from Deep Neural Network

Let's now run some predictions on the full training and development sets.  

In [95]:
model_name='tf_models/deep_relu_drop-{}'.format(n_iter)

nn_pred_train = network_predict(model_name,X_train_trans)

print('3 layer ReLU network')
check_predictions(nn_pred_train,actual)
print('Naive Bayes')
check_predictions(pred_nb,actual)

NameError: name 'outputs' is not defined

INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-10000


In [69]:
%pdb on 

Automatic pdb calling has been turned ON


In [73]:
#Try testing on the dev-set
model_name='tf_models/deep_relu_drop-{}'.format(dNN.n_iter)
nn_pred_train = dNN.predict_all(model_name,X_train_trans)
print('4 layer ReLU network: on Dev set')
actual_train=df_train['toxic'].values
nn_stats=check_predictions(nn_pred_train,actual_train)

nn_pred_dev = dNN.predict_all(model_name,X_dev_trans)
print('4 layer ReLU network: on Dev set')
actual_dev=df_dev['toxic'].values
nn_stats=check_predictions(nn_pred_dev,actual_dev)

4 layer ReLU network: on Dev set
(32083, 1) (32083, 1)
True Positive 0.05794345915282237. False Positive 0.023096343858118006
False Negative 0.037652339245083065. True Negative 0.8813078577439766
Log-loss is 2.0982036497639474
AUROC is 0.7902960670474108


INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-5000


4 layer ReLU network: on Dev set
(95554, 1) (95554, 1)
True Positive 0.08278041735563137. False Positive 0.007043137911547397
False Negative 0.013908365950143374. True Negative 0.8962680787826779
Log-loss is 0.7236449386910208
AUROC is 0.9241781204032228


INFO:tensorflow:Restoring parameters from tf_models/deep_relu_drop-5000


In [74]:
print('Naive Bayes on Dev set')
pred_dev_nb=nb.predict(X_dev_counts)
nb_stats=check_predictions(pred_dev_nb,actual_dev)

Naive Bayes on Dev set
(32083, 1) (32083, 1)
True Positive 0.06299286226350403. False Positive 0.023501542873172708
False Negative 0.0326029361344014. True Negative 0.8809026587289218
Log-loss is 1.9377988469688499
AUROC is 0.8164822255177966


In [None]:
Evidently the (3-layer ReLU-tanh) network is over-fitting to the training set.  Not surprising, since there is no regularization here.
It could outperform the Naive Bayes method on the training set, but had worse performance on the development dataset.

Let's put in some dropout. Putting in dropout after each layer, with a 0.1 dropout probability improved performance.

In [294]:
#dev scores
[f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)]

[0.63848495096381463, 0.69363963655066008]

In [298]:
print([f1_score(actual,nn_pred_train),f1_score(actual,pred_nb)])
print([f1_score(actual_dev,nn_pred_dev),f1_score(actual_dev,pred_dev_nb)])

[0.94606310013717421, 0.85717301805130364]
[0.63848495096381463, 0.69363963655066008]


In [211]:
np.array(np.round([1.0, 0.0, 0.1, 0.9])).astype(bool)

array([ True, False, False,  True], dtype=bool)

In [179]:
nn_pred.shape

(956, 1)

I am finding that beyond one or two layers, the network just seems to output zeros.  Maybe the learning rate was too high?

In [102]:
?sklearn

Object `sklearn` not found.


In [124]:
#checks output for a single training batch.
plt.figure()
plt.hist(nn_pred[y_batch[:,0],0],bins=20)
plt.hist(nn_pred[~y_batch[:,0],0],bins=20)
plt.show()

<matplotlib.figure.Figure at 0x7efb754de748>

In [205]:
plt.figure()
plt.hist(nn_pred_total,bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7efb75311e80>

# Recurrent Neural Network

So let's try the current flavour of the month approach: a recurrent neural network.
Based on talking to Joseph and Fahim at the group, they used a two-layer neural network based on the just the 2000 most common words, using ReLU activation.(I think they said their approach was inspired by someone at Kaggle.)
Let's try something similar, with initially a single layer leaky ReLU layer, but after using a Truncated SVD.