# Toxicity in Wikipedia Comments

This is a parallel work to any work on the wikipedia toxicity data on the same
topic.  This data has not been cleaned yet, and has not had multiple categories introduced yet.  However it is presented free from bias, for people to play with.

Beware: Lots of swearing, racism, homophobia, misogyny is contained within.  Mental Hazmat suits are recommended.

Looking at the comments data, we'll need to clean the data quite a bit (lots of newlines, weird characters).

My rough plan is to build up a lexicon, tokenize that data, and try to build a Naive Bayes model.  (Maybe later a Recurrent Neural network model?)

Other Analysis possibilities:
* Support Vector Machine
    - the other big architecture, less popular now?
* Recurrent Neural Network
    - Build up word embeddings (word2vec), or just use the pretrained ones.
* Naive Bayes
    - can find most important words in spam
    - simple, easy to understand baseline.
* Latent Factor Analysis 
    - maybe useful prelude or alternative for building up embeddings.
    
Cleaning:
* Clean data : How to remove newlines (search/replace: NEWLINE with '')
* Tokenize (convert words to indices)
* Stemming words
* Balancing data set
* Match up comments, and review scores
* Search for gibberish words (make a new "feature" for badly spelled comments)

In [1]:
import pandas as pd
import nltk as nltk
import numpy as np
import matplotlib.pyplot as plt
import scipy.sparse as sparse

from sklearn.feature_extraction.text import CountVectorizer

In [2]:
#df_com = pd.read_csv('data/toxicity_annotated_comments_unanimous.tsv',sep='\t')
#df_rate = pd.read_csv('data/toxicity_annotations_unanimous.tsv',sep='\t')
df_com = pd.read_csv('data/toxicity_annotated_comments.tsv',sep='\t')
df_rate = pd.read_csv('data/toxicity_annotations.tsv',sep='\t')

#make rev_id an integer
df_com['rev_id']=df_com['rev_id'].astype(int)
df_rate['rev_id']=df_rate['rev_id'].astype(int)

#reindex 
#df_com.index=df_com['rev_id']
#df_rate.index=df_rate['rev_id']
print(df_com.shape, df_rate.shape)

(159686, 7) (1598289, 4)


In [3]:
#make a new column in df_com with array of worker_ids, and toxicity
df_com['scores']=None

#since 'rev_id' is sorted, can take first difference, and find where
#there are changes.  Those set the boundaries for changes.
#Need to also append final index.
change_indices=df_rate.index[df_rate['rev_id'].diff()!=0].values

#use numpy split instead.
arr=df_rate[['worker_id','toxicity_score']].values
split_arr=np.split(arr,change_indices)
#drop first index as empty
split_arr.pop(0)
df_com['scores']=split_arr
#change_indices=np.append(change_indices,len(df_rate))
# for i in range(len(change_indices)-1):
#     ind0 = change_indices[i]
#     ind1 = change_indices[i+1]
#     d0=df_rate.iloc[ind0:ind1]
#     scores=d0[['worker_id','toxicity_score']]
#     #pass it a list so it can be set as an entry.
#     #accessing later will require score[0] idiocy to get at list.
#     df_com.loc[i,'scores']=[scores.values]

In [4]:
def score_mean(score_list):
    """score_mean
    Compute mean of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.mean(score_list[:,1])
    return s

def score_median(score_list):
    """score_median
    Compute median of toxicity scores for input array.
    Array is first (and only) element in the input list.
    Compute mean running down the rows.  Could be updated to include weighted sum of weights
    """
    s = np.median(score_list[:,1])
    return s


In [214]:
df_com['mean_toxic']=df_com['scores'].apply(score_mean)
df_com['median_toxic']=df_com['scores'].apply(score_median)

In [520]:
df_com['toxic']= (df_com['median_toxic']<=-1)
Ntoxic=df_com['toxic'].sum()
Ntot=len(df_com)
print("Total comments: {}. Toxic comments: {}. Toxic Fraction: {}".format(Ntot,Ntoxic,Ntoxic/Ntot))

Total comments: 159686. Toxic comments: 15362. Toxic Fraction: 0.09620129504151897


In [None]:
#can combine the dataframes together by extracting all reviewer ids, and scores for each 

In [837]:
#cleaning the data
#Can use pandas built in str functionality with regex to eliminate

#maybe also dates?
def clean_up(comments):
    com_clean=comments.str.replace('NEWLINE_TOKEN',' ')
    com_clean=com_clean.str.replace('TAB_TOKEN',' ')    
    #Remove HTML trash, via non-greedy replacing anything between backticks
    com_clean=com_clean.str.replace("style=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("class=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("width=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("align=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellpadding=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("cellspacing=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("rowspan=\`\`.*?\`\`",' ')
    com_clean=com_clean.str.replace("colspan=\`\`.*?\`\`",' ')
    #remove numbers
    com_clean=com_clean.str.replace("[0-9]+",' ')
    #remove numbers
    com_clean=com_clean.str.replace("_",' ')
    #remove symbols.    There must be a more comprehensive way of doing this?
    com_clean=com_clean.str.replace("[\[\[\{\}=_:\|\(\)\\\/\`]+",' ')
    #remove multiple spaces, replace with a single space
    com_clean=com_clean.str.replace('\\s+',' ')

    #remove symbols
    return com_clean
df_com['comment_clean']=clean_up(df_com['comment'])

In [838]:
#separate off training_split
train_msk=df_com['split']=='train'
df_train=df_com[train_msk]


In [839]:
#borrowing from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
count_vect=CountVectorizer(stop_words='english',lowercase=True,strip_accents='unicode')
X_train_counts=count_vect.fit_transform(df_train['comment_clean'])
X_train_counts.shape

(95692, 125568)

# Checking the vectorizer and finding common words

In [283]:
#get vocabulary dictionary
voc_dict=count_vect.vocabulary_
#make a dataframe, with entries as rows
voc_df=pd.DataFrame.from_dict(voc_dict,orient='index')
#sort by row entry value, and then use that as the index for the counts.
voc_df1=voc_df.sort_values(by=0)

In [320]:
voc_df1.iloc[29143]

0    29143
Name: dick, dtype: int64

In [841]:
X_cond= pw_tox*ptox/(pw_tox*ptox + pw_cln*(1-ptox))
word_mat=np.array([X_train_counts.sum(axis=0),X_cond,pw_cln,pw_tox]).squeeze()
word_df=pd.DataFrame(word_mat.T,columns=['count','pcond','p_clean','p_toxic'],index=voc_df1.index)
print(word_df.sort_values('pcond',ascending=False,inplace=True))

None


In [843]:
xtot=X_train_counts.sum(axis=0).squeeze()
#compare vectorized vs. naive counts to check mappings
def check_vect(count_mat,comments,vocab,word):
    """check_vect(count_mat,comments,vocab,word)
    Checks the counts/occurence of words between the count vectorizer,
    and a naive 'contains' search.  Returns all the matching comments,
    and any discrepencies.        
    """
    ind=vocab.loc[word].values
    xtot=count_mat.sum(axis=0)
    vect_count=(xtot[0,ind])
    #find comments with words
    msk=(count_mat[:,ind]>0).toarray().squeeze()
    #find comments via naive search
    naive_msk=comments.str.contains('{}'.format(word),case=False)
    naive_count=np.sum(naive_msk)
    comments=comments[msk]
    naive_comments=comments[naive_msk]
    diff_comments=comments[msk!=naive_msk]
    return vect_count,naive_count,comments,naive_comments,diff_comments

In [844]:
vc,cc,com,ncom,dcom=check_vect(X_train_counts,df_train['comment_clean'],voc_df,'gay')
#searching for 'fuck' gives a salutory lesson in why accent tripping is worthwhile, and a simple word filter will probably be circumvented.
#does not account for leetspeak or rather: 13375|o3@|< (but who uses that these days?)

In [845]:
#currently searching for "jew", a term that has clean connotations, but can be used as anti-semitic.
print('Vect: {}, Naive: {}'.format(vc,cc))
print([com.head(),ncom.head()])

Vect: [[1866]], Naive: 638
[588      Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdin...
790      There is one other thing that just occured to me an encyclopedia deals in facts, not in specula...
871      Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technicall...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High School so I know that they are, i...
1467     WHY ARE YOU SUCH A GAY NIGGER?!?! GOD DAMNDD... YOU ARE SUCH A GAY NIGGER. FUCK FUCK SHTAY AWAY...
Name: comment_clean, dtype: object, 588      Pro-Gay Bias? To be honest, I am not sure I entirely follow. What is exactly is meant by holdin...
790      There is one other thing that just occured to me an encyclopedia deals in facts, not in specula...
871      Dante, I agree that it is the use of the word that constructs a POV. Of course, in a technicall...
1443    Stop editing it. I'm from Santa Clarita, I went to Saugus High S

In [853]:
naughty_word=['sex','fuck','shit','cunt','piss','cocksucker','dick','ass','cuck']#'nigger','trans','faggot','kike','jew','beta','cuck']
word_counts=X_train_counts.sum(axis=0)
for word in naughty_word:
    try:
        ind=count_vect.vocabulary_[word]
        print(word,'count: {}'.format(word_counts[0,ind]))
    except:
        print(word,'not found')

sex count: 1190
fuck count: 6179
shit count: 1689
cunt count: 619
piss count: 151
cocksucker count: 425
dick count: 1022
ass count: 1275
cuck not found


In [None]:
So, this is a pretty old dataset?  The alt-right's favourite insults are nowhere to be seen?

I noticed that there are very few obvious racist slurs in the unanimous data set. (lots of sexism, general hate)
Weird sociological question on perception of toxicity of racism, perhaps by american reviewers? (this is something that the actual original project is explicitly considering at https://conversationai.github.io/bias.html)

(searching for the n-word found these)
Some ratings seem way off. e.g. the scores for comments 1467, 1657 include some -1s.
Someone even thought 1918 was neutral!
Wait, 2669 and 2670 are now identical comments. And some raters thought that 2670 was neutral too!  What the hell?!
This suggests using the median toxicity score to avoid the mean being contaminated by people with a really different sense 

# Naive Bayes

I want to implement a Naive Bayes classifier as a baseline.  I've written my own version, which I will try to compare to
scikit-learn's version.  (They both return the same result now).

This basically treats the comments in a bag-of-words sense, and drops any correlations between the words.  Perhaps including some more
common n-grams, e.g. "frigging crank".

In [390]:
def cond_prob(X_counts,toxic,csmooth=1):
    """bayes_prob
    X_counts - sparse matrix of counts of each word in a given message
    toxic - whether word was toxic or not, with 0,1
    """
    nrows,nwords=X_counts.shape
    ptoxic = np.sum(toxic)/nrows
    
    toxic_mat=X_counts[toxic==1,:]
    clean_mat=X_counts[toxic==0,:]
    #sum across messages
    nword_toxic=np.sum(toxic_mat,axis=0)
    nword_clean=np.sum(clean_mat,axis=0)    

    #estimate probability of word given toxicity by number of times
    #that word occurs in toxic documents, divided by the total number of words
    #in toxic documents
    #Laplace/Lidstone smooth version
    pword_toxic= (nword_toxic+csmooth) \
                / (np.sum(toxic_mat)+nwords*csmooth)

    pword_clean= (nword_clean+csmooth) \
                /(np.sum(clean_mat)+nwords*csmooth)
    x1=np.sum(toxic_mat,0)
    x2=nword_toxic
    return ptoxic,pword_toxic,pword_clean    

ptox,pw_tox,pw_cln = cond_prob( X_train_counts, df_train['toxic'].values, csmooth=0.01)

In [856]:
def naive_bayes(mat,pword_tox,pword_cln,ptox):
    """Compute probability that a message 
    is toxic via naive_bayes estimate.
    """
    #I screwed up using prod_i[p(w_i|T)p(T)]
    #instead of P(T)prod_i[p(w_i|T)].  Ugh.
    #log probability for toxic/clean comments
    log_Tword = np.log(pword_tox)
    log_Cword = np.log(pword_cln)
    ## now accumulate probabilities by multiplying number of counts
    #per comment, with the weights per word
    #also add on log-normalization.
    msk=mat>0
    log_Tscore = mat.dot(log_Tword.T)+np.log(ptox)
    log_Cscore = mat.dot(log_Cword.T)+np.log(1-ptox)
    #predict based on which has larger probability (or log-likelihood)
    pred=log_Tscore>log_Cscore
    #also output probabilities
    #Tscore=np.exp(log_Tscore)
    #Cscore=np.exp(log_Cscore)
    #prob=Tscore/(Tscore+Cscore)
    prob=1/(1+np.exp(log_Cscore-log_Tscore))
    #score=np.exp(log_score)
    return pred,prob,log_Tscore,log_Cscore,log_Tword,log_Cword

In [857]:
actual=df_train['toxic'].values
msk=actual
Xtox = X_train_counts[msk,:]
df_tox=df_train[msk]
pred,prob,logT,logC,log_Tword,log_Cword=naive_bayes(X_train_counts,pw_tox,pw_cln,ptox)    



In [872]:
plt.figure()
plt.hist(np.maximum(-200,np.log(prob)),bins=100)
plt.show()

<matplotlib.figure.Figure at 0x7fb6aae881d0>

  


In [867]:
np.minimum(-1,np.array([3,-2]))

array([-1, -2])

In [876]:
def check_predictions(pred,actual):
    actual=np.reshape(actual,(len(actual),1))
    pred=np.reshape(pred,(len(actual),1))    
    print(pred.shape,actual.shape)
    tp = np.mean((pred==True)&(actual==True))
    tn = np.mean((pred==False)&(actual==False))
    fp = np.mean((pred==True)&(actual==False))    
    fn = np.mean((pred==False)&(actual==True))            
    scores=[tp,tn,fp,fn]
    print("True Positive {}. False Positive {}".format(tp,fp))
    print("False Negative {}. True Negative {}".format(fn,tn))
    pred_num=pred.astype(float)
    #give zero a small correction.
    pred_num[pred==False]=1E-16
    pred_num[pred==True]=1-1E-16    
    logloss=-np.mean(np.multiply(actual,np.log(pred_num)))
    #logloss=0
    print("Log-loss is {}".format(logloss))
    return scores,logloss
logloss,score_rates=check_predictions(pred,actual)


(95692, 1) (95692, 1)
True Positive 0.01795343393387117. False Positive 0.000209003887472307
False Negative 0.07865861305020273. True Negative 0.9031789491284538
Log-loss is 2.8978903975197396


In [16]:
#Look at the false negatives 
# df_fn=df_train[(pred==False)]
# df_fn=df_fn[df_fn['toxic']==True]
# df_fn[['comment_clean','mean_toxic','median_toxic']]

After fixing this, I get a 50% false positive rate.  And have a 10% false negative rate.  Note that this is searching for the most toxic comments.  

The false negatives in that larger seem to be more rules-lawyering, whinging about admnistration, and sidestepping filters. e.g. f:)u:)c:)k:).
This is a bit harder for the classifier to find.

Also length?  Can try a SVM, and then some dimensionality reduction word2vec, then neural network.

In [364]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.01)
nb.fit(X_train_counts,df_train['toxic'].values)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [426]:
pred_nb=nb.predict(X_train_counts)
nb_stats=check_predictions(pred_nb,actual)

(95692, 1) (95692, 1)
True Positive 0.00896626677256197. False Positive 0.009196171048781508
False Negative 0.0009300672992517661. True Negative 0.9809074948794048


Well, I must have an error in my Naive Bayes code.  I'm getting much worse results.  Have correct class probabilities, and conditional probabilities.  (Found the bug.  Used $\prod_i[p(w_i|T)p(T)]$ instead of
$p(T)\prod_i p(w_i|T)$.  Found my error after examining hidden attributes, but really checking and copying from the sklearn source code)

## Naive Bayes False Positives and Negatives

Let's now look a bit at the misclassified results.

In [431]:
#fixing shapes to avoid broadcasting
actual=np.reshape(actual,(len(actual),1))
pred=np.reshape(pred,(len(actual),1))    

fp_msk = ((pred==True)&(actual==False))    
fn_msk = ((pred==False)&(actual==True))            

In [441]:
df_fn=df_train[fn_msk][['comment_clean','mean_toxic','median_toxic','toxic']]
df_fp=df_train[fp_msk][['comment_clean','mean_toxic','median_toxic','toxic']]

In [834]:
df_fp.head()

                                                                                            comment_clean  \
1659   == SHUT UP, FAG == LOL YOU WILL BAN ME HUH? OOH IM SHAKIN IM SHAKIN FUCKING FAGGOT YOU NEED TO ...   
2570                                                                                DAMN ur an ugly fukka   
2670      == Chanting Cox == Shut up you dumb cunt. Your a nigger loving, cum cuzzling crack whore bitch!   
2773                                                              ==Suck COck!!!== you do!!!! you do!!!!    
2880        == HOLA FUCKER RETARD WHO HAS NOTHING BETTER TO DO THAN DO WIKIPEDIA SHIT == FUCK YOU ASSHOLE   

      mean_toxic  median_toxic  toxic  
1659        -1.3          -1.0  False  
2570        -1.1          -1.0  False  
2670        -1.1          -1.0  False  
2773        -1.4          -1.0  False  
2880        -1.4          -1.5  False  

In [835]:
df_fn.head()

                                                                                             comment_clean  \
2464   BUSH US EVIL, HE'S SELFISH AND VERY SHORT TEMPERED. HE SHOULDN'T EVEN BE IN CONTROL OF A PET SHO...   
3560    == GAY SEX, ANAL, ALL NIGHT == Why are you reverting me? SPUI saya on his talk page that he doe...   
5066    No thanks faggot, I'm fine right here. But you might try going to hell. You don't like to suck ...   
10064  ` == WHAT IN THE FUCK ARE YOU REFERRING TOO == I HAVE NO IDEA WHAT THE PISS YOU'RE REFERRING TOO...   
11732                             == ASSHOLE! == I just read this article, and man, was Hitler an asshole!   

       mean_toxic  median_toxic  toxic  
2464         -1.4          -2.0   True  
3560         -1.6          -2.0   True  
5066         -1.3          -2.0   True  
10064        -1.7          -2.0   True  
11732        -1.6          -2.0   True  

In [451]:
ind=df_fn.index.values

So false negatives.  Much more spacing/characters being used to avoid the filter.  

So at least the "false positives" are because the people using the rating scale are wildly inconsistent.  These are "-1" on the toxicity scale, and so "non-toxic" under the rule where toxic comments have median toxicity less than -1.
I think I recognize some Full Metal Jacket quotes in there being used as insults.
Some are "neutral" but have lots of repitition.  I can't for the life of me imagine any of these comments adding anything to the discussion.

# Dimensionality Reduction

Let's use the truncated SVD for dimensionality reduction (or latent semantic analysis?)
Apparently TF-IDF matrix is superior to straight term frequency matrix for this purpose  (more closely matches assumptions in the SVD about the noise.)

In [695]:
from sklearn.decomposition import TruncatedSVD

In [698]:
#took a minute or two
TSVD=TruncatedSVD(n_components=100,n_iter=10)
TSVD.fit(X_train_counts)

TruncatedSVD(algorithm='randomized', n_components=100, n_iter=10,
       random_state=None, tol=0.0)

In [699]:
#actually transform the results 
X_train_trans=TSVD.transform(X_train_counts)

In [707]:
plt.plot(TSVD.explained_variance_)
plt.xlabel('Singular value label')
plt.ylabel('Singular value')
plt.show()

<matplotlib.figure.Figure at 0x7fb6ea265630>

We will next use these results in a "deep" neural network.  

# Support Vector Machine

In CS229, Andrew Ng's assignmmnt 2 suggest the SVM as a natural improvement over the Naive Bayes method.  Let's implement one of those.
I'm going to update it to do batch gradient descent with sparse matrices.  The version I wrote initially was trash.
Or I could use an ensemble of SVM's based on 

In [519]:
pred_nb.sum()
actual=df_train['toxic'].values
actual.sum()

947

In [708]:
#define a cost function, check that we're minimizing it.
#define the alternative cost function to be sure we'e also minimize that original choice.
#check constraints are obeyed?
def svm_cost(alpha,Kmat,cat,l):
    m = Kmat.shape[0]
    Ka = np.dot(Kmat,alpha);            
    cost=0.5*l*np.dot( alpha, Ka)
    yvec=(1-cat*Ka)/m
    ymsk=yvec>0
    cost+=np.sum(yvec*ymsk)
    return cost                    

#Compute Kernel Matrix, an m x m matrix
#that effectively measures similarity between inputs.
#each K_{ij} is the "distance" between the weighted inputs,
# $x^{(i)}_k, $x^{(j)}_k$.
# def kernel_matrix(x,tau):
#     m=x.shape[0]
#     x=(x>0).astype(int)
#     K=np.zeros([m,m])
#     for i in range(0,m):
#         K[:,i]=np.exp(-np.sum((x-x[i])**2,1)/(2*tau**2));
#     return K

#Compute a column from the Kernel matrix.
#Matrix is assumed to be [m,m], with vec
#of length m.  Returns vector of length m.
# def Kvec(mat,vec,tau):
#     xarg=np.sum((mat-vec)**2,1)
#     Kv=np.exp(-xarg/(2*tau**2));
#     return Kv

def Kbatch(mat,ind,norm2,tau):
    """Kbatch(mat,cvec,ind,norm2,tau)
    Compute a batch of kernel matrix elements 
    Input: mat  - sparse matrix (nobs x nfeature)
           ind - indices for that subset of rows (nbatch)
           norm2 - column matrix with squared norm for each (nobs,1)
    Return: Kvecs - nbatch x nobs subset of the full kernel matrix.
    """
    nbatch=len(ind)
    #extract chosen rows
    cvec = mat[ind,:].T
    #relying on numpy broadcasting to join (nobs,nbatch) + (nobs,1)
    xarg=-2.0*mat.dot(cvec)+norm2
    #further broadcasting: use a row-vector ind to make a row-vector
    #of relevant norms.
    #then broadcast again from (nobs,nbatch)+ (1,nbatch)
    xarg+=norm2[ind].T
    Kv=np.exp(-xarg/(2*tau**2));
    return Kv

#carry out update on parameters for given loss function for SVM,
#given parameters, a row-vector of inputs K_i
def svm_batchsgd_update(alpha,Kbatch,y,ind,rate,l):
    """svm_batchsgd_update
    alpha  - nobs x 1 vector
    Kbatch - nobs x Nbatch subset of Kernel matrix
    y      - (1xNbatch) labels for inputs
    ind    - (1xNbatch) indices for batch
    """
    nobs = Kbatch.shape[0]
    yK = np.multiply(Kbatch,y.T)   #nobs x Nbatch 
    yKa = np.dot(alpha.T,yK);
    Kalpha = np.multiply(Kbatch,alpha[ind].T)
    #da= (-y_i*K_i)*((1-y_i*Ka) >0)+m*l*K_i*alpha[ind];
    da= -np.multiply(yK,yKa<-1)+nobs*l*Kalpha;
    #sum all changes over columns
    alpha=alpha-rate*np.sum(da,axis=1);
    return alpha
    
#Fit SVM coefficients for spam with stochastic gradient descent.
#use known categories in cat_vec, and word_matrix with nobs x nwords
def svm_fit(word_mat,cat_vec,tau=8,Nbatch=100):
    #just count whether word occurs.
    new_mat=(word_mat>0).astype(int)
    nobs,nword=new_mat.shape;
    alpha=0.1*np.random.randn(nobs,1)    #initialize parameters
    alpha0=alpha
    #alpha=np.zeros((nobs,1))    #initialize parameters
    alpha_tot=np.zeros((nobs,1))
    niter=int(40*nobs/Nbatch);
    l=1/(tau**2*nobs)
    norm2=new_mat.multiply(new_mat).sum(axis=1)
    #multiple iterations of stochastic gradient descent.
    for t in range(0,niter):
        indx=np.random.randint(low=0,high=nobs,size=Nbatch)        
        Kv = Kbatch(new_mat,indx,norm2,tau)
        yt=cat_vec[indx]
        rate=np.sqrt(np.sqrt(1.0/(t+1)))
        alpha=svm_batchsgd_update(alpha,Kv,yt,indx,rate,l)
        alpha_tot=alpha_tot+alpha
        if (10*t % niter ==0):
            print("Iter {} of {}".format(t,niter))
    alpha_tot=alpha_tot/niter
    return alpha0,alpha_tot

#given parameters, predict the output
def svm_predict(train_mat,test_mat,alpha,tau):
    ntrain=train_mat.shape[0]
    ntest=test_mat.shape[0]
    pred_cat=np.zeros(ntest)
    train_new=(train_mat>0).astype(int)
    test_new=(test_mat>0).astype(int)

    train_norm=train_new.multiply(train_new).sum(axis=1)
    test_norm=test_new.multiply(test_new).sum(axis=1)
    train_test_dot = np.dot(train_new,test_new.T)
    for i in range(0,ntest):
        #compute dot-product of param-vector and column of kernel matrix
        dist2 = train_norm-2*train_test_dot[:,i]+test_norm[i]
        Kvec=np.exp(-dist2/(2*tau**2))
        #Kvec=np.exp(-np.sum((train_new-test_new[i])**2,1)/(2*tau**2))
        pred_size= np.dot(alpha.T,Kvec)
        pred_cat[i] = np.sign(pred_size)
    return pred_cat


In [None]:
norm2=X_train_counts.multiply(X_train_counts).sum(axis=1)

In [497]:
%pdb off

Automatic pdb calling has been turned OFF


In [778]:
Nsub=1000
np.random.seed(454)
def get_subset(frac_perc,dat_mat,labels):
    """get_subset
    Returns balanced random subset of the data and labels.
    """ 
    #make vector and sample indices for true/false.
    nvec=np.arange(len(labels))
    #get the indices for true/false
    Tvec=nvec[labels]
    Cvec=nvec[~labels]
    #grab a random shuffling of those indices.
    np.random.shuffle(Tvec)
    np.random.shuffle(Cvec)
    #grab some fraction of them.
    it = int(len(Tvec)*frac_perc)
    ic = int(len(Cvec)*frac_perc)

    ind_sub=np.append(Tvec[:it],Cvec[:ic])
    Xsub = dat_mat[ind_sub]
    label_sub = labels[ind_sub].reshape((len(ind_sub),1))
    return ind_sub,Xsub,label_sub

In [792]:
%pdb off
ind_sub,Xsub,label_sub=get_subset(0.01,X_train_counts,actual)
print(Xsub.shape,np.mean(label_sub))

Automatic pdb calling has been turned OFF
(956, 125568) 0.00941422594142


In [783]:
Xsub.shape

(3826, 125568)

In [793]:
#my code: super slow.
#TODO: Look into Cython.  Does it play nice with sparse?
alpha0,alpha=svm_fit(Xsub,label_sub,Nbatch=100)

Iter 191 of 382


Iter 0 of 382


In [794]:
%pdb off
svm_pred=svm_predict(Xsub,Xsub,alpha,8)
check_predictions(svm_pred,label_sub)

Automatic pdb calling has been turned OFF


In [640]:
#Lets try to use sklearns version on the same subset of the data

In [731]:
from sklearn.svm import SVC
#seems to choke, and do nothing?
nfeature,nobs=X_train_counts.shape

In [None]:
#Try to determine parameters via cross-validation
#Note that there is no need for explicit regularization?  Apparently in large dimensions, the parameters C/gamma (for penalty radius and width of basis function do a decent job in regularizing), since l1, l2 regularization don't work.  

Since apparently the training time for a SVM goes as $O(n_{sample}^3)$, maybe it is better to train an ensemble of SVMs.
In which case the training time is $O(n_{sample}^3/n_{ensemble^2})$ for the ensemble.  Then evaluating the results typically takes $O(n_sample)$ for all of the ensemble together.  (This is something like making the assumption that the kernels are block-diagonal, once appropriately sorted).  If we repeat this for multiple such random splits we can extract different correlations.
Then take a majority vote.  

In [808]:
?SVC

In [885]:
#TODO: Add class_weights - since wildly unbalanced data. 
Nsub=4000
frac_perc=0.01
svm=SVC(cache_size=1000,verbose=True,gamma=0.01,C=1,class_weight='balanced')
indsub,Xsub,label_sub=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
svm.fit(Xsub,label_sub.ravel())
svm_pred=svm.predict(Xsub)
svm_stats=check_predictions(svm_pred,label_sub)
#test on a different subset of the training data
indsub2,Xsub2,label_sub2=get_subset(frac_perc,X_train_counts,df_train['toxic'].values)
svm_pred2=svm.predict(Xsub2)
svm_stats2=check_predictions(svm_pred2,label_sub)

(956, 1) (956, 1)
True Positive 0.02405857740585774. False Positive 0.007322175732217573
False Negative 0.07217573221757322. True Negative 0.8964435146443515
Log-loss is 2.6590522412818274


(956, 1) (956, 1)
True Positive 0.09309623430962342. False Positive 0.0020920502092050207
False Negative 0.0031380753138075313. True Negative 0.9016736401673641
Log-loss is 0.11561096701225335


[LibSVM]

(1000, 1) (1000, 1)
True Positive 0.034. False Positive 0.0
False Negative 0.466. True Negative 0.5


In [507]:
Kb.shape

(95692, 100)

In [87]:
a = np.matrix([[2],[4]])
B = np.matrix([[3,0,5],[0,0,9]])
Bs = sparse.csr_matrix(B)
a+np.matrix([[3,2]])

matrix([[5, 4],
        [7, 6]])

In [91]:
ind=[0,0,1,0]
a[ind]

matrix([[2],
        [2],
        [4],
        [2]])

# Recurrent Neural Network

So let's try the current flavour of the month approach: a recurrent neural network.
Based on talking to Joseph and Fahim at the group, they used a two-layer neural network based on the just the 2000 most common words, using ReLU activation.  
(I think they said they borrowed the approach from someone at Kaggle.)
Let's try something similar, with initially a single layer leaky ReLU layer, but after using a Truncated SVD.


# Deep Network

Another idea is to build a deep neural network on the term-frequency matrix, effectively running with extensions to the Naive Bayes model.


In [902]:
import tensorflow as tf
from tensorflow.contrib.layers import fully_connected, l2_regularizer
from tensorflow.contrib.rnn import BasicRNNCell,LSTMCell

In [804]:
ls tf.contrib.

ls: cannot access tf.contrib: No such file or directory


In [923]:
#just use the default graph
Nlayers=2
Nhidden=100
Nout=1
lr = 0.01
frac_perc=0.01
n_iter=100

Nobs,Nfeature=X_train_trans.shape
#only grabbing a fraction of the data
Nobs=np.int(Nobs*frac_perc)
tf.reset_default_graph()

#load in the training examples, and their labels
X = tf.placeholder(tf.float32,[Nobs,Nfeature],name='X')
y = tf.placeholder(tf.float32,[Nobs,Nout],name='y')

# #make a hidden layer
H1 = fully_connected(inputs=X,num_outputs=Nhidden,
    activation_fn=tf.nn.leaky_relu,
    weights_regularizer=l2_regularizer)
H2 = fully_connected(inputs=H1,num_outputs=Nhidden,
    activation_fn=tf.nn.leaky_relu,
    weights_regularizer=l2_regularizer)

# #just condense the number of inputs down, acting as a linear matrix combining results
outputs=fully_connected(inputs=H2,num_outputs=Nclass,
     activation_fn=None)

# #add square-loss, and optimizer.
# #Best form of the loss?  Kaggle competition uses log-loss
# # J = \frac{1}{N}\sum_{\text{obs i}}\frac{1}{K}\sum_{\text{classes} k} y^{(i,k)}\log \hat{y}^{(i,k)}

#should compute mean log-loss
eps=1E-16
tf.Print(outputs)
loss = tf.reduce_mean(y*tf.log(outputs) + (1-y)*tf.log(1-outputs+eps))
#loss = tf.losses.log_loss(y,outputs,epsilon=1E-16)
#define optimization function.
optimizer=tf.train.AdamOptimizer(learning_rate=lr)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()

#with tf.Session() as sess:
print('Running this thang')
with tf.Session() as sess:
     init.run()
     for iteration in range(n_iter):
         #select random starting point.
         ind_batch,X_batch,y_batch=get_subset(
         frac_perc,X_train_trans,actual)

         if iteration%10 ==0:
            mse =loss.eval(feed_dict={X:X_batch,y:y_batch})
            print('iter #{}. Current log-loss:{}'.format(iteration,mse))

         sess.run(training_op, feed_dict={X: X_batch, y:y_batch})
              


TypeError: Print() missing 1 required positional argument: 'data'

(95692, 100)

1

In [806]:
?fully_connected

In [924]:
?tf.Print