# Toxic Comment Classification Challenge

Kaggle Challenge: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

## Load the Data

There is the train.csv and test.csv set. Since the test.csv has no labels we will concentrate on the train.csv set and apply cross-validation. 
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

We use Pandas to load the data as it provied a nice ways to work with the data.

In [47]:
from pandas import DataFrame, read_csv

train_df = read_csv('train_shuffled.csv')
test_df = read_csv('test_shuffled.csv')
train_df.head(20)



Unnamed: 0.1,Unnamed: 0,comment_text,toxic
0,7608,"""\r\n OK, I understand. -) I've been a racist ...",0
1,44267,Hmm... vandalist brings up what I thought to b...,0
2,112180,hey everyone User: Xeno sucks on mens dick!!!!!!!,1
3,67617,Regarding edits made during March 8 2007 (UTC)...,0
4,99209,In the unlikely event I accidentally picked tw...,0
5,14096,Droop is not better than Hagenbach-Bischoff!!!...,0
6,70223,"""\r\n\r\n (watching) The template suggests tha...",0
7,25335,"""\r\n\r\n Blocked \r\n\r\n — xplicit """,0
8,54093,Then please add that source to here and mentio...,0
9,93746,"""\r\n\r\nTo 'This article deals with contenti...",0


In [48]:
test_df.head(10)

Unnamed: 0.1,Unnamed: 0,comment_text
0,151890,"""\r\nHow is it a """"personal attack"""" to point ..."
1,152979,"""\r\n{| style=""""background-color:#F5FFFA; padd..."
2,148234,"(UTC)\r\n\r\nFirst, any 'numerology' regarding..."
3,65022,Ebyabe falsifies information on her repeated b...
4,66534,"""\r\n\r\nSpeedy deletions\r\nHello! I recently..."
5,30094,"""7 Sep 2012 ==\r\n As of 2012, 800,000 Rohingy..."
6,148805,REDIRECT Talk:Bombardments of Shimonoseki
7,64216,"""\r\n\r\nSuggestion The new name should be:Cal..."
8,121596,"Autor \r\n\r\nHi, who was the autor of this pi..."
9,117023,Fuck FINLAY McWALTER hes a faggot and he needs...


## Download GloVe Embeddings

You may find pretrained versions of the GloVe embeddings here: https://nlp.stanford.edu/projects/glove/ However the Gensim library provides an API which lets you download smoe of these embeddings directly onto your Laptop.

In [49]:
import gensim.downloader as api

info = api.info()  # show info about available models/datasets
model = api.load("glove-twitter-25")  # download the model and return as object ready for use
model.most_similar("cat")

[(u'dog', 0.9590820074081421),
 (u'monkey', 0.920357882976532),
 (u'bear', 0.914313793182373),
 (u'pet', 0.9108030796051025),
 (u'girl', 0.8880630731582642),
 (u'horse', 0.8872727155685425),
 (u'kitty', 0.8870542645454407),
 (u'puppy', 0.8867697715759277),
 (u'hot', 0.8865255117416382),
 (u'lady', 0.8845518827438354)]

In [50]:
train_df['toxic'].value_counts(normalize=True)

0    0.904156
1    0.095844
Name: toxic, dtype: float64

## Feature Extraction

We use the average of the word-embeddings in each sentence.

In [51]:
#size of the training input
embedding_dimension = model['cat'].shape[0]
nsamples = train_df['comment_text'].shape[0]

print('Input Size: {} x {}'.format(nsamples, embedding_dimension))

Input Size: 127656 x 25


In [52]:
# Tokenize
from gensim.utils import simple_preprocess

def tokenize_comments(df):
    tokenized_comments = []
    for comment in df['comment_text']:
        tokens = simple_preprocess(comment, deacc=False, min_len=2, max_len=15)
        tokenized_comments.append(tokens)
    return tokenized_comments
    

In [53]:
import numpy as np
#compute average word embedding
dummy_embedding = np.random.randn(embedding_dimension)

def compute_avg_wemb(tokenized_comments):
    nsamples = len(tokenized_comments)
    X_data = np.zeros(shape=(nsamples, embedding_dimension))
    for i, tokens in enumerate(tokenized_comments):
        sentence_vectors = []
        for token in tokens:
            if model.vocab.get(token, None) is not None:
                vec = model.get_vector(token)
            else:
                vec = dummy_embedding
            sentence_vectors.append(vec)
        if len(sentence_vectors) > 0:
            sentence_array = np.array(sentence_vectors)
            avg_vector = np.mean(sentence_array, axis=0)
        else: 
            avg_vector = np.zeros(embedding_dimension)    
        X_data[i] = avg_vector
        assert not np.isnan(avg_vector).any()
    return X_data
    

In [54]:
#create labels
tokenized_train = tokenize_comments(train_df)
X_train = compute_avg_wemb(tokenized_train)
y_train = train_df['toxic']

print(len(tokenized_train))
print(len(tokenized_test))

127656
31915


In [55]:
tokenized_test = tokenize_comments(test_df)
X_test = compute_avg_wemb(tokenized_test)

## Model Training

In [56]:
from sklearn.metrics import f1_score
#ranomd baseline
scores = []
for i in range(5):
    y_pred = np.random.randint(0, 2, size=y_train.shape)
    scores.append(f1_score(y_train, y_pred, average='binary', pos_label=1))
print('Random Scores:', np.mean(scores))

#all_one baseline
y_pred = np.ones(shape=y_train.shape)
print('All One Score:', f1_score(y_train, y_pred))

('Random Scores:', 0.16155380057571378)
('All One Score:', 0.1749219034819967)


In [57]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

clf_gaussnb = GaussianNB()
scores = cross_val_score(clf_gaussnb, X_train, y_train, cv=5, scoring='f1')
print(scores)
print(np.mean(scores), np.std(scores))

[0.43026316 0.43939146 0.43855897 0.45008237 0.4281457 ]
(0.4372883319836912, 0.00778030879009286)


In [58]:
#fit classifier to training data
clf_gaussnb.fit(X_train, y_train)
y_pred = clf_gaussnb.predict(X_test)

In [59]:
#correct format for submission
test_df['toxic'] = y_pred.tolist()
test_df.head(20)

Unnamed: 0.1,Unnamed: 0,comment_text,toxic
0,151890,"""\r\nHow is it a """"personal attack"""" to point ...",0
1,152979,"""\r\n{| style=""""background-color:#F5FFFA; padd...",0
2,148234,"(UTC)\r\n\r\nFirst, any 'numerology' regarding...",0
3,65022,Ebyabe falsifies information on her repeated b...,1
4,66534,"""\r\n\r\nSpeedy deletions\r\nHello! I recently...",0
5,30094,"""7 Sep 2012 ==\r\n As of 2012, 800,000 Rohingy...",0
6,148805,REDIRECT Talk:Bombardments of Shimonoseki,1
7,64216,"""\r\n\r\nSuggestion The new name should be:Cal...",0
8,121596,"Autor \r\n\r\nHi, who was the autor of this pi...",0
9,117023,Fuck FINLAY McWALTER hes a faggot and he needs...,1


In [60]:
#save results to csv
test_df.to_csv('test_solution.csv')