# Toxic Comment Classification Challenge

Kaggle Challenge: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge

## Load the Data

There is the train.csv and test.csv set. Since the test.csv has no labels we will concentrate on the train.csv set and apply cross-validation. 
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:
- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.

We use Pandas to load the data as it provied a nice ways to work with the data.

In [2]:
from pandas import DataFrame, read_csv

train_df = read_csv('data.csv')
train_df.head(10)



Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


## Download GloVe Embeddings

You may find pretrained versions of the GloVe embeddings here: https://nlp.stanford.edu/projects/glove/ However the Gensim library provides an API which lets you download smoe of these embeddings directly onto your Laptop.

In [3]:
import gensim.downloader as api

info = api.info()  # show info about available models/datasets
model = api.load("glove-twitter-25")  # download the model and return as object ready for use
model.most_similar("cat")



[(u'dog', 0.9590820074081421),
 (u'monkey', 0.920357882976532),
 (u'bear', 0.914313793182373),
 (u'pet', 0.9108030796051025),
 (u'girl', 0.8880630731582642),
 (u'horse', 0.8872727155685425),
 (u'kitty', 0.8870542645454407),
 (u'puppy', 0.8867697715759277),
 (u'hot', 0.8865255117416382),
 (u'lady', 0.8845518827438354)]

In [4]:
train_df['toxic'].value_counts(normalize=True)

0    0.904156
1    0.095844
Name: toxic, dtype: float64

## Feature Extraction

We use the average of the word-embeddings in each sentence.

In [5]:
#size of the training input
embedding_dimension = model['cat'].shape[0]
nsamples = train_df['comment_text'].shape[0]

print('Input Size: {} x {}'.format(nsamples, embedding_dimension))

Input Size: 159571 x 25


In [6]:
# Tokenize
from gensim.utils import simple_preprocess

tokenized_comments = []
for comment in train_df['comment_text']:
    tokens = simple_preprocess(comment, deacc=False, min_len=2, max_len=15)
    tokenized_comments.append(tokens)
    

In [7]:
import numpy as np
#compute average word embedding
dummy_embedding = np.random.randn(embedding_dimension)

X_data = np.zeros(shape=(nsamples, embedding_dimension))
for i, tokens in enumerate(tokenized_comments):
    sentence_vectors = []
    for token in tokens:
        if model.vocab.get(token, None) is not None:
            vec = model.get_vector(token)
        else:
            vec = dummy_embedding
        sentence_vectors.append(vec)
    if len(sentence_vectors) > 0:
        sentence_array = np.array(sentence_vectors)
        avg_vector = np.mean(sentence_array, axis=0)
    else: 
        avg_vector = np.zeros(embedding_dimension)    
    X_data[i] = avg_vector
    assert not np.isnan(avg_vector).any()
    

## Data Analysis

- How long is the longest sentence?
- What is the average length of a sentence? 
- What is the standard deviation of the sentence lenghts? 

In [8]:
#create labels
y = train_df['identity_hate']
np.bincount(y)

array([158166,   1405], dtype=int64)

In [9]:
from sklearn.metrics import f1_score
#ranomd baseline
scores = []
for i in range(5):
    y_pred = np.random.randint(0, 2, size=y.shape)
    scores.append(f1_score(y, y_pred))
print(scores)
print(np.mean(scores), np.std(scores))

#all_one baseline
y_pred = np.ones(shape=y.shape)
print(f1_score(y, y_pred))

[0.016693045800771588, 0.017190987680203235, 0.017582363385283094, 0.016591586047313096, 0.016408228825600793]
(0.01689324234783436, 0.00043124317618979517)
0.017456018288440515


In [10]:
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score

clf_gaussnb = GaussianNB()
scores = cross_val_score(clf_gaussnb, X_data,y, cv=5, scoring='f1')
print(scores)
print(np.mean(scores), np.std(scores))

[0.10884354 0.10302792 0.09857612 0.1061008  0.10758312]
(0.10482629821018952, 0.0036786262442442493)


In [None]:
from sklearn.ensemble import RandomForestClassifier

clf_randforest = RandomForestClassifier(class_weight='balanced')
scores = cross_val_score(clf_randforest, X_data,y, cv=5, scoring='f1')
print(scores)
print(np.mean(scores), np.std(scores))

[0.07006369 0.07619048 0.09061489 0.0660066  0.07260726]
(0.07509658371510446, 0.0084400097906251)


In [None]:
from sklearn.ensemble import GradientBoostingClassifier

clf_gb = GradientBoostingClassifier()
scores = cross_val_score(clf_gb, X_data,y, cv=5, scoring='f1')
print(scores)
print(np.mean(scores), np.std(scores))