# <b>Introduction<b>

This is akaggle competitin <b>"Toxic Comment Classification Challenge"<b> <br>
Link: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge<br>

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

## <b>Project Outline <b>

In this project I will cover the follwouings :

<li> Download data from yelp and process them
<li> Build neural network with LSTM
<li> Build neural network with LSTM and CNN
<li> Use pre-trained GloVe word embeddings
<li> Word Embeddings from Word2Vec

## <b>Import libraries<b>

In [43]:
# Keras
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, GRU, Conv1D, MaxPooling1D, Dropout, SpatialDropout1D, Activation, Bidirectional, GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.layers.embeddings import Embedding
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from keras.callbacks import EarlyStopping,ModelCheckpoint, Callback
from sklearn.metrics import accuracy_score, roc_auc_score


## Plot
import plotly.offline as py
import plotly.graph_objs as go
py.init_notebook_mode(connected=True)
import matplotlib as plt

# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer

# Other
import re
import string
import numpy as np
import pandas as pd

## <b> Data Processing<b>

In [2]:
df_train = pd.read_csv('train.csv',  error_bad_lines=False)
df_test = pd.read_csv('test.csv',  error_bad_lines=False)

In [3]:
df_train= df_train.drop_duplicates()
df_test= df_test.drop_duplicates()

In [4]:
df_train.describe()

Unnamed: 0,toxic,severe_toxic,obscene,threat,insult,identity_hate
count,159571.0,159571.0,159571.0,159571.0,159571.0,159571.0
mean,0.095844,0.009996,0.052948,0.002996,0.049364,0.008805
std,0.294379,0.099477,0.223931,0.05465,0.216627,0.09342
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0
max,1.0,1.0,1.0,1.0,1.0,1.0


In [5]:
df_train.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [6]:
df_test.head(10)

Unnamed: 0,id,comment_text
0,00001cee341fdb12,Yo bitch Ja Rule is more succesful then you'll...
1,0000247867823ef7,== From RfC == \n\n The title is fine as it is...
2,00013b17ad220c46,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,00017563c3f7919a,":If you have a look back at the source, the in..."
4,00017695ad8997eb,I don't anonymously edit articles at all.
5,0001ea8717f6de06,Thank you for understanding. I think very high...
6,00024115d4cbde0f,Please do not add nonsense to Wikipedia. Such ...
7,000247e83dcc1211,:Dear god this site is horrible.
8,00025358d4737918,""" \n Only a fool can believe in such numbers. ..."
9,00026d1092fe71cc,== Double Redirects == \n\n When fixing double...


### Tokenize text data

Because of the computational expenses, I use the top 20000 unique words. First, tokenize the comments then convert those into sequences. I keep 50 words to limit the number of words in each comment. 

In [7]:
def clean_text(text):

    # Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r" \n ", " ", text)
    text = re.sub(r"what's", "what is", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", "have", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"aren't", "are not", text)
    text = re.sub(r"couldn't", "could not", text)
    text = re.sub(r"didn't", "did not", text)
    text = re.sub(r"doesn't", "does not", text)
    text = re.sub(r"don't", "do not", text)
    text = re.sub(r"hadn't", "had not", text)
    text = re.sub(r"hasn't", "has not", text)
    text = re.sub(r"haven't", "have not", text)
    text = re.sub(r"isn't", "is not", text)
    text = re.sub(r"shouldn't", "should not", text)
    text = re.sub(r"wasn't", "was not", text)
    text = re.sub(r"weren't", "were not", text)
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"wouldn't", "would not", text)
    text = re.sub(r"mustn't", "must not", text)
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"\'re", "are", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\= =", " ", text)
    text = re.sub(r"\==", " ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"\s{2,}", " ", text)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    # Remove stop words
    stops = {'.', '?', '!', ':', 'which', 'above', 'is', 'during', 'that', 'under', 'such', 'mightn', 'with', 'm', 'herself', 'there', 'other', 'some', 'down', 'himself', 't', 'this', 'by', 'will', 'what', 'o', 'its', 'am', 'but', 'hers', 'be', 'your', 'yours', 'did', 'll', 'from', 're', 'should', 'how', 'while', 'on', 'then', 'isn', 'as', 'my', 'those', 'no', 'same', 'in', "you've", 'up', 'an', "it's", 'him', 'her', 'than', 'after', 'myself', 'ourselves', 'these', 'were', 'their', 'why', 'off', 'again', 'she', 've', 'between', 'when', 'any', 'yourselves', 'each', 'who', "you're", 'ours', 'they', 'his', 'aren', 'so', 'been', 'now', 'had', 'most', 'our', 'too', 'whom', 'all', 'ma', "you", 'we', 'once', 'being', 'more', 'themselves', 'out', 'or', 'below', 'just', 'he', 'it', 'i', 'where', 'both', 'until', 'haven', 'you', 'into', 'the', 'a', 'and', 'at', 'own', "she's", 'if', 'ain', 's', 'are', 'theirs', 'of', 'before', 'against', 'here', 'about', "that'll", 'only', 'yourself', 'me', 'them', 'to', 'shan', 'y', 'for', 'further', 'itself', 'through', 'because', 'd'}
    text = [w for w in text if not w in stops]
    text = " ".join(text)
    
    ## Remove puncuation
    text = text.translate(string.punctuation)

    return text

In [8]:
df_train['comment_text'] = df_train['comment_text'].map(lambda x: clean_text(x))
df_test['comment_text'] = df_test['comment_text'].map(lambda x: clean_text(x))

In [10]:
df_test.head(10)

Unnamed: 0,id,comment_text
0,00001cee341fdb12,yo bitch ja rule succesful ever whats hating s...
1,0000247867823ef7,rfc title fine imo
2,00013b17ad220c46,sources zawe ashton lapland
3,00017563c3f7919a,have look back source information updated was ...
4,00017695ad8997eb,do not anonymously edit articles
5,0001ea8717f6de06,thank understanding think very highly would no...
6,00024115d4cbde0f,please do not add nonsense wikipedia edits con...
7,000247e83dcc1211,dear god site horrible
8,00025358d4737918,fool can believe numbers correct number lies 1...
9,00026d1092fe71cc,double redirects fixing double redirects do no...


In [11]:
df_train["comment_text"] = df_train["comment_text"].dropna()
df_test["comment_text"] = df_test["comment_text"].dropna()

In [12]:
classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
xtrain = df_train.drop(classes, axis=1)
xtrain = xtrain.drop(['id'], axis=1)
ytrain = df_train[classes]
xtest = df_test.drop(['id'], axis=1)

In [13]:
xtrain.head(4)

Unnamed: 0,comment_text
0,explanation edits made username hardcore metal...
1,aww matches background colour seemingly stuck ...
2,hey man really not trying edit war guy constan...
3,can not make real suggestions improvement - wo...


In [14]:
vocabulary_size = 100000
max_length = 150

tokenizer = Tokenizer(num_words = vocabulary_size, lower=True)
tokenizer.fit_on_texts(list(xtrain['comment_text']) + list(xtest['comment_text']))

xtrain = tokenizer.texts_to_sequences(xtrain['comment_text'])
xtrain = pad_sequences(xtrain, maxlen = max_length)

xtest = tokenizer.texts_to_sequences(xtest['comment_text'])
xtest = pad_sequences(xtest, maxlen = max_length)

In [18]:
print(xtest.shape)

(153164, 150)


###  <b>Build neural network with LSTM and CNN<b>

### Get embeddings from Glove 

In [19]:
embeddings_index = dict()
f = open('glove.42B.300d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 1917494 word vectors.


In [20]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocabulary_size, 300))
for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

### Network Architechture

The network starts with an embedding layer. The layer lets the system expand each token to a more massive vector, allowing the network to represent a word in a meaningful way. The layer takes 20000 as the first argument, which is the size of our vocabulary, and 100 as the second input parameter, which is the dimension of the embeddings. The third parameter is the input_length of 50, which is the length of each comment sequence.<br>

The LSTM model worked well. However, it takes forever to train three epochs. One way to speed up the training time is to improve the network adding “Convolutional” layer. Convolutional Neural Networks (CNN) come from image processing. They pass a “filter” over the data and calculate a higher-level representation. They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs.

In [44]:
def create_model():
    model = Sequential()
    model.add(Embedding(vocabulary_size, 300, input_length = 150, trainable = False, weights=[embedding_matrix]))
    model.add(SpatialDropout1D(0.2))
    model.add(Bidirectional(GRU(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
    model.add(GlobalAveragePooling1D())
    model.add(Dense(6, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

### Train the network

There are about 1.6 million comments, and it takes a while to train the model in a MacBook Pro. To save time I have used only three epochs. GPU machines can be used to accelerate the training with more time steps. I split the whole datasets as 75% for training and 25% for validation.

### Ensemble 

In [22]:
from sklearn.model_selection import StratifiedKFold

class Create_ensemble(object):
    def __init__(self, n_splits, train_epoch, base_models):
        self.n_splits = n_splits
        self.base_models = base_models
        self.train_epoch = train_epoch

    def predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        folds = list(KFold(n_splits=self.n_splits, shuffle=True, random_state=2016).split(X, y))

        S_train = np.zeros((X.shape[0], y.shape[1]))
        S_test = np.zeros((T.shape[0], y.shape[1]))
        
        for i, clf in enumerate(self.base_models):

            for j, (train_idx, valid_idx) in enumerate(folds):
                X_train = X[train_idx]
                y_train = y[train_idx]
                X_valid = X[valid_idx]
                y_valid = y[valid_idx]
                
                ## valid
                clf.fit(X_train, y_train, validation_data = (X_valid, y_valid), 
                        batch_size= 128, epochs = self.train_epoch)
                valid_pred = clf.predict(X_valid)
                S_train[valid_idx,:] = valid_pred
                
                ## test
                test_pred = clf.predict(T)
                S_test += test_pred 

                print('Model: {}, fold: {}, roc_auc_score: {}'.format(i, j, roc_auc_score(y_valid, valid_pred)))
            print( "\nTraining roc_auc_score for model {} : {}".format(i, roc_auc_score(y, S_train)))
            S_test /= self.n_splits
            
        return S_train, S_test

### Evaluation 

In [45]:
class RocAucEvaluation(Callback):
    def __init__(self, validation_data=(), interval=1):
        super(Callback, self).__init__()

        self.interval = interval
        self.X_val, self.y_val = validation_data

    def on_epoch_end(self, epoch, logs={}):
        if epoch % self.interval == 0:
            y_pred = self.model.predict(self.X_val, verbose=0)
            score = roc_auc_score(self.y_val, y_pred)
            print("\n ROC-AUC - epoch: {:d} - score: {:.6f}".format(epoch+1, score))

In [46]:
x_train , x_val, y_train, y_val = train_test_split(xtrain, ytrain, train_size = 0.20, random_state = 2018)

x_train_final = np.array(x_train)
x_val_final = np.array(x_val)
y_train_final = np.array(y_train)
y_val_final = np.array(y_val)
x_test_final = np.array(xtest)


From version 0.21, test_size will always complement train_size unless both are specified.



### Callback list

In [47]:
filepath="weights_base.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
early = EarlyStopping(monitor="val_acc", mode="max", patience = 3)
ra_val = RocAucEvaluation(validation_data = (x_val_final, y_train_final), interval = 1)
callbacks_list = [ra_val, checkpoint, early]

### Fit the model

In [48]:
isCrossValidation = False

if isCrossValidation:
    model = Create_ensemble(n_splits = 2, train_epoch = 8, base_models = [create_model()])
    train_pred, test_pred = model.predict(xtrain, ytrain, xtest)
else:
    model = create_model()
    model.fit(x_train_final, y_train_final, validation_data = (x_val_final, y_val_final), 
              batch_size= 128, epochs = 8, callbacks = callbacks_list)
    
    test_pred = model.predict(x_test_final)

Train on 31914 samples, validate on 127657 samples
Epoch 1/8

ValueError: Found input variables with inconsistent numbers of samples: [31914, 127657]

### Submission

In [265]:
sample_submission = pd.read_csv('sample_submission.csv')
sample_submission[classes] = test_pred
sample_submission.to_csv('submission_increased_words.csv', index=False)