# Toxic Comment Classification

## Part2: GloVe + LSTM in Keras (in progress)
Global Vectors (GloVe)  and Long Short Term Memory (LSTM) in Keras

This notebook is the second part of Toxic comment classification project. The first part, Part1: Tfidf + Logistic Regression, is the other notebook in this repository.

In [175]:
# Import all necessary packages
import pandas as pd
import numpy as np
#import codecs, sys, os 
import matplotlib.pyplot as plt

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input,Embedding,Bidirectional,LSTM,GlobalMaxPool1D,Dense,Dropout,Activation
from keras.models import Model

In [5]:
# Load data sets
train = pd.read_csv('train.csv') #training set
test = pd.read_csv('test.csv') #test set

In [49]:
# Import pre-trained word vectors, GloVe 
## file is from https://nlp.stanford.edu/projects/glove/
word_vector_file = 'glove.6B.50d.txt'

## Word Embedding

In [14]:
train_comment_array = train['comment_text'].values
test_comment_array = test['comment_text'].values

In [88]:
max_features = 20000 # top words to be used (I might increase this number)
text_len = 100 # max number of words to be used in each comment
word_vec_dim =50 # dimension of Glove vector

In [24]:
tokenizer = Tokenizer(num_words = max_features)
tokenizer.fit_on_texts(list(train_comment_array)+list(test_comment_array)) #input: list of texts to train on

In [93]:
# Dictionary of word to index (index for this particular data)
word_to_index = tokenizer.word_index
len(word_to_index)

394787

In [94]:
list(word_to_index.items())[:10] 

[('the', 1),
 ('to', 2),
 ('of', 3),
 ('a', 4),
 ('and', 5),
 ('you', 6),
 ('i', 7),
 ('is', 8),
 ('that', 9),
 ('in', 10)]

In [101]:
min(list(word_to_index.values())) # index starts from 1 (not 0)

1

In [95]:
list(word_to_index.items())[1991:2000]

[('build', 1992),
 ('ps', 1993),
 ('worry', 1994),
 ('corrected', 1995),
 ('wife', 1996),
 ('benefit', 1997),
 ('remains', 1998),
 ('liberal', 1999),
 ('network', 2000)]

In [90]:
word_to_count = tokenizer.word_counts
list(word_to_count.items())[:10]

[('explanation', 3095),
 ('why', 31804),
 ('the', 917801),
 ('edits', 16189),
 ('made', 17181),
 ('under', 12228),
 ('my', 78385),
 ('username', 3172),
 ('hardcore', 320),
 ('metallica', 91)]

In [44]:
# List of texts to list of index sequences, one per text
train_comment_seq = tokenizer.texts_to_sequences(train_comment_array)
test_comment_seq = tokenizer.texts_to_sequences(test_comment_array)

In [45]:
# Truncate and pad zeros to make equal size comments
## Try padding='pre', truncating='pre' as well?
train_comment_seq_pad = pad_sequences(train_comment_seq, maxlen = text_len, padding='post', truncating='post' )
test_comment_seq_pad = pad_sequences(test_comment_seq, maxlen = text_len, padding='post', truncating='post' )

In [47]:
# Function that changes tuples to first, array of the rest 
## https://www.python-course.eu/python3_passing_arguments.php
def get_word_vec(word,*vec): 
    return word, np.asarray(vec, dtype='float32')

In [56]:
# Make a dictionary of word: GloVe_vector
## encoding="utf8" removed the error I got 
word_to_vec = dict(get_word_vec(*item.strip().split()) for item in open(word_vector_file, encoding="utf8"))

In [72]:
np.stack(word_to_vec.values()).mean(), np.stack(word_to_vec.values()).std()

(0.020940498, 0.6441043)

In [104]:
# Make embedding matrix
## Initialize embedding matrix as a numpy array of shape (max_features, word_vec_dim)
## (assuming max_feaures <= unique number of words in texts i.e., len(word_to_index))
## with random numbers with mean and std of word vectors for words not in pretrained word vectors 
## words will be ordered as in the word_to_index from out texts
embed_matrix = np.random.normal(np.stack(word_to_vec.values()).mean(), 
                                np.stack(word_to_vec.values()).std(),
                               (max_features, word_vec_dim))

for word, idx in word_to_index.items():
    if idx > max_features: # index starts from 1 (not 0) in word_to_index
        break
    vec = word_to_vec.get(word, None) # need get() to get values for the case word in not in keys
    if vec is not None:
        embed_matrix[idx-1] = vec

## LSTM

In [105]:
categories =['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

In [108]:
labels = train[categories].values
labels

array([[0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0]], dtype=int64)

### Model1

In [111]:
# bidirectional LSTM(50)+dropout(0.1) GlobalMaxPool1D Dense(50 & 6)
comment_sequences = Input(shape=(text_len,))
X = Embedding(max_features, word_vec_dim, weights=[embed_matrix])(comment_sequences)
X = Bidirectional(LSTM(50, return_sequences= True, dropout= 0.1, recurrent_dropout= 0.1))(X)
X = GlobalMaxPool1D()(X)
X = Dense(50, activation= "relu")(X)
X = Dropout(0.1)(X)
X = Dense(6, activation= "sigmoid")(X)
model = Model(inputs= comment_sequences, outputs= X)
model.compile(loss= 'binary_crossentropy', optimizer= 'adam', metrics= ['accuracy'])

In [113]:
%%time
model.fit(train_comment_seq_pad, labels, batch_size=32, epochs=2, validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
Wall time: 10min 55s


<keras.callbacks.History at 0x136939b2ba8>

In [117]:
# Make predictions for test set
prob_predictions = model.predict([test_comment_seq_pad], batch_size=1024, verbose=1)



In [136]:
# submission
submission = pd.read_csv('sample_submission.csv')
submission[categories]= prob_predictions #can enter multiple columns at once if columns are already there
submission.to_csv('submission_LSTM1.csv', index=False)

LB AUC: 0.9704 (worse than Tfidf + Logistic regression)

### Model2

In [148]:
%%time
# same as model1 except for trainable = False
comment_sequences = Input(shape=(text_len,))
X = Embedding(max_features, word_vec_dim, weights=[embed_matrix], trainable= False)(comment_sequences)
X = Bidirectional(LSTM(50, return_sequences= True, dropout= 0.1, recurrent_dropout= 0.1))(X)
X = GlobalMaxPool1D()(X)
X = Dense(50, activation= "relu")(X)
X = Dropout(0.1)(X)
X = Dense(6, activation= "sigmoid")(X)
model = Model(inputs= comment_sequences, outputs= X)
model.compile(loss= 'binary_crossentropy', optimizer= 'adam', metrics= ['accuracy'])

model.fit(train_comment_seq_pad, labels, batch_size=32, epochs=2, validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
Wall time: 8min 7s


In [149]:
# Make predictions for test set
prob_predictions = model.predict([test_comment_seq_pad], batch_size=1024, verbose=1)

# submission
submission = pd.read_csv('sample_submission.csv')
submission[categories]= prob_predictions #can enter multiple columns at once if columns are already there
submission.to_csv('submission_LSTM4.csv', index=False)



LB AUC: 0.9211 (much worse than the trainable embedding layer!)

### Model3 

In [145]:
%%time
# not bidirectional LSTM(128)+dropout(0.5) twice 
comment_sequences = Input(shape=(text_len,))
X = Embedding(max_features, word_vec_dim, weights=[embed_matrix])(comment_sequences)
X = LSTM(128, return_sequences= True)(X)
X = Dropout(rate=.5)(X)
X = LSTM(128, return_sequences = False)(X)
X = Dropout(rate=.5)(X)
X = Dense(6, activation= "sigmoid")(X)

model = Model(inputs= comment_sequences, outputs= X)
model.compile(loss= 'binary_crossentropy', optimizer= 'adam', metrics= ['accuracy'])

model.fit(train_comment_seq_pad, labels, batch_size=32, epochs=2, validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
Wall time: 17min 29s


In [146]:
# Make predictions for test set
prob_predictions = model.predict([test_comment_seq_pad], batch_size=1024, verbose=1)

# submission
submission = pd.read_csv('sample_submission.csv')
submission[categories]= prob_predictions #can enter multiple columns at once if columns are already there
submission.to_csv('submission_LSTM3.csv', index=False)



LB AUC: 0.9564 (worse than model1)

## Simple Ensemble: Logistic Regression with Tfidf & LSTM with GloVe

I will try a very simple ensemble model.

In [153]:
pred_LR = pd.read_csv('submission.csv') #AUC=.9745 (from Part1 notebook)
pred_LSTM = pd.read_csv('submission_LSTM1.csv') #AUC=.9704 (model1 in this notebook)

In [151]:
pred_Emsemble = (pred_LR[categories].values + pred_LR[categories].values)/2

submission = pd.read_csv('sample_submission.csv')
submission[categories]= pred_Emsemble #can enter multiple columns at once if columns are already there
submission.to_csv('submission_LR_LSTM.csv', index=False)

LB AUC: 0.9744 (not better than Logistic Regression with Tfidf only) 

It is possible that the two model predictions are high correlated and that's why this ensemble does not perform better than the logistic regression model. Let me check the correlations for each category.

In [161]:
# check correlation between the two model predictions
print('Correlation between two model predictions:') 
correl = []
for category in categories:
    corr = np.corrcoef(pred_LR[category], pred_LSTM[category])[0,1]
    correl.append(corr)
    print(category)
    print("%0.4f"%corr)

Correlation between two model predictions:
toxic
0.9149
severe_toxic
0.7793
obscene
0.9185
threat
0.2988
insult
0.8697
identity_hate
0.6699


The correlation coefficients between two model predictions are pretty high except for threat (less than .3). In particular, the correlations for each of toxic, obsecene, and insult are very high (over .85). This remids me of that I tuned the hyperparameters for the logistic regression model for each category. There I found less regularization is required for more unbalanced categories. The above correlations seem to correlated with the level of imbalances. 

In [165]:
# proportions of toxic comments for each category
positive_rate = labels.mean(axis=0)
positive_rate

array([0.09584448, 0.00999555, 0.05294822, 0.00299553, 0.04936361,
       0.00880486])

In [172]:
# rank order correlation between two model correlations and proportions of toxic comments
import scipy.stats as stats
stats.spearmanr(correl, positive_rate)

SpearmanrResult(correlation=0.942857142857143, pvalue=0.004804664723032055)

This high correlation could be from poor performances of LSTM for some categories. Thus, I will now check performances of the LSTM model separately for each category. Then, I will apply different levels of regularization or even different architectures for different categories to improve the overall performance.

First, I will fit the model for each category to see if this improves the performance.

In [197]:
submission = pd.read_csv('sample_submission.csv')

for category in categories:
    comment_sequences = Input(shape=(text_len,))
    X = Embedding(max_features, word_vec_dim, weights=[embed_matrix])(comment_sequences)
    X = Bidirectional(LSTM(50, return_sequences= True, dropout= 0.1, recurrent_dropout= 0.1))(X)
    X = GlobalMaxPool1D()(X)
    X = Dense(50, activation= "relu")(X)
    X = Dropout(0.1)(X)
    X = Dense(1, activation= "sigmoid")(X)
    model = Model(inputs= comment_sequences, outputs= X)
    model.compile(loss= 'binary_crossentropy', optimizer= 'adam', metrics= ['accuracy'])  
    print("### Fitting for {} ###".format(category))
    model.fit(train_comment_seq_pad, train[category], batch_size=32, epochs=2, validation_split=0.1)
    submission[category] = model.predict([test_comment_seq_pad], batch_size=1024, verbose=1)
    
submission.to_csv('submission_LSTM5.csv', index=False)

### Fitting for toxic ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
### Fitting for severe_toxic ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
### Fitting for obscene ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
### Fitting for threat ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
### Fitting for insult ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2
### Fitting for identity_hate ###
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


LB AUC: .9755

This is better than all of the previous models I tried including LSTM model1 above and logistic regression model in Part1.

In [182]:
# check if auc for the last category identity hate is also that high
from sklearn import metrics
pred = model.predict([train_comment_seq_pad], batch_size=1024, verbose=1) #for identity hate
y = train['identity_hate']
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=1)
metrics.auc(fpr, tpr)
# Yes, AUC is very high just like accuracy for identity hate

0.9938879274682488

It seems the Glove+LSMT model is especially better than Tfidf+LR model for more severely unbalanced categories (accuracy > .99). Thus, I will use the predictions by Tfidf+LR for toxic, obscene, and insult categories and predictions by GloVe+LSMT for severe toxic, threat, and identity hate categories and see if this combined prediction improves AUC. 

In [198]:
pred_LR = pd.read_csv('submission.csv') #AUC=.9745 (from Part1 notebook)
pred_LSTM = pd.read_csv('submission_LSTM5.csv') #AUC=.9755 

submission = pd.read_csv('sample_submission.csv')

submission['toxic']= pred_LR['toxic']
submission['severe_toxic']= pred_LSTM['severe_toxic']
submission['obscene']= pred_LR['obscene']
submission['threat']= pred_LSTM['threat']
submission['insult']= pred_LR['insult']
submission['identity_hate']= pred_LSTM['identity_hate']

submission.to_csv('submission_LR_LSTM_3.csv', index=False)

LB AUC: 0.9751 (not better than LSTM only)

I will try the simple ensemble model I tried above since the LSTM model was improved by fitting each category.

In [200]:
pred_Emsemble = (pred_LR[categories].values + pred_LR[categories].values)/2

submission = pd.read_csv('sample_submission.csv')
submission[categories]= pred_Emsemble #can enter multiple columns at once if columns are already there
submission.to_csv('submission_LR_LSTM_5.csv', index=False)

LB AUC: 0.9744 (not better than LSTM only)

The categories severe toxic, threat, and identity hate are also those with lower correlations between the two models, so the simple ensemble model might work only for those categories.

In [199]:
submission = pd.read_csv('sample_submission.csv')

submission['toxic']= pred_LR['toxic']
submission['severe_toxic']= (pred_LR['severe_toxic']+pred_LSTM['severe_toxic'])/2
submission['obscene']= pred_LR['obscene']
submission['threat']= (pred_LR['threat']+pred_LSTM['threat'])/2
submission['insult']= pred_LR['insult']
submission['identity_hate']= (pred_LR['identity_hate']+pred_LSTM['identity_hate'])/2

submission.to_csv('submission_LR_LSTM_4.csv', index=False)

LB AUC: 0.9770 

Yes! This is the best AUC I've ever got.

## Summary so far

- Made predictions using GloVe word embedding + LSTM in Keras. 
- Used multi-task learning for multiple lables (1 by 6 vector label) as mentioned in the future directions of Part1
- LSTM models with multi-task learning was worse than the Tfidf + Logistic regression model
- Fitting each category for the LSTM model was much slower, but made better predictions than the logistic regression
- Found a simple ensemble method (the last one) can increase AUC even further