# This is the first solution to the Toxic Comments challenge as part of my Machine Learning Capstone project.  


Steps that we will go through are as follows:

1. Import and explore the data
2. Process the data into a format that we can train a model with
3. Train a model
4. Use our model to make predictions using our test sets (we have multiple, one is 20% of our training data, the other is the testing set provided by the Kaggle Competition
5. View our accuracy, precision, recall and f1 scores
6. Submit our testing set to the Kaggle Competition to retrive our mean-wise AUC ROC score


In [1]:
# 1. Data importation and exploration

import pandas as pd

test_data = pd.read_csv('test.csv') # this is our training set, with labels (provided by kaggle in csv format)
train_data = pd.read_csv('train.csv')  # this is our testing set without labels (provided by kaggle in csv format)

In [2]:
# 2. Process the data into a format that we can train a model with

# we will seperate our training data into features and labels

features = train_data['comment_text']
label_columns = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
labels = train_data[label_columns]

In [3]:
# we will store lengths of our arrays

xtrain_size = len(features)
ytrain_size = len(labels)
xtest_size = len(test_data)

print (xtrain_size, ytrain_size, xtest_size)

print (test_data.head())

159571 159571 153164
                 id                                       comment_text
0  00001cee341fdb12  Yo bitch Ja Rule is more succesful then you'll...
1  0000247867823ef7  == From RfC == \n\n The title is fine as it is...
2  00013b17ad220c46  " \n\n == Sources == \n\n * Zawe Ashton on Lap...
3  00017563c3f7919a  :If you have a look back at the source, the in...
4  00017695ad8997eb          I don't anonymously edit articles at all.


In [4]:
# countvectorizer will give us word counts for how many times each word 
# (dictionary will be built from all available words in our training set) occurs in each comment
# it will also remove punctuation and extra spacing

from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words='english', max_df = 0.32, min_df=3) 

# stop words will remove and, or, if, etc, max_df will disregard words that occur in more than xx percent of comments, 
#min_df is the minimum times a word must occur to be considered a feature

xtrain_trans = count_vect.fit_transform(features) # fitting the x train data to the countvectorizer method


xtest_trans = count_vect.transform(test_data['comment_text']) # transform the x test data using the fitted countvectorizer


num_words = len(count_vect.get_feature_names()) # numner of words used as features

In [5]:
print ('Number of words in dictionary: ', format(num_words))
print ('X_train size: ', format(xtrain_size))
print ('test_data_size: ', format(xtest_size))

Number of words in dictionary:  52439
X_train size:  159571
test_data_size:  153164


In [6]:
import numpy as np
np.random.seed(42) # we will set the random seed number so that results are replicable

from keras.layers import Dropout, Dense, Activation
from keras.models import Sequential

model = Sequential()

model.add(Dense(512, input_shape = (num_words,), activation = 'relu')) #we use 512 in the first layer because this is the largest we can have this layer without memory error

model.add(Dropout(0.25)) # dropout layers are added to ease computation and prevent overfitting

model.add(Dense(384 ,activation = 'relu')) #we want to gradually reduce the surface area of our array and increase depth
          
model.add(Dropout(0.25))
          
model.add(Dense(256, activation = 'relu'))

model.add(Dropout(0.25))
          
model.add(Dense(128, activation = 'relu'))
          
model.add(Dense(6)) # number of classifications
          
model.add(Activation('softmax')) #in order to return probabilities, and since we have 133 possible dog breeds

model.summary()


Using TensorFlow backend.


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 512)               26849280  
_________________________________________________________________
dropout_1 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 384)               196992    
_________________________________________________________________
dropout_2 (Dropout)          (None, 384)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               98560     
_________________________________________________________________
dropout_3 (Dropout)          (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               32896     
__________

In [7]:
# Since CountVectorizer() returns a sparse matrix and because of the number of parameters we will need to use a batch generator
# the batch generator will take one batch of data, convert that data to dense and feed it into the CNN model.  
# this will avoid problems of running out of memory

def batch_generator(X, y, batch_size, steps):
    counter=0
    while True:
        batchx = X[counter*batch_size : (counter+1)*batch_size].todense() # covert our feature set to a dense matrix
        batchy = y[counter*batch_size : (counter+1)*batch_size]
        yield (batchx, batchy)

        if counter == (steps - 2):
            batchx = X[(counter+1)*batch_size : ].todense()
            batchy = y[(counter+1)*batch_size : ]
            counter = 0
            yield (batchx, batchy)
        else:
            counter = counter + 1
            
            
#        shuffle_index = np.arange(np.shape(y)[0])
#        np.random.shuffle(shuffle_index)
#        X =  X[shuffle_index, :]
#        y =  y[shuffle_index, :]
#        while True:
#            index_batch = shuffle_index[batch_size*counter:batch_size*(counter+1)]
#            X_batch = X[index_batch,:].todense()
#            y_batch = y[index_batch]
#            yield (np.array(X_batch), y_batch)
#            counter += 1
#            if (counter >= number_of_batches):
#                np.random.shuffle(shuffle_index)
#                counter=0

In [8]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [9]:
#model.fit(xtrain_trans, labels, batch_size=100, epochs=2, verbose=1, validation_split=0.2)

batch_size = 1800  #this is the maximum batch size I can use with this model without memory issue.
# Batch_size and input layer dimensions are limiting factors due to memory issues

nb_epoch = 100

steps_per_epoch = int(xtrain_size/batch_size)+1

# will use batch_generator to generate batches to trian model
model.fit_generator(generator=batch_generator(xtrain_trans, labels, batch_size, steps_per_epoch), 
                    epochs=nb_epoch, steps_per_epoch=steps_per_epoch)  


Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Epoch 92/100
Epoch 93/100
Epoch 94/100
Epoch 95/100
Epoch 96/100
Epoch 97/100
Epoch 98/100
Epoch 99/100
Epoch 100/100


<keras.callbacks.History at 0x1fcdb2314e0>

In [10]:

# we will need to create a second batch generator to predict on our test samples

def test_generator(X, batch_size, steps):
    counter=0
    while True:
        batchx = X[counter*batch_size : (counter+1)*batch_size].todense()
        yield (batchx)

        if counter == (steps - 2):
            batchx = X[(counter+1)*batch_size : ].todense()
            counter = 0
            yield (batchx)
        else:
            counter = counter + 1

#generate predictions

test_results = model.predict_generator(generator = test_generator(xtest_trans, batch_size, (int(xtest_size/batch_size)+1)),
                        steps = (int(xtest_size/batch_size)+1), workers=1, verbose=1)



In [11]:
# create submission file for Kaggle

submission = pd.DataFrame(data=(test_results), index=test_data['id'],
                          columns=['toxic','severe_toxic','obscene','threat','insult','identity_hate'])


submission.to_csv('cnn.csv', index=True)

With our initial CNN model submission to kaggle, we received ROC AUC score of 0.7324, slightly lower then our best effort using Naive Bayes