Importing standard libraries : numpy, pandas and matplotlib

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

Importing the Keras libraries

In [2]:
import keras
from keras.models import Model
from keras.preprocessing.text import Tokenizer

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [20]:
from keras.models import Model
from keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D
from keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from keras.preprocessing import text, sequence
from keras.preprocessing.sequence import pad_sequences

Loading the train dataset

In [4]:
dataset_train = pd.read_csv('train.csv')

Loading the test dataset

In [5]:
dataset_test = pd.read_csv('test.csv')

Subsetting all the different classes of comments

In [6]:
all_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]


Splitting the training data set into X_train and X_test.

Training set which comprises only the comments from the training data set.

In [7]:
X_train = dataset_train["comment_text"]

Training set which comprises only the comments from the training data set.

In [8]:
y_train = dataset_train[all_classes].values

Training set which comprises the toxic, severe_toxic, obscene, threat, insult, identity_hate values for all the comments in the training data set.

In [9]:
X_test = dataset_test["comment_text"]

Setting the number of features(unique words) as 50000. Breaking down the sentences into individual unique words and assign an id to them using the Keras Tokenizer library.

In [23]:
max_features = 50000
maxlen = 150
embed_size = 128

In [16]:
tokenizer = text.Tokenizer(num_words=max_features)

In [17]:
tokenizer.fit_on_texts(list(X_train) + list(X_test))

In [18]:
list_tokenized_train = tokenizer.texts_to_sequences(X_train)
list_tokenized_test = tokenizer.texts_to_sequences(X_test)

Padding the sentences to a maximum length of 150 words.

In [21]:
pad_train = pad_sequences(list_tokenized_train, maxlen=maxlen)
pad_test = pad_sequences(list_tokenized_test, maxlen=maxlen)

Input layer for the neural network

In [22]:
input = Input(shape=(maxlen, ))

Passing the input to the embedding layer. Projecting the words in vector space.

In [24]:
x = Embedding(max_features, embed_size)(input)

SpatialDropout1D performs the same function as Dropout, however it drops entire 1D feature maps instead of individual elements.

In [25]:
x = SpatialDropout1D(0.4)(x)

In [26]:
x = Reshape((maxlen, embed_size, 1))(x)

Adding the convolutional layers

In [27]:
conv_0 = Conv2D(32, kernel_size=(3,3), kernel_initializer='normal', activation='relu')(x)
conv_1 = Conv2D(32, kernel_size=(3,3), kernel_initializer='normal', activation='relu')(x)
conv_2 = Conv2D(32, kernel_size=(3,3), kernel_initializer='normal', activation='relu')(x)
conv_3 = Conv2D(32, kernel_size=(3,3), kernel_initializer='normal', activation='relu')(x)

Adding the maxpooling layers

In [28]:
maxpool_0 = MaxPool2D(pool_size=(2,2))(conv_0)
maxpool_1 = MaxPool2D(pool_size=(2,2))(conv_1)
maxpool_2 = MaxPool2D(pool_size=(2,2))(conv_2)
maxpool_3 = MaxPool2D(pool_size=(2,2))(conv_3)

Concatenating the maxpooling layers

In [29]:
concat = Concatenate(axis=1)([maxpool_0, maxpool_1, maxpool_2, maxpool_3])

Adding the flatenning layer

In [31]:
flat = Flatten()(concat)

Dropout layer to drop 10% of the nodes

In [32]:
d = Dropout(0.1)(flat)

Finally, we feed the output into a Sigmoid layer. Has 6 nodes because we have 6 ways to classify the comments.

In [33]:
output = Dense(6, activation="sigmoid")(d)

Initializing and compiling the model

In [35]:
classifier = Model(inputs=input, outputs= output)

In [36]:
classifier.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

Fitting the classifier model

In [38]:
classifier.fit(pad_train,y_train, batch_size=32, epochs=2,validation_split=0.1)

Train on 143613 samples, validate on 15958 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x23f1d759860>

Predicting the test set

In [41]:
y_test = classifier.predict(pad_test)

In [42]:
y_test[0]

array([0.9996816 , 0.21816543, 0.87973875, 0.32872525, 0.4746668 ,
       0.2818759 ], dtype=float32)

In [44]:
cnn_final = pd.read_csv('sample_submission.csv')
cnn_final[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = y_test


In [46]:
cnn_final.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999682,0.218165,0.879739,0.328725,0.474667,0.281876
1,0000247867823ef7,0.035647,0.018604,0.03555,0.004679,0.021505,0.008496
2,00013b17ad220c46,0.149375,0.048432,0.077565,0.034539,0.096834,0.035278
3,00017563c3f7919a,0.019845,0.017095,0.02236,0.015416,0.027863,0.002433
4,00017695ad8997eb,0.073227,0.02054,0.032292,0.0154,0.042665,0.007459


In [45]:
cnn_final.to_csv('final_text_CNN.csv', index=False)