**Perform the necessary imports**

In [1]:
import numpy as np
import pandas as pd
from keras.preprocessing import text, sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


**Necessary global variables**

In [2]:
list_of_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
max_features = 20000
max_text_length = 400
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
batch_size = 32
epochs = 2

**Quick peek into the data**

In [3]:
train_df = pd.read_csv('../input/train.csv')
print(train_df.head())

         id                                       comment_text  toxic  \
0  22256635  Nonsense?  kiss off, geek. what I said is true...      1   
1  27450690  "\n\n Please do not vandalize pages, as you di...      0   
2  54037174  "\n\n ""Points of interest"" \n\nI removed the...      0   
3  77493077  Asking some his nationality is a Racial offenc...      0   
4  79357270  The reader here is not going by my say so for ...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0              0  
1             0        0       0       0              0  
2             0        0       0       0              0  
3             0        0       0       0              0  
4             0        0       0       0              0  


**Printing using 'iloc' just for fun**

In [4]:
print(train_df.iloc[0, -7])
print(train_df.iloc[0, 1])

Nonsense?  kiss off, geek. what I said is true.  I'll have your account terminated.
Nonsense?  kiss off, geek. what I said is true.  I'll have your account terminated.


**Checking if  NaNs exist in the training data**

In [5]:
print(np.where(pd.isnull(train_df)))

(array([], dtype=int64), array([], dtype=int64))


**Apparently no NaNs in the training set!**

**Converting pandas series to a numpy array using .values**

In [6]:
x = train_df['comment_text'].values
print(x)

[ "Nonsense?  kiss off, geek. what I said is true.  I'll have your account terminated."
 '"\n\n Please do not vandalize pages, as you did with this edit to W. S. Merwin. If you continue to do so, you will be blocked from editing.    "'
 '"\n\n ""Points of interest"" \n\nI removed the ""points of interest"" section you added because it seemed kind of spammy. I know you probably didn\'t mean to disobey the rules, but generally, a point of interest tends to be rather touristy, and quite irrelevant to an area culture. That\'s just my opinion, though.\n\nIf you want to reply, just put your reply here and add {{talkback|Jamiegraham08}} on my talkpage.   "'
 ...,
 'Mamoun Darkazanli\nFor some reason I am unable to fix the bold formatting on the Arabic name of Mamoun Darkazanli. The entire first paragraph is bolded and I just want to bold the script. Please take a look when you get the chance.'
 'Salafi would be a better term. It is more politically correct to use in Islam. Very few Muslims, e

In [7]:
print("properties of x")
print("type : {}, dimensions : {}, shape : {}, total no. of elements : {}, data type of each element: {}, size of each element {} bytes".format(type(x), x.ndim, x.shape, x.size, x.dtype, x.itemsize))

properties of x
type : <class 'numpy.ndarray'>, dimensions : 1, shape : (95851,), total no. of elements : 95851, data type of each element: object, size of each element 8 bytes


**Getting the labels**

In [8]:
y = train_df[list_of_classes].values
print(y)

[[1 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 ..., 
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]


In [9]:
print("properties of y")
print("type : {}, dimensions : {}, shape : {}, total no. of elements : {}, data type of each element: {}, size of each element {} bytes".format(type(y), y.ndim, y.shape, y.size, y.dtype, y.itemsize))

properties of y
type : <class 'numpy.ndarray'>, dimensions : 2, shape : (95851, 6), total no. of elements : 575106, data type of each element: int64, size of each element 8 bytes


**Keras makes our life easy. Using Tokenizer to get a list of sequence and then padding it form a 2D numpy array **

In [10]:
x_tokenizer = text.Tokenizer(num_words=max_features)
print(x_tokenizer)
x_tokenizer.fit_on_texts(list(x))
print(x_tokenizer)
x_tokenized = x_tokenizer.texts_to_sequences(x) #list of lists(containing numbers), so basically a list of sequences, not a numpy array
#pad_sequences:transform a list of num_samples sequences (lists of scalars) into a 2D Numpy array of shape 
x_train_val = sequence.pad_sequences(x_tokenized, maxlen=max_text_length)

<keras.preprocessing.text.Tokenizer object at 0x7f70241c6278>
<keras.preprocessing.text.Tokenizer object at 0x7f70241c6278>


In [11]:
print("properties of x_train_val")
print("type : {}, dimensions : {}, shape : {}, total no. of elements : {}, data type of each element: {}, size of each element {} bytes".format(type(x_train_val), x_train_val.ndim, x_train_val.shape, x_train_val.size, x_train_val.dtype, x_train_val.itemsize))

properties of x_train_val
type : <class 'numpy.ndarray'>, dimensions : 2, shape : (95851, 400), total no. of elements : 38340400, data type of each element: int32, size of each element 4 bytes


**90% of the data is used for training and the rest for validation**

In [12]:
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y, test_size=0.1, random_state=1)

**Start building the model**

In [13]:
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=max_text_length))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto 6 output layers, and squash it with a sigmoid:
model.add(Dense(6))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()

Build model...
Instructions for updating:
`NHWC` for data_format is deprecated, use `NWC` instead
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 400, 50)           1000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
____________________________________________

**Begin training**

In [14]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
validation_data=(x_val, y_val))

Train on 86265 samples, validate on 9586 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7095399940>

**Good job! 98% accuracy on the validation set. Scope for improvement exists!**

**Quick peek into the test set**

In [15]:
test_df = pd.read_csv('../input/test.csv')
print(test_df.head())

         id                                       comment_text
0   6044863  ==Orphaned non-free media (Image:41cD1jboEvL. ...
1   6102620  ::Kentuckiana is colloquial.  Even though the ...
2  14563293  Hello fellow Wikipedians,\nI have just modifie...
3  21086297  AKC Suspensions \nThe Morning Call - Feb 24, 2...
4  22982444                      == [WIKI_LINK: Talk:Celts] ==


**Checking if  NaNs exist in the test data**

In [16]:
print(np.where(pd.isnull(test_df)))

(array([52300]), array([1]))


**Hmmm**

In [17]:
test_df.iloc[52300, 1]

nan

**Fill the NaN field**

In [18]:
x_test = test_df['comment_text'].fillna('comment_missing').values
print(x_test)

['==Orphaned non-free media (Image:41cD1jboEvL. SS500 .jpg)=='
 '::Kentuckiana is colloquial.  Even though the area is often referred to as this, it (in my opinion) has never held the encyclopedic precision of "Louisville metropolitian area", which has a specific U.S. Census definition.  Also, apparently Kentuckiana often refers to the local television viewing area, which isn\'t nearly contiguous with the official metro area.  As you indicate, Kentuckiana seems to be more of a slang or marketing phenomena than anything we could pin down in encyclopedic terms here.  That\'s why we see Wikipedia language like "the Louisville metropolitan area, sometimes referred to as Kentuckiana". That\'s my take on it. —   •'
 'Hello fellow Wikipedians,\nI have just modified  on [WIKI_LINK: Double Trouble (George Jones and Johnny Paycheck album)]. Please take a moment to review [EXTERNAL_LINK: my edit]. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit

**Tokenizing and padding similar to what we did before to training data**

In [19]:
x_test_tokenized = x_tokenizer.texts_to_sequences(x_test)
x_testing = sequence.pad_sequences(x_test_tokenized, maxlen=max_text_length)

**Time to predict!**

In [20]:
y_testing = model.predict(x_testing, verbose = 1)



**Submit predictions!**

In [21]:
sample_submission = pd.read_csv("../input/sample_submission.csv")
sample_submission[list_of_classes] = y_testing
sample_submission.to_csv("toxic_comment_classification.csv", index=False)