### COMP3359 Final Project: Toxic Comment Classification


## Objective
There are enormous discussions happening in our social media everyday and just one toxic comment can sour the whole discussion. Many platforms are struggling to effectively keep the environment clean. And how to define toxic may partly depend on what the platform is. Like some legal adult websites may be ok with obscene, but most are not. 

Therefore, the artificial intelligence program will not only detect the toxicity of the comment, but also label the comment with its toxicity type if it is. 


## Overview 
[Start With a Simple Nutrual Network](#p1) 

[Experiment With Other Models](#p2) 

[Test Performance of One Model](#p3)

 
## Instruction

The first(start with a simple nutrual network) and second(experiment with other models) parts are both about data preprocessing, building model and trainning model, and the third part is to see the performance of one model. 

__For example, if you want to test the simplest model in case0.1.1 (model1)__

step1: run all the codes in `Before start`

step2: run first two parts of the codes in `Start with a simple nutrual network` (`hyperparameters` and `data preprocess`)

step3: run case0.1.1 code in third part(`build and train mulitple models`) of `Start with a simple nutrual network`

step4: go to `test performace of one model`, change the model variable to the model you want to test, like model1, and run codes in 'test performance of one model'

For all cases in `Start with a simple nutrual network`, step1 and step2 are the same, and run corresponding code in step3.

__Another example, if you want to test the model in second part, like case 1.1.2 (model_cnn7, we found it has the best performance)__

step1: run all the codes in `Before start` (same as before)

step2: run first part codes in `Experiment with other models` (`More consistent variables for other models`)

step3: run case 1.2.3 codes

step4: go to `test performace of one model`, change the model variable to the model you want to test, like model_cnn7, and run codes in 'test performance of one model' (same as before)

For all cases in `Experiment with other models`, step1 and step2 are the same, and run corresponding code in step3.

# Before Start

## Google Colab

In [None]:
# # this is optional (actually not necessary at all, who else will use google's laggy service)

# """ Prepare Notebook for Google Colab """
# # Mount Google Drive
# from google.colab import drive
# drive.mount('/content/drive')

# # Specify directory of course materials in Google Drive
# module_dir = '/content/drive/My Drive/3359proj/'

# # Add course material directory in Google Drive to system path, for importing .py files later
# # (Ref.: https://stackoverflow.com/questions/48905127/importing-py-files-in-google-colab)
# import sys
# sys.path.append(module_dir)

## Prepare prerequist library

In [92]:
!pip install keras
!pip install tensorflow==1.13.1

Note: you may need to restart the kernel to use updated packages.


## Global variables

In [2]:
vocab_size = 20000
epochs = 5
batch_size = 50

## Load Data

In [3]:
""" Load Data """

import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer, one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D, Conv1D, Dropout, Flatten, MaxPooling1D, Concatenate, GlobalMaxPooling1D
from keras.models import Model, Sequential, load_model
from keras import initializers, regularizers, constraints, optimizers, layers

# # this is for google colab( DONOT RUN IT if working locally)
# data_dir = os.path.join(module_dir, "input/")
# data_path_train = os.path.join(data_dir, "train.csv")
# data_path_test = os.path.join(data_dir, "test.csv")
# data_path_test_label = os.path.join(data_dir, "test_labels.csv")
# train = pd.read_csv(data_path_train)
# test = pd.read_csv(data_path_test)
# test_label = pd.read_csv(data_path_test_label)

train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
test_label = pd.read_csv('./input/test_labels.csv')

In [40]:
# Check if the data has the null input if so, do some data engineering
train.isnull().any(),test.isnull().any()

(id               False
 comment_text     False
 toxic            False
 severe_toxic     False
 obscene          False
 threat           False
 insult           False
 identity_hate    False
 dtype: bool, id              False
 comment_text    False
 dtype: bool)

In [None]:
test = test[test_label['toxic'] != -1]
test_label = test_label[test_label['toxic'] != -1]

sen_train = train["comment_text"]
sen_test = test["comment_text"]

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y_train = train[list_classes].values
y_test = test_label[list_classes].values


# # Analyse train and test data set
# unlabelled_in_train = train[(train['toxic']!=1) & (train['severe_toxic']!=1) & (train['obscene']!=1) & 
#                             (train['threat']!=1) & (train['insult']!=1) & (train['identity_hate']!=1)]
# print('Percentage of unlabelled comments in train set is ', len(unlabelled_in_train)/len(train)*100)

# unlabelled_in_test = test_label[(test_label['toxic']!=1) & (test_label['severe_toxic']!=1) & (test_label['obscene']!=1) & 
#                             (test_label['threat']!=1) & (test_label['insult']!=1) & (test_label['identity_hate']!=1)]
# print('Percentage of unlabelled comments in test set is ', len(unlabelled_in_test)/len(test_label)*100)

# print(train[list_classes].sum())
# print(test_label[list_classes].sum())

<a id='p1'></a>
# Start With a Simple Nutrual Network

This section implemented a model with embedding + 1-Max pooling layer + 1 dense layer.

Model Hyperparameters to test:
1. [Sentence Max Length: 100, 200, 400](#p11) 
2. [Word Embedding Dimension: 64, 128, 256](#p12) 

## Hyperparameters

In [42]:
# as usually DON'T RUN it in order to save your time
# visualize the length of sentences to help us choose max_length
# totalNumWords = [len(one_comment) for one_comment in tokenized_train]
# plt.hist(totalNumWords,bins = np.arange(0,410,10))#[0,50,100,150,200,250,300,350,400])#,450,500,550,600,650,700,750,800,850,900])
# plt.show()

In [43]:
max_length200 = 200
max_length400 = 400
max_length100 = 100
emd_size = 128
emd_size64 = 64
emd_size256 = 256

## Data Preprocessing

In [28]:
# Vocabulary size for tokenization
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(list(sen_train))
print("Total vocab size: ", len(list(tokenizer.word_index.items())))
print("Vocab in use: ", tokenizer.num_words)
tokenized_train = tokenizer.texts_to_sequences(sen_train)
tokenized_test = tokenizer.texts_to_sequences(sen_test)

Total vocab size:  210337
Vocab in use:  20000


## Build and Train Mulitple Models

To save time, run corresponding codes of the model you want to test in this part.

<a id='p11'></a>
### 1.  Sentence Max Length

case 0.1.1: max length = 200

In [45]:
x_train = pad_sequences(tokenized_train, maxlen=max_length200)
x_test = pad_sequences(tokenized_test, maxlen=max_length200)

model1 = Sequential()
model1.add(Embedding(vocab_size, emd_size, input_length=max_length200,))
model1.add(GlobalMaxPooling1D())
model1.add(Dense(6,activation='sigmoid'))
model1.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model1.summary())

Model: "sequential_16"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_16 (Embedding)     (None, 200, 128)          2560000   
_________________________________________________________________
global_max_pooling1d_14 (Glo (None, 128)               0         
_________________________________________________________________
dense_24 (Dense)             (None, 6)                 774       
Total params: 2,560,774
Trainable params: 2,560,774
Non-trainable params: 0
_________________________________________________________________
None


In [46]:
# Fit model
history = model1.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 0.1.2: max length = 100

In [47]:
x_train = pad_sequences(tokenized_train, maxlen=max_length100)
x_test = pad_sequences(tokenized_test, maxlen=max_length100)

model2 = Sequential()
model2.add(Embedding(vocab_size, emd_size, input_length=max_length100,))
model2.add(GlobalMaxPooling1D())
model2.add(Dense(6,activation='sigmoid'))
model2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
history = model2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 0.1.3: max length = 400

In [48]:
x_train = pad_sequences(tokenized_train, maxlen=max_length400)
x_test = pad_sequences(tokenized_test, maxlen=max_length400)

model3 = Sequential()
model3.add(Embedding(vocab_size, emd_size, input_length=max_length400,))
model3.add(GlobalMaxPooling1D())
model3.add(Dense(6,activation='sigmoid'))
model3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


When max_length = 200, val_accuracy = 0.9715, after 4 epochs.

When max_length = 100, val_accuracy = 0.9719, after 3 epoch.

When max_length = 400, val_accuracy = 0.9709, after 3 epochs.

Therefore, we consider max_length = 100 as the optimal max_length, and apply it to later experiments.

<a id='p12'></a>
### 2. Word Embedding Dimension

case 0.2.1: emd size = 64

In [54]:
x_train = pad_sequences(tokenized_train, maxlen=max_length100)
x_test = pad_sequences(tokenized_test, maxlen=max_length100)

model4 = Sequential()
model4.add(Embedding(vocab_size, emd_size64, input_length=max_length100,))
model4.add(GlobalMaxPooling1D())
model4.add(Dense(6,activation='sigmoid'))
model4.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model4.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 0.2.2: emd size = 256

In [55]:
x_train = pad_sequences(tokenized_train, maxlen=max_length100)
x_test = pad_sequences(tokenized_test, maxlen=max_length100)

model5 = Sequential()
model5.add(Embedding(vocab_size, emd_size256, input_length=max_length100,))
model5.add(GlobalMaxPooling1D())
model5.add(Dense(6,activation='sigmoid'))
model5.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model5.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


When emd_size = 64, the highest val_accuracy = 0.9706, after 5 epoch, as the loss and val_loss are still decreasing, it has the promise to get higher val_accuracy if there are more epochs.

When emd_size = 256, the highest val_accuracy = 0.9730, after 2 epochs. It's much higher than 0.9706, although the former has promise, we would think emd_size = 256 has better performance.

When emd_size = 128, the val_accuracy = 0.9719, after 3 epochs.

Therefore, we choose emd_size = 256 for further tests.

<br><br>

<a id='p2'></a>
# Experiment with other models

1. [TextCNN](#p21) 
2. [CNN+LSTM](#p22) 
3. [BiDirectional RNN(LSTM/GRU)](#p23)
4. [Attention](#p24)

## More Consistent variables for other models

In [7]:
#Preprocessing  is the same; Running the following when the padding is finished
import pandas as pd
import numpy as np

from keras import layers
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer, one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, Conv1D, LSTM, GlobalMaxPooling1D, Dense, Dropout, Flatten, MaxPooling1D, Input, Concatenate,Conv2D, MaxPool2D, Reshape, CuDNNLSTM
from keras.models import load_model


""" common variables for all models"""

emd_size = 256 # optimal get from previous nutrual network
max_length = 100 # optiaml get from previous nutrual network
hidden_dims = 64

x_train = pad_sequences(tokenized_train, maxlen=max_length)
x_test = pad_sequences(tokenized_test, maxlen=max_length)

<a id='p21'></a>
## 1.TextCNN
According to the paper Convolutional Neural Networks for Sentence Classification by Yoon Kim, we can use word vector and use CNN as we do to images. The paper says CNN has excellent performance on sentence-level classification tasks with multiple benchmarks. So let's first give it a shoot and we will do some analysis once we have got the result.

Model Hyperparameters to test:

1. [Filter Size](#s11) 
2. [Density Layer](#s12)
3. [Filter number](#s13)

case 1.0: filter size = 3, has internal density layer, filter number = 250

In [57]:
#CNN model
#try to choose the same parameter as before so that we can compare the result with regards to the model choice
#Parameters without specific comment are subject to fine-tuning.
# the reason why I set them to numerical value is that I don't want to re-run to load the parameter value specified before
filters = 250
kernal_size = 3 # normally is set to 3

#but I don't want to waste time waiting so I will leave the training with larger epoch to my parter


#As below is a shallow CNN; Definitely we can try with another deep CNN but it will take some time to fine-tuning as well hence
# I will take care of that if I do have some spare time.
# The reset is just routine with nothing worth mentioning.
model_cnn = Sequential()
model_cnn.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn.add(Conv1D(filters, kernal_size, activation='relu'))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(hidden_dims,activation='relu'))
model_cnn.add(Dropout(0.5))
model_cnn.add(Dense(6,activation='sigmoid'))
model_cnn.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_cnn.summary()

#Fit model
history = model_cnn.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)


Model: "sequential_23"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_23 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_20 (Conv1D)           (None, 98, 250)           192250    
_________________________________________________________________
global_max_pooling1d_21 (Glo (None, 250)               0         
_________________________________________________________________
dense_31 (Dense)             (None, 64)                16064     
_________________________________________________________________
dropout_12 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_32 (Dense)             (None, 6)                 390       
Total params: 5,328,704
Trainable params: 5,328,704
Non-trainable params: 0
___________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<a id='s11'></a>
### 1.1 Filter Size

case 1.1.1: Filter Size = 5

In [63]:
filters = 250

filter_size = 8

model_cnn2 = Sequential()
model_cnn2.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn2.add(Conv1D(filters, filter_size, activation='relu'))
model_cnn2.add(GlobalMaxPooling1D())
model_cnn2.add(Dense(hidden_dims,activation='relu'))
model_cnn2.add(Dropout(0.5))
model_cnn2.add(Dense(6,activation='sigmoid'))
model_cnn2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn2.summary())

#Fit model
history = model_cnn2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_29"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_29 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_32 (Conv1D)           (None, 93, 250)           512250    
_________________________________________________________________
global_max_pooling1d_27 (Glo (None, 250)               0         
_________________________________________________________________
dense_43 (Dense)             (None, 64)                16064     
_________________________________________________________________
dropout_18 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_44 (Dense)             (None, 6)                 390       
Total params: 5,648,704
Trainable params: 5,648,704
Non-trainable params: 0
___________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Case 1.0: filter size = 3, highest val_acc = 0.9718, after 2 epochs

Case 1.1.1, filter size = 8, highest val_acc = 0.9728, after 2 epochs

case 1.1.2: Filter Size = [3,4,5]

In [None]:
filters = 250

filter_sizes = [3, 4, 5]

model_cnn3 = Sequential()
model_cnn3.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn3.add(Conv1D(filters, filter_sizes[i], activation='relu'))

model_cnn3.add(GlobalMaxPooling1D())
model_cnn3.add(Dense(hidden_dims,activation='relu'))
model_cnn3.add(Dropout(0.5))
model_cnn3.add(Dense(6,activation='sigmoid'))
model_cnn3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn3.summary())

#Fit model
history = model_cnn3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)


Even though filter size = 3 has worse performance than filter size = 8, when we use multiple filter size: [3,4,5], we get better performance.

<a id='s12'></a>
### 1.2 Density Layer

case 1.2.1: For case 1.0 remove one density layer

In [65]:
filters = 250

filter_size = 3

model_cnn4 = Sequential()
model_cnn4.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn4.add(Conv1D(filters, filter_size, activation='relu'))
model_cnn4.add(GlobalMaxPooling1D())
model_cnn4.add(Dropout(0.5))
model_cnn4.add(Dense(6,activation='sigmoid'))
model_cnn4.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn4.summary())

#Fit model
history = model_cnn4.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_31"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_31 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_36 (Conv1D)           (None, 98, 250)           192250    
_________________________________________________________________
global_max_pooling1d_29 (Glo (None, 250)               0         
_________________________________________________________________
dropout_20 (Dropout)         (None, 250)               0         
_________________________________________________________________
dense_47 (Dense)             (None, 6)                 1506      
Total params: 5,313,756
Trainable params: 5,313,756
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 1.2.2: For case 1.1.1 remove one density layer

In [117]:
filters = 250

filter_size = 8

model_cnn8 = Sequential()
model_cnn8.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn8.add(Conv1D(filters, filter_size, activation='relu'))
model_cnn8.add(GlobalMaxPooling1D())
model_cnn8.add(Dropout(0.5))
model_cnn8.add(Dense(6,activation='sigmoid'))
model_cnn8.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn8.summary())

#Fit model
history = model_cnn8.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_62"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_51 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_49 (Conv1D)           (None, 93, 250)           512250    
_________________________________________________________________
global_max_pooling1d_39 (Glo (None, 250)               0         
_________________________________________________________________
dropout_32 (Dropout)         (None, 250)               0         
_________________________________________________________________
dense_70 (Dense)             (None, 6)                 1506      
Total params: 5,633,756
Trainable params: 5,633,756
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 1.2.3: For case 1.1.2 remove one density layer

In [118]:
filters = 250

filter_size = [3,4,5]

model_cnn7 = Sequential()
model_cnn7.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn7.add(Conv1D(filters, filter_sizes[i], activation='relu'))

model_cnn7.add(GlobalMaxPooling1D())
model_cnn7.add(Dropout(0.5))
model_cnn7.add(Dense(6,activation='sigmoid'))
model_cnn7.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn7.summary())

#Fit model
history = model_cnn7.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_63"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_52 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_50 (Conv1D)           (None, 98, 250)           192250    
_________________________________________________________________
conv1d_51 (Conv1D)           (None, 95, 250)           250250    
_________________________________________________________________
conv1d_52 (Conv1D)           (None, 91, 250)           312750    
_________________________________________________________________
global_max_pooling1d_40 (Glo (None, 250)               0         
_________________________________________________________________
dropout_33 (Dropout)         (None, 250)               0         
_________________________________________________________________
dense_71 (Dense)             (None, 6)               

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Case 1,2.1, remove one density layer from case 1.0. highest val_acc = 0.9727 after 1 epoch. But in case 1.0, with density layer, the highest val_acc = 0.9718, after 2 epochs. higher.

Case 1.2.2, remove one density layer from case 1.1.1. Val_acc = 0.9724, after 1 epoch. And in case 1.1.1, val_acc = 0.9728, lower.

Case 1.2.3, remove one density layer from case 1.1.2. highest val_acc = 0.9727, case 1.1.2, val_acc = 0.9739.

Density layer is used to reshape the output. If it’s in the middle, we think it’s to get the features from higher dimensions to just 1D array, a.k.a extract the features. It seems that the density layer won’t help to enhance the performance when filter size is small. But when the filter size is big or complex, the density layer is helpful. 

<a id='s13'></a>
### 1.3 Filter Number

case 1.3.1: filter number = 100

In [66]:
filter_sizes = [3,4,5]
filters = 100


model_cnn5 = Sequential()
model_cnn5.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn5.add(Conv1D(filters, filter_sizes[i], activation='relu'))

model_cnn5.add(GlobalMaxPooling1D())
model_cnn5.add(Dense(hidden_dims,activation='relu'))
model_cnn5.add(Dropout(0.5))
model_cnn5.add(Dense(6,activation='sigmoid'))
model_cnn5.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn5.summary())

#Fit model
history = model_cnn5.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_32"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_32 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_37 (Conv1D)           (None, 98, 100)           76900     
_________________________________________________________________
conv1d_38 (Conv1D)           (None, 95, 100)           40100     
_________________________________________________________________
conv1d_39 (Conv1D)           (None, 91, 100)           50100     
_________________________________________________________________
global_max_pooling1d_30 (Glo (None, 100)               0         
_________________________________________________________________
dense_48 (Dense)             (None, 64)                6464      
_________________________________________________________________
dropout_21 (Dropout)         (None, 64)              

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 1.3.2: filter number = 500

In [67]:
filter_sizes = [3,4,5]
filters = 500


model_cnn6 = Sequential()
model_cnn6.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn6.add(Conv1D(filters, filter_sizes[i], activation='relu'))

model_cnn6.add(GlobalMaxPooling1D())
model_cnn6.add(Dense(hidden_dims,activation='relu'))
model_cnn6.add(Dropout(0.5))
model_cnn6.add(Dense(6,activation='sigmoid'))
model_cnn6.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn6.summary())

#Fit model
history = model_cnn6.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_33"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_33 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_40 (Conv1D)           (None, 98, 500)           384500    
_________________________________________________________________
conv1d_41 (Conv1D)           (None, 95, 500)           1000500   
_________________________________________________________________
conv1d_42 (Conv1D)           (None, 91, 500)           1250500   
_________________________________________________________________
global_max_pooling1d_31 (Glo (None, 500)               0         
_________________________________________________________________
dense_50 (Dense)             (None, 64)                32064     
_________________________________________________________________
dropout_22 (Dropout)         (None, 64)              

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Case 1.3.1, filter number = 100, highest val_acc = 0.9682 after 2 epochs, val_loss = 0.0841

Case 1.3.2, filter number = 500, highest val_acc = 0.9731 after epochs, val_losst = 0.0711

Case 1.1.2, filter number = 250, highest val_acc = 0.9739 after 3 epochs, val_loss = 0.0716

Larger filter number means for a same filter size window has more filters, which means learn more from the same filter window. But too many filters will make the train speed very slow, and might cause overfitting, like case 1.3.2.

The best structure for textCNN in our case is that, kernel size = [3,4,5], filter size =250, with a density layer. The best val_acc = 0.9739 is in case 1.1.2

<a id='p22'></a>
## 2.CNN + LSTM
We have tried CNN above and will try LSTM below, now let's implement them together to see the result. The motivation is that I want to see how it handles long sequences together with what to keep and what to forget.


case 2.0: Combine TextCNN and LSTM after getting the result of part 1 and 3

combin case 1.1.2 and case 3.2.1

In [116]:
#CNN model
#try to choose the same parameter as before so that we can compare the result with regards to the model choice
#Parameters without specific comment are subject to fine-tuning.
# the reason why I set them to numerical value is that I don't want to re-run to load the parameter value specified before
filters = 250 # optimal from previous tests
kernal_size = [3,4,5] # optimal from previous tests
units=50 # optimal from below tests
#but I don't want to waste time waiting so I will leave the training with larger epoch to my parter


#As below is a shallow CNN; Definitely we can try with another deep CNN but it will take some time to fine-tuning as well hence
# I will take care of that if I do have some spare time.
# The reset is just routine with nothing worth mentioning.
model_cnn_lstm = Sequential()
model_cnn_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn_lstm.add(Conv1D(filters, kernal_size[i], activation='relu'))
model_cnn_lstm.add(MaxPooling1D()) # also try 2D 
model_cnn_lstm.summary()
model_cnn_lstm.add(LSTM(units=units)) #same as the above
model_cnn_lstm.add(Dropout(0.2))
model_cnn_lstm.add(Dense(6,activation='sigmoid'))
model_cnn_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_cnn_lstm.summary()

#Fit model
history = model_cnn_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_61"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_50 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
conv1d_46 (Conv1D)           (None, 98, 250)           192250    
_________________________________________________________________
conv1d_47 (Conv1D)           (None, 95, 250)           250250    
_________________________________________________________________
conv1d_48 (Conv1D)           (None, 91, 250)           312750    
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 45, 250)           0         
Total params: 5,875,250
Trainable params: 5,875,250
Non-trainable params: 0
_________________________________________________________________
Model: "sequential_61"
_________________________________________________________________
Layer (type)        

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The highest val_acc = 0.9703 after 3 epochs. 

TextCNN has 0.9739 val_acc, and LSTM has 0.9737. The result shows that the combination makes no improvement.

Possible reason for no improvement: We did the combination by adding an LSTM layer after CNN. That might not be a real combination, like when we do computation in CNN layers, it can’t remember the previous thing until touching the LSTM layer.

<a id='p23'></a>
## 3.BiDirectional RNN(LSTM/GRU)

Now we need something that could remember previous information as well as remembering info for a long period of time.
HA! That's the classical BiDirectional RNN.
Here I only implemented it with LSTM, but in practice it could be done with GRU or both interchangably.
I will leave it to my partner to do some test running,

Model Hyperparameters to test:
1. [LSTM Hidden Nodes Number](#s31) 
2. [Density Layer](#s32)

case 3.0: units = 50 (alph = 9), has internal density layer

In [69]:
units = 50

model_BiD = Sequential()
model_BiD.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD.add(LSTM(units=units,return_sequences = True))
model_BiD.add(GlobalMaxPooling1D())
model_BiD.add(Dense(hidden_dims,activation='relu'))
model_BiD.add(Dropout(0.5))
model_BiD.add(Dense(6,activation='sigmoid'))
model_BiD.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD.summary()

#Fit model
history = model_BiD.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_35"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_35 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
lstm_3 (LSTM)                (None, 100, 50)           61400     
_________________________________________________________________
global_max_pooling1d_33 (Glo (None, 50)                0         
_________________________________________________________________
dense_53 (Dense)             (None, 64)                3264      
_________________________________________________________________
dropout_24 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_54 (Dense)             (None, 6)                 390       
Total params: 5,185,054
Trainable params: 5,185,054
Non-trainable params: 0
___________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<a id='s31'></a>
### 3.1 LSTM Hidden Nodes Number

case 3.1.1 units = 149 (alph = 3)

In [70]:
units = 149

model_BiD2 = Sequential()
model_BiD2.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD2.add(LSTM(units=units,return_sequences = True))
model_BiD2.add(GlobalMaxPooling1D())
model_BiD2.add(Dense(hidden_dims,activation='relu'))
model_BiD2.add(Dropout(0.2))
model_BiD2.add(Dense(6,activation='sigmoid'))
model_BiD2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD2.summary()

#Fit model
history = model_BiD2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_36"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_36 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
lstm_4 (LSTM)                (None, 100, 149)          241976    
_________________________________________________________________
global_max_pooling1d_34 (Glo (None, 149)               0         
_________________________________________________________________
dense_55 (Dense)             (None, 64)                9600      
_________________________________________________________________
dropout_25 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_56 (Dense)             (None, 6)                 390       
Total params: 5,371,966
Trainable params: 5,371,966
Non-trainable params: 0
___________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 3.1.2 units = 74 (alph = 6)

In [9]:
units = 74
hidden_dims = 64

model_BiD3 = Sequential()
model_BiD3.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD3.add(LSTM(units=units,return_sequences = True))
model_BiD3.add(GlobalMaxPooling1D())
model_BiD3.add(Dense(hidden_dims,activation='relu'))
model_BiD3.add(Dropout(0.2))
model_BiD3.add(Dense(6,activation='sigmoid'))
model_BiD3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD3.summary()

#Fit model
history = model_BiD3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 256)          5120000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 74)           97976     
_________________________________________________________________
global_max_pooling1d_2 (Glob (None, 74)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4800      
_________________________________________________________________
dropout_1 (Dropout)          (None, 64)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 390       
Total params: 5,223,166
Trainable params: 5,223,166
Non-trainable params: 0
____________________________________________

Case 3.0, units = 50, coefficient = 9, highest val_acc = 0.9725, val_loss = 0.0702, after 1 epoch

Case 3.1.1, units = 149, coefficient = 3, highest val_acc = 0.9721, val_loss = 0.0684, after 2 epochs

Case 3.1.2, units = 74, coefficient = 6,highest val_acc = 0.9720, val_loss = 0.0701, after 2 epochs

50 is the optimal value, and we will use it for further test.

<a id='s32'></a>
### 3.2 Density Layer

case 3.2.1: For case 3.0 delete middle density layer

In [75]:
units = 50

model_BiD5 = Sequential()
model_BiD5.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD5.add(LSTM(units=units,return_sequences = True))
model_BiD5.add(GlobalMaxPooling1D())
model_BiD5.add(Dropout(0.2))
model_BiD5.add(Dense(6,activation='sigmoid'))
model_BiD5.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD5.summary()

#Fit model
history = model_BiD5.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_40"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_40 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
lstm_8 (LSTM)                (None, 100, 50)           61400     
_________________________________________________________________
global_max_pooling1d_38 (Glo (None, 50)                0         
_________________________________________________________________
dropout_29 (Dropout)         (None, 50)                0         
_________________________________________________________________
dense_62 (Dense)             (None, 6)                 306       
Total params: 5,181,706
Trainable params: 5,181,706
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


Case 3.2.1, we remove the density layer from case 3.0, which units = 50, and has the highest val_acc 0.9725. Then in case 3.2.1, highest val_acc = 0.9737. Also similar in textCNN, when units is not that so large, more density layers make no improvement in our case.

After all tests, the best LSTM model structure is like, units = 50, with just one density layer in the end. The highest val_acc = 0.9737, after 2 epochs.

<a id='p24'></a>
## 4.Attention Models
It's not covered in the lecture but since the release of Hierarchical Attention Networks for Document Classification paper written jointly by CMU and Microsoft guys in 2016, it's been quite popular.
But what is the REAL incentive after trying this model?
It's from the REAL TRUMP: "what do you have to lose? I say, take it.", so here I will give it a shoot as the president does to the hydroxychloroquine.

As below is a simple attention model which help us by pay more attention to some word since toxic comments tend to be determined by just one or two toxic words, especially some 4 letter word, u know.

Obviously, attention can be implemented together with models mentioned above, but since I don't have such time, I will leave it to my partner to do some trial.

In [95]:
# https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
from keras.layers import Layer
from keras import backend as K
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):

        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)
        
    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(name='{}_W'.format(self.name),
                                 shape=(input_shape[-1],),
                                 initializer=self.init,
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(name='{}_b'.format(self.name),
                                     shape=(input_shape[1],),
                                     initializer='zero',
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True
        
    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)

        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim

case 4.0: implemeted Attention with LSTM

In [97]:
units = 50 # optimal from previous

model_a = Sequential()
model_a.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_a.add(LSTM(units=units,return_sequences = True))
model_a.add(Attention(max_length))
#model_a.add(GlobalMaxPooling1D())
model_a.add(Dense(hidden_dims,activation='relu'))
model_a.add(Dropout(0.2))
model_a.add(Dense(6,activation='sigmoid'))
model_a.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_a.summary()

#Fit model
history = model_a.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_48"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_48 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
lstm_16 (LSTM)               (None, 100, 50)           61400     
_________________________________________________________________
attention_4 (Attention)      (None, 50)                150       
_________________________________________________________________
dense_63 (Dense)             (None, 64)                3264      
_________________________________________________________________
dropout_30 (Dropout)         (None, 64)                0         
_________________________________________________________________
dense_64 (Dense)             (None, 6)                 390       
Total params: 5,185,204
Trainable params: 5,185,204
Non-trainable params: 0
___________________________________________

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


case 4.1: Try model structure introduced in https://www.kaggle.com/sanket30/cudnnlstm-lstm-99-accuracy, which is said to be 99% accuracy

In [119]:
from keras.layers import LSTM, Dense, Bidirectional, Input,Dropout,BatchNormalization, CuDNNGRU, CuDNNLSTM

model_a2 = Sequential()
model_a2.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_a2.add(Bidirectional(LSTM(128, dropout=0.4, recurrent_dropout=0.4, activation='relu', return_sequences=True)))
model_a2.add(Bidirectional(LSTM(64, return_sequences = True)))
model_a2.add(Attention(max_length))
model_a2.add(Dense(6,activation='sigmoid'))
model_a2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_a2.summary())

#Fit model
history = model_a2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

Model: "sequential_64"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_53 (Embedding)     (None, 100, 256)          5120000   
_________________________________________________________________
bidirectional_23 (Bidirectio (None, 100, 256)          394240    
_________________________________________________________________
bidirectional_24 (Bidirectio (None, 100, 128)          164352    
_________________________________________________________________
attention_11 (Attention)     (None, 128)               228       
_________________________________________________________________
dense_72 (Dense)             (None, 6)                 774       
Total params: 5,679,594
Trainable params: 5,679,594
Non-trainable params: 0
_________________________________________________________________
None


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Train on 159571 samples, validate on 63978 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


oops, there is something wrong with the result and we talked about it in the Limitation part of our PDF report.

## 5.BERT
As mentioned in the comment, "it's not difficult to implement BERT",so BERT is not implemented. In addition, you may not want to download pretrained model which takes a while through streaming.

<a id='p3'></a>
# Test Performance of One Model

Show the acc of each label

In [128]:
from sklearn.metrics import accuracy_score

model = model_cnn7 # replace with the model want to test

for label in list_classes:
    print('======================= Evaluate {} ==============================='.format(label))
    
    y_train_true = train[label]
    y_test_true = test_label[label]
    
    # compute accuracy
    y_train_pre = model.predict(x_train)
    y_test_pre = model.predict(x_test)
    
    print('Accuracy in train set of {} is {}'.format(label, accuracy_score(y_train_true, y_train_pre[:,1].round(),  normalize=True)))
    print('Accuracy in test set of {} is {}'.format(label, accuracy_score(y_test_true, y_test_pre[:,1].round(),  normalize=True)))

Accuracy in train set of toxic is 0.914113466732677
Accuracy in test set of toxic is 0.9123292381756228
Accuracy in train set of severe_toxic is 0.9922166308414436
Accuracy in test set of severe_toxic is 0.9919972490543625
Accuracy in train set of obscene is 0.9569094634990067
Accuracy in test set of obscene is 0.9496701991309513
Accuracy in train set of threat is 0.988926559337223
Accuracy in test set of threat is 0.9902779080308857
Accuracy in train set of insult is 0.9597420583940691
Accuracy in test set of insult is 0.9528900559567351
Accuracy in train set of identity_hate is 0.9865138402341278
Accuracy in test set of identity_hate is 0.9847916471287005


Pick 10 random sentences from test set and print out the predicted label and true label. 

And here we use all the sentences in test set, even if their labels are all '-1', which marked as unmarked for this competition, but for test we can still test it. (for labels with all '-1', we considered it as 'unlabelled')

In [None]:
train = pd.read_csv('./input/train.csv')
test = pd.read_csv('./input/test.csv')
test_label = pd.read_csv('./input/test_labels.csv')

sen_train = train["comment_text"]
sen_test = test["comment_text"]

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y_train = train[list_classes].values
y_test = test_label[list_classes].values

tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(list(sen_train))
tokenized_train = tokenizer.texts_to_sequences(sen_train)
tokenized_test = tokenizer.texts_to_sequences(sen_test)

x_train = pad_sequences(tokenized_train, maxlen=max_length)
x_test = pad_sequences(tokenized_test, maxlen=max_length)

In [167]:
import random

randoms = random.sample(range(len(x_test)), 10)

model = model_cnn7 # replace with the model want to test

for i in randoms:

        x_sample = x_test[i]
        y_true = y_test[i]
        y_pre = model.predict(np.array([x_sample,]))
        print('Sentence id is {}'.format(test['id'][i]))
        label_true = []
        label_pre = []
        for i in range (len(list_classes)):
            if (y_true[i] == 1):
                label_true.append(list_classes[i])
            if (y_pre[0][i] >= 0.5):
                label_pre.append(list_classes[i])
        if (y_true[0] == -1):
            label_true.append('unlabelled')
            
        if (len(label_true) == 0):
            label_true.append('normal')
            
        if (len(label_pre) == 0):
            label_pre.append('normal')
        print('True labels are {}'.format(label_true))
        print('Predicted labels are {}'.format(label_pre))
        print('(predicted array {})'.format(y_pre))
        print('=============================================================')

Sentence id is e33928dae1712a06
True labels are ['unlabelled']
Predicted labels are ['toxic', 'obscene', 'insult']
(predicted array [[0.99993944 0.40892202 0.998252   0.0112919  0.86991    0.07564162]])
Sentence id is 4d8870314a57d8e4
True labels are ['unlabelled']
Predicted labels are ['normal']
(predicted array [[3.9631287e-03 2.3208251e-05 2.4333235e-04 2.5662528e-06 1.8232784e-04
  3.5119509e-05]])
Sentence id is cfbc476bf15f6373
True labels are ['unlabelled']
Predicted labels are ['normal']
(predicted array [[4.2877709e-05 2.9629155e-06 6.6561275e-05 2.0524837e-09 2.0775164e-05
  3.8602158e-07]])
Sentence id is 2c66b6c902a599b2
True labels are ['toxic', 'obscene', 'insult']
Predicted labels are ['toxic', 'obscene']
(predicted array [[9.9888498e-01 3.8344768e-04 9.6542639e-01 6.0155251e-08 7.4289426e-02
  5.5028549e-06]])
Sentence id is f83464b600f0a8ff
True labels are ['normal']
Predicted labels are ['normal']
(predicted array [[3.3287317e-05 1.7350229e-06 2.1727858e-05 3.8528283e

You also test the model with any sentence you want.

In [33]:
model = model_BiD3; # replace the model want to test

sen_one_test = "hello world"

tokenized_one_test = tokenizer.texts_to_sequences([sen_one_test])

x_one_test = pad_sequences(tokenized_one_test, maxlen=max_length)

y_pre = model.predict(np.array(x_one_test)) 

label_pre = []
for i in range (len(list_classes)):
    if (y_pre[0][i] >= 0.5):
        label_pre.append(list_classes[i])
            
    if (len(label_pre) == 0):
        label_pre.append('normal')

print('Sentence is {}'.format(sen_one_test))
print('Predicted labels are {}'.format(label_pre))
print('(predicted array {})'.format(y_pre))

Sentence is hello world
Predicted labels are ['normal']
(predicted array [[2.9118722e-03 2.1204372e-05 1.8908845e-03 6.1425867e-06 7.1677449e-04
  3.2186953e-04]])
