### COMP3359 Final Project: Toxic Comment Classification


## Objective
There are enormous discussions happening in our social media everyday and just one toxic comment can sour the whole discussion. Many platforms are struggling to effectively keep the environment clean. And how to define toxic may partly depend on what the platform is. Like some legal adult websites may be ok with obscene, but most are not. 

Therefore, the artificial intelligence program will not only detect the toxicity of the comment, but also label the comment with its toxicity type if it is. 


## Overview 
[Start With Simple LSTM Model](#p1) 

[Experiment With Other Models](#p2) 
 
-----

<a id=’p1’></a>
# Start With Simple LSTM Model

This section implemented a model with embedding+LSTM layer + 1 dense layer with dropout.

Model Hyperparameters to test:
1. [Sentence Max Length: 100, 200, ](#p11) 
2. [Word Embedding Dimension: 64, 128, 256](#p12) 
3. [Batch size: 500, 1000, 5000](#p13)

In [0]:
# this is optional (actually not necessary at all, who else will use google's laggy service)

""" Prepare Notebook for Google Colab """
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Specify directory of course materials in Google Drive
module_dir = '/content/drive/My Drive/3359proj/'

# Add course material directory in Google Drive to system path, for importing .py files later
# (Ref.: https://stackoverflow.com/questions/48905127/importing-py-files-in-google-colab)
import sys
sys.path.append(module_dir)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
pip install keras



In [0]:
""" Load Data """

import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer, one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
from keras.layers import Bidirectional, GlobalMaxPool1D, Conv1D, Dropout, Flatten, MaxPooling1D, Concatenate, GlobalMaxPooling1D
from keras.models import Model, Sequential, load_model
from keras import initializers, regularizers, constraints, optimizers, layers

# # this is for google colab( DONOT RUN IT if working locally)
data_dir = os.path.join(module_dir, "input/")
data_path_train = os.path.join(data_dir, "train.csv")
data_path_test = os.path.join(data_dir, "test.csv")
data_path_test_label = os.path.join(data_dir, "test_labels.csv")
train = pd.read_csv(data_path_train)
test = pd.read_csv(data_path_test)
test_label = pd.read_csv(data_path_test_label)

# train = pd.read_csv('./input/train.csv')
# test = pd.read_csv('./input/test.csv')
# test_label = pd.read_csv('./input/test_labels.csv')

Using TensorFlow backend.


In [0]:
# Check if the data has the null input if so, do some data engineering
train.isnull().any(),test.isnull().any()

(id               False
 comment_text     False
 toxic            False
 severe_toxic     False
 obscene          False
 threat           False
 insult           False
 identity_hate    False
 dtype: bool, id              False
 comment_text    False
 dtype: bool)

In [0]:
sen_train = train["comment_text"]
sen_test_temp = test["comment_text"]

list_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
y_train = train[list_classes].values
y_test_temp = test_label[list_classes].values

sen_test = []
y_test = []

test_no = len(y_test_temp)
for i in range(test_no):
  if not np.array_equal(y_test_temp[i], [-1,-1,-1,-1,-1,-1]):
    y_test.append(y_test_temp[i])
    sen_test.append(sen_test_temp[i])
    
y_test = np.array(y_test)
sen_test = np.array(sen_test)

In [0]:
# Vocabulary size for tokenization
vocab_size = 20000
tokenizer = Tokenizer(num_words=vocab_size)
tokenizer.fit_on_texts(list(sen_train))
tokenized_train = tokenizer.texts_to_sequences(sen_train)
tokenized_test = tokenizer.texts_to_sequences(sen_test)

In [0]:
# as usually DON'T RUN it in order to save your time
totalNumWords = [len(one_comment) for one_comment in tokenized_train]
plt.hist(totalNumWords,bins = np.arange(0,410,10))#[0,50,100,150,200,250,300,350,400])#,450,500,550,600,650,700,750,800,850,900])
plt.show()

In [0]:
max_length200 = 200
max_length150 = 150
max_length100 = 100
batch_size500 = 500
batch_size50 = 50
epochs = 15
emd_size = 128
emd_size64 = 64
emd_size256 = 256

build a simple modle with LSTM; Skipp the following when running other models as follows.

<a id=’p11’></a>
### 1.  Sentence Max Length

case 0.1.1: max length = 200

In [0]:
x_train = pad_sequences(tokenized_train, maxlen=max_length200)
x_test = pad_sequences(tokenized_test, maxlen=max_length200)

model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length200,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_lstm.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 200, 128)          2560000   
_________________________________________________________________
lstm_1 (LSTM)                (None, 200, 60)           45360     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 60)                0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 60)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 366       
Total params: 2,605,726
Trainable params: 2,605,726
Non-trainable params: 0
_________________________________________________________________
None


In [0]:
# model = Model(inputs=inp,outputs=x)
# model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
# print(model.summary())
#history = model.fit(x_train,y_train, batch_size=batch_size, epochs=epochs, validation_data=[x_test,y_test])

# Fit model
history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size50)

159571/159571 [==============================] - 89s 556us/step
Training accuracy: 0.98535
63978/63978 [==============================] - 36s 561us/step
Testing Accuracy: 0.96839

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
import matplotlib.pyplot as plt
%matplotlib inline
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(maxlen=200,batch_size=1000,emd_size=64)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 0.1.2: max length = 100

In [0]:
x_train = pad_sequences(tokenized_train, maxlen=max_length100)
x_test = pad_sequences(tokenized_test, maxlen=max_length100)

model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length100,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size50)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(maxlen=100)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 0.1.3: max length = 150

In [0]:
x_train = pad_sequences(tokenized_train, maxlen=max_length150) # if 100 still overfit change to 50
x_test = pad_sequences(tokenized_test, maxlen=max_length150)

model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length300,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size1000)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(maxlen=300)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<a id=’p12’></a>
### 2. Word Embedding Dimension

case 0.2.1: emd size = 128

In [0]:
x_train = pad_sequences(tokenized_train, maxlen=max_length200)
x_test = pad_sequences(tokenized_test, maxlen=max_length200)

In [0]:
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size128, input_length=max_length200,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size1000)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(emd_size=128)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 0.2.2: emd size = 256

In [0]:
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size256, input_length=max_length200,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size1000)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(emd_size=256)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<a id=’p13’></a>
### 3. Batch Size

case 0.3.1: batch size = 500

In [0]:
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length200,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size500)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(bs=500)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 0.3.2: batch size = 5000

In [0]:
model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length200,))
model_lstm.add(LSTM(units=60,return_sequences = True))
model_lstm.add(GlobalMaxPooling1D())
model_lstm.add(Dropout(0.2))
model_lstm.add(Dense(6,activation='sigmoid'))
model_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size5000)

In [0]:
loss,accuracy = model_lstm.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_lstm.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(bs=5000)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

###@Author Pu Hongxi
<br><br>

<a id=’p2’></a>
# Experiment with other models

1. [TextCNN](#p21) 
2. [CNN+LSTM](#p22) 
3. [BiDirectional RNN(LSTM/GRU)](#p23)
4. [Attention](#p24)

In [0]:
""" common variables for all models"""

emd_size = 128 # optimal get from LSTM
filters = 250
hidden_dims = 64
vocab_size = 20000 
max_length = 200 # optiaml get from LSTM
epochs = 15
batch_size = 50

<a id=’p21’></a>
## 1.TextCNN
According to the paper Convolutional Neural Networks for Sentence Classification by Yoon Kim, we can use word vector and use CNN as we do to images. The paper says CNN has excellent performance on sentence-level classification tasks with multiple benchmarks. So let's first give it a shoot and we will do some analysis once we have got the result.

Model Hyperparameters to test:

1. [Filter Size](#s11) 
2. [Density Layer](#s12)

In [0]:
#Preprocessing  is the same; Running the following when the padding is finished
import pandas as pd
import numpy as np

from keras import layers
from keras.models import Sequential
from keras.preprocessing.text import Tokenizer, one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense, Dropout, Flatten, MaxPooling1D, Input, Concatenate,Conv2D, MaxPool2D, Reshape, CuDNNLSTM
from keras.models import load_model
x_train;
y_train;
x_test;
y_test;

In [0]:
#CNN model
#try to choose the same parameter as before so that we can compare the result with regards to the model choice
#Parameters without specific comment are subject to fine-tuning.
# the reason why I set them to numerical value is that I don't want to re-run to load the parameter value specified before
emd_size = 128
filters = 250
kernal_size = 3 # normally is set to 3
hidden_dims = 64
vocab_size = 20000 # =max_features
max_length = 200 # =maxlen
epochs #2 It's better to change it to something larger say at least 5
#but I don't want to waste time waiting so I will leave the training with larger epoch to my parter


#As below is a shallow CNN; Definitely we can try with another deep CNN but it will take some time to fine-tuning as well hence
# I will take care of that if I do have some spare time.
# The reset is just routine with nothing worth mentioning.
model_cnn = Sequential()
model_cnn.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn.add(Conv1D(filters, kernal_size, activation='relu'))
model_cnn.add(GlobalMaxPooling1D())
model_cnn.add(Dense(hidden_dims,activation='relu'))
model_cnn.add(Dropout(0.5))
model_cnn.add(Dense(6,activation='sigmoid'))
model_cnn.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_cnn.summary()

#Fit model
history = model_cnn.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)


In [0]:
loss,accuracy = model_cnn.evaluate(x_train,y_train)
print("EMM! <manual Exclamation> That's the Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_cnn.evaluate(x_test,y_test)
print("That's the one: Testing Accuracy: {:.5f}".format(accuracy))

#should I visualize it? No way for such fancy useless things. Do it yourself if you want.  

EMM! <manual Exclamation> That's the Training accuracy: 0.98581
That's the one: Testing Accuracy: 0.97131


159571/159571 [==============================] - 27s 169us/step
EMM! <manual Exclamation> That's the Training accuracy: 0.98581
63978/63978 [==============================] - 11s 172us/step
That's the one: Testing Accuracy: 0.97131

    It has better performance than LSTM along

<a id=’s11’></a>
### 1.1 Filter Size

case 1.1.1: Filter Size = 5

In [0]:
filter_size = 5

model_cnn2 = Sequential()
model_cnn2.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn2.add(Conv1D(filters, filter_size, activation='relu'))
model_cnn2.add(GlobalMaxPooling1D())
model_cnn2.add(Dense(hidden_dims,activation='relu'))
model_cnn2.add(Dropout(0.5))
model_cnn2.add(Dense(6,activation='sigmoid'))
model_cnn2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn2.summary())

#Fit model
history = model_cnn2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

#Fit model
# history = model_cnn3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_cnn2.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_cnn2.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(TextCNN: filter=5)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 1.1.2: Filter Size = [3,4,5]

In [0]:
filter_sizes = [3,4,5]

model_cnn3 = Sequential()
model_cnn3.add(Embedding(vocab_size, emd_size, input_length=max_length,))
for i in range(len(filter_sizes)):
  model_cnn3.add(Conv1D(filters, filter_sizes[i], activation='relu'))

model_cnn3.add(GlobalMaxPooling1D())
model_cnn3.add(Dense(hidden_dims,activation='relu'))
model_cnn3.add(Dropout(0.5))
model_cnn3.add(Dense(6,activation='sigmoid'))
model_cnn3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn3.summary())

#Fit model
history = model_cnn3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)


In [0]:
loss,accuracy = model_cnn3.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_cnn3.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(TextCNN: filter=[3,4,5])')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<a id=’s12’></a>
### 1.2 Density Layer

case 1.2.1: For case 1.1.1 remove one density layer

In [0]:
filter_size = 5

model_cnn4 = Sequential()
model_cnn4.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn4.add(Conv1D(filters, filter_size, activation='relu'))
model_cnn4.add(GlobalMaxPooling1D())
model_cnn4.add(Dropout(0.5))
model_cnn4.add(Dense(6,activation='sigmoid'))
model_cnn4.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
print(model_cnn4.summary())

#Fit model
history = model_cnn4.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_cnn4.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_cnn4.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(textCNN: delete middle density layer)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<a id=’p22’></a>
## 2.CNN + LSTM
We have tried CNN and LSTM above, now let's implement them together to see the result. The motivation is that I want to see how it handles long sequences together with what to keep and what to forget.


In [0]:
import sys, os, re, csv, codecs, numpy as np, pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, MaxPooling1D,GlobalAveragePooling1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

In [0]:
#CNN model
#try to choose the same parameter as before so that we can compare the result with regards to the model choice
#Parameters without specific comment are subject to fine-tuning.
# the reason why I set them to numerical value is that I don't want to re-run to load the parameter value specified before
emd_size = 128
filters = 250
kernal_size = 3 # normally is set to 3
hidden_dims = 128
vocab_size = 20000 # =max_features
max_length = 200 # =maxlen
units=60
epochs #2 It's better to change it to something larger say at least 5
#but I don't want to waste time waiting so I will leave the training with larger epoch to my parter


#As below is a shallow CNN; Definitely we can try with another deep CNN but it will take some time to fine-tuning as well hence
# I will take care of that if I do have some spare time.
# The reset is just routine with nothing worth mentioning.
model_cnn_lstm = Sequential()
model_cnn_lstm.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_cnn_lstm.add(Conv1D(filters, kernal_size, activation='relu'))
model_cnn_lstm.add(MaxPooling1D()) # also try 2D 
model_cnn_lstm.summary()
model_cnn_lstm.add(LSTM(units=units)) #same as the above
model_cnn_lstm.add(Dense(hidden_dims,activation='relu'))
model_cnn_lstm.add(Dropout(0.5))
model_cnn_lstm.add(Dense(6,activation='sigmoid'))
model_cnn_lstm.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_cnn_lstm.summary()

#Fit model
history = model_cnn_lstm.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)



In [0]:
loss,accuracy = model_cnn_lstm.evaluate(x_train,y_train)
print("EMM! <manual Exclamation> That's the Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_cnn_lstm.evaluate(x_test,y_test)
print("That's the one: Testing Accuracy: {:.5f}".format(accuracy))

EMM! <manual Exclamation> That's the Training accuracy: 0.98529
That's the one: Testing Accuracy: 0.96890


Combine TextCNN and LSTM after getting the result of part 1 and 3

In [0]:
# code

Till now, the test accuracy of various models is
LSTM:  0.96839
CNN:  0.97131
CNN+ LSTM: 0.96890 

IT's improved compared with the original one. 
As for why it's no better than CNN, I will let my parnter to find out by changing some parameters. 

<a id=’p23’></a>
## 3.BiDirectional RNN(LSTM/GRU)

Now we need something that could remember previous information as well as remembering info for a long period of time.
HA! That's the classical BiDirectional RNN.
Here I only implemented it with LSTM, but in practice it could be done with GRU or both interchangably.
I will leave it to my partner to do some test running,

Model Hyperparameters to test:
1. [LSTM Hidden Nodes Number](#s31) 
2. [Density Layer](#s32)

In [0]:
model_BiD = Sequential()
model_BiD.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD.add(LSTM(units=units,return_sequences = True))
model_BiD.add(GlobalMaxPooling1D())
model_BiD.add(Dense(hidden_dims,activation='relu'))
model_BiD.add(Dropout(0.5))
model_BiD.add(Dense(6,activation='sigmoid'))
model_BiD.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD.summary()

#Fit model
history = model_BiD.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_BiD.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_BiD.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

Training accuracy: 0.98512
Testing Accuracy: 0.97213


<a id=’s31’></a>
### 3.1 LSTM Hidden Nodes Number

In [0]:
def model_birnn_lstm(units):
    inp = Input(shape=(max_length,))
    x = Embedding(vocab_size, emd_size)(inp)
    
    x = Bidirectional(CuDNNLSTM(units=units, return_sequences=True))(x)
    avg_pool = GlobalAveragePooling1D()(x)
    max_pool = GlobalMaxPooling1D()(x)
    conc = concatenate([avg_pool, max_pool])
    conc = Dense(hidden_dims, activation="relu")(conc)
    conc = Dropout(0.2)(conc)
    outp = Dense(6, activation="sigmoid")(conc)
    model = Model(inputs=inp, outputs=outp)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

case 3.1.1 units = 94 (alph = 3)

In [0]:
units = 94

model_BiD2 = Sequential()
model_BiD2.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD2.add(LSTM(units=units,return_sequences = True))
model_BiD2.add(GlobalMaxPooling1D())2
model_BiD2.add(Dense(hidden_dims,activation='relu'))
model_BiD2.add(Dropout(0.2))
model_BiD2.add(Dense(6,activation='sigmoid'))
model_BiD2.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD2.summary()

#Fit model
history = model_BiD2.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_BiD2.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_BiD2.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(RNN: units=94)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 3.1.2 units = 47 (alph = 6)

In [0]:
units = 47
model_BiD3 = Sequential()
model_BiD3.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD3.add(LSTM(units=units,return_sequences = True))
model_BiD3.add(GlobalMaxPooling1D())2
model_BiD3.add(Dense(hidden_dims,activation='relu'))
model_BiD3.add(Dropout(0.2))
model_BiD3.add(Dense(6,activation='sigmoid'))
model_BiD3.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD3.summary()

#Fit model
history = model_BiD3.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_BiD3.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_BiD3.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(RNN: units=47)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

case 3.1.3 units = 31 (alph = 9)

In [0]:
units = 31
model_BiD4 = Sequential()
model_BiD4.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD4.add(LSTM(units=units,return_sequences = True))
model_BiD4.add(GlobalMaxPooling1D())2
model_BiD4.add(Dense(hidden_dims,activation='relu'))
model_BiD4.add(Dropout(0.2))
model_BiD4.add(Dense(6,activation='sigmoid'))
model_BiD4.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD4.summary()

#Fit model
history = model_BiD4.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_BiD4.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_BiD4.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(RNN: units=31)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

<a id=’s32’></a>
### 3.2 Density Layer

case 3.2.1: For case 3.1.1 delete middle density layer

In [0]:
units = 94

model_BiD5 = Sequential()
model_BiD5.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_BiD5.add(LSTM(units=units,return_sequences = True))
model_BiD5.add(GlobalMaxPooling1D())2
model_BiD5.add(Dropout(0.2))
model_BiD5.add(Dense(6,activation='sigmoid'))
model_BiD5.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_BiD5.summary()

#Fit model
history = model_BiD5.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

In [0]:
loss,accuracy = model_BiD5.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_BiD5.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

""" Visualize Training Results """
# Get training results
history_dict = history.history
train_acc = history_dict['accuracy']
test_acc = history_dict['val_accuracy']

# Plot training results
plt.plot(train_acc, label='Train Acc.')
plt.plot(test_acc, label='Test Acc.')

# Show plot
plt.title('Model Training Results(RNN: delete middle density layer)')
plt.xlabel('Epochs')
plt.ylabel('Acc')
plt.legend(loc="lower right")
plt.grid()
plt.show()

Till now, the test accuracy of various models is
LSTM:          0.96839
CNN:           0.97131
CNN+ LSTM:     0.96890 
BiDir RNN:     0.97213


<a id=’p24’></a>
## 4.Attention Models
It's not covered in the lecture but since the release of Hierarchical Attention Networks for Document Classification paper written jointly by CMU and Microsoft guys in 2016, it's been quite popular.
But what is the REAL incentive after trying this model?
It's from the REAL TRUMP: "what do you have to lose? I say, take it.", so here I will give it a shoot as the president does to the hydroxychloroquine.

As below is a simple attention model which help us by pay more attention to some word since toxic comments tend to be determined by just one or two toxic words, especially some 4 letter word, u know.

Obviously, attention can be implemented together with models mentioned above, but since I don't have such time, I will leave it to my partner to do some trial.

In [0]:
# https://www.kaggle.com/qqgeogor/keras-lstm-attention-glove840b-lb-0-043
class Attention(Layer):
    def __init__(self, step_dim,
                 W_regularizer=None, b_regularizer=None,
                 W_constraint=None, b_constraint=None,
                 bias=True, **kwargs):

        self.supports_masking = True
        self.init = initializers.get('glorot_uniform')

        self.W_regularizer = regularizers.get(W_regularizer)
        self.b_regularizer = regularizers.get(b_regularizer)

        self.W_constraint = constraints.get(W_constraint)
        self.b_constraint = constraints.get(b_constraint)

        self.bias = bias
        self.step_dim = step_dim
        self.features_dim = 0

        super(Attention, self).__init__(**kwargs)

    def build(self, input_shape):
        assert len(input_shape) == 3

        self.W = self.add_weight(name='{}_W'.format(self.name),
                                 shape=(input_shape[-1],),
                                 initializer=self.init,
                                 regularizer=self.W_regularizer,
                                 constraint=self.W_constraint)
        self.features_dim = input_shape[-1]

        if self.bias:
            self.b = self.add_weight(name='{}_b'.format(self.name),
                                     shape=(input_shape[1],),
                                     initializer='zero',
                                     regularizer=self.b_regularizer,
                                     constraint=self.b_constraint)
        else:
            self.b = None

        self.built = True

    def compute_mask(self, input, input_mask=None):
        # do not pass the mask to the next layers
        return None

    def call(self, x, mask=None):
        features_dim = self.features_dim
        step_dim = self.step_dim

        e = K.reshape(K.dot(K.reshape(x, (-1, features_dim)), K.reshape(self.W, (features_dim, 1))), (-1, step_dim))  # e = K.dot(x, self.W)
        if self.bias:
            e += self.b
        e = K.tanh(e)

        a = K.exp(e)
        # apply mask after the exp. will be re-normalized next
        if mask is not None:
            # cast the mask to floatX to avoid float64 upcasting in theano
            a *= K.cast(mask, K.floatx())
        # in some cases especially in the early stages of training the sum may be almost zero
        # and this results in NaN's. A workaround is to add a very small positive number ε to the sum.
        a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
        a = K.expand_dims(a)

        c = K.sum(a * x, axis=1)
        return c

    def compute_output_shape(self, input_shape):
        return input_shape[0], self.features_dim

NameError: ignored

In [0]:

model_a = Sequential()
model_a.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_a.add(LSTM(units=units,return_sequences = True))
model_a.add(Attention(200))
model_a.add(GlobalMaxPooling1D())
model_a.add(Dense(hidden_dims,activation='relu'))
model_a.add(Dropout(0.5))
model_a.add(Dense(6,activation='sigmoid'))
model_a.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_a.summary()

#Fit model
history = model_a.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)




NameError: ignored

Also you can try different model layers, but to compare the result, I implemented the same layers.
As below is another way to give it a shot, it's said to be 99% accuracy;
https://www.kaggle.com/sanket30/cudnnlstm-lstm-99-accuracy
You can dig in to find out why his model is better.

In [0]:
loss,accuracy = model_a.evaluate(x_train,y_train)
print("Training accuracy: {:.5f}".format(accuracy))
loss,accuracy = model_a.evaluate(x_test,y_test)
print("Testing Accuracy: {:.5f}".format(accuracy))

case 4.1: Add one more LSTM layer

In [0]:
model_a = Sequential()
model_a.add(Embedding(vocab_size, emd_size, input_length=max_length,))
model_a.add(LSTM(units=units,return_sequences = True))

model_a.add(LSTM(units=units, return_sequences=True))
model_a.add(LSTM(units=units, return_sequences=True))
model_a.add(Attention())
model_a.add(Dense(hidden_dims,activation='relu'))
model_a.add(Dropout(0.2))
model_a.add(Dense(6,activation='sigmoid'))
model_a.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])
model_a.summary()

#Fit model
history = model_a.fit(x_train,y_train, epochs=epochs,validation_data=(x_test,y_test),batch_size=batch_size)

## 5.BERT
As mentioned in the comment, "it's not difficult to implement BERT",so BERT is not implemented. In addition, you may not want to download pretrained model which takes a while through streaming.