# Importing Required Packages
- nltk for Natural Language Processing
- pandas for reading and writing csv files
- numpy for manipulation of arrays
- tensorflow for keras Deep Learning Module
- keras for Deep Learning Modules
- Sequential for building the deep neural network
- LSTM for Long Short Term Memory for RNN (https://colah.github.io/posts/2015-08-Understanding-LSTMs/ found this very useful for learnning LSTMs)
- Dense, Flatten, Embeddings for using them in the NNs
- train_test_split to split our training data to training and validation sets


In [1]:
import nltk
import pandas as pd
import numpy as np
import tensorflow as tf
import keras

from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense, Flatten, Embedding, Activation
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


I am using my own google drive to upload my datasets. My university has a unlimited data that can be uploaded in my gdrive. It's easier for me to export the data at the desired location and save all my files in the same place. Keeps everything organized.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


Setting up a path variable so that I do not need to refer to it everytime.

In [0]:
path ='/content/drive/My Drive/Datasets/ToxicComment'

# Data Initialization

In [4]:
train = pd.read_csv(path + '/train.csv')
test = pd.read_csv(path + '/test.csv')
train.sample(5)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
77651,d003a549b62a8b9e,perhaps one of these notes to link to https://...,0,0,0,0,0,0
78179,d1432a20b6bb361f,January 2009 \n Please stop your disruptive ed...,0,0,0,0,0,0
77045,ce57034a2003e303,These tags are completely bogus... the creatio...,0,0,0,0,0,0
23742,3eb5f46517faa982,"""\n\nOrphaned fair use image (Image:BTT UK 1.j...",0,0,0,0,0,0
158824,f4679504e5761d89,"Check the map on the right, or here, or here...",0,0,0,0,0,0


In [5]:
train.columns

Index(['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat',
       'insult', 'identity_hate'],
      dtype='object')

Feature Engineering done from a Kaggle open Notebook (https://www.kaggle.com/eikedehling/feature-engineering). A huge thanks for Eike Dehling for such an awesome feature engineering algorithm.
We need to do for both of the training and the testing sets.

In [0]:
for dataset in [train,test]:
  dataset['length'] = dataset['comment_text'].apply(len)
  dataset['num_words'] = dataset['comment_text'].str.split().apply(len)
  dataset['caps_letter'] = dataset['comment_text'].apply(lambda capsL: sum(1 for c in capsL if c.isupper()))
  dataset['caps_word'] = dataset['comment_text'].apply(lambda capsW : sum(1 for c in capsW.split() if c.isupper()))
  dataset['exclaimations'] = dataset['comment_text'].apply(lambda exclaim : exclaim.count('!'))
  dataset['questions'] = dataset['comment_text'].apply(lambda questions : questions.count('?'))
  dataset['punct'] = dataset['comment_text'].apply(lambda punct: sum(punct.count(p) for p in '.,;:'))
  dataset['sp_char'] = dataset['comment_text'].apply(lambda sp_char: sum(sp_char.count(s) for s in '@#$%^*()-'))
  dataset['unique_words'] = dataset['comment_text'].apply(lambda text: len(set(w for w in text.split())))
  dataset['num_emoji'] = dataset['comment_text'].apply(lambda emoji: sum(emoji.count(w) for w in (':-)', ':)', ';-)', ';)')))

  dataset['capsL_to_length'] = dataset['caps_letter'] / dataset['length']
  dataset['capsW_to_length'] = dataset['caps_word'] / dataset['length']
  dataset['unique_to_words'] = dataset['unique_words']/dataset['num_words']

Taking the labels in the training set and putting in a separate variable for easier use in the future.

In [0]:
y_train = pd.DataFrame(columns = ['toxic','severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate'])
y_train['toxic'] = train ['toxic']
y_train['severe_toxic'] = train['severe_toxic']
y_train['obscene'] = train['obscene']
y_train['threat'] = train['threat']
y_train['insult'] = train['insult']
y_train['identity_hate'] = train['identity_hate']

Here I am going to set the max words of a single comment to be 500. From the first 500 words we could possibily detect what type of comment its going to be. This is also done to save computations.
Max words in the comments is set to be 1000000. This is to convert all the words to one-hot vectors. 
https://hackernoon.com/what-is-one-hot-encoding-why-and-when-do-you-have-to-use-it-e3c6186d008f
- Gives us what one hot encoder is.

https://medium.com/@pemagrg/one-hot-encoding-129ccc293cda
- Gives an idea on why we should do it.

In [0]:
maxlen = 500
max_words = 1000000

We take all of the words in a single sentence so that we could apply the tokenizer and process them to make it as a one-hot vectors.

In [0]:
train_comments = train['comment_text']
train_sentences = []
test_comments = test['comment_text']
test_sentences = []
for i in range(len(train_comments)):
    train_sentences.append(train_comments[i])
for i in range(len(test_comments)):
    test_sentences.append(test_comments[i])

In [0]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(train_sentences)
sequences = tokenizer.texts_to_sequences(train_sentences)
word_index = tokenizer.word_index

train_padded = pad_sequences(sequences, maxlen=maxlen)

tokenizer_test = Tokenizer(num_words=max_words)
tokenizer_test.fit_on_texts(test_sentences)
sequences_test = tokenizer_test.texts_to_sequences(test_sentences)
word_index_test = tokenizer_test.word_index
test_sequences = pad_sequences(sequences_test, maxlen=maxlen)

# Splitting the Training and Validation set

In [0]:
labels = y_train
x_train, x_validate, y_train, y_validate = train_test_split(train_padded, labels, test_size=0.05, random_state=42)

# Building the LSTM Model

I will be running the model only once. For my NN model, and the train test split I have done it took about 2 hours to complete 10 epochs reaching an accuracy of 99.05% on the training set and 98.76% on the validation set. 

In [0]:
%%time
model = Sequential()
model.add(Embedding(max_words, 32))
model.add(LSTM(32))
model.add(Dense(6, activation='sigmoid'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
fit_lstm = model.fit(x_train, y_train, epochs=10,batch_size=128, validation_data= (x_validate, y_validate))


We are going to save the model immediately after running 10 epochs. To know more about loading and saving models, the following link will be a good source. 
https://machinelearningmastery.com/save-load-keras-deep-learning-models/

In [0]:
model.save(path + '/commentsclassification.h5')

# Loaded weights Initialization and beyond

This code is run after getting out our loaded weights for the classification. Since we need a fresh start, I restarted the kernel and ran code excluding 'Building the LSTM Model'

This is the simplest explaination of transfer learning. I ran a model for 2 hours on a particular set of data. Using the weights I got at the last step I am making my predictions.

I have a very strong feeling that transfer learning is the future of building an artificial intelligence.

https://machinelearningmastery.com/transfer-learning-for-deep-learning/
is a good start to learn more about transfer learning. 

If we had to do image classification task, it would be a good idea to take loaded weights from Inception-network. (https://keras.io/applications/#inceptionv3)

Since we have to do word based classification, we need to build our own network.

In [12]:
from keras.models import load_model
model = load_model(path + '/commentsclassification.h5')












Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where




In [13]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, None, 32)          32000000  
_________________________________________________________________
lstm_1 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_1 (Dense)              (None, 6)                 198       
Total params: 32,008,518
Trainable params: 32,008,518
Non-trainable params: 0
_________________________________________________________________


# Predictions

In [0]:
%%time
predict = model.predict(test_sequences)

In [0]:
sample[["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]] = predict

In [0]:
sample.to_csv(path + '/submission.csv',index= False)