## Contents

1. [Introduction](#1)
2. [Pre-processing](#2)
3. [Simple LSTM Model](#3)

## Introduction

This notebook is a follow-on from the previous notebook for Quora Insincere Questions Classification challenge. In the first notebook, I built a very quick baseline model using TFIDF- Logistic Reg and Linear SVC. I did not do much pre-processing apart from removing punctuation and I used uni-grama and bigrams.

This time, I will try something more complex, and LSTM based model that uses pre-trained GloVe Word Embeddings. Having had a glance at the kaggle kernels, a relatively simple deep learning model should aim for 0.67+ leaderboard score, while I should be able to tune and increase complexity up to the winners scores at ~0.71.

In [None]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

## Pre-Processing


In [None]:
train_set = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
test_set = pd.read_csv("../input/quora-insincere-questions-classification/test.csv")
print("Train shape : ",train_set.shape)
print("Test shape : ",test_set.shape)

In [None]:
train_set.head()

### Initial Steps
1. Train and validation set split
2. Fill missing values
3. Tokenize sentences
4. Pad sentences (ie. if it is less than 100 words long, then fill up the rest with zeros)
5. Get target values 

In [None]:
train_df, val_df = train_test_split(train_set,test_size=0.1,random_state= 123)

## some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

## fill up the missing values
train_X = train_df["question_text"].fillna("_na_").values
val_X = val_df["question_text"].fillna("_na_").values
test_X = test_set["question_text"].fillna("_na_").values

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values


## Build Bi-LSTM Using Pretrained GloVe Vectors

Load embedding matrix

In [None]:
EMBEDDING_FILE = '../input//quora-insincere-questions-classification/embeddings/glove.840B.300d/glove.840B.300d.txt'
#Transfer the embedding weights into a dictionary by iterating through every line of the file.
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

In [None]:
#We get the mean and standard deviation of the embedding weights so that we could maintain the 
        #same statistics for the rest of our own random generated weights.
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

In [None]:
#We are going to set the embedding size to the pretrained dimension as we are replicating it.
        #the size will be Number of Words in Vocab X Embedding Size
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

#With the newly created embedding matrix, we'll fill it up with the words that we have in both 
        #our own dictionary and loaded pretrained embedding.
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Now build bidirectional LSTM model

In [None]:
# define input layer. shape=(maxlen,) means keras will infer the other dimension
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
# Input pass to Embedding layer - use weights parameter to pass in embedding matrix and trainable=F since we use pretained weights
X = Embedding(max_features,embed_size,weights=[embedding_matrix],trainable=False)(inp)
# Pass through b-directional LSTM cell (units =64, but output dim of LSTM is 128 because of 2 directions)
X = Bidirectional(LSTM(units=64, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))(X)
# The global maxpooling layer reduces dimensions from 3d to 2d
X = GlobalMaxPool1D()(X)
# X = Dropout(0.1)(X)
X = Dense(units=16, activation='relu')(X)
X = Dropout(0.1)(X)
# Last layer only requires output of 1-dim vectors since its binary classification. Sigmoid forces output between 0 and 1
X = Dense(1, activation="sigmoid")(X)
model = Model(inputs=inp, outputs=X)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
print(model.summary())


Train Model

In [None]:
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Make Predictions
- in order to determine best threshold to optimise F1 score we can calculate against thresholds from 0.1-0.5

In [None]:
pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.05):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

Looks like ~0.3 is the best threshold. Let's predict on the final test set and make the submission.

In [None]:
pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)

Set threshold and then write submission csv

In [None]:
pred_test_y = (pred_glove_test_y>0.35).astype(int)
out_df = pd.DataFrame({"qid":test_set["qid"].values})
out_df['prediction'] = pred_test_y
out_df.head()

In [None]:
out_df.to_csv("submission.csv", index=False)

F1 Score of ~0.66 is the competition result. Solid improvement on the tfidf result of ~0.55