## Contents

1. [Introduction](#1)
2. [Pre-processing](#2)
3. [Simple LSTM Model](#3)

## Introduction

In the previous notebook, I built a baseline deep learning model using a bi-directional LSTM. I want to test whether conducting a bit more text processing and cleaning to better match the word embeddings matrix helps inmprove performance. I will follow along this kernel: https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2

In [1]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics
import operator 
import re
import gc

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, CuDNNGRU, Conv1D,CuDNNLSTM
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers

Using TensorFlow backend.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


## Pre-Processing


In [2]:
train_set = pd.read_csv("../input/quora-insincere-questions-classification/train.csv")
test_set = pd.read_csv("../input/quora-insincere-questions-classification/test.csv")
print("Train shape : ",train_set.shape)
print("Test shape : ",test_set.shape)

Train shape :  (1306122, 3)
Test shape :  (375806, 2)


In [3]:
train_set.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


### Vocab and Coverage Functions

In [4]:
# creates dict which contains vocab and frequencies
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [5]:
# Check how much of the vocab is covered by the word embeddings
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.3%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.3%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

### Extra Processing Starts Here

In [6]:
# load embeddings
EMBEDDING_FILE = '../input//quora-insincere-questions-classification/embeddings/glove.840B.300d/glove.840B.300d.txt'
#Transfer the embedding weights into a dictionary by iterating through every line of the file.
def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

In [7]:
# vocab = build_vocab(train_set['question_text'])

In [8]:
# Check coverage in glove embeddings
# oov_glove = check_coverage(vocab, embeddings_index)

We have a large vocab, which is probably mostly relatively rare words, therefore we only have embeddings for 33% of the vocab. However, if we look at the whole text, the coverage is solid at 88%, those are the more predictive words anyway. But we still are wasting more than 10% of the data since it doesnt have an embedding.

In [9]:
# oov_glove[:10]

### Contractions

In [10]:
# Mapping taken from another kernel
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [11]:
def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known

In [12]:
# print("- Known Contractions -")
# print("   Glove :")
# print(known_contractions(embeddings_index))

Not alot of contractions exist within the glove embeddings. Clean the contractions using the contraction mapping above.

In [13]:
def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

train_set['treated_question'] = train_set['question_text'].apply(lambda x: clean_contractions(x, contraction_mapping))
test_set['treated_question'] = test_set['question_text'].apply(lambda x: clean_contractions(x, contraction_mapping))

In [14]:
# vocab = build_vocab(train_set['treated_question'])
# print("Glove : ")
# oov_glove = check_coverage(vocab, embeddings_index)

Now let's remove punctuations and special characters

In [15]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'
def unknown_punct(embed, punct):
    unknown = ''
    for p in punct:
        if p not in embed:
            unknown += p
            unknown += ' '
    return unknown

In [16]:
# print("How many symbols/punct do not exist in the embedding:")
# print(unknown_punct(embeddings_index, punct))


- We use a map to replace unknown characters with known ones.
- We make sure there are spaces between words and punctuation, so they are taken as extra tokens

In [17]:
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }

In [18]:
def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])
    
    for p in punct:
        text = text.replace(p, f' {p} ')
    
    specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''}  # Other special characters that I have to deal with in last
    for s in specials:
        text = text.replace(s, specials[s])
    
    return text

In [19]:
train_set['treated_question'] = train_set['treated_question'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))
test_set['treated_question'] = test_set['treated_question'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))

In [20]:
# vocab = build_vocab(train_set['treated_question'])
# print("Glove : ")
# oov_glove = check_coverage(vocab, embeddings_index)


In [21]:
# len(vocab)

In [22]:
# oov_glove[:50]

Much more coverage after dealing with punctuation. Now up to 99.5% coverage of all text and 75% coverage of the embeddings. Whats left are just very specific words that we wouldn't have embeddings for unless we had a specific corpus for that topic. There's alot of Crypto words!

#### Now we can build the model like last time and see if there is any benefit

### Initial Steps
1. Train and validation set split
2. Fill missing values
3. Tokenize sentences
4. Pad sentences (ie. if it is less than 100 words long, then fill up the rest with zeros)
5. Get target values 

In [23]:
train_df, val_df = train_test_split(train_set,test_size=0.1,random_state= 123)

## some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (specify size of vocab to use)
maxlen = 100 # max number of words in a question to use

## fill up the missing values
train_X = train_df["treated_question"].fillna("_na_").values
val_X = val_df["treated_question"].fillna("_na_").values
test_X = test_set["treated_question"].fillna("_na_").values

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values


## Build Bi-LSTM Using Pretrained GloVe Vectors

In [24]:
#We get the mean and standard deviation of the embedding weights so that we could maintain the 
        #same statistics for the rest of our own random generated weights.
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

  if (await self.run_code(code, result,  async_=asy)):


In [25]:
#We are going to set the embedding size to the pretrained dimension as we are replicating it.
        #the size will be Number of Words in Vocab X Embedding Size
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))

#With the newly created embedding matrix, we'll fill it up with the words that we have in both 
        #our own dictionary and loaded pretrained embedding.
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

Now build bidirectional LSTM model

In [26]:
# define input layer. shape=(maxlen,) means keras will infer the other dimension
inp = Input(shape=(maxlen, )) #maxlen=200 as defined earlier
# Input pass to Embedding layer - use weights parameter to pass in embedding matrix and trainable=F since we use pretained weights
X = Embedding(max_features,embed_size,weights=[embedding_matrix],trainable=False)(inp)
# Pass through b-directional LSTM cell (units =64, but output dim of LSTM is 128 because of 2 directions)
X = Bidirectional(CuDNNLSTM(units=64, return_sequences=True))(X)
X = Dropout(0.1)(X)
# The global maxpooling layer reduces dimensions from 3d to 2d
X = GlobalMaxPool1D()(X)
X = Dense(units=16, activation='relu')(X)
X = Dropout(0.1)(X)
# Last layer only requires output of 1-dim vectors since its binary classification. Sigmoid forces output between 0 and 1
X = Dense(1, activation="sigmoid")(X)
model = Model(inputs=inp, outputs=X)
model.compile(loss='binary_crossentropy',
                  optimizer='adam',
                  metrics=['accuracy'])
print(model.summary())


_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         (None, 100)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 100, 300)          15000000  
_________________________________________________________________
bidirectional_1 (Bidirection (None, 100, 128)          187392    
_________________________________________________________________
dropout_1 (Dropout)          (None, 100, 128)          0         
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 128)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 16)                2064      
_________________________________________________________________
dropout_2 (Dropout)          (None, 16)                0         
__________

Train Model

In [27]:
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Train on 1175509 samples, validate on 130613 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f6094881c88>

Make Predictions
- in order to determine best threshold to optimise F1 score we can calculate against thresholds from 0.1-0.5

In [28]:
pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5857011147431062
F1 score at threshold 0.11 is 0.5942631736912061
F1 score at threshold 0.12 is 0.6017543859649123
F1 score at threshold 0.13 is 0.6083959899749374
F1 score at threshold 0.14 is 0.614620815304031
F1 score at threshold 0.15 is 0.6196634682241691
F1 score at threshold 0.16 is 0.625035294117647
F1 score at threshold 0.17 is 0.6294246732182043
F1 score at threshold 0.18 is 0.6342547844577614
F1 score at threshold 0.19 is 0.6392005094792534
F1 score at threshold 0.2 is 0.6432719511589815
F1 score at threshold 0.21 is 0.6472065916398714
F1 score at threshold 0.22 is 0.6499669732229053
F1 score at threshold 0.23 is 0.652633740630455
F1 score at threshold 0.24 is 0.6561867704280155
F1 score at threshold 0.25 is 0.6581536529202427
F1 score at threshold 0.26 is 0.6602567489037985
F1 score at threshold 0.27 is 0.6620402498265093
F1 score at threshold 0.28 is 0.6626804567981038
F1 score at threshold 0.29 is 0.6649649704013468
F1 score at threshold 0.3

> Looks like ~0.29 is the best threshold. Let's predict on the final test set and make the submission.

In [29]:
pred_glove_test_y = model.predict([test_X], batch_size=1024, verbose=1)



Set threshold and then write submission csv

In [30]:
pred_test_y = (pred_glove_test_y>0.35).astype(int)
out_df = pd.DataFrame({"qid":test_set["qid"].values})
out_df['prediction'] = pred_test_y
out_df.head()

Unnamed: 0,qid,prediction
0,0000163e3ea7c7a74cd7,1
1,00002bd4fb5d505b9161,0
2,00007756b4a147d2b0b3,0
3,000086e4b7e1c7146103,0
4,0000c4c3fbe8785a3090,0


In [31]:
out_df.to_csv("submission.csv", index=False)

The leaderboard F1_score increased to 0.669! That means extra text processing does have some small impact on performance (probably the handful of questions with rare/interesting words end up going from wrong to correct).