**Notebook Objective:**

Objective of the notebook is to look at the different pretrained embeddings provided in the dataset and to see how they are useful in the model building process. 

First let us import the necessary modules and read the input data.

In [1]:
%tensorflow_version 2.x

In [2]:
#!pip install --upgrade keras

In [3]:
import os
import time
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm
import math
from sklearn.model_selection import train_test_split
from sklearn import metrics

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation,  Conv1D
from keras.layers import Bidirectional, GlobalMaxPool1D
from keras.models import Model
from keras import initializers, regularizers, constraints, optimizers, layers
from tensorflow.python.keras.layers import CuDNNGRU

In [4]:
train_df = pd.read_csv("/content/sample_data/train.csv")
print("Train shape : ",train_df.shape)

Train shape :  (1306122, 3)


In [5]:
target_types = train_df.groupby('target').agg('count')
target_types

Unnamed: 0_level_0,qid,question_text
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1225312,1225312
1,80810,80810


In [6]:
target_labels = train_df.target.sort_values().index
target_counts = train_df.target.sort_values()

In [7]:
import re
import nltk
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('wordnet')

In [8]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [9]:
eng_stopwords = stopwords.words('english')
eng_stopwords.remove('not') #remove not from the words as it is negative
eng_stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [10]:
lemmatizer = WordNetLemmatizer()

In [11]:
def data_preprocessing(questions):
    #data cleaning
    questions = re.sub(re.compile('<.*?>'),'',questions)
    questions = re.sub('[^A-Za-z0-9]+',' ',questions)
    
    #Lowercase
    questions = questions.lower()

    #tokenization
    tokens = nltk.word_tokenize(questions)

    #stop words removal
    questions = [word for word in tokens if word not in eng_stopwords] #remove stop wprds

    #lemmatization
    questions = [lemmatizer.lemmatize(word) for word in questions]

    #join words in preprocessed questions
    questions = ' '.join(questions)

    return questions

In [12]:
train_df['preprocessed_question_text']=train_df["question_text"].apply(lambda question_text: data_preprocessing(question_text))
train_df.head()

Unnamed: 0,qid,question_text,target,preprocessed_question_text
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0,quebec nationalist see province nation 1960s
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0,adopted dog would encourage people adopt not shop
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0,velocity affect time velocity affect space geo...
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0,otto von guericke used magdeburg hemisphere
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0,convert montra helicon mountain bike changing ...


In [13]:
## split to train and val
train_df, val_df = train_test_split(train_df, test_size=0.3, random_state=2018)

In [14]:
## fill up the missing values in the question_text with "_na_"
train_df["question_text"] = train_df["question_text"].fillna("_na_").values
val_df["question_text"] = val_df["question_text"].fillna("_na_").values

In [15]:
## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values

Bag of Words:It converts a collection of text documents to a matrix of token counts.


In [16]:
from sklearn.feature_extraction.text import CountVectorizer

In [17]:
vect= CountVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3), min_df = 3)
X_train = vect.fit_transform(list(train_df['preprocessed_question_text'].values))
X_val = vect.transform(val_df['preprocessed_question_text'].values)

In [18]:
from sklearn.naive_bayes import MultinomialNB,GaussianNB,BernoulliNB
from sklearn.metrics import accuracy_score,f1_score

In [19]:
clf=MultinomialNB()
clf.fit(X_train,train_y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [20]:
y_val = clf.predict(X_val)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.9270640598003762
Validation f1_score:  0.5412606943931684


In [21]:
del X_train,vect,X_val
import gc; gc.collect()
time.sleep(10)

TFIDF(Term Frequency Inverse Document Frequency): It shows how important a word is to a document in a collection or corpus.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

Sentence 1: The car is driven on the road.
Sentence 2: The truck is driven on the highway.
Here TF-IDF of the above 2 documents , represent our corpus.
'The'--> TF--> A --> 1/7, B--> 1/7, IDF = log(2/2)=0 , TF-IDF --> A -> 0, B-> 0 
Similarly for the words 'is','driven','on','the'.
But for the words 'car' and 'truck' have more significance.

'car'--> TF--> A -->1/7,B --> 0/7,IDF = log(2/1)=0.3,TF-IDF --> A-> 0.043,B-> 0

'truck'--> TF-->A-->0/7,B --> 1/7,IDF = log(2/1)=0.3,TF-IDF --> A-> 0,B-> 0.043

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [23]:
tfidfvec= TfidfVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3), min_df = 3,
                      max_features=None,use_idf=1,smooth_idf=1,sublinear_tf=1,stop_words='english')
X_train_tfidf = tfidfvec.fit_transform(list(train_df['preprocessed_question_text'].values) )
X_val_tfidf = tfidfvec.transform(val_df['preprocessed_question_text'].values)

In [24]:
clf=BernoulliNB()
clf.fit(X_train_tfidf,train_y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [25]:
y_val = clf.predict(X_val_tfidf)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.9379461357656372
Validation f1_score:  0.5113643214565623


In [26]:
del X_train_tfidf,tfidfvec,X_val_tfidf
import gc; gc.collect()
time.sleep(10)

HashingVectorizer: The HashingVectorizer is based on feature hashing and is a memory efficient technique, also known as the hashing trick. Unlike the CountVectorizer where the index assigned to a word in the document vector is determined by the alphabetical order of the word in the vocabulary, the HashingVectorizer maintains no vocabulary and determines the index of a word in an array of fixed size via hashing(So no worry of mis-spelling).

In [27]:
from sklearn.feature_extraction.text import HashingVectorizer

In [28]:
hashvec= HashingVectorizer(dtype=np.float32,strip_accents='unicode',
                      analyzer='word',token_pattern=r'\w{1,}',
                      ngram_range=(1,3),n_features = 2**10)
X_train_hashvec = hashvec.fit_transform(list(train_df['preprocessed_question_text'].values))
X_val_hashvec = hashvec.transform(val_df['preprocessed_question_text'].values)

In [29]:
clf=GaussianNB()
clf.fit(X_train_hashvec.toarray(),train_y)

GaussianNB(priors=None, var_smoothing=1e-09)

In [30]:
y_val = clf.predict(X_val_hashvec.toarray())
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.7236019058945429
Validation f1_score:  0.22872647253615908


In [31]:
clf=BernoulliNB()
clf.fit(X_train_hashvec,train_y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [32]:
y_val = clf.predict(X_val_hashvec)
print("Validation accuracy: ",accuracy_score(val_y,y_val))
print("Validation f1_score: ",f1_score(val_y,y_val))

Validation accuracy:  0.8969163197962418
Validation f1_score:  0.26971614536250227


In [33]:
del X_train_hashvec,hashvec,X_val_hashvec
import gc; gc.collect()
time.sleep(10)

Next steps are as follows:
 * Split the training dataset into train and val sample. Cross validation is a time consuming process and so let us do simple train val split.
 * Fill up the missing values in the text column with '_na_'
 * Tokenize the text column and convert them to vector sequences
 * Pad the sequence as needed - if the number of words in the text is greater than 'max_len' trunacate them to 'max_len' or if the number of words in the text is lesser than 'max_len' add zeros for remaining values.

Using Embeddings

In [38]:
## some config values 
embed_size = 300 # how big is each word vector
max_features = 50000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_df["question_text"]))
train_X = tokenizer.texts_to_sequences(train_df["question_text"])
val_X = tokenizer.texts_to_sequences(val_df["question_text"])

## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)

**Without Pretrained Embeddings:**

Now that we are done with all the necessary preprocessing steps, we can first train a Bidirectional GRU model. We will not use any pre-trained word embeddings for this model and the embeddings will be learnt from scratch. Please check out the model summary for the details of the layers used. 

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

print(model.summary())

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 300)          15000000  
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 128)          140544    
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                

Train the model using train sample and monitor the metric on the valid sample. This is just a sample model running for 2 epochs. Changing the epochs, batch_size and model parameters might give us a better model.

In [None]:
## Train the model 
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f73aefe66a0>

Now let us get the validation sample predictions and also get the best threshold for F1 score. 

In [None]:
pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_noemb_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5898567827847954
F1 score at threshold 0.11 is 0.5974824006681779
F1 score at threshold 0.12 is 0.6033436041367949
F1 score at threshold 0.13 is 0.6099999999999999
F1 score at threshold 0.14 is 0.6148175217273868
F1 score at threshold 0.15 is 0.6200178296458385
F1 score at threshold 0.16 is 0.6234161008408915
F1 score at threshold 0.17 is 0.6266816708376023
F1 score at threshold 0.18 is 0.6291448367663873
F1 score at threshold 0.19 is 0.6315771212462917
F1 score at threshold 0.2 is 0.6338647802062437
F1 score at threshold 0.21 is 0.635869759462454
F1 score at threshold 0.22 is 0.6375230986629951
F1 score at threshold 0.23 is 0.6382963085460192
F1 score at threshold 0.24 is 0.6392397987702627
F1 score at threshold 0.25 is 0.6400573433432678
F1 score at threshold 0.26 is 0.6394544829430416
F1 score at threshold 0.27 is 0.6398794575590155
F1 score at threshold 0.28 is 0.6398796733992265
F1 score at threshold 0.29 is 0.6392681432890083
F1 score at threshold 0

Now that our model building is done, it might be a good idea to clean up some memory before we go to the next step.

In [None]:
del model, inp, x
import gc; gc.collect()
time.sleep(10)

In [None]:
!wget 'https://nlp.stanford.edu/data/glove.840B.300d.zip'

--2020-11-16 18:36:39--  https://nlp.stanford.edu/data/glove.840B.300d.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip [following]
--2020-11-16 18:36:39--  http://downloads.cs.stanford.edu/nlp/data/glove.840B.300d.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2176768927 (2.0G) [application/zip]
Saving to: ‘glove.840B.300d.zip’


2020-11-16 18:53:33 (2.05 MB/s) - ‘glove.840B.300d.zip’ saved [2176768927/2176768927]



So we got some baseline GRU model without pre-trained embeddings. Now let us use the provided embeddings and rebuild the model again to see the performance. 



We have four different types of embeddings.
 * GoogleNews-vectors-negative300 - https://code.google.com/archive/p/word2vec/
 * glove.840B.300d - https://nlp.stanford.edu/projects/glove/
 * paragram_300_sl999 - https://cogcomp.org/page/resource_view/106
 * wiki-news-300d-1M - https://fasttext.cc/docs/en/english-vectors.html
 
 A very good explanation for different types of embeddings are given in this [kernel](https://www.kaggle.com/sbongo/do-pretrained-embeddings-give-you-the-extra-edge). Please refer the same for more details..

**Glove Embeddings:**

In this section, let us use the Glove embeddings and rebuild the GRU model.

In [None]:
!unzip glove.840B.300d.zip

Archive:  glove.840B.300d.zip
  inflating: glove.840B.300d.txt     


In [None]:
!rm glove.840B.300d.zip

In [34]:
EMBEDDING_FILE = 'glove.840B.300d.txt'
def get_coefs(word,*arr): 
  return word, np.asarray(arr, dtype='float32')
embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE))

In [35]:
all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()
embed_size = all_embs.shape[1]

  if self.run_code(code, result):


In [36]:
del all_embs

In [39]:
word_index = tokenizer.word_index
nb_words = min(max_features, len(word_index))
embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embed_size))
for word, i in word_index.items():
    if i >= max_features: continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None: embedding_matrix[i] = embedding_vector

In [40]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size, weights=[embedding_matrix])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 100)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 100, 300)          15000000  
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 128)          140544    
_________________________________________________________________
global_max_pooling1d (Global (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 16)                2064      
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                

In [41]:
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f753c0ff6a0>

In [42]:
pred_glove_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_glove_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5901340837609668
F1 score at threshold 0.11 is 0.5983845730959529
F1 score at threshold 0.12 is 0.6060832566697333
F1 score at threshold 0.13 is 0.6131348357412888
F1 score at threshold 0.14 is 0.6196678752070185
F1 score at threshold 0.15 is 0.6252388317714493
F1 score at threshold 0.16 is 0.6300078554595445
F1 score at threshold 0.17 is 0.634832864729772
F1 score at threshold 0.18 is 0.6393878908848969
F1 score at threshold 0.19 is 0.6430908391070055
F1 score at threshold 0.2 is 0.646455466112698
F1 score at threshold 0.21 is 0.6493459600518636
F1 score at threshold 0.22 is 0.6521500822163535
F1 score at threshold 0.23 is 0.6546010106945587
F1 score at threshold 0.24 is 0.6564763762829757
F1 score at threshold 0.25 is 0.6584404455869751
F1 score at threshold 0.26 is 0.6609505993210004
F1 score at threshold 0.27 is 0.6628657374210812
F1 score at threshold 0.28 is 0.6644775170806632
F1 score at threshold 0.29 is 0.665692754449413
F1 score at threshold 0.3

In [43]:
del word_index, embeddings_index, embedding_matrix, model, inp, x
import gc; gc.collect()
time.sleep(10)

**Wiki News FastText Embeddings:**

Now let us use the FastText embeddings trained on Wiki News corpus in place of Glove embeddings and rebuild the model.

In [44]:
!wget 'https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip'

--2020-11-16 19:35:08--  https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M.vec.zip
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.75.142, 104.22.74.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 681808098 (650M) [application/zip]
Saving to: ‘wiki-news-300d-1M.vec.zip’


2020-11-16 19:35:37 (23.1 MB/s) - ‘wiki-news-300d-1M.vec.zip’ saved [681808098/681808098]



In [45]:
!unzip wiki-news-300d-1M.vec.zip

Archive:  wiki-news-300d-1M.vec.zip
  inflating: wiki-news-300d-1M.vec   


In [46]:
!rm wiki-news-300d-1M.vec.zip

In [47]:
EMBEDDING_FILE2 = 'wiki-news-300d-1M.vec'
def get_coefs(word,*arr): 
  return word, np.asarray(arr, dtype='float32')
embeddings_index2 = dict(get_coefs(*o.split(" ")) for o in open(EMBEDDING_FILE2) if len(o)>100)

In [48]:
all_embs2 = np.stack(embeddings_index2.values())
emb_mean2,emb_std2 = all_embs2.mean(), all_embs2.std()
embed_size2 = all_embs2.shape[1]

  if self.run_code(code, result):


In [49]:
del all_embs2

In [51]:
word_index2 = tokenizer.word_index
nb_words2 = min(max_features, len(word_index2))
embedding_matrix2 = np.random.normal(emb_mean2, emb_std2, (nb_words2, embed_size2))
for word, i in word_index2.items():
    if i >= max_features: continue
    embedding_vector2 = embeddings_index2.get(word)
    if embedding_vector2 is not None: embedding_matrix2[i] = embedding_vector2

In [52]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size2, weights=[embedding_matrix2])(inp)
x = Bidirectional(CuDNNGRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [53]:
model.fit(train_X, train_y, batch_size=512, epochs=2, validation_data=(val_X, val_y))

Epoch 1/2
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x7f7548d15860>

In [54]:
pred_fasttext_val_y = model.predict([val_X], batch_size=1024, verbose=1)
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_fasttext_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.6093761557465568
F1 score at threshold 0.11 is 0.6158884428480879
F1 score at threshold 0.12 is 0.6212652157875322
F1 score at threshold 0.13 is 0.6264502611251838
F1 score at threshold 0.14 is 0.6309121219337938
F1 score at threshold 0.15 is 0.6347820483907343
F1 score at threshold 0.16 is 0.6382846691488666
F1 score at threshold 0.17 is 0.6417757164154381
F1 score at threshold 0.18 is 0.6441149833965049
F1 score at threshold 0.19 is 0.6462729386999848
F1 score at threshold 0.2 is 0.6487842598161355
F1 score at threshold 0.21 is 0.650946021146355
F1 score at threshold 0.22 is 0.6526441673554809
F1 score at threshold 0.23 is 0.6538741975067134
F1 score at threshold 0.24 is 0.6554183961331823
F1 score at threshold 0.25 is 0.65735358232815
F1 score at threshold 0.26 is 0.6589069011731443
F1 score at threshold 0.27 is 0.6601412602191202
F1 score at threshold 0.28 is 0.6613159716375745
F1 score at threshold 0.29 is 0.6620202478090058
F1 score at threshold 0.3

In [55]:
del word_index2, embeddings_index2,  embedding_matrix2, model, inp, x
import gc; gc.collect()
time.sleep(10)

**Observations:**
 * Overall pretrained embeddings seem to give better results comapred to non-pretrained model. 
 * The performance of the different pretrained embeddings are almost similar.
 
**Final Blend:**

Though the results of the models with different pre-trained embeddings are similar, there is a good chance that they might capture different type of information from the data. So let us do a blend of these three models by averaging their predictions.

In [56]:
pred_val_y = 0.70*pred_glove_val_y + 0.30*pred_fasttext_val_y 
for thresh in np.arange(0.1, 0.501, 0.01):
    thresh = np.round(thresh, 2)
    print("F1 score at threshold {0} is {1}".format(thresh, metrics.f1_score(val_y, (pred_val_y>thresh).astype(int))))

F1 score at threshold 0.1 is 0.5969944973604089
F1 score at threshold 0.11 is 0.6055569833518947
F1 score at threshold 0.12 is 0.612854789568057
F1 score at threshold 0.13 is 0.6190942545109211
F1 score at threshold 0.14 is 0.6245151162966206
F1 score at threshold 0.15 is 0.6303167837100186
F1 score at threshold 0.16 is 0.6350749291211016
F1 score at threshold 0.17 is 0.6394815868500079
F1 score at threshold 0.18 is 0.6441215577591594
F1 score at threshold 0.19 is 0.6472220869649901
F1 score at threshold 0.2 is 0.6502030548659181
F1 score at threshold 0.21 is 0.6536317497295948
F1 score at threshold 0.22 is 0.6565445731214651
F1 score at threshold 0.23 is 0.6589829585113808
F1 score at threshold 0.24 is 0.660456482881892
F1 score at threshold 0.25 is 0.6629818889700003
F1 score at threshold 0.26 is 0.6646641371557055
F1 score at threshold 0.27 is 0.6664774742816602
F1 score at threshold 0.28 is 0.6679922613929492
F1 score at threshold 0.29 is 0.6694998010777966
F1 score at threshold 0.