<a href="https://colab.research.google.com/github/koralturkk/QuoraQuestionPairs/blob/master/QuoraQuestionPairs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Quora Question Pairs

Quora Question Pairs challenge is a semantic similarity problem. In the notebook, I have followed word and sentence embeddings to vectorize words and sentences to extract patterns and find out degrees of similarity between sentences.

The following notebook is consisted of parts listed below to approach the problem. 



1.   Explanatory Data Analysis


          *   Stratified Sampling
          *   Train/Val/Test Split




2.   Models for Semantic Similarity


          2.1. Sentence Embeddings

              *   Google's Universal Sentence Encoder
              *   Sentence-BERT 


          2.2. Word Embeddings

              *   Google News's Word2Vec and Siamese Network

3. Overview and Conclusion


In the notebook, pre-trained embedding layers are used to vectorize words and sentences. Transfer Learning in NLP tasks are widely used approach when the computational and data resources are scarce. The embedding models are generated by training process of state of the art models on very large text corpus. 

#### Link to drive and import custom packages

The 

In [0]:
from google.colab import drive
import sys

drive.mount('/content/drive')
sys.path.append('/content/drive/My Drive/Colab Notebooks')


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


#### Loading Packages

In [0]:
import tensorflow as tf
from hyperopt import Trials, STATUS_OK, tpe, fmin, hp
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow_hub as hub
import seaborn as sns
import os, re, io, random
from absl import logging
from google.colab import files
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score 

#pip install sentence_transformers
from sentence_transformers import SentenceTransformer
import keras.backend as K
from keras.optimizers import Adadelta, SGD
from keras.callbacks import ModelCheckpoint, LearningRateScheduler
import time
import scipy.spatial
import json
import torch ## loading file
import nltk ## stopwords
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.layers import Dense, Input, LSTM, Embedding,Activation,Flatten,merge,TimeDistributed,CuDNNGRU,Bidirectional, GRU,concatenate,subtract,add,maximum,multiply,Layer,Lambda
from keras.layers.merge import concatenate
from keras import optimizers
from keras.models import Model
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras.backend import backend as K
from tqdm import tqdm
from keras.preprocessing.sequence import pad_sequences
from gensim.models import KeyedVectors
import itertools
from hyperopt import Trials, STATUS_OK, tpe, fmin, hp
import pickle

"nltk" package will provide a list of stop words that will be useful when cleaning the text.  

In [0]:
stop_words = nltk.download('stopwords')
stop_words = stopwords.words("english")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Saving and Loading Files

Due to the storage heavy nature of embeddings and models, it is a good practice to store processesed embeddings and models to avoid overhead. 

In [0]:
def save_obj(obj, name):
    with open("/content/drive/My Drive/Carl_Finance/"+ name + ".pkl", "wb") as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def load_obj(name):
    with open("/content/drive/My Drive/Carl_Finance/"+ name + ".pkl", "rb") as f:
        return pickle.load(f)

#save = save_obj(embedding_dict, "embeddings")

#embedding_dict = {}

model_save_name = 'embeddings.pt'
path = F"/content/drive/My Drive/Carl_Finance/{model_save_name}" 

def add_to_embedding_dict(key:str, value):
  embedding_dict.update({key:value})

## To save a file
#torch.save(embedding_dict, path)
#save_obj(embedding_dict,'embeddings')


## To load a file 
#embedding_dict = torch.load(path)

embedding_dict = load_obj("embeddings")


In [0]:
train_data = pd.read_csv("/content/drive/My Drive/Carl_Finance/train.csv")
#test_data = pd.read_csv("/content/drive/My Drive/Carl_Finance/test.csv")

## Exploratory Data Analysis 

Make a statement about length of sentences and their contribution to performance

### Preprocessing 

#### Handling Missing Values

In [0]:
print("Missing Values in question1 column: {}".format(train_data["question1"].isnull().sum()))
print("Missing Values in question2 column: {}".format(train_data["question2"].isnull().sum()))

Missing Values in question1 column: 1
Missing Values in question2 column: 2


#### Removing Rows with Missing Values

In [0]:
train_data = train_data.dropna(how='any',axis=0) 

In [0]:
print("Missing Values in question1 column: {}".format(train_data["question1"].isnull().sum()))
print("Missing Values in question2 column: {}".format(train_data["question2"].isnull().sum()))

Missing Values in question1 column: 0
Missing Values in question2 column: 0


#### Cleaning Text

In [0]:
def clean_columns(df, columns = [], clean_stop_words = False):

  for col in columns:
    df.loc[:,col] = df.loc[:,col].apply(lambda x: " ".join(x.lower() for x in x.split()))
    df.loc[:,col] = df.loc[:,col].str.replace(r"\d+", "")
    df.loc[:,col] = df.loc[:,col].str.replace('[^\w\s]','')
    df.loc[:,col] = df.loc[:,col].str.replace(r"[︰-＠]", "")

    if clean_stop_words:
      df.loc[:,col] = df.loc[:,col].apply(lambda x: " ".join(x for x in x.split() if x not in stop_words)) ##stop words

  return df

In [0]:
print(train_data.tail())
print(train_data.head())

            id  ...  is_duplicate
404285  404285  ...             0
404286  404286  ...             1
404287  404287  ...             0
404288  404288  ...             0
404289  404289  ...             0

[5 rows x 6 columns]
   id  qid1  ...                                          question2 is_duplicate
0   0     1  ...  What is the step by step guide to invest in sh...            0
1   1     3  ...  What would happen if the Indian government sto...            0
2   2     5  ...  How can Internet speed be increased by hacking...            0
3   3     7  ...  Find the remainder when [math]23^{24}[/math] i...            0
4   4     9  ...            Which fish would survive in salt water?            0

[5 rows x 6 columns]


In [0]:
train_data = clean_columns(train_data, ["question1", "question2"])
print(train_data.tail())
print(train_data.head())

            id  ...  is_duplicate
404285  404285  ...             0
404286  404286  ...             1
404287  404287  ...             0
404288  404288  ...             0
404289  404289  ...             0

[5 rows x 6 columns]
   id  qid1  ...                                          question2 is_duplicate
0   0     1  ...  what is the step by step guide to invest in sh...            0
1   1     3  ...  what would happen if the indian government sto...            0
2   2     5  ...  how can internet speed be increased by hacking...            0
3   3     7  ...    find the remainder when mathmath is divided by             0
4   4     9  ...             which fish would survive in salt water            0

[5 rows x 6 columns]


### Sampling Training Set

Since the data set is too large. I will work on sample data to build NLP model to make to training and optimization process more robust. It is important to preserve the distribution to build a viable model for deployment.

Therefore, I will used stratified sampling to sample data that has same distribution of "is_duplicate" labels from the dataset. 

In [0]:
y = train_data["is_duplicate"]
X = train_data.copy(deep=False).drop(columns = ["is_duplicate"])

X_source, X_sample, y_source, y_sample = train_test_split(X, y, test_size=0.1, random_state=42, stratify= y)

In [0]:
y_sample.value_counts()

0    25503
1    14926
Name: is_duplicate, dtype: int64

In [0]:
print("Sample distribution of label 1: ",y_sample.value_counts()[1]/y_sample.value_counts().sum())

Sample distribution of label 1:  0.369190432610255


In [0]:
train_data.is_duplicate.value_counts()

0    255024
1    149263
Name: is_duplicate, dtype: int64

In [0]:
print("Source data distribution of label 1: ", train_data.is_duplicate.value_counts()[1]/train_data.is_duplicate.value_counts().sum())

Source data distribution of label 1:  0.3692005926482919


The distribution of labels are very close to each other. We can move forward with the implementation on sample data. 

### Train/Val/Test Set

Seperation of data into train/val/test set help us to validate the generalizability of models. 

The model is trained on training data, tuned on val data and final test is to check results on test data to see how valid the model is on unseen examples. 

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.2, random_state=42, stratify = y_sample)

In [0]:
X_train.reset_index(inplace=True)
X_test.reset_index(inplace=True)
#y_train.reset_index(inplace=True)
#y_test.reset_index(inplace=True)

# Models for Semantic Similarity

Models for semantic similarity use embeddings to vectorize sentences or words. Then various distance metrics are calculated to measure variations between vector outputs. Some of the most commonly used distance metrics are Cosine Similarity, Manhattan Distance, Minkowski Distance and Euclidean Distance. 

#### Distance Metrics 

Distance metrics help us identify the relative difference or similarity between different vectors. Which distance metrics to be used relies on the nature of the problem. Moreover, distance metrics that are suitable for the problem scope can be accepted as a hyperparameter, therefore, they can be subjected comparison by the f1 score they yield. 

#### Useful Functions

In [0]:
def report_predictions(y_train, predictions):
  actual = y_train.to_list()
  results = confusion_matrix(actual, predictions) 
  print('\nConfusion Matrix :\n')
  print(results) 
  #print('Accuracy Score :\n',accuracy_score(actual, predictions))
  print('\nReport : \n')
  print(classification_report(actual, predictions))
  #print("\nF1 Score:\n")
  #print(f1_score(y_train,predictions))

def f1_score_and_accuracy(y_train, predictions):
  actual = y_train.to_list()
  print('Accuracy Score :\n',accuracy_score(actual, predictions))
  print("\nF1 Score:\n")
  print(f1_score(y_train,predictions))

def cosine_similarities(embedding_1, embedding_2):
  if len(embedding_1) == len(embedding_2):
    length = len(embedding_1)
  else:
    return print("Array sizes do not match")

  similarities = []

  for i in range(length):
    sentence_1 = np.reshape(embedding_1[i], (1, -1))
    sentence_2 = np.reshape(embedding_2[i], (1, -1))
    similarity = float(cosine_similarity(sentence_1,sentence_2))
    similarities.append(similarity)

  return np.reshape(similarities, (-1,1))

def get_predictions(similarities, threshold = 0.8):
  predictions = list(map(float, similarities>threshold))
  return predictions

def find_max_word_count(df, columns):
  count = 0
  for col in columns:
    new_count = max(df[col].str.split().map(len))
    if new_count > count:
      count = new_count

    else:
      continue
  return count


def tokenize(df, questions_cols):
 
  for index, row in df.iterrows():

      for question in questions_cols:

          q2n = [] 
          for word in row[question].split():
              if word in stop_words and word not in word2vec.vocab:
                  continue

              if word not in vocabulary:
                  vocabulary[word] = len(inverse_vocabulary)
                  q2n.append(len(inverse_vocabulary))
                  inverse_vocabulary.append(word)
              else:
                  q2n.append(vocabulary[word])

          df.set_value(index, question, q2n)

  return df

def build_embedding_matrix(vocabulary, embedding_dim =300):
  embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim)  # This will be the embedding matrix
  embeddings[0] = 0  # So that the padding will be ignored

# Build the embedding matrix
  for word, index in vocabulary.items():
    if word in word2vec.vocab:
      embeddings[index] = word2vec.word_vec(word)

  return embeddings 

## Sentence Embedding Models

Sentence Embedding Models embeds full sentences into a vector representation. On this task I will implement and assess results of Universal Sentence Encoder and Sentence-Bert.


*   Google's Universal Sentence Encoder

        1.   Deep Averaging Network
        2.   Transformer


*   Sentence-Bert 



Paper 
Take a look!!   
-https://arxiv.org/pdf/1907.04307.pdf  
-https://www.learnopencv.com/universal-sentence-encoder/


### Universal Sentence Encoder

There are 2 embeddings shared with the community in Universal Sentence Encoder. Both trained with different architectures, Deep Averaging Networks and Transformer Networks. These architectures present trade-offs, DAN model is computationally less expensive and has less accuracy overall while the model with transformer encoder scores higher accuracy with more computational costs. 

I will try to implement both for the project to see their impact on predictions. 

#### Universal Sentence Encoder Trained with Deep Averaging Network

In [0]:
DAN_module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" #@param ["https://tfhub.dev/google/universal-sentence-encoder/4"]

# Reduce logging output.
#logging.set_verbosity(logging.ERROR)

#DAN_module_embedding = hub.KerasLayer(DAN_module_url)



In [0]:
##Create

#question_1_list = X_train.loc[:,"question1"].to_list()
#question_2_list = X_train.loc[:,"question2"].to_list()


#with tf.Session() as session:
 # session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  #question_1_DAN_embeddings = session.run(DAN_module_embedding(question_1_list))
  #question_2_DAN_embeddings = session.run(DAN_module_embedding(question_2_list))

## Load
question_1_DAN_embeddings, question_2_DAN_embeddings = embedding_dict["question_1_DAN_embeddings"], embedding_dict["question_2_DAN_embeddings"]

In [0]:
DAN_similarities = cosine_similarities(question_1_DAN_embeddings,question_2_DAN_embeddings)
DAN_predictions = get_predictions(DAN_similarities, 0.8)

In [0]:
DAN_report = report_predictions(y_train, DAN_predictions)
DAN_report

Confusion Matrix :
[[15809  4593]
 [ 4214  7727]]
Accuracy Score : 0.7276999659895496
Report : 
              precision    recall  f1-score   support

           0       0.79      0.77      0.78     20402
           1       0.63      0.65      0.64     11941

    accuracy                           0.73     32343
   macro avg       0.71      0.71      0.71     32343
weighted avg       0.73      0.73      0.73     32343



In [0]:
### Save
#add_to_embedding_dict("question_1_DAN_embeddings",question_1_DAN_embeddings)
#add_to_embedding_dict("question_2_DAN_embeddings",question_2_DAN_embeddings)
#torch.save(embedding_dict, path)

#### Universal Sentence Encoder with Transformer Encoder

Crashed due to RAM

In [0]:
#TRANS_module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/5" #@param ["https://tfhub.dev/google/universal-sentence-encoder-large/5"]
#TRANS_module_embedding = hub.load(TRANS_module_url)

In [0]:
#with tf.Session() as session:
 # session.run([tf.global_variables_initializer(), tf.tables_initializer()])
  #question_1_TRANS_embeddings, question_2_TRANS_embeddings = session.run([TRANS_module_embedding(question_1_list), TRANS_module_embedding(question_2_list)])

In [0]:
#TRANS_similarities = cosine_similarities(question_1_TRANS_embeddings,question_2_TRANS_embeddings)
#TRANS_predictions = get_predictions(similarities, 0.8)

In [0]:
#report_predictions(y_train, TRANS_predictions)

### Sentence-BERT


Researchers in Ubiquitous Knowledge Processing Labb (UKP-TUDA) implemented Sentence-Bert model which is a modification of a pretrained BERT network. The model uses siamese and triplet network to derive semantically meaningful sentence embeddings. This way, they have utilized BERT to be used for new tasks such as semantic similarity which was not possible before.


In [0]:
### Encoding Text into Sentence Embeddings

#bert_nli_stsb_mean_transformer = SentenceTransformer('bert-base-nli-stsb-mean-tokens')
#bert_nli_stsb_mean_question_1_embeddings = bert_nli_stsb_mean_transformer.encode(question_1_list)
#bert_nli_stsb_mean_question_2_embeddings = bert_nli_stsb_mean_transformer.encode(question_2_list)

### Loading pre-encoded sentence embeddings

bert_nli_stsb_mean_question_1_embeddings = embedding_dict["bert_nli_stsb_mean_question_1_embeddings"]
bert_nli_stsb_mean_question_2_embeddings = embedding_dict["bert_nli_stsb_mean_question_2_embeddings"]

In [0]:
## To save a file

#add_to_embedding_dict("bert_nli_stsb_mean_question_1_embeddings",bert_nli_stsb_mean_question_1_embeddings)
#add_to_embedding_dict("bert_nli_stsb_mean_question_2_embeddings", bert_nli_stsb_mean_question_2_embeddings)

#save_obj(embedding_dict,"embeddings")

#torch.save(embedding_dict, path)

In [0]:
sbert_similarities = cosine_similarities(bert_nli_stsb_mean_question_1_embeddings,bert_nli_stsb_mean_question_2_embeddings)
sbert_predictions = get_predictions(sbert_similarities, 0.8)

In [0]:
sbert_report = report_predictions(y_train, sbert_predictions)
sbert_report

Confusion Matrix :
[[16234  4168]
 [ 4578  7363]]
Accuracy Score : 0.7295860000618372
Report : 
              precision    recall  f1-score   support

           0       0.78      0.80      0.79     20402
           1       0.64      0.62      0.63     11941

    accuracy                           0.73     32343
   macro avg       0.71      0.71      0.71     32343
weighted avg       0.73      0.73      0.73     32343



In [0]:
#del bert_nli_stsb_mean_transformer

## Word Embedding Models

In the word embedding models, in order to have achieve a "is duplicate" binary classification objective, we need to feed word embedding sequences to a siamese network. Then, the encoded sequence will decide how much semantically similar these two input sequences by calculating the distance function using the two outputs of the network. 

The choice of word embedding to vectorize words is important for performance. Therefore, I would choose to work with Google News' word2vec embeddings. 

The selection of word embeddings has to be done empirically. Each trial has to be documented and we should deploy model that yields the highest accuracy and f1 score. 


In this notebook, I will implement Siamese Manhattan LSTM. A model that yields high accuracy in the task to compare the word embedding result to sentence embedding results of state of the art models. 

Later, I generate a wide hyperparameter space and by using Gaussian optimization, I will try to come up with a custom architecture to compete against the preexisting models. 

In [0]:
## max word count is needed to implement sequences. So that we can instantiate sequences that can encapsulate all the sentence in the dataset. 
max_word_count = find_max_word_count(train_data,["question1", "question2"])
print(max_word_count)

237


#### Word2Vec

After the creation of embedding matrix for the words that we have, it is not necessary to load the GoogleNews Embedding File due to its size. 

In [0]:
#word2vec_embedding_file = "/content/drive/My Drive/Carl_Finance/GoogleNews-vectors-negative300.bin.gz"
#word2vec = KeyedVectors.load_word2vec_format(word2vec_embedding_file, binary=True)

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


#### Word Embedding Matrix

In [0]:
vocabulary = dict()
inverse_vocabulary = ['<unk>']  # '<unk>' will never be used, it is only a placeholder for the [0, 0, ....0] embedding
questions_cols = ['question1', 'question2']

## Tokenize questions 

X_w2v = clean_columns(X_sample.copy(deep=False), columns = questions_cols, clean_stop_words = True)
X_w2v = tokenize(X_w2v,questions_cols)



In [0]:
## Embedding 
word_2_vec_embeddings = embedding_dict["word_2_vec_embeddings"]   #build_embedding_matrix(vocabulary,embedding_dim =300) 

## Delete 
#del word2vec

#add_to_embedding_dict("word_2_vec_embeddings",word_2_vec_embeddings)

#save_obj(embedding_dict,"embeddings")

In [0]:
X_w2v.head()

Unnamed: 0,id,qid1,qid2,question1,question2
156326,156326,244682,244683,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]","[12, 1, 2, 13, 9, 14, 15, 10, 16, 17, 18, 19]"
233222,233222,343427,343428,"[20, 21, 22, 23, 24]","[25, 21, 26, 27, 28]"
318647,318647,444025,444026,"[29, 30, 31]","[29, 30, 31]"
401836,401836,535283,535284,"[32, 33, 34]","[32, 33]"
3053,3053,6053,6054,"[35, 36]","[35, 36]"


### Manhattan LSTM 

The manhattan distance is used in the model. In the article, it is said that manhattan distance outperforms the cosine similarity. So I will follow the same approach during the implementation. 

In [0]:
X_train_w2v, X_val_w2v, y_train_w2v, y_val_w2v = train_test_split(X_w2v, y_sample, test_size=0.2, random_state=42, stratify = y_sample)

In [0]:
### Changing X_train_w2v to dict will provide greater ease at siamese network

X_train_w2v = {'left': X_train_w2v.question1, 'right': X_train_w2v.question2}
X_val_w2v = {'left': X_val_w2v.question1, 'right': X_val_w2v.question2}

# Convert labels to their numpy representations
y_train_w2v = y_train_w2v.values
y_val_w2v = y_val_w2v.values

# Zero padding
for dataset, side in itertools.product([X_train_w2v, X_val_w2v], ['left', 'right']):
    dataset[side] = pad_sequences(dataset[side], maxlen=max_word_count)

# Make sure everything is ok
assert X_train_w2v['left'].shape == X_train_w2v['right'].shape
assert len(X_train_w2v['left']) == len(y_train_w2v)

In [0]:
### Model Implementation

# Model variables
embedding_dim = 300
n_hidden = 50
gradient_clipping_norm = 1.25
batch_size = 64
n_epoch = 25


def cosine_distance(x1, x2):
    x1 = K.l2_normalize(x1, axis=-1)
    x2 = K.l2_normalize(x2, axis=-1)
    return -K.mean(x1 * x2, axis=-1, keepdims=True)

def cos_dist_output_shape(shape_1, shape_2):
    shape1, shape2 = shapes
    return (shape1[0],1)

def exponent_neg_manhattan_distance(left, right):
    ''' Helper function for the similarity estimate of the LSTMs outputs'''
    return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))


# The visible layer
def create_model():
  left_input = Input(shape=(max_word_count,), dtype='int32')
  right_input = Input(shape=(max_word_count,), dtype='int32')

  embedding_layer = Embedding(len(word_2_vec_embeddings), embedding_dim, weights=[word_2_vec_embeddings], input_length=max_word_count, trainable=False)

  # Embedded version of the inputs
  encoded_left = embedding_layer(left_input)
  encoded_right = embedding_layer(right_input)

  # Since this is a siamese network, both sides share the same LSTM
  shared_lstm = LSTM(n_hidden)

  left_output = shared_lstm(encoded_left)
  right_output = shared_lstm(encoded_right)


  # Calculates the distance as defined by the MaLSTM model
  malstm_distance = Lambda(function=lambda x: exponent_neg_manhattan_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])

  # Pack it all up into a model
  malstm = Model([left_input, right_input], [malstm_distance])

  # Adadelta optimizer, with gradient clipping by norm
  optimizer = Adadelta(clipnorm=gradient_clipping_norm)


  ## Callbacks

  ## Learning Rate 

  #lr_schedule = LearningRateScheduler(lambda epoch: 1e-8*10**(epoch/5))

  ## Checkpoint Callback
  filepath="/content/drive/My Drive/Carl_Finance/weights.best.hdf5"
  model_checkpoint = ModelCheckpoint(filepath, monitor="val_acc", verbose=1, save_best_only=True, mode='max')
  callbacks_list = [model_checkpoint]

  # load weights

  malstm.compile(loss="mean_squared_error", optimizer=optimizer, metrics=["accuracy"])

  # Start training

  #malstm = malstm.fit([X_train_w2v['left'], X_train_w2v['right']], y_train_w2v, batch_size=batch_size, epochs=n_epoch,
  #                           validation_data=([X_val_w2v['left'], X_val_w2v['right']], y_val_w2v),callbacks=callbacks_list, verbose = 1)

  return malstm

malstm = create_model() 
malstm.load_weights("/content/drive/My Drive/Carl_Finance/weights.best.hdf5")














In [0]:
malstm.summary()

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 237)          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 237)          0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 237, 300)     9497700     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
lstm_1 (LSTM)                   (None, 50)           70200       embedding_1[0][0]          

In [0]:
malstm_pred = malstm.predict([X_val_w2v['left'], X_val_w2v['right']])
malstm_report = report_predictions(y_val_w2v,get_predictions(malstm_pred))
malstm_report

Confusion Matrix :
[[990  30]
 [499  98]]
Accuracy Score : 0.6728509585652442
Report : 
              precision    recall  f1-score   support

           0       0.66      0.97      0.79      1020
           1       0.77      0.16      0.27       597

    accuracy                           0.67      1617
   macro avg       0.72      0.57      0.53      1617
weighted avg       0.70      0.67      0.60      1617



## Custom Architecture with Hyperparameter Tuning using Bayesian Optimization

Sequential model-based optimization

Sequential model-based optimization is a Bayesian optimization technique that uses information from past trials to inform the next set of hyperparameters to explore, and there are two variants of this algorithm used in practice:one based on the Gaussian process and the other on the Tree Parzen Estimator. The HyperOpt package implements the Tree Parzen Estimator algorithm to perform optimization. The Tree Parzen Estimator replaces the generative process of choosing parameters from the search space in a tree like fashion with a set of non parametric distributions. 

- https://blog.dominodatalab.com/hyperopt-bayesian-hyperparameter-optimization/

#### Sampling Again


We need to sample the dataset again due to the computational overhead. It is very time heavy process to search parameter space. Therefore I would like to decrease the number of training examples. The parameters given out from the sample may not reflect fully the optimal parameters for the whole dataset. However, they may present a close picture to what parameters should be. Given the resources I have, it seemed to be the best approach. 

In [0]:
X_sample_w2v, X_source_w2v, y_sample_w2v, y_source_w2v = train_test_split(X_w2v, y_sample, test_size=0.8, random_state=42, stratify = y_sample)
X_train_bay, X_val_bay, y_train_bay, y_val_bay = train_test_split(X_sample_w2v, y_sample_w2v, test_size=0.2, random_state=42, stratify = y_sample_w2v)

In [0]:
### Changing X_train_w2v to dict will provide greater ease at siamese network

X_train_bay = {'left': X_train_bay.question1, 'right': X_train_bay.question2}
X_val_bay = {'left': X_val_bay.question1, 'right': X_val_bay.question2}

# Convert labels to their numpy representations
y_train_bay = y_train_bay.values
y_val_bay = y_val_bay.values

# Zero padding
for dataset, side in itertools.product([X_train_bay, X_val_bay], ['left', 'right']):
    dataset[side] = pad_sequences(dataset[side], maxlen=max_word_count)

# Make sure everything is ok
assert X_train_bay['left'].shape == X_train_bay['right'].shape
assert len(X_train_bay['left']) == len(y_train_bay)

In [0]:
from hyperopt import Trials, STATUS_OK, tpe, fmin, hp
from keras import optimizers
from keras.layers import GRU

bayesian_history= pd.DataFrame(columns=["model","loss", "acc", "val_acc","batch_size","nodes","learning_rate","arch","epochs","optimizer","distance_metric"])

search_space = {
    "batch_size": hp.choice("batch_size", [8,16,32,64,128]),
    "nodes": hp.choice("nodes", [25,50,75,100]),
    "learning_rate":  hp.loguniform("learning_rate", np.log(0.01), np.log(0.2)),
    "arch": hp.choice("arch",["bidirectional", "unidirectional","GRU"]),
    "epochs": hp.choice("epochs", [15, 20, 25]),
    "optimizer": hp.choice("optimizer",["sgd", "rms","adamdelta"]),
    "distance_metric": hp.choice("distance_metric",["exponent_neg_manhattan_distance", "cosine_distance"]),
}


def build_siamese_network(params):


  left_input = Input(shape=(max_word_count,), dtype='int32')
  right_input = Input(shape=(max_word_count,), dtype='int32')

  embedding_layer = Embedding(len(word_2_vec_embeddings), embedding_dim, weights=[word_2_vec_embeddings], input_length=max_word_count, trainable=False)

  encoded_left = embedding_layer(left_input)
  encoded_right = embedding_layer(right_input)

  if params["arch"] == "bidirectional":
    model = Bidirectional(LSTM(params["nodes"])) 
  elif params["arch"] == "GRU":
    model = GRU(params["nodes"])
  else:
    model = LSTM(params["nodes"]) 

  left_output = model(encoded_left)
  right_output = model(encoded_right)


  if params["distance_metric"] == "exponent_neg_manhattan_distance":
    distance = Lambda(function=lambda x: exponent_neg_manhattan_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])
  if params["distance_metric"] == "cosine_distance":
    distance = Lambda(function=lambda x: cosine_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])

  model = Model([left_input, right_input], [distance])

  ### Calbacks 
  file_name = 'siamese_callbacks.csv'
  callback_file = F"/content/drive/My Drive/Carl_Finance/{file_name}" 
  csv_callback_bay = keras.callbacks.CSVLogger(callback_file, separator=',', append=False)

  early_stop_bay = tf.keras.callbacks.EarlyStopping(monitor='val_acc', patience=3)
  
  filepath_bay ="/content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5"
  checkpoint_bay = ModelCheckpoint(filepath_bay, monitor="val_acc", verbose=1, save_best_only=True, mode='max')
  
  callbacks_list_bay = [early_stop_bay,checkpoint_bay,csv_callback_bay]

  lr = params["learning_rate"]
  epochs = params["epochs"]
  batch_size = params["batch_size"]

  if params["optimizer"] == 'rms':
      optimizer = optimizers.RMSprop(lr=lr)
  elif params["optimizer"] == "adadelta":
      optimizer = Adadelta(clipnorm=gradient_clipping_norm)
  else:
      optimizer = optimizers.SGD(lr=lr, decay=1e-6, momentum=0.9, nesterov=True)


  model.compile(loss="mean_squared_error", optimizer=optimizer, metrics=['accuracy'])


  history = model.fit([X_train_bay['left'], X_train_bay['right']], y_train_bay, batch_size=batch_size, epochs=epochs,
                            validation_data=([X_val_bay['left'], X_val_bay['right']], y_val_bay),verbose=2,callbacks=callbacks_list_bay)
  
  val_error = np.amin(history.history['val_loss']) 


  ## Record Keeping for Parameter Space
  print('Best validation error of epoch:', val_error)

  record= pd.DataFrame({"model": [model],"loss": [np.amin(history.history['loss'])],
                        "val_error":[np.amin(history.history['val_loss'])],
                        "acc":[np.amin(history.history['acc'])],
                        "val_acc":[np.amin(history.history['val_acc'])],
                        "batch_size":[params["batch_size"]],"nodes":[params["nodes"]],
                        "learning_rate":[params["learning_rate"]],"arch":[params["arch"]],
                        "epochs":[params["epochs"]],"optimizer":[params["optimizer"]],
                        "distance_metric":[params["distance_metric"]]})
  
  bayesian_history.append(record, ignore_index = True)
  bayesian_history.to_csv('/content/drive/My Drive/Carl_Finance/bayesian_history.csv')

  ## Save DataFrame
  save_obj(bayesian_history,"bayesian_history_dataframe.pt")

  return {'loss': val_error, 'status': STATUS_OK, 'model': model, "history":history}

trials = Trials()
best = fmin(build_siamese_network,
    space=search_space,
    algo=tpe.suggest, # type random.suggest to select param values randomly
    max_evals=5, # max number of evaluations you want to do on objective function
    trials=trials)

Train on 6468 samples, validate on 1617 samples
Epoch 1/25
 - 177s - loss: 0.2764 - acc: 0.6572 - val_loss: 0.2497 - val_acc: 0.6685


Epoch 00001: val_acc improved from -inf to 0.66852, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 2/25
 - 166s - loss: 0.2225 - acc: 0.6948 - val_loss: 0.2178 - val_acc: 0.6889


Epoch 00002: val_acc improved from 0.66852 to 0.68893, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 3/25
 - 164s - loss: 0.1951 - acc: 0.7217 - val_loss: 0.2079 - val_acc: 0.6970


Epoch 00003: val_acc improved from 0.68893 to 0.69697, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 4/25
 - 164s - loss: 0.1816 - acc: 0.7348 - val_loss: 0.2041 - val_acc: 0.6988


Epoch 00004: val_acc improved from 0.69697 to 0.69882, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 5/25
 - 161s - loss: 0.1731 - acc: 0.7498 - val_loss: 0.2015 

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort,



Train on 6468 samples, validate on 1617 samples
Epoch 1/25
 - 43s - loss: 0.3803 - acc: 0.6308 - val_loss: 0.3802 - val_acc: 0.6308


Epoch 00001: val_acc improved from -inf to 0.63080, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 2/25
 - 34s - loss: 0.3800 - acc: 0.6308 - val_loss: 0.3799 - val_acc: 0.6308


Epoch 00002: val_acc did not improve from 0.63080
Epoch 3/25
 - 34s - loss: 0.3799 - acc: 0.6308 - val_loss: 0.3798 - val_acc: 0.6308


Epoch 00003: val_acc did not improve from 0.63080
Epoch 4/25
 - 33s - loss: 0.3798 - acc: 0.6308 - val_loss: 0.3798 - val_acc: 0.6308


Epoch 00004: val_acc did not improve from 0.63080
Best validation error of epoch:
0.3797537739179572
Train on 6468 samples, validate on 1617 samples
Epoch 1/25
 - 51s - loss: 0.2047 - acc: 0.6916 - val_loss: 0.2018 - val_acc: 0.6914


Epoch 00001: val_acc improved from -inf to 0.69140, saving model to /content/drive/My Drive/Carl_Finance/bayesian_weights_best.hdf5
Epoch 2/2

In [0]:
print(trials.best_trial['result']['loss'])

<keras.callbacks.History object at 0x7f4fa335bf28>
0.19290456345365897


## Model Comparison 

Since there is not intensive class imbalance. I will use accuracy metric to compare models. I choosed to obtain accuracy metrics using validation sets since there is no hyperparameter tuning done on the models. Therefore, these data has not been touched in a way that harms the generalizability of the results. 

In [0]:
print("\n--- Universal Sentence Encoder by DAN Accuracy ---\n")
print(accuracy_score(y_train, DAN_predictions))

print("\n--- S-BERT Accuracy ---\n")
print(accuracy_score(y_train, sbert_predictions))

print("\n--- MALSTM Accuracy ---\n")
print(accuracy_score(y_val_w2v, get_predictions(malstm_pred)))



--- Universal Sentence Encoder by DAN Accuracy ---

0.7276999659895496

--- S-BERT Accuracy ---

0.7295860000618372

--- MALSTM Accuracy ---

0.6728509585652442


### Takeaways

In NLP tasks, the first challenge is always to turn text into numerical representation. Researchers have tried different methodologies to achieve this task over the years. Until 2016 Bag of Words and TF-IDF were popular approaches to achieve high performances. These models rely on frequency that given words appear on the corpus and featurize words based on their number of occurence in sequences. 

After 2016, embeddings became popular in the NLP community. Embeddings differ from BOW and TF-IDF due to the fact that they try to represent the context of the word in a multidimensional space instead of counting occurences. This way, the context became easier to grasp by NLP algorithms. The true extend of word embeddings comes in when transfer learning is utilized. Open-source pre-trained models allow developers to access to models that are trained on massive text corpusses. Therefore, in the Kaggle competitions and practical use cases word embeddings became widespread compared to BOW and TF-IDF to achieve higher accuracy. 

In Quora Question Pairs challenge, I wanted to utilize the same approach in different ways. I wanted to utilize sentence and word embeddings and compare their impact on the performance. Initially, I started working with Google's Universal Sentence Encoder and then, S-BERT sentence embeddings to classify questions as duplicate. Moreover, I have also implemented state of the art MaLSTM model for the classification purpose. After these implementations I wanted to make a custom model implementation for a siamese network. Using Bayesian Optimization I have created a vast parameter space for the algorithm. 

The custom model validation accuracy could not surpass the sentence embeddings performances. Considering the computationally heavy nature of the task and scarcity of time. It was a very hard task to overcome the performances of these state of the art models. 


## Things To Do If Given More Time



*   Train data on larger sets for MalSTM. Although it uses pre-trained embeddings, a larger dataset would improve the performance of LSTM layer because there would be more example for the LSTM to learn from. 
*   I would examine which pairs of questions were falsely labeled and depending on the analysis, if there is any systematic error, I would diagnose and come up with a solution. 
*   The Bayesian Optimization algorithms evaluation number was 5. It takes 5-6 hours to train 5 evaluations on GPU. The fact that 5 evaluations did not let Bayesian Optimization to learn from outcomes. Therefore, due to time constraints, that algorithm could not utilized 100%. 




## Citations 


*   Reimers, Nils, and Iryna Gurevych. “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks.” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, doi:10.18653/v1/d19-1410.

*   Cer, Daniel & Yang, Yinfei & Kong, Sheng-yi & Hua, Nan & Limtiaco, Nicole & John, Rhomni & Constant, Noah & Guajardo-Cespedes, Mario & Yuan, Steve & Tar, Chris & Sung, Yun-Hsuan & Strope, Brian & Kurzweil, Ray. (2018). Universal Sentence Encoder. 

*   Neculoiu, Paul, et al. “Learning Text Similarity with Siamese Recurrent Networks.” Proceedings of the 1st Workshop on Representation Learning for NLP, 2016, doi:10.18653/v1/w16-1617.

*   Bergstra, J., Yamins, D., Cox, D. D. (2013) Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures. To appear in Proc. of the 30th International Conference on Machine Learning (ICML 2013).

*   Mansukhani, Subir. “HyperOpt: Bayesian Hyperparameter Optimization.” Data Science Blog by Domino, blog.dominodatalab.com/hyperopt-bayesian-hyperparameter-optimization/.

*   Zelros. “From Bag of Words to Transformers: 10 Years of Practical Natural Language Processing.” Medium, Medium, 3 Nov. 2019, medium.com/@Zelros/from-bag-of-words-to-transformers-10-years-of-practical-natural-language-processing-8ccc238f679a.

