# Final Project: Training a Siamese Neural Network (SNN) to Detect Duplicate Question Pairs
**Michael Klein**  
16 December 2022  
CS 539 Machine Learning  

### Abstract

This project utilizes natural language processing (NLP) techniques together with Siamese Neural Networks (SNNs) to classify pairs of questions as duplicates or not duplicates. Unlike conventional classification problems, the machine learning model must be designed to accept two inputs and compare them to each other rather than to the rest of the training set.

The data was sourced from [Quora’s Question Pairs Dataset](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs), which provides 400,000 question pairs as well as class labels (`is_duplicate`, 0 or 1).

Solving problems like this one can go a long way toward improving search engine results and chatbot interactions as well as facial recognition and similar technologies (when extended to accept image input rather than character strings).

The model is extensively developed using Keras, and data are pre-processed with Keras and the nltk module in Python. Though the initial model results were disappointing (no better than random guessing), the revised model showed significant improvement in accuracy (~86% after 50 epochs, ~80% on the validation set). Though this can use some improvement, these results, which have been the culmination of a semester's worth of work, are a good starting point for fine-tuning the model for business use.

### Overview

This final project uses a Siamese neural network (SNN) model to assess whether pairs of lexically similar questions are semantically similar, that is, if two similar-worded questions are actually asking the same thing.

These special types of neural networks accept two inputs and compare features to each to determine whether the two inputs are similar by comparing their features. Inputs in an SNN model are passed through two identical sub-networks that use the same parameters and weights (when a weight is changed in one subnetwork, the other subnetwork adjusts to match).

SNNs are commonly used in facial recognition technology and are the reason why your smartphone only needs a few images of your face rather than thousands of images (as would be the case with a conventional CNN).  

<img src="./images/snn_example.png" style="width: 600px;"/>  

Example SNN used for validating signatures. *Source: [Towards Data Science](https://towardsdatascience.com/a-friendly-introduction-to-siamese-networks-85ab17522942)*

#### Related Work

While there was no specific lecture or article that galvanized me to undertake this project, I did begin this course with an general idea of what I wanted to focus on. After taking the course CS 548 (Knowledge Discover and Data Mining), I knew that I wanted to learn more about neural networks, in particular convolutional neural networks (CNNs) for time-series analysis. At the same time, a friend of mine was looking for ways to build on a chatbot used to help participants in remote project work in the genomics space. I thought learning SNNs was a good overlap of my interests and practical real-world applications.  

For more information on how SNNs work, I found [this article](https://towardsdatascience.com/a-friendly-introduction-to-siamese-networks-85ab17522942) to be quite helpful.

#### Initial Research Question

*For a given pair of lexically similar questions, are the two questions asking for the same information?*

#### Motivation and Business Value

Solving the research question would ultimately answer a broader question for someone designing a chatbot:

*If a given answer A is appropriate for question Q1, can A also be used to answer question, Q2?*

Answering this question is beyond the scope of this project, but it would be the logical next step in terms of the project results would be to apply a (hopefully accurate) ML model to a repository of questions and answers. Rather than generate similar answers for differently worded questions, a chatbot can save a lot of time by first determining whether a new question is similar to another one already in the repository. If so, then the chatbot need only look up the answer to the first question, rather than search for a suitable answer or respond with a disappointing, “I don’t know.”

#### Dataset

The dataset is sourced from [Quora’s Question Pairs Dataset](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs), which is publicly available. The dataset is large (about 404,300 question pairs) and already processed — there are almost no missing values, and each question pair is neatly classified as being duplicates (1) or not (0). 

A few favorite pairs:  

-457: *What are the life lessons that Batman teaches us?* and *What are the life lessons you can learn from the dark knight?* (duplicates)  
-135001: *Why is life so unfair and difficult?* and *Why is life very unfair?* (duplicates)  
-214890: *What is the difference between data mining, artificial intelligence and machine learning?* and *What is the difference between data science, artificial intelligence and machine learning?* (close, but not duplicates)  
-404055: *Is Donald Trump corrupt?* and *Does Donald Trump think Putin is corrupt?* (not duplicates)

Pre-processing of the dataset included removing punctuation, standardizing spelling, lemmatizing words, and embedding words as vectors.

### Recap of Previous Milestones

-**Milestone 1:** Project proposal in which I identified the dataset to be analyzed as well as my intention to utilize SNNs to train an ML model.  
-**Milestone 2:** EDA and preliminary pre-processing using R. The next major decision would be to select a Python module capable of employing SNNs. Eventually, Keras was selected.  
-**Milestone 3:** Initial model building and training using Euclidean distance and contrastive loss function. Unfortunately, the initial ML model performed no better than random guessing. The proposed next steps were to revisit the input data and hyperparameters as well as to learn more about Keras.  
-**Milestone 4:** Model revision, which fortunately yielded much more promising results after using Manhattan distance and mean squared error as the loss function. After 10 epochs of training the model, the model performed significantly better than random guessing (about 77% for the training set, 76% for the validation set). Soon after this, I was able to train the model for 50 epochs, which led to even better results. This is the model covered in this final iteration of the project.

## Exploratory Data Analysis (EDA)

Earlier in the project, we turned to R for performing exploratory data analysis (EDA). This was because R in general, and R Studio in particular, are better suited to performing these types of analyses. The R source code and associated R data objects are available in the `/src` folder.

Because I ran into some initial problems in training the model, I decided to start the entire data pre-processing from scratch in Python while keeping the plots generated in R.

### Missing Values

There are two missing values in the `question_2` column. What are they?  

![](./images/missing_vals.png)

With only two missing values out of over 400 thousand, it's safe to simply toss these records from the dataset. This is done later in Cell 3.

### Distinct Questions

The missing values above were interesting because they essentially are asking the same question. So, out of all questions in each column, how many are unique?  

<img src="./images/distinct_vals.png" style="width: 200px;"/>

Both of these numbers are well below 400,000, or the total number of records. About a third of questions in each column are "recycled," perhaps paired with different versions of similar questions that might be tricky to detect as duplicates (or not).

Let's now take a look at how many distinct questions there are when grouping duplicate and non-duplicate questions.  

<img src="./images/distinct_vals_by_duplicates.png" style="width: 300px;"/>  

There are total of 404,288 questions (about 255,000 tagged as non-duplicates and the rest tagged as duplicates).  

Interestingly, there's a greater proportion or distinct questions in non-duplicates compared to duplicates. Perhaps this is because there are more similar-sounding questions that actually aren't duplicates, and an ML model would need more non-similar examples for a given question in order to make an accurate classification.

### Dataset Classification Distribution

Building on the results above, we can see how the dataset is distributed in terms of duplicate and non-duplicate questions. Ideally, the dataset would feature a 50/50 split of duplicates/non-duplicates. But if this isn't the case, then we can do some stratified sampling when we train the model to ensure that both classes are adequately represented.  

<img src="./images/duplicate_v_non_duplicate.png" style="width: 400px;"/>  

With many more non-duplicates than duplicates, it makes sense to use stratified random sampling when generating a train and test set for the ML model (see Cell 12, Line 6 of Final Analysis).

#### Dataset Classification Distribution by Question

Next, we can plot distinct questions grouped by class, by question (`q1` or `q2`).  

<img src="./images/distinct_questions.png" style="width: 400px;"/>  

What we can conclude here is that neither class has significantly more distinct questions than the other. In other words, it is not the case  that one question is being repeated much more compared to others (on average).

## Final Analysis

### Getting Started

As ever, we'll start by loading the required Python modules for this analysis.

In [1]:
import pandas as pd
import numpy as np
import statistics as stats
import contractions
import csv
import re
import os

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

from string import punctuation
punctuation = list(punctuation)


import tensorflow as tf
import keras
import keras.backend as K

from sklearn.model_selection import train_test_split

from keras.preprocessing.text import Tokenizer
from keras.utils import pad_sequences
from keras.optimizers import Adam, RMSprop, Adadelta
from keras.models import Sequential, Model
from keras.regularizers import l2
from tensorflow.keras.utils import plot_model
from keras.layers import Embedding, Input, Dense, Flatten, GlobalMaxPool2D, GlobalAvgPool2D, Concatenate, Multiply, Dropout, Subtract, Add, Conv2D, LSTM, Lambda, Dropout

tf.config.run_functions_eagerly(True)

import warnings
warnings.filterwarnings('ignore')

# Change working directory to one level up so that we can load the required datasets from sibling directory
os.chdir(os.path.dirname(os.getcwd()))

[nltk_data] Downloading package punkt to /Users/michklein/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/michklein/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/michklein/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/michklein/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


Instead of importing the data as an R object as last time, we will read in the data from scratch and try alternative pre-processing modules from Keras.

Columns are casted as integers, strings, or factors, with the class attribute `is_duplicate` being cast as a factor or category.

In [2]:
filepath = 'data/quora_duplicate_questions.tsv'

col_type = {'id': 'int',
            'qid1': 'int',
            'qid2': 'int',
            'question1_raw': 'str',
            'question2_raw':  'str',
            'is_duplicate': 'category'}

data = pd.read_csv(filepath, skiprows=[0], names=col_type.keys(), dtype=col_type, delimiter='\t', index_col='id')

data.sample(5)

Unnamed: 0_level_0,qid1,qid2,question1_raw,question2_raw,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
140497,223256,178264,How would the US respond if China or Russia to...,You are one of the great powers and you have j...,0
268029,385504,385505,How can I get 10 lpa plus package as a fresher...,How can an M.Sc. in Statistics fresher in Indi...,0
120435,195341,186930,How can I get Meth out of my system in less th...,How do I pass a drug test for meth in 40 hours?,1
365220,50270,22099,How does Demonetisation of 1000 and 500 rupees...,What will be the impact on real estate by bann...,1
66259,114870,114871,What does it mean to have a Whiggish view of h...,What does it mean to teach history? What does ...,0


The next step is to check the dataset for missing values and drop them.

In [3]:
print(f'Total Number of Records (Before Dropping NAs):\t{len(data)}')

data = data.dropna(axis=0)

print(f'Total Number of Records (After Dropping NAs):\t{len(data)}')

Total Number of Records (Before Dropping NAs):	404290
Total Number of Records (After Dropping NAs):	404287


Next, we'll start pre-processing the data by converting text to lowercase, removing numerals, and expanding any contractions.

In [4]:
questions = ['question1_raw', 'question2_raw']

for col in questions:
    data[col] = data[col].str.lower()
    data[col] = data.apply(lambda row: re.sub(r'[0-9]+', '', row[col]), axis=1)
    data[col] = data.apply(lambda row: contractions.fix(row[col]), axis=1)
    
data.sample(5)

Unnamed: 0_level_0,qid1,qid2,question1_raw,question2_raw,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
18802,35592,35593,should a person join the core or software indu...,should i join hardware or software industry? i...,0
261600,377779,377780,is it appropriate for a (christian) spouse to ...,my husband is having an emotional affair with ...,0
376384,227955,151358,where can i find fairy themed cupcakes in gold...,where can i get free cupcake delivery in gold ...,0
323007,107409,247763,"what is the simple meaning of ""once in a blue ...","what is the meaning of ""once in a blue moon ""?",1
296143,389457,418300,how often do guys regret rejecting a girl?,do girls enjoy rejecting guys?,0


We're now ready to tokenize queries, which is breaking down queries and sentences into individual words. This will make further pre-processing easier.

In [5]:
data['question1'] = data.apply(lambda row: word_tokenize(row['question1_raw']), axis=1)
data['question2'] = data.apply(lambda row: word_tokenize(row['question2_raw']), axis=1)

data.sample(5)

Unnamed: 0_level_0,qid1,qid2,question1_raw,question2_raw,is_duplicate,question1,question2
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
42597,76736,76737,why does kungfu panda always eat momo?,should i open a momo stall near cheryl and ber...,0,"[why, does, kungfu, panda, always, eat, momo, ?]","[should, i, open, a, momo, stall, near, cheryl..."
129626,208188,208189,what is the evolution of god?,what is an evolute?,0,"[what, is, the, evolution, of, god, ?]","[what, is, an, evolute, ?]"
12331,23768,23769,what is it like to be a mentor to high school ...,what are the best ways to find mentor for a hi...,0,"[what, is, it, like, to, be, a, mentor, to, hi...","[what, are, the, best, ways, to, find, mentor,..."
115134,187811,29483,what would you wish for quora?,what is a wish?,0,"[what, would, you, wish, for, quora, ?]","[what, is, a, wish, ?]"
88813,149319,149320,renewing indian passport while working in diff...,are we overusing gadgets in day-to-day life?,0,"[renewing, indian, passport, while, working, i...","[are, we, overusing, gadgets, in, day-to-day, ..."


Stopwords are common words (the, a/an, etc.) that won't add much meaning to a query. We can remove them as well as any punctuation marks so that the model is only analyzing words. A simple function can do both steps at once.

In [6]:
def remove_stopwords_punc(tokenized_list):
    clean_tokens = [word for word in tokenized_list if word not in stopwords and word not in punctuation]
    return clean_tokens

Applying the function to the dataset:

In [7]:
cleaned_cols = ['question1', 'question2']

for col in cleaned_cols:
    data[col] = data.apply(lambda row: remove_stopwords_punc(row[col]), axis=1)
    
data.loc[:,['question1', 'question2', 'is_duplicate']].sample(5)

Unnamed: 0_level_0,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
164769,[used],"[apply, silicone, caulk, without, using, gun]",0
359842,"[save, hair, getting, thin, frizzy]","[stop, hair, thinning]",0
372288,"[stalkbook, app, work]","[summary, app, work]",0
12608,"[navy, reserves, boot, camp, like]","[navy, boot, camp, stories]",0
260316,"[best, retirement, investment, strategies]","[presently, shall, retiring, years, maintain, ...",0


Looks good so far!  

The next step is to standardize spelling in the dataset questions, which utilize both American and British spelling conventions. By default, this project will take all British spelling variants and convert them to their American equivalents. However, there is functionality to perform the opposite conversion, since neither spelling convention is inherently "more correct" than the other

The dictionary of American vs British spelling conventions is sourced from [this site](http://www.tysto.com/uk-us-spelling-list.html).

In [8]:
filepath_dict = 'data/british_american_spelling_dict.csv'

with open(filepath_dict) as file:
    reader = csv.reader(file)
    british_american_dict = dict((row[0], row[1]) for row in reader)
    
def standard_spellcheck(word_list):
    output = []
    for word in word_list:
        if word in british_american_dict.keys():
            output.append(british_american_dict.get(word))
        else:
            output.append(word)
    
    return output

Applying the function to the dataset:

In [9]:
for col in cleaned_cols:
    data[col] = data.apply(lambda row: standard_spellcheck(row[col]), axis=1)

data.loc[[555, 2735],['question1_raw', 'question1', 'is_duplicate']]

Unnamed: 0_level_0,question1_raw,question1,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
555,what is your favourite anime character and why?,"[favorite, anime, character]",1
2735,did ancient people perceive less colours than us?,"[ancient, people, perceive, less, colors, us]",0


Finally, we want to lemmatize all words. Lemmatization reduces words to their basic grammatical roots. This will make it easier to test that words like "eat" and "eating" are similar even if they're evaluated as different character strings.

In [10]:
def lemmatize_list(word_list):
    output = []
    for word in word_list:
        output.append(lemmatizer.lemmatize(word))
    return output

Applying the lemmatization function to the dataset:

In [11]:
for col in cleaned_cols:
    data[col] = data.apply(lambda row: lemmatize_list(row[col]), axis=1)
    data[col] = data.apply(lambda row: ' '.join(row[col]), axis=1)
    
data.loc[:,['question1', 'question2', 'is_duplicate']].sample(5)

Unnamed: 0_level_0,question1,question2,is_duplicate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
68470,pro con transatlantic trade investment partner...,advantage disadvantage transatlantic trade inv...,1
148718,improve writing skill,improve english writing speaking skill,1
43762,benefit ban rupee note,consequence rupee note banning,1
91027,lose weight,lose kilo,1
175607,historical background cinese worker australian...,historical background cinese worker australian...,0


### Creating Training and Test Data

Next, we'll take the dataset and split it into training and test sets. We'll start with 80% training and 20% test, though we may adjust this later. The `train_test_split` function in Scikit can automatically stratify the sample so that the proportions in each set represent the proportions in the overall dataset.

In [12]:
test_size = 0.2

X_train, X_test, y_train, y_test = train_test_split(data[['question1', 'question2']], 
                                                    data['is_duplicate'],
                                                    test_size=test_size,
                                                    stratify=data['is_duplicate'],
                                                    random_state=611)

# Cast labels as Numpy arrays and reshape for passing to model later
y_train = np.asarray(y_train).astype('float32').reshape(-1,1)
y_test = np.asarray(y_test).astype('float32').reshape(-1,1)

Let's see what the final set sizes look like:

In [13]:
print(f'Training Set (X):\t{X_train.shape[0]:,} records')
print(f'Test Set (X):\t\t{X_test.shape[0]:,} records')
print(f'\nTraining Set (y):\t{y_train.shape[0]:,} records')
print(f'Test Set (y):\t\t{y_test.shape[0]:,} records')

Training Set (X):	323,429 records
Test Set (X):		80,858 records

Training Set (y):	323,429 records
Test Set (y):		80,858 records


Everything looks good! We're ready for the next step.

### Tokenizing, Encoding, and Padding Textual Data

The next step is to tokenize, encode, and pad the textual information.

**Tokenizing** will involve separating words in a given corpus into discrete units. Typically, each word (separated by spaces) is a single token. Keras provides us with a `Tokenizer` class to help with this.  

**Encoding** will involve converting words and phrases into vectors. This is because, like all neural networks, the SNN will recognize only numeric arrays as inputs and not characters.

**Padding** will involved resizing word vector arrays so that they are all the same shape for input into the SNN model.

One approach to encoding is to assign new values as new words appear. For example, the sentence, "Give the ball to the man" would be encoded as `[1, 2, 3, 4, 2, 5]` (note that the word "the" is repeated, and so is the index.

Another approach is to perform a frequency analysis and assign values based on frequency. In the example above, the word "the" would be assigned a value of `1` because it appears the most frequently. This is the approach we'll take.

Finally, note that the index starts at `1` rather than `0` as usual. The reason for this will be explained shortly.

#### Creating a Corpus

First, we need to create our corpus, that is, the body of text from which indices are organized and assigned. In this case, the corpus will comprise the entire set of questions, all aggregated as one body of text. Reading this corpus won't make much sense, but the purpose here is only to arrange words by frequency and assign values accordingly.

The following function can take a dataframe of strings and output a list of combined words.

In [14]:
def create_corpus(df):
    return df.agg(' '.join, axis=1)

corpus = create_corpus(X_train[['question1', 'question2']])

#### Encoding the Corpus

Next, we can instantiate a Keras `Tokenizer` class and encode the corpus.

In [15]:
t = Tokenizer()
t.fit_on_texts(corpus.values)

print(f'Total Corpus Word Count:\t{sum(t.word_counts.values()):,}')
print(f'Distinct Words in Corpus:\t{len(t.word_index) + 1:,}')

Total Corpus Word Count:	3,553,559
Distinct Words in Corpus:	68,967


Our corpus is about 3.6 million words that comprise a little under 70,000 distinct words.

Next, we'll map text in our training and test sets to the newly assigned values.

In [16]:
X_train_q1 = t.texts_to_sequences(X_train['question1'].values)
X_train_q2 = t.texts_to_sequences(X_train['question2'].values)

X_test_q1 = t.texts_to_sequences(X_test['question1'].values)
X_test_q2 = t.texts_to_sequences(X_test['question2'].values)

Here's an example of what the mapped data looks like compared to the original text:

In [17]:
pd.DataFrame(data={'Original': X_train['question1'],
                   'Encoded': X_train_q1}).reset_index().head(10)

Unnamed: 0,id,Original,Encoded
0,279136,acceleration measured unit distance/time^,"[2777, 2872, 1431, 558, 15]"
1,321162,gravity look like,"[1027, 170, 4]"
2,43014,new rupee gps,"[24, 137, 1391]"
3,383717,evidence support ancient alien theory,"[904, 240, 1052, 905, 389]"
4,312111,mean ex-crush stare try get attention,"[47, 609, 758, 3324, 700, 2, 2444]"
5,266555,people still believe communism viable economic...,"[5, 102, 247, 3643, 5208, 1173, 103]"
6,393886,girl played troll doll back 's toy boy play era,"[46, 1672, 5106, 6972, 133, 9, 3834, 420, 206,..."
7,187733,blanket statement,"[8043, 1459]"
8,77386,example normative economic statement,"[94, 15932, 1173, 1459]"
9,34598,law arrest woman,"[224, 4350, 62]"


In Row 1, note that the word `'like'` is encoded as `4`, meaning that it is the fourth most frequently occurring word in the dataset. In Row 4, you can see that the word `'get'` is the second most frequent word since it's encoded as `2`. This isn't surprising, as these words are often used in search queries. In other words, everything is still looking good, and we can move to the next step.

#### Padding Encoded Word Sequences

Now that our people-friendly words have been transformed into Keras-friendly numbers, we face another issue. Different-length phrases result in arrays of differing dimensions. This won't do for our SNN model, so the solution is to make every encoded array the same size. Shorter arrays will be populated with the index `0` where no token values exist. This is the reason for reserving the `0` index and starting with `1` above.

First, let's look for the maximum length of a given array, that is, the phrase with the maximum number of words in the dataset. This will serve as a starting point for how much to pad the data.

We can start by defining a function to find the maximum, median, and 95/95 percentile lengths of any given array in the entire dataset (test and training, Q1 and Q2). We'll also output these summary statistics as a printout.

In [18]:
def find_set_stats(train_1, train_2, test_1, test_2, output='summary'):
    
    vals = [*train_1, *train_2, *test_1, *test_2]
    
    total_max = int(max(len(x) for x in vals))
    total_med = int(stats.median(len(x) for x in vals))
    total_95 = int(np.percentile([len(x) for x in vals], 95))
    total_99 = int(np.percentile([len(x) for x in vals], 99))
     
    if output == 'summary':
        print(f'SUMMARY STATISTICS')
        print(f'-'*20)
        print(f'Max Length:\t{total_max}')
        print(f'Median Length:\t{total_med}')
        print(f'Length 95%ile:\t{total_95}')
        print(f'Length 99%ile:\t{total_99}')
        
        return None
        
    elif output == 'max':
        return total_max
    
    elif output == 'med':
        return total_med
    
    elif output == '95':
        return total_95
    
    elif output == '99':
        return total_99
    
    else:
        print('Invalid output argument')
        return None

In [19]:
find_set_stats(X_train_q1, X_train_q2, X_test_q1, X_test_q2, output='summary')

SUMMARY STATISTICS
--------------------
Max Length:	99
Median Length:	5
Length 95%ile:	12
Length 99%ile:	16


What this shows is that the maximum length of a query in the entire dataset is 99 words long. However, this is clearly an outlier. In fact, 99% of the queries in the dataset are 16 words or fewer.

If we went with the max value of 99, then every array would be 99 elements long, and most of those elements would be zeroes. In other worlds, we would significantly increase the complexity and processing time of the model without gaining much information. Therefore, we will use 16 as the maximum length for our padding. Queries that are longer will be truncated.

In [20]:
max_len_padding = find_set_stats(X_train_q1, X_train_q2, X_test_q1, X_test_q2, output='99')

X_train_q1 = pad_sequences(X_train_q1, maxlen=max_len_padding, padding='post')
X_train_q2 = pad_sequences(X_train_q2, maxlen=max_len_padding, padding='post')

X_test_q1 = pad_sequences(X_test_q1, maxlen=max_len_padding, padding='post')
X_test_q2 = pad_sequences(X_test_q2, maxlen=max_len_padding, padding='post')

Let's see what the padded data looks like now:

In [21]:
pd.DataFrame(data={'Original': X_train['question1'],
                   'Encoded and Padded': X_train_q1.tolist()}).reset_index().head(10)

Unnamed: 0,id,Original,Encoded and Padded
0,279136,acceleration measured unit distance/time^,"[2777, 2872, 1431, 558, 15, 0, 0, 0, 0, 0, 0, ..."
1,321162,gravity look like,"[1027, 170, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0..."
2,43014,new rupee gps,"[24, 137, 1391, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
3,383717,evidence support ancient alien theory,"[904, 240, 1052, 905, 389, 0, 0, 0, 0, 0, 0, 0..."
4,312111,mean ex-crush stare try get attention,"[47, 609, 758, 3324, 700, 2, 2444, 0, 0, 0, 0,..."
5,266555,people still believe communism viable economic...,"[5, 102, 247, 3643, 5208, 1173, 103, 0, 0, 0, ..."
6,393886,girl played troll doll back 's toy boy play era,"[46, 1672, 5106, 6972, 133, 9, 3834, 420, 206,..."
7,187733,blanket statement,"[8043, 1459, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
8,77386,example normative economic statement,"[94, 15932, 1173, 1459, 0, 0, 0, 0, 0, 0, 0, 0..."
9,34598,law arrest woman,"[224, 4350, 62, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


As you can see, the initial padded data values are identical to the encoded data from before, only that the array is populated with the reserve index `0` to fill the array if the phrase is less than 16 words.

### Mapping Data to Space with Global Vectors (GloVE)

We've now mapped our textual queries to integer, but that begs the question: how do we know how similar words are to each other? After all, if we compare some of the top words `like` and `get`, it's clear that these words aren't semantically related to each other at all, even if they're close together in their integer values.

The solution is to use the Global Vectors (GloVe) model, which is an open-source algorithm for representing words as vectors in *n*-dimensional space. Vectors for synonymous words like *little* and *small* are close to each other, as are words that share the same context, such as *kitchen* and *eat*.

By "close," we mean that vectors have a small Euclidean distance, which will be the standard used to calculate word similarities in this model.

The theory behind the algorithm is detailed in [this paper by Pennington, Socher, and Manning (2014)](https://nlp.stanford.edu/pubs/glove.pdf). In addition to developing the algorithm, the authors produced a repository of 400,000 words sourced from Wikipedia articles. Each word is described by an *n*-dimensional vector, with 300 being the standard size.

#### Loading the GloVe Embeddings

The GloVe embeddings for 400,000 words are publicly available for download. We'll use the standard 300-dimensional dataset.

In [22]:
glove_words = {}

filepath = 'data/glove.6B.300d.txt'

with open(filepath) as file:
    for line in file:
        vals = line.split()
        word = vals[0]
        coords = np.asarray(vals[1:], dtype='float32')
        
        glove_words[word] = coords

glove_len = len(glove_words)    
glove_dim = len(glove_words.get("can")) 

print(f'Loaded {glove_len:,} word vectors of {glove_dim} dimensions each.')

Loaded 400,000 word vectors of 300 dimensions each.


Just what we expected!

#### Mapping the Corpus to GloVe Embeddings

The next step is to create an array of GloVe embeddings for each word in our corpus. With about 69,000 words in the corpus, chances are that the most common words will appear in the list of 400,000 GloVe embeddings. However, less-frequent words may not appear on the list at all. These words will simply be populated with zeroes in the array. 

In [23]:
not_in_glove = []
corpus_size = len(t.word_index) + 1

corpus_array = np.empty((corpus_size, glove_dim))

glove_vec = None


for word, idx in t.word_index.items():
#     If a word in the corpus appears in the GloVe embeddings, get the coordinates for that word vector. Else, add the word to the list of words not found.
    if word in glove_words.keys():
        glove_vec = glove_words.get(word)
    else:
        not_in_glove.append(word)
    
#     Then, add the vector coordinates to the corpus array if the word was found. Otherwise, populate the array with zeroes for that word
    if glove_vec is not None:
        corpus_array[idx] = glove_vec
    else:
        corpus_array[idx] = np.zeros(glove_dim)
        
    glove_vec = None

How did we do? First, let's see how many words weren't found in the GloVe embeddings file:

In [24]:
print(f'{len(not_in_glove):,}')

20,297


About 20,000 words weren't found, which is a significant number. We'll continue for now, but we may want to revisit this if our model still doesn't perform well.

### Setting Up the Model

Our dataset is essentially ready to be passed into the Siamese model, which leaves with one last task: setting up the Siamese model!

#### Using Manhattan Distance

In the previous milestone, we used the Euclidean distance to calculate the similarity of two embedded word vectors (a smaller distance means that the words are more similar to each other).

After some more research, it might make more sense to consider Manhattan distance rather than Euclidean distance. This is because it's not quite clear what vector coordinates represent in the embedded vector space. Rather than representing continuous variables, it could very well be that the coordinates reflect black-box categorical attributes that are determined by the GloVe algorithm. If this is the case, then it would make more sence to consider Manhattan distance, which is more appropriate for categorical attributes.

#### The Loss Function

Before, we used contrastive loss as the loss function for the model:

$Y \times D^2 + (1-Y) \times max(margin - D, 0)^2$

where:

-$Y$ is the label value (1 or 0)  
-$D$ is the calculated Euclidean distance.  
-$margin$ is a parameter we set and essentially acts as a radius of acceptable similarity. In our case, two vectors that are colocated within the margin radius will be labeled as duplicates.

However, this iteration will try to use a built-in loss function, mean squared error, which is another appropriate function for models that calculate distances. First, the square term means that there won't be any negative distances/values.The square term also means that words that are far apart from each other are penalized much more than embedded word vectors that are close together.

#### Defining Manhattan Distance Function

Before initializing the model, we'll need to define the functions to calculate Manhattan distance. Since these functions will be called within the model itself, we won't use conventional Python Math and Numpy operations, but rather the set of functions that come with the Keras backend library (abbreviated below as `K`).

In [25]:
# The manhatten_dist function accepts a tuple of two vectors, which are then unpacked. The Manhattan distance is then calculated.
def manhattan_dist(x, y):
    return K.exp(-K.sum(K.abs(x-y), axis=1, keepdims=True))

#### Initializing the Model and Building Layers

We can now initialize the model! We'll start by building the network and adding layers. However, the approach will be slightly different than what was done in Milestone 3.

Instead of setting up Dense and Flatten layers, we'll focus only on the Embedding, LSTM, and Lamba layers.

For the first SNN layer, the `Embedding` layer, we'll pass the following arguments:  

-The `input_dim` will match the number of words in the dataset (about 82,000)  
-The `output_dim` will match the number of dimensions in the GloVe embeddings data (50 in this first iteration)  
-`weights` will be taken from the `corpus_array` created earlier  
-The `input_length` will correspond the padded vector length (17)  
-`trainable` will be set to `False` so that weights are not updated during training  
-`name` will be `embedding_layer`

One known problem with neural networks that process sequential data (like text) is that they struggle with retaining information from earlier steps to later ones. To mitigate this, we'll add a long short-term memory (LSTM) layer that's shared between the two embedding layers.  

In [26]:
n_hidden=50

input_1 = Input(shape=(max_len_padding,),dtype='int32')
input_2 = Input(shape=(max_len_padding,), dtype='int32')

embedding_layer = Embedding(input_dim=corpus_size,
                            output_dim=glove_dim,
                            weights=[corpus_array],
                            input_length = max_len_padding,
                            trainable=False,
                            name='embedding_layer')

encoded_input_1 = embedding_layer(input_1)
encoded_input_2 = embedding_layer(input_2)

lstm_shared_layer = LSTM(n_hidden, name='lstm_shared_layer')

output_1 = lstm_shared_layer(encoded_input_1)
output_2 = lstm_shared_layer(encoded_input_2)

2022-12-10 04:32:45.013202: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Finally, the Lambda layer will calculate the Manhattan distance between the two embedded vectors.

In [27]:
dist = Lambda(lambda x: manhattan_dist(x[0], x[1]), output_shape=lambda x: (x[0][0], 1))([output_1, output_2])

Next, we'll create a Keras Model instance and specify the input (the input vectors) and output (their Manhattan distance).

In [28]:
SNN_model = Model(inputs=[input_1, input_2], outputs=[dist])

Finally, we'll compile the model.

In [29]:
optimizer = Adadelta(clipnorm=1.25)

SNN_model.compile(loss='mean_squared_error', optimizer='Adam', metrics=['accuracy'])

### Training the Model

The Siamese model is all set up, so we just need to train it. We'll use 10 epochs for now, through the final milestone utilize more. Given the memory limitations of this machine, it would probably be better to define a batch size of 10,000.

In [30]:
fit_model = True

if fit_model:
    SNN_model.fit([X_train_q1, X_train_q2], y_train,
                  epochs=50, batch_size=10000,
                  validation_data=([X_test_q1, X_test_q2], y_test))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


It's always a good idea to save the model and weights for quick and easy access later.

In [31]:
filepath_model = 'results/SNN_model_03.json'
filepath_weights = 'results/SNN_model_03_weights.h5'

SNN_model_to_json = SNN_model.to_json()

with open(filepath_model, 'w') as file:
    file.write(SNN_model_to_json)
    
SNN_model.save_weights(filepath_weights)

## Closing Comments

### Model Results

So, how did the model perform? Much better than any of the previous iterations. With an accuracy score of about about 87% for the training set and 80% for the validation set, we can confidently say that the model is a success.

### Conclusion (the Point of it All)

The final model is much more effective than the initial model, and it's clear that using Manhattan distance rather than Euclidean distance significantly improved accuracy. 

The conclusion we can make is that it is indeed possible to determine whether two queries are semantically similar. With enough training data and the use of a Siamese Neural Network, we can input two question strings and eventually output their similarity (being classed as duplicates or non-duplicates). In short, this model shows promise in having a business value in detecting questions that have been answered before.

At the same time, I would be hesitant to keep training the model over more epochs. As can be seen in the results above, the accuracy of the test and validations sets have begun to diverge. This means that there's a risk that the model might be overfitting to the data at hand.

### Recommendations for Future Work

Now that it's clear that the model is off to a good start, there's still much to be done! Given more time and Keras knowledge, I would have liked to have done the following:  

**1. Improve the model:** As mentioned above, improving the model is more than simply running more epochs. I would be interested in further deepening my understanding of neural networks improve model performance without running into overfitting issues.  

**2. Make the model more specific:** One of the problems of NLP tasks is that different words have different meanings, depending on the context. For instance, a "sequence" in mathematics is quite different than a "sequence" in genomics. In the former case, similar words might include "geometric," "series," and so on. In the latter case, similar words would include "genome," "DNA," "guanine," and so on. The issue with GloVe, or any other vector representation algorithm for that matter, is that one is forced to compromise and develop vector representations that satisfy all contexts equally poorly. Therefore, a truly effective ML model should both contain context-specific data (rather than general website queries) as well as an embedding algorithm that vectorizes words based on specific contexts (whether that's genomics, marine biology, or something else).

**3. Employ the Model:** Due to time constraints, I was not able to develop any sort of interactive app that utilizes the model, that is, a way for a user to input two questions and find out if the model classifies them as duplicates or not. Should I be able to achieve the first two recommendations above, developing an interative platform would be the next step.  

That said, it's been an exciting semester, and I'm glad to have had the chance to challenge myself in this course while developing a working ML model along the way.