# Finding Similar Questions with Siamese Networks 


* [Introduction](#section-one)
* [Data Preparation](#section-two)
* [Model Selection and Creation](#section-three)
    - [Model](#subsection-one)
    - [Loss Function](#anything-you-like)
* [Evaluation](#section-three)

<a id="section-one"></a>
## Introduction


The objective of this project is to determine whether two questions are pairs or not. On websites like Quora or StackOverflow, it can be beneficial for user experience to link similar questions together. This allows for users to not only see more answers, but also find the answer faster if someone has already asked their question. 

In this work, I will be using the Siamese Network technique with LSTM models. Through this notebook, I will explain each step of my process. Additionally, I would like to mention that I am doing this project based off of my NLP Certificate on Coursera. So some code, including that of the data generator will be used based on that work. 

In [24]:
!pip install trax

import os
import nltk
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import numpy as np
import pandas as pd
import random as rnd
from sklearn.model_selection import train_test_split





<a id="section-two"></a>
## Data Preparation


The data from Quora is stored in a csv format with questions to compare found in the same row. The training data is also labeled with the tag is_duplicate, indicated whether or not the questions have the same meaning (0 - No, 1 - Yes). In this step, I will go read in the data, clean up the text data, split the data into a train and test set, and create data generators for both sets that can be read in by our model. 

In [25]:
train = pd.read_csv('/kaggle/input/quora-question-pairs/train.csv.zip')
test = pd.read_csv('/kaggle/input/quora-question-pairs/test.csv')

In [26]:
train.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


I am just randomly splitting the dataset and then 

In [27]:
N_train = 300000
N_test  = 10*1024
data_train = train[:N_train]
data_test  = train[N_train:N_train+N_test]
print("Train set:", len(data_train), "Test set:", len(data_test))
del(train) # remove to free memory

Train set: 300000 Test set: 10240


In [28]:
td_index = (data_train['is_duplicate'] == 1).to_numpy()
td_index = [i for i, x in enumerate(td_index) if x] 

Q1_train_words = np.array(data_train['question1'][td_index])
Q2_train_words = np.array(data_train['question2'][td_index])

Q1_test_words = np.array(data_test['question1'])
Q2_test_words = np.array(data_test['question2'])
y_test  = np.array(data_test['is_duplicate'])

In [29]:
#create arrays
Q1_train = np.empty_like(Q1_train_words)
Q2_train = np.empty_like(Q2_train_words)

Q1_test = np.empty_like(Q1_test_words)
Q2_test = np.empty_like(Q2_test_words)

In [30]:
# Building the vocabulary with the train set         
from collections import defaultdict

vocab = defaultdict(lambda: 0)
vocab['<PAD>'] = 1

for idx in range(len(Q1_train_words)):
    Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])
    Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx])
    q = Q1_train[idx] + Q2_train[idx]
    for word in q:
        if word not in vocab:
            vocab[word] = len(vocab) + 1
print('The length of the vocabulary is: ', len(vocab))

The length of the vocabulary is:  36352


In [31]:
for idx in range(len(Q1_test_words)): 
    Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])
    Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])

In [32]:
print('Train set has reduced to: ', len(Q1_train) ) 
print('Test set length: ', len(Q1_test) )

Train set has reduced to:  111473
Test set length:  10240


In [33]:
# Splitting the data
cut_off = int(len(Q1_train)*.8)
train_Q1, train_Q2 = Q1_train[:cut_off], Q2_train[:cut_off]
val_Q1, val_Q2 = Q1_train[cut_off: ], Q2_train[cut_off:]
print('Number of duplicate questions: ', len(Q1_train))
print("The length of the training set is:  ", len(train_Q1))
print("The length of the validation set is: ", len(val_Q1))

Number of duplicate questions:  111473
The length of the training set is:   89178
The length of the validation set is:  22295


In [34]:
def data_generator(Q1, Q2, batch_size, pad=1, shuffle=True):
    """Generator function that yields batches of data
    """

    input1 = []
    input2 = []
    idx = 0
    len_q = len(Q1)
    question_indexes = [*range(len_q)]
    
    if shuffle:
        rnd.shuffle(question_indexes)
    
    while True:
        if idx >= len_q:
            idx = 0
            # shuffle to get random batches if shuffle is set to True
            if shuffle:
                rnd.shuffle(question_indexes)
        
        q1 = Q1[question_indexes[idx]]
        q2 = Q2[question_indexes[idx]]

        idx += 1
        input1.append(q1)
        input2.append(q2)
        if len(input1) == batch_size:
            max_len = max(max([len(_) for _ in input1]),max([len(_) for _ in input2]))
            max_len = 2**int(np.ceil(np.log2(max_len)))
            b1 = []
            b2 = []
            for q1, q2 in zip(input1, input2):
                q1 = q1 + [pad] * (max_len - len(q1))
                q2 = q2 + [pad] * (max_len - len(q2))
               
                b1.append(q1)
                b2.append(q2)

            yield np.array(b1), np.array(b2)

            # reset the batches
            input1, input2 = [], []  # reset the batches

## Siamese Model Creation

A siamese model works to create two or more identical subnetworks. It may seem counterintuitive to train the same network twice, as it will take quite literally double the computation time. However, in this use case of similarity comparison this network architecture makes a lot of sense. In short terms, we will be created a dual layered subnetwork that will work to compare two questions. Each question will run through a different subnetwork that will use mathematical operations (or ML magic) to break down each question into a vector that contains it's meaning. These two vectors containing the intuition about the questions will be compared using cosine similarity. This metric will return a score from -1 to 1, where -1 means that the questions are very different and 1 shows that they have the meaning. 

An important step of this model is to build the network architectures of the submodels. In this work, I used Long Short Term Memory (LSTM) models to do the heavy lifting. RNN models were popularized by working with text models. However, they tend to struggle with longer sequences and are prone to vanishing gradients. LSTM models work better at capturing important aspects of text sequences as they learn to remember and when to forget. If the most important part of a sequence is in the beginning, traditional RNN models put less and less emphasis on the beginning chunks of a sequence as the sequence grows in length. LSTM remembers the important parts of the sequence and forgets the unimportant. 

In [35]:
def normalize(x):
    return x / np.sqrt(np.sum(x * x, axis=-1, keepdims=True))

In [36]:
# Define the Siamese Model

def Siamese(vocab_size=len(vocab), model_dimension=128, mode='train'):
    """Returns a Siamese model.

    Args:
        vocab_size (int, optional): Length of the vocabulary. Defaults to len(vocab).
        d_model (int, optional): Depth of the model. Defaults to 128.
        mode (str, optional): 'train', 'eval' or 'predict', predict mode is for fast inference. Defaults to 'train'.

    Returns:
        trax.layers.combinators.Parallel: A Siamese model. 
    """

    def normalize(x):  # normalizes the vectors to have L2 norm 1
        return x / fastnp.sqrt(fastnp.sum(x * x, axis=-1, keepdims=True))
    
    #create the LSTM Model that makes up the backbone of our Siamese Model
    LSTM = tl.Serial(
        tl.Embedding(vocab_size=vocab_size, d_feature=model_dimension),
        tl.LSTM(model_dimension),
        tl.Mean(axis=1),
        tl.Fn('Normalize', lambda x: normalize(x))
    )
    
    # Run on Q1 and Q2 in parallel.
    model = tl.Parallel(LSTM, LSTM)
    return model

In [44]:
# check your model
model = Siamese()
print(model)

Parallel_in2_out2[
  Serial[
    Embedding_36352_128
    LSTM_128
    Mean
    Normalize
  ]
  Serial[
    Embedding_36352_128
    LSTM_128
    Mean
    Normalize
  ]
]


LSTM Architecture Step by Step:
1. Forget Gate : Sigmoid Function (0 throw information out, 1 keep)
2. Input Gate: updates the current cell state
* *     Sigmoid Layer: closer to 1 = more important keep
* *  Tanh Layer: -1 to 1 helps regulate the flow of informatin
* * Outputs Sigmoid * Tanh layers for usable state values
3. Output Gate: what next hidden state should be

### Loss Function

In [37]:
def TripletLossFn(v1, v2, margin=0.25):
    """
    Custom Triplet Loss Function
    """
    scores = fastnp.dot(v1,v2.T) # pairwise cosine sim

    batch_size = len(scores)
    
    positive = fastnp.diagonal(scores)  # the positive ones (duplicates)

    negative_without_positive = scores - fastnp.eye(batch_size) * 2.0 
    
    closest_negative = negative_without_positive.max(axis = 1)
    
    negative_zero_on_duplicate = (1.0 - fastnp.eye(batch_size)) * scores
    
    mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1) / (batch_size - 1)
    
    triplet_loss1 = fastnp.maximum(margin - positive + closest_negative, 0 )
    triplet_loss2 = fastnp.maximum(margin - positive + mean_negative, 0 )
    triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)

    
    return triplet_loss

In [38]:
# Make Triplet Loss Layer

from functools import partial
def TripletLoss(margin=0.25):
    triplet_loss_fn = partial(TripletLossFn, margin=margin)
    return tl.Fn('TripletLoss', triplet_loss_fn)

<a id="section-four"></a>
## Training

In [39]:
batch_size = 256
train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab['<PAD>'])
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab['<PAD>'])
print('train_Q1.shape ', train_Q1.shape)
print('val_Q1.shape   ', val_Q1.shape)

train_Q1.shape  (89178,)
val_Q1.shape    (22295,)


In [53]:
lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)

def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator, val_generator=val_generator, output_dir='model/'):
    
    output_dir = os.path.expanduser(output_dir)

    train_task = training.TrainTask(
        labeled_data=train_generator,       # Use generator (train)
        loss_layer=TripletLoss(),         # Use triplet loss. Don't forget to instantiate this object
        optimizer=trax.optimizers.Adam(learning_rate = 0.01),          # Don't forget to add the learning rate parameter
        lr_schedule=lr_schedule, # Use Trax multifactor schedule function
    )

    eval_task2 = training.EvalTask(
        labeled_data=val_generator,       # Use generator (val)
        metrics=[TripletLoss()]         # Use triplet loss. Don't forget to instantiate this object
    )


    training_loop = training.Loop(Siamese(),
                                  train_task,
                                  eval_task=[eval_task2],
                                  output_dir=output_dir)

    return training_loop

In [54]:
train_steps = 5
training_loop = train_model(Siamese, TripletLoss, lr_schedule)
training_loop.run(train_steps)

TypeError: __init__() got an unexpected keyword argument 'eval_task'