# ***Duplicate Questions Recognition using LSTM and Trax***
By Nakshatra Singh

This notebook is an illustration on how to build a LSTM Model using Trax which can identify the Duplicate Questions or Similar Questions which is useful when we have to work with several versions of the same Questions.

###**Using Google GPU for Training**

Google colab offers free GPUs and TPUs! Since we'll be training a large model it's best to take advantage of this (in this case we'll use GPU), otherwise training can take long time.

A GPU can be added by going to the menu and selecting:

`Edit -> Notebook Settings -> Hardware Accelerator -> (GPU)`

Then run the following cell to confirm that a GPU is detected. 

In [1]:
import tensorflow as tf
# Get the device GPU name 
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
  print('Found GPU at : {}'.format(device_name)) 
else:
  raise SystemError('GPU not found!')  

Found GPU at : /device:GPU:0


###**Setting up Imports and Downloading the Dependencies**

I have downloaded all the Libraries and Dependencies required for this Project in one particular cell. 

In [2]:
#@ Downloading the Libraries and Dependencies. 

!pip install -q -U trax                   # Downloading the Trax.
import nltk                       
nltk.download("punkt")
import pandas as pd
import numpy as np
import os
import trax
from trax import layers as tl
from trax.supervised import training
from trax.fastmath import numpy as fastnp
import random
from collections import defaultdict
from functools import partial 

[K     |████████████████████████████████| 471kB 4.9MB/s 
[K     |████████████████████████████████| 2.6MB 19.4MB/s 
[K     |████████████████████████████████| 174kB 41.7MB/s 
[K     |████████████████████████████████| 71kB 10.4MB/s 
[K     |████████████████████████████████| 1.1MB 43.9MB/s 
[K     |████████████████████████████████| 348kB 53.9MB/s 
[K     |████████████████████████████████| 3.6MB 49.1MB/s 
[K     |████████████████████████████████| 1.3MB 58.1MB/s 
[K     |████████████████████████████████| 2.9MB 50.4MB/s 
[K     |████████████████████████████████| 890kB 54.8MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


###**Retrieve Dataset**

Let's download the dataset which is uploaded on my google drive. 

In [3]:
!gdown --id 1USFKeVRsuai_62I_tYa_RbSPOhBtJx8G 

Downloading...
From: https://drive.google.com/uc?id=1USFKeVRsuai_62I_tYa_RbSPOhBtJx8G
To: /content/Questions.csv
60.7MB [00:00, 107MB/s] 


In [4]:
data = pd.read_csv('/content/Questions.csv')
print(f"Number of Question Pairs: {len(data)}")
print() 
data.head() 

Number of Question Pairs: 404351



Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


I will split the Data into Training set and Testing Set. The Test Set will be used later to evaluate the Model. I will select only the Question Pairs that are duplicate to train the Model. I will build two batches as input for the Neural Networks: Siamese Networks. 

In [5]:
#@ Processing the Data:
N_train = 300000                                               
N_test = 10240                                                 
data_train = data[:N_train]                                                    # Training pairs.
data_test = data[N_train:N_train+N_test]                                       # Test pairs.
del(data)                                                                      # Removing.

#@ Inspecting the Data:
print(f"Training Set: {len(data_train)} and Test Set: {len(data_test)}")

#@ Selecting the Question Pairs for Training:
train_idx = (data_train["is_duplicate"] == 1).to_numpy()
train_idx = [i for i, x in enumerate(train_idx) if x]
print(f"Number of Duplicate Questions: {len(train_idx)}")

Training Set: 300000 and Test Set: 10240
Number of Duplicate Questions: 111486


###**Preparing the Data**

In [6]:
#@ Preparing the Data: Training the Model: 
Q1_train_words = np.array(data_train['question1'][train_idx])
Q2_train_words = np.array(data_train['question2'][train_idx])

#@ Preparing the Data: Evaluating the Model:
Q1_test_words = np.array(data_test['question1'])
Q2_test_words = np.array(data_test['question2'])
y_test = np.array(data_test["is_duplicate"]) 

#@ Inspecting the Data:
print("TRAINING QUESTIONS:\n")
print("Question 1:", Q1_train_words[1])
print("Question 2:", Q2_train_words[1], "\n")

print("TESTING QUESTIONS:\n")
print("Question 1:", Q1_test_words[1])
print("Question 2:", Q2_test_words[1], "\n") 

TRAINING QUESTIONS:

Question 1: How can I be a good geologist?
Question 2: What should I do to be a great geologist? 

TESTING QUESTIONS:

Question 1: What is the best bicycle to buy under 10k?
Question 2: Which is the best bike in in dia to buy in INR 10k? 



I will encode each word of the selected pairs with an Index which will be a list of numbers. Firstly, I will Tokenize each word using NLTK and I will use Python's Default Dictionary which assigns the values 0 to all Out of Vocabulary Words. 

In [7]:
'''Sit back and enjoy your Coffee'''
#@ Preparing the Data:
Q1_train = np.empty_like(Q1_train_words)                             # Creating new Training array.
Q2_train = np.empty_like(Q2_train_words)                             # Creating new Training array.
Q1_test = np.empty_like(Q1_test_words)                               # Creating new Test array.
Q2_test = np.empty_like(Q2_test_words)                               # Creating new Test array.

#@ Building Vocabulary with Training Dataset:
vocab = defaultdict(lambda: 0)                                       # It will create a dict with default 0 when a key doesn't exist.
vocab["<PAD>"] = 1
for idx in range(len(Q1_train_words)):
  Q1_train[idx] = nltk.word_tokenize(Q1_train_words[idx])            # Tokenizing the Training Set.
  Q2_train[idx] = nltk.word_tokenize(Q2_train_words[idx])            # Tokenizing the Training Set.
  q = Q1_train[idx] + Q2_train[idx]                                  # Adding Q1 and Q2 pair of tokens togethers.
  for word in q:
    if word not in vocab:
      vocab[word] = len(vocab) + 1
print("The length of the Vocabulary is:", len(vocab))

#@ Testing Dataset:
for idx in range(len(Q1_test_words)):
  Q1_test[idx] = nltk.word_tokenize(Q1_test_words[idx])              # Tokenizing the Test Set.
  Q2_test[idx] = nltk.word_tokenize(Q2_test_words[idx])              # Tokenizing the Test Set.

#@ Inspecting the Final Prepared Dataset:
print("Training Set is reduced to:", len(Q1_train))
print("Test Set is:", len(Q1_test))

The length of the Vocabulary is: 36342
Training Set is reduced to: 111486
Test Set is: 10240


In [8]:
#@ Preparing the Data:

#@ Converting Questions pairs to array of Integers:
for i in range(len(Q1_train)):
  Q1_train[i] = [vocab[word] for word in Q1_train[i]]
  Q2_train[i] = [vocab[word] for word in Q2_train[i]]

#@ Converting Questions pairs to array of Integers:
for i in range(len(Q1_test)):
  Q1_test[i] = [vocab[word] for word in Q1_test[i]]
  Q2_test[i] = [vocab[word] for word in Q2_test[i]]

#@ Inspecting the Encoded Data:
print("Question in the Training Set:")                           # Inspecting the Training Set.
print(Q1_train_words[1], "\n")
print("Encoded Version:")
print(Q1_train[1], "\n")
print("Question in the Test Set:")                               # Inspecting the Test Set.
print(Q1_test_words[1], "\n")
print("Encoded Version:")
print(Q1_test[1], "\n")

#@ Splitting the Training Set into Training and Validation Dataset:
split = int(len(Q1_train) * 0.8)
train_Q1, train_Q2 = Q1_train[:split], Q2_train[:split]                        # Split for Training set.
val_Q1, val_Q2 = Q1_train[split:], Q2_train[split:]                            # Split for Validation set.
print(f"Total numbers of questions pairs: {len(Q1_train)}")  
print()            
print(f"The length of Training set: {len(train_Q1)}")                          # Length of Final Training set.
print()                          
print(f"The length of Validation set: {len(val_Q1)}")                          # Length of Final Validation set. 

Question in the Training Set:
How can I be a good geologist? 

Encoded Version:
[32, 33, 4, 34, 6, 35, 36, 21] 

Question in the Test Set:
What is the best bicycle to buy under 10k? 

Encoded Version:
[30, 156, 78, 216, 8914, 39, 716, 286, 8324, 21] 

Total numbers of questions pairs: 111486

The length of Training set: 89188

The length of Validation set: 22298


###**Data Generator**

In most of the NLP, ML and DL in general, using batches when training the Dataset is more efficient. Now, I will build the Data Generator that takes in Questions pairs and returns batches in the form of Tuples. The Tuples consist of two arrays and each array will have batch size Questions pairs. The command next (data generator) will return the next batch. The Data Generator will returns the Data in a format that can be used directly int the Model while computing Feed Forward. It will return a pair of arrays of Questions.

In [9]:
#@ Data Generator:
def data_generator(Q1, Q2, batch_size, pad=1, shuffle=True):
  """ Generator Function that yields the Batches of Data. """
  #@ Initializing the Dependencies:
  input1, input2 = [], []
  idx = 0
  len_q = len(Q1)
  question_index = [*range(len_q)]
  if shuffle:
    random.shuffle(question_index)
  
  while True:
    if idx >= len_q:
      idx = 0
      if shuffle:
        random.shuffle(question_index)
    #@ Getting the Questions pairs in Index positions:
    q1 = Q1[question_index[idx]]
    q2 = Q2[question_index[idx]]
    idx += 1
    #@ Adding the Data:
    input1.append(q1)
    input2.append(q2)
    if len(input1) == batch_size:
      max_len = max(max([len(q) for q in input1]),
                    max([len(q) for q in input2]))
      max_len = 2**int(np.ceil(np.log2(max_len)))
      b1, b2 = [], []
      for q1, q2 in zip(input1, input2):
        q1 = q1 + [pad] * (max_len - len(q1))                         # Adding pad to q1 until it reaches max length.
        q2 = q2 + [pad] * (max_len - len(q2))                         # Adding pad to q2 until it reaches max length.
        b1.append(q1)
        b2.append(q2)
      yield np.array(b1), np.array(b2)
      input1, input2 = [], []                                         # Resetting the Batches.

#@ Inspecting the Example of Data Generator:
res1, res2 = next(data_generator(train_Q1, train_Q2, batch_size=2))
print(f"First Questions:\n{res1}")
print(f"\nSecond Questions:\n{res2}") 

First Questions:
[[   30    16    73  7516 10298    38    21     1     1     1     1     1
      1     1     1     1     1     1     1     1     1     1     1     1
      1     1     1     1     1     1     1     1]
 [   30    16     6 11900  1085    38    21    86     6 11900  1085  4838
     39   473    31   260    21     1     1     1     1     1     1     1
      1     1     1     1     1     1     1     1]]

Second Questions:
[[   32    38  7513 15792   302    21     1     1     1     1     1     1
      1     1     1     1     1     1     1     1     1     1     1     1
      1     1     1     1     1     1     1     1]
 [   30   156 11900   421    11    15  3089   131   302    38   276  1108
     38    21     1     1     1     1     1     1     1     1     1     1
      1     1     1     1     1     1     1     1]]


###**Siamese Neural Network**

A Siamese Neural Network is a Neural Network which uses the same weight while working in tandem on two different Input vectors to compute comparable output Vectors. Here, I will get the Embedding, run it through LSTM or Long Short Term Memory Network, Noramlize the two Vectors and Finally, I will use Triplet Loss to get the corresponding Cosine Similarity for each pair of Questions.

In [10]:
#@ Siamese Neural Network using Trax:
def Siamese(vocab_size=len(vocab), d_model=128, mode="train"):
  """ Returns a Siamese Model. """
  #@ Normalizing the Vectors for L2 Normalization:
  def normalize(x):
    return x / fastnp.sqrt(fastnp.sum(x*x, axis=-1, keepdims=True))
  #@ Preparing the Model:
  processor = tl.Serial(                                                  # Returns one hot Vector.
      tl.Embedding(vocab_size=vocab_size, d_feature=d_model),             # Adding Embedding Layer.
      tl.LSTM(n_units=d_model),                                           # Adding the LSTM Layer.
      tl.Mean(axis=1),                                                    # Mean over Columns in Neural Networks.
      # tl.Dense(n_units=vocab_size),                                     # Adding a Dense Layer.
      tl.Fn("Normalize", lambda x: normalize(x))                          # Adding the Normalizing Function.
  )
  #@ Running the Model in parallel:
  model = tl.Parallel(processor, processor)
  return model

#@ Setting up Siamese Neural Network Model:
model = Siamese()
print(model)                                                              # Inspecting the Model.

Parallel_in2_out2[
  Serial[
    Embedding_41789_128
    LSTM_128
    Mean
    Normalize
  ]
  Serial[
    Embedding_41789_128
    LSTM_128
    Mean
    Normalize
  ]
]


###**Triplet Loss**

The Triplet Loss makes use of a Baseline or Anchor Input which is compared to the Positive or Truthy Input and a Negatve or Falsy Input. The distance from the Anchor Input to the Positive Input is minimized and the distance from the Anchor Input to the Negative Input is maximized. The Triplet Loss is composed of two terms where one term utilizes the mean of all the non duplicates and the second term utilizes the Closest Negative. 

In [11]:
#@ Triplet Loss Function:
def TripletLossFn(v1, v2, margin=0.25):
  """ Custom Loss Function. """
  scores = fastnp.dot(v1, v2.T)                                                       # Calculating the dot product of two batches.
  batch_size = len(scores)                                                            # Calculating the new batch size.
  positive = fastnp.diagonal(scores)                                                  # Getting positive diagonal entries in scores.
  negative_without_positive = scores - 2.0 * fastnp.eye(batch_size)
  closest_negative = negative_without_positive.max(axis=1)                            # Taking row by row max.
  negative_zero_on_duplicate = scores * (1.0 - fastnp.eye(batch_size))
  mean_negative = fastnp.sum(negative_zero_on_duplicate, axis=1)/(batch_size - 1)
  triplet_loss1 = fastnp.maximum(0, margin - positive + closest_negative)
  triplet_loss2 = fastnp.maximum(0, margin - positive + mean_negative)
  triplet_loss = fastnp.mean(triplet_loss1 + triplet_loss2)
  return triplet_loss

#@ Triplet Loss:
def TripletLoss(margin=0.25):
  triplet_loss_fn = partial(TripletLossFn, margin=margin)
  return tl.Fn("TripletLoss", triplet_loss_fn) 

###**Model Training**

Now, I will train the Model. I will define the Cost Function and the Optimizer as ususal. I will use Training Iterator to go through all the Data for each Epochs while training the Model. 

In [12]:
#@ Preparing the Data:
batch_size = 256
train_generator = data_generator(train_Q1, train_Q2, batch_size, vocab["<PAD>"])
val_generator = data_generator(val_Q1, val_Q2, batch_size, vocab["<PAD>"])

#@ Training the Model:
lr_schedule = trax.lr.warmup_and_rsqrt_decay(400, 0.01)

def train_model(Siamese, TripletLoss, lr_schedule, train_generator=train_generator,
                val_generator=val_generator, output_dir="model/"):
  """ Training the Siamese Model. """
  output_dir = os.path.expanduser(output_dir)
  
  #@ Training:
  train_task = training.TrainTask(
      labeled_data = train_generator,                                                   # Using Train Generator.
      loss_layer = TripletLoss(),                                                       # Using Triplet Loss Function.
      optimizer = trax.optimizers.Adam(0.001),                                          # Using Adam Optimizer.
      lr_schedule = lr_schedule                                                         # Using Trax Multifactor Schedule Function.
  )
  #@ Evaluating:
  eval_task = training.EvalTask(
      labeled_data = val_generator,                                                     # Using Validation Generator.
      metrics = [TripletLoss()],                                                        # Instantiating the Objects for Evaluation.
      n_eval_batches = 3
  )
  #@ Training the Model:
  training_loop = training.Loop(                                                        # Training the Model.
      Siamese(),                                                                        # Siameses Neural Networks.
      train_task, eval_tasks = eval_task,
      output_dir = output_dir
  )
  return training_loop

#@ Training the Model:
training_loop = train_model(Siamese, TripletLoss, lr_schedule)
training_loop.run(3000)                                                                 # Training for 1000 epochs.


Step      1: Total number of trainable weights: 5480576
Step      1: Ran 1 train steps in 5.37 secs
Step      1: train TripletLoss |  0.49999863
Step      1: eval  TripletLoss |  0.49999818

Step    100: Ran 99 train steps in 8.27 secs
Step    100: train TripletLoss |  0.49995661
Step    100: eval  TripletLoss |  0.49858472

Step    200: Ran 100 train steps in 5.26 secs
Step    200: train TripletLoss |  0.49985439
Step    200: eval  TripletLoss |  0.49999246

Step    300: Ran 100 train steps in 7.15 secs
Step    300: train TripletLoss |  0.49998352
Step    300: eval  TripletLoss |  0.49999899

Step    400: Ran 100 train steps in 5.45 secs
Step    400: train TripletLoss |  0.49998981
Step    400: eval  TripletLoss |  0.49996084

Step    500: Ran 100 train steps in 5.55 secs
Step    500: train TripletLoss |  0.49998331
Step    500: eval  TripletLoss |  0.49999070

Step    600: Ran 100 train steps in 5.50 secs
Step    600: train TripletLoss |  0.49837515
Step    600: eval  TripletLoss | 

###**Model Evaluation**

I will utilize the Test Set which was configured earlier to determine the accuracy of the Model. Actually the Training Set only had Positive examples whereas the Test Set and y test is setup as pairs of Questions and some of which are duplicates and some are not. I will compute the Cosine Similarity of each pair, threshold it and compare the result to y test. The results are accumulated to produce the Accuracy. 

In [13]:
#@ Loading the Saved Model:
model = Siamese()
model.init_from_file("/content/model/model.pkl.gz")

#@ Model Evaluation: 
def classify(test_Q1, test_Q2, y, threshold, model, vocab, data_generator=data_generator, batch_size=64):
  """ Function to test the Accuracy of the Model. """
  accuracy = 0                                                                               # Initializing the Accuracy.
  for i in range(0, len(test_Q1), batch_size):
    q1, q2 = next(data_generator(test_Q1[i:i+batch_size], test_Q2[i:i+batch_size],
                                 batch_size, vocab["<PAD>"], shuffle=False))
    y_test = y[i:i+batch_size]                                                               # Using batch size of actual output target.
    v1, v2 = model((q1, q2))                                                                 # Using the Model.
    for j in range(batch_size):
      d = np.dot(v1[j], v2[j].T)                                                             # Calculating the Cosine Similarity.
      res = d > threshold
      accuracy += (y_test[j] == res)
  accuracy = accuracy / len(test_Q1)
  return accuracy

#@ Computing the Accuracy of the Model:
accuracy = classify(Q1_test, Q2_test, y_test, 0.7, model, vocab, batch_size=512)             # Calculating the Accuracy.
print("Accuracy of the Model:", accuracy) 

Accuracy of the Model: 0.75107421875


Now, I will test the Model using my own Questions. I will build a reverse Vocabulary that allows the map encoded Questions back to words. 

In [14]:
#@ Model Evaluation with own Questions:
def predict(question1, question2, threshold, model, vocab, data_generator=data_generator, verbose=False):
  """ Function for predicting if two Questions are Duplicates. """
  q1 = nltk.word_tokenize(question1)                                # Tokenization.
  q2 = nltk.word_tokenize(question2)                                # Tokenization.
  Q1, Q2 = [], []
  for word in q1:
    Q1 += [vocab[word]]                                             # Encoding.
  for word in q2:
    Q2 += [vocab[word]]                                             # Encoding.
  Q1, Q2 = next(data_generator([Q1], [Q2], 1, vocab["<PAD>"]))
  v1, v2 = model((Q1, Q2))                                          # Using Model.
  d = fastnp.dot(v1[0], v2[0].T)
  res = d > threshold
  if (verbose):
    print("Q1 = ", Q1, "\nQ2 = ", Q2)
    print("d = ", d)
    print("result = ", res)
  return res 

In [16]:
#@ Examples of Questions:
question1 = "How are you?"
question2 = "How are you doing?"
#@ Predicting the Duplicated Questions:
example1 = predict(question1, question2, 0.6, model, vocab, verbose=True)
print("Example1:", example1, "\n")

#@ Example of Questions:
question1 = "Do you enjoy eating the dessert?"
question2 = "Do you like hiking in the desert?"
#@ Predicting the Duplicated Questions:
example2 = predict(question1, question2, 0.6, model, vocab, verbose=True)
print("Example2:", example2) 

Q1 =  [[32 87 53 21  1  1  1  1]] 
Q2 =  [[  32   87   53 1438   21    1    1    1]]
d =  0.8786118
result =  True
Example1: True 

Q1 =  [[  443    53  3158  1169    78 29071    21     1]] 
Q2 =  [[  443    53    60 15323    28    78  7438    21]]
d =  0.5000767
result =  False
Example2: False
