## Similar Question Generation Using Quora Dataset with Seq2Seq Architecture

**Objective**: As part of Assignment 7 - Part 2, we will be using the Quora dataset which is a collection of question pairs and a field that shows whether they are duplicate questions. 

**Task**: For a given Question as input, using Seq2Seq generate a similar question. We are using the model discussed in Session 7 of END2


### 1. Downloading and loading the dataset into DataFrames

#### About the dataset

Called as the Quora First Dataset, this dataset was curated by the Quora team to identify similar questions and group them together. For example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora because the intent behind both is identical. 

This dataset consists of over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair

We will be considering only duplicate question pairs for this exercise. Although the NLU task for this dataset is text similarity, we are using this dataset as a part of Assignment 7 Part 2 in END2 as a hands-on for using Seq2Seq architecture. 

**Input:** Question Q1 <br>
**Output:** Question Q2 <br>
where <br>
<Q1, Q2> are a pair of duplicate questions


In [10]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy import data
from torchtext.legacy.data import Field, BucketIterator


import spacy
import numpy as np

import random
import math
import time

import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.max_colwidth', None)


In [6]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True
#!pip install spacy --upgrade

In [3]:
!wget http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv

--2021-06-24 12:18:39--  http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv
Resolving qim.fs.quoracdn.net (qim.fs.quoracdn.net)... 151.101.41.2
Connecting to qim.fs.quoracdn.net (qim.fs.quoracdn.net)|151.101.41.2|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 58176133 (55M) [text/tab-separated-values]
Saving to: ‘quora_duplicate_questions.tsv’


2021-06-24 12:18:42 (399 MB/s) - ‘quora_duplicate_questions.tsv’ saved [58176133/58176133]



In [12]:
import pandas as pd

df = pd.read_csv("quora_duplicate_questions.tsv", sep='\t', engine="python",error_bad_lines=False)

df['question1'] = df['question1'].astype(str)
df['question2'] = df['question2'].astype(str)
df = df.dropna()

df = df[df['is_duplicate']==1]
df.to_csv("quora_duplicate_dataset.csv")

print("Number of records : ",len(df))
display(df[['question1','question2']].head(n = 20))

Number of records :  149263


Unnamed: 0,question1,question2
5,Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?,"I'm a triple Capricorn (Sun, Moon and ascendant in Capricorn) What does this say about me?"
7,How can I be a good geologist?,What should I do to be a great geologist?
11,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
12,What can make Physics easy to learn?,How can you make physics easy to learn?
13,What was your first sexual experience like?,What was your first sexual experience?
15,What would a Trump presidency mean for current international master’s students on an F1 visa?,How will a Trump presidency affect the students presently in US or planning to study in US?
16,What does manipulation mean?,What does manipulation means?
18,Why are so many Quora users posting questions that are readily answered on Google?,Why do people ask Quora questions which can be answered easily by Google?
20,Why do rockets look white?,Why are rockets and boosters painted white?
29,How should I prepare for CA final law?,How one should know that he/she completely prepare for CA final exam?


In [13]:
%%bash
python -m spacy download en
#python -m spacy download de

Collecting en-core-web-sm==3.0.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0-py3-none-any.whl (13.7 MB)
[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Pleaseuse the
full pipeline package name 'en_core_web_sm' instead.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


bash: switchml: line 1: syntax error: unexpected end of file
bash: error importing function definition for `switchml'
bash: _moduleraw: line 1: syntax error: unexpected end of file
bash: error importing function definition for `_moduleraw'


In [14]:
#spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

In [15]:
def tokenize_q2(text):
    """
    Tokenizes English text from a string into a list of strings (tokens) and reverses it
    """
    return [tok.text.lower() for tok in spacy_en.tokenizer(text)][::-1]

def tokenize_q1(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    """
    return [tok.text.lower() for tok in spacy_en.tokenizer(text)]

In [16]:
Q1 = Field(sequential=True,
                 #batch_first = True,
                 tokenize = tokenize_q1, 
                 init_token = '<sos>', 
                 eos_token = '<eos>')


Q2 = Field(sequential=True,
                 #batch_first = True,
                 tokenize = tokenize_q2, 
                 init_token = '<sos>', 
                 eos_token = '<eos>')

fields = [('question1', Q1),('question2',Q2)]

In [17]:
example = [data.Example.fromlist([df.question1.iloc[i],df.question2.iloc[i]], fields) for i in range(df.shape[0])] 
train_data, test_data = data.Dataset(example, fields).split(split_ratio = 0.7)

In [18]:
print(f"Number of training examples: {len(train_data.examples)}")
#print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 104484
Number of testing examples: 44779


In [19]:
print(vars(train_data.examples[0]))

{'question1': ['what', 'are', 'the', 'most', 'intellectually', 'stimulating', 'movies', 'you', 'have', 'ever', 'seen', '?'], 'question2': ['?', 'watched', 'ever', 'have', 'you', 'films', 'stimulating', 'intellectually', 'most', 'the', 'are', 'what']}


In [20]:
Q1.build_vocab(train_data, min_freq = 2)
Q2.build_vocab(train_data, min_freq = 2)

print('Top 10 words in Question1 Vocab :', list(Q1.vocab.freqs.most_common(10)))

import os, pickle
with open('Quora_Q1_tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Q1.vocab.stoi, tokens)

  
with open('Quora_Q2_tokenizer.pkl', 'wb') as tokens: 
    pickle.dump(Q2.vocab.stoi, tokens)

Top 10 words in Question1 Vocab : [('?', 108748), ('the', 46721), ('what', 41081), ('is', 31957), ('how', 31727), ('i', 26415), ('to', 23391), ('do', 22943), ('in', 20349), ('a', 19731)]


In [21]:
print(f"Unique tokens in Question1 vocabulary: {len(Q1.vocab)}")
print(f"Unique tokens in Question2  vocabulary: {len(Q2.vocab)}")

Unique tokens in Question1 vocabulary: 14513
Unique tokens in Question2  vocabulary: 14455


In [22]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print("Curent Device is ", device)

Curent Device is  cuda


In [23]:
BATCH_SIZE = 128

train_iterator, test_iterator = BucketIterator.splits(
    (train_data, test_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.question1),
    sort_within_batch=True,
    device = device)

In [24]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [25]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [26]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

In [27]:
INPUT_DIM = len(Q1.vocab)
OUTPUT_DIM = len(Q2.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [28]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(14513, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(14455, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=14455, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [29]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 22,187,639 trainable parameters


In [30]:
optimizer = optim.Adam(model.parameters(),lr=0.001)

In [31]:
PAD_IDX = Q2.vocab.stoi[Q2.pad_token]

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

In [32]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        q1 = batch.question1
        q2 = batch.question2

        optimizer.zero_grad()
        
        output = model(q1, q2)
        
        #ans = [ans len, batch size]
        #output = [ans len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        q2 = q2[1:].view(-1)
        
        #ans = [(ans len - 1) * batch size]
        #output = [(ans len - 1) * batch size, output dim]
        
        loss = criterion(output, q2)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [33]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            q1 = batch.question1
            q2 = batch.question2

            output = model(q1, q2, 0) #turn off teacher forcing

            #ans = [ans len, batch size]
            #output = [ans len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            q2 = q2[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, q2)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [34]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [35]:
N_EPOCHS = 35
CLIP = 1

best_test_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    test_loss = evaluate(model, test_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if test_loss < best_test_loss:
        best_test_loss = test_loss
        torch.save(model.state_dict(), 'quora-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Test. Loss: {test_loss:.3f} |  Test. PPL: {math.exp(test_loss):7.3f}')

Epoch: 01 | Time: 4m 2s
	Train Loss: 4.791 | Train PPL: 120.465
	 Test. Loss: 4.849 |  Test. PPL: 127.670
Epoch: 02 | Time: 4m 2s
	Train Loss: 3.776 | Train PPL:  43.624
	 Test. Loss: 4.302 |  Test. PPL:  73.878
Epoch: 03 | Time: 4m 1s
	Train Loss: 3.256 | Train PPL:  25.943
	 Test. Loss: 3.969 |  Test. PPL:  52.919
Epoch: 04 | Time: 4m 3s
	Train Loss: 2.945 | Train PPL:  19.009
	 Test. Loss: 3.943 |  Test. PPL:  51.577
Epoch: 05 | Time: 4m 3s
	Train Loss: 2.714 | Train PPL:  15.088
	 Test. Loss: 3.799 |  Test. PPL:  44.651
Epoch: 06 | Time: 4m 3s
	Train Loss: 2.533 | Train PPL:  12.587
	 Test. Loss: 3.774 |  Test. PPL:  43.572
Epoch: 07 | Time: 4m 2s
	Train Loss: 2.395 | Train PPL:  10.971
	 Test. Loss: 3.795 |  Test. PPL:  44.499
Epoch: 08 | Time: 4m 1s
	Train Loss: 2.296 | Train PPL:   9.938
	 Test. Loss: 3.736 |  Test. PPL:  41.916
Epoch: 09 | Time: 4m 2s
	Train Loss: 2.188 | Train PPL:   8.919
	 Test. Loss: 3.785 |  Test. PPL:  44.056
Epoch: 10 | Time: 4m 2s
	Train Loss: 2.121 | T

KeyboardInterrupt: 

In [36]:
model.load_state_dict(torch.load('quora-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.727 | Test PPL:  41.565 |


In [68]:
id = 1

example = train_data.examples[id]
q2 = example.question2[::-1]
print('Question1: ', ' '.join(example.question1))
print('Question2: ', " ".join(q2))



Question1:  what are the basic types of satellites ?
Question2:  what are different types of satellites ? what is the most advanced type ?


In [74]:
q1_tensor = Q1.process([example.question1]).to(device)
q2_tensor = Q2.process([example.question2]).to(device)
print(q2_tensor.shape)

model.eval()
with torch.no_grad():
    outputs = model(q1_tensor, q2_tensor, teacher_forcing_ratio=0)

print(outputs.shape)
output_idx = outputs[1:].squeeze(1).argmax(1)

print('Input Question: ', ' '.join(example.question1))
print('Generated Question: ', ' '.join([Q2.vocab.itos[idx] for idx in output_idx][::-1]))

#EOS needs to be considered as a special symbol. This is a bug which needs to be fixed

torch.Size([16, 1])
torch.Size([16, 1, 14455])
Input Question:  what are the basic types of satellites ?
Generated Question:  are the different types of satellites <eos> what are the different types of type ?


In [76]:
#EOS needs to be considered as a special symbol. This is a bug which needs to be fixed

import random
n =  len(train_data.examples)
for i in range(0,5): #number of tries is 5
    id = random.randint(1,n)
    example = train_data.examples[id]
    q2 = example.question2[::-1]
    
    q1_tensor = Q1.process([example.question1]).to(device)
    q2_tensor = Q2.process([example.question2]).to(device)

    model.eval()
    with torch.no_grad():
        outputs = model(q1_tensor, q2_tensor, teacher_forcing_ratio=0)

    output_idx = outputs[1:].squeeze(1).argmax(1)
    
    print('\nInput Question: ', ' '.join(example.question1))
    print('Target Output: ', " ".join(q2))
    print('Generated Question: ', ' '.join([Q2.vocab.itos[idx] for idx in output_idx][::-1]))



Input Question:  what is the best samsung air conditioner repair center in hyderabad ?
Target Output:  how can we find the best samsung air conditioner repair center in hyderabad ?
Generated Question:  <eos> <eos> <eos> where is the best carrier air conditioner repair center in hyderabad ?

Input Question:  what will be the repercussions of banning rs 500 and rs 1000 notes on indian economy ?
Target Output:  what are your views on india banning 500 and 1000 notes ? in what way it will affect indian economy ?
Generated Question:  <eos> <eos> <eos> <eos> <eos> <eos> <eos> <eos> how will the ban on 500₹ and 1000₹ notes on the indian economy ?

Input Question:  how imminent is world war three ?
Target Output:  is world war 3 on the way with the us elections ?
Generated Question:  war <eos> <eos> <eos> <eos> are we heading toward world war iii ?

Input Question:  how do i get rid of my belly fat ?
Target Output:  what s the best way to reduce belly fat ?
Generated Question:  s the best way