<a href="https://colab.research.google.com/github/graviraja/deep-paraphrase-generation/blob/master/deep_paraphrase_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **A deep generative framework for Paraphrase generation paper experimental code**



We will be using quora question pairs dataset 

I have downloaded dataset into google drive and will be using by mounting the drive

In [1]:
from google.colab import drive
drive.mount('/content/data')

# if you want to upload the file, uncomment the following code
# from google.colab import files
# files.upload()

Drive already mounted at /content/data; to attempt to forcibly remount, call drive.mount("/content/data", force_remount=True).


In [2]:
# if the file is uploaded, change the filename correspondingly
filename = './data/My Drive/quora_questions.csv'
filename

'./data/My Drive/quora_questions.csv'

## Imports

In [0]:
import re
import json
import math
import pickle
import spacy
import random
import codecs
import collections
import unicodedata

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable

from sklearn.model_selection import train_test_split

In [89]:
SEED = 5
random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# use the gpu if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cuda


In [90]:
df = pd.read_csv(filename, encoding='utf8')
print(f"Number of rows in the data: {len(df)}")
df.head()


Number of rows in the data: 404351


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


Considering only the duplicates, since they are sentences and paraphrases to each other

In [91]:
data = df[df['is_duplicate'] == 1]
duplicate_df = data[['question1', 'question2']].reset_index(drop=True)
print(f"Number of duplicate questions: {len(data)}")
duplicate_df.head()

Number of duplicate questions: 149306


Unnamed: 0,question1,question2
0,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
1,How can I be a good geologist?,What should I do to be a great geologist?
2,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
3,What can make Physics easy to learn?,How can you make physics easy to learn?
4,What was your first sexual experience like?,What was your first sexual experience?


Set the *sample_size* if you want to just try out. You can use the complete data when the model is built

In [92]:
sample_size = 149306
sample_data = duplicate_df[:sample_size]
print(f"Length of the sample data: {len(sample_data)}")
sample_data.head()

Length of the sample data: 149306


Unnamed: 0,question1,question2
0,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan..."
1,How can I be a good geologist?,What should I do to be a great geologist?
2,How do I read and find my YouTube comments?,How can I see all my Youtube comments?
3,What can make Physics easy to learn?,How can you make physics easy to learn?
4,What was your first sexual experience like?,What was your first sexual experience?


### Preprocessing the data

In [0]:
def unicodeToAscii(s):
    # convert the unicode format to ascii
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
    )


def process_sentence(sentence):
    # Lowercase, trim, and remove non-letter characters
    s = unicodeToAscii(sentence.lower().strip())
    s = re.sub(r"([.!?])", r" \1", s)
    s = re.sub(r"[^a-zA-Z0-9.!?]+", r" ", s)
    s = re.sub(r"\s+", r" ", s).strip()
    return s

In [94]:
questions = [process_sentence(ques) for ques in sample_data['question1'].values]
paraphrases = [process_sentence(ques) for ques in sample_data['question2'].values]
print(questions[:5])
print('----------------------------------')
print(paraphrases[:5])

['astrology i am a capricorn sun cap moon and cap rising . . .what does that say about me ?', 'how can i be a good geologist ?', 'how do i read and find my youtube comments ?', 'what can make physics easy to learn ?', 'what was your first sexual experience like ?']
----------------------------------
['i m a triple capricorn sun moon and ascendant in capricorn what does this say about me ?', 'what should i do to be a great geologist ?', 'how can i see all my youtube comments ?', 'how can you make physics easy to learn ?', 'what was your first sexual experience ?']


### Split the dataset into training and validation datasets

In [96]:
train_src, valid_src, train_trg, valid_trg = train_test_split(questions, paraphrases, test_size=0.1, random_state=SEED)
print(f"Length of training data: {len(train_src)}")
print(f"Length of validation data: {len(valid_src)}")
corpus = {}
corpus['train_src'] = train_src
corpus['train_trg'] = train_trg
corpus['valid_src'] = valid_src
corpus['valid_trg'] = valid_trg

Length of training data: 134375
Length of validation data: 14931


### Creating the vocabulary and the data loader

We will use only the training data for creating the vocabulary

In [0]:
class BatchLoader:
  ''' Creates the vocabulary from the given corpus
  
  Args:
    corpus: Data for the vocabulary needs to be constructed.
  '''
  def __init__(self, corpus):
    
    self.vocab_size = 0
    self.word_to_idx = {}
    self.idx_to_word = {}
    
    self.data = [pd.DataFrame({'question1': corpus['train_src'], 'question2': corpus['train_trg']}), 
                 pd.DataFrame({'question1': corpus['valid_src'], 'question2': corpus['valid_trg']})]
    
    
    self.max_seq_len = 0
    
    self.pad_label = '<pad>'
    self.go_label = '<go>'
    self.end_label = '<eos>'
    self.unk_label = '<unk>'
    
    # build the vocab from the train data
    self.build_vocab(corpus['train_src'] + corpus['train_trg'])
    
  def build_vocab(self, corpus):
    # get the maximum sequence length from the given data
    self.max_seq_len = max([len(sent) for sent in corpus])
    
    # build the vocab from the words
    # you can use any other tokenizer here like spacy
    words = []
    for line in corpus:
      words += line.split()
    self.build_word_vocab(words)
 
  def build_word_vocab(self, words):
    # create the word count
    word_counts = collections.Counter(words)

    # map index to words
    idx_to_word = [x[0] for x in word_counts.most_common()]
    self.idx_to_word = [self.pad_label, self.go_label, self.end_label, self.unk_label] + list(sorted(idx_to_word))

    # vocab size
    self.vocab_size = len(self.idx_to_word)

    # map word to index
    self.word_to_idx = {x: i for i, x in enumerate(self.idx_to_word)}
  
  def get_word_by_idx(self, idx):
    # get the word of the given idx
  
    return self.idx_to_word[idx]
  
  def get_idx_by_word(self, word):
    # get the idx of the given word if the word present in vocab, else return unk
    
    if word in self.word_to_idx.keys():
      return self.word_to_idx[word]
    else:
      return self.word_to_idx[self.unk_label]
  
  def sample_word_from_distribution(self, distribution):
    # randomly sample a word from the given distribution
    
    assert distribution.shape[-1] == self.vocab_size
    ix = np.random.choice(range(self.vocab_size), p=distribution.ravel())
    return self.idx_to_word[ix]

  def likely_word_from_distribution(self, distribution):
    # get the word which has max probability from the given distribution
    
    assert distribution.shape[-1] == self.vocab_size
    ix = np.argmax(distribution.ravel())
    return self.idx_to_word[ix]
  
  def convert_to_numerical_batch(self, batch):
    # convert the given batch of sentences to its numerical format
    # input => list of sentences => [batch_size]
    
    max_len = np.max([len(x) for x in batch])    
    embed = [[self.get_idx_by_word(w) for w in s] + [self.get_idx_by_word(self.pad_label)] * (max_len - len(s)) for s in batch]
    
    return embed
  
  def next_batch(self, batch_size, type, return_sentences=False):
    # get a batch of data from the correspoding file => train/val
    if type == 'train':
      file_id = 0
    else:
      file_id = 1
      
    df = self.data[file_id].sample(batch_size, replace=False)
    sentences = [df['question1'].values, df['question2'].values]
    # sentences => list of 2 items [questions, paraphrases]

    input = self.input_from_sentences(sentences)
    
    if return_sentences:
      return input, [[s.split() for s in q] for q in sentences]
    else:
      return input
  
  def input_from_sentences(self, sentences):
    # sentences => list of 2 items [questions, paraphrases]
    
    # split the sentences into words
    sentences = [[s.split() for s in q] for q in sentences]
    
    # get the encoder, decoder inputs and target
    encoder_input_source, encoder_input_paraphrase = self.get_encoder_input(sentences)
    decoder_input_source, decoder_input_paraphrase = self.get_decoder_input(sentences)
    target = self.get_target(sentences)
    
    return [
        encoder_input_source, encoder_input_paraphrase,
        decoder_input_source, decoder_input_paraphrase,
        target
    ]
  
  def get_encoder_input(self, sentences):
    # add end label to each of the question and paraphrases, and convert them into numerical form
    
    encoder_source_input = self.convert_to_numerical_batch([s + [self.end_label] for s in sentences[0]])
    encoder_paraphrase_input = self.convert_to_numerical_batch([s + [self.end_label] for s in sentences[1]])
    
    return [Variable(torch.tensor(encoder_source_input)).long(), Variable(torch.tensor(encoder_paraphrase_input)).long()]
  
  def get_decoder_input(self, sentences):
    # add end label to the questions
    # add go label to the paraphrases
    
    decoder_source_input = self.convert_to_numerical_batch([s + [self.end_label] for s in sentences[0]])
    decoder_paraphrase_input = self.convert_to_numerical_batch([[self.go_label] + s for s in sentences[1]])
    
    return [Variable(torch.tensor(decoder_source_input)).long(), Variable(torch.tensor(decoder_paraphrase_input)).long()]
  
  def get_target(self, sentences):
    # get the target, which paraphrase decoder part needs to generate
    # we need to generate the paraphrase of a given sentence, so we will consider only the paraphrase
    
    sentences = sentences[1] 
    max_len = np.max([len(x) for x in sentences]) 
    target = [[self.get_idx_by_word(w) for w in s] + [self.get_idx_by_word(self.end_label)] + ([self.get_idx_by_word(self.pad_label)] * (max_len - len(s))) for s in sentences]
    
    return Variable(torch.tensor(target)).long()
  
  def get_raw_input_from_sentences(self, sentences):
    sentences = [s.split() for s in sentences]
    return Variable(torch.tensor(self.convert_to_numerical_batch(sentences))).long()

### Parameters of the model

The following parameters class contains all the model related parameters

In [0]:
class Parameters:
  ''' This code contains the parameters of the model.
  
  Args:
    max_seq_len: A integer indicating the max seq length
    vocab_size: A integer indicating the vocab size
    embedding_size: A integer indicating the word embedding size
  '''
  def __init__(self, max_seq_len, vocab_size, embedding_size):
    
    self.max_seq_len = int(max_seq_len) + 1
    
    self.vocab_size = int(vocab_size)
    self.word_embed_size = embedding_size
    
    self.encoder_rnn_size = 600
    self.encoder_num_layers = 1
    
    self.latent_variable_size = 1100
    
    self.decoder_rnn_size = 600
    self.decoder_num_layers = 2
    
    self.kld_penalty_weight = 1.0
    self.cross_entropy_penalty_weight = 79.0
  
  def get_kld_coef(self, i):
    # get the kl divergence penalty based on the iteration i
   
    return self.kld_penalty_weight * (math.tanh((i - 3500)/ 1000) + 1) / 2.0

### Highway network

This will be used before passing the input to encoder and decoder. So the embeddings are first passed through the highway network and then passed through the encoder/decoder.

In [0]:
class Highway(nn.Module):
  """This code contains the implementation of Highwat network.
  
  """
  def __init__(self, size, num_layers, f):
    super().__init__()
    
    self.num_layers = num_layers
    self.non_linear = nn.ModuleList([nn.Linear(size, size) for _ in range(num_layers)])
    self.linear = nn.ModuleList([nn.Linear(size, size) for _ in range(num_layers)])
    self.gate = nn.ModuleList([nn.Linear(size, size) for _ in range(num_layers)])
    self.f = f
  
  def forward(self, x):
    """
    Args:
        x : input with shape of [batch_size, size]
    Returns:
        tensor with shape of [batch_size, size] after applying highway network
    applies σ(x) ⨀ (f(G(x))) + (1 - σ(x)) ⨀ (Q(x)) transformation | G and Q is affine transformation,
        f is non-linear transformation, σ(x) is affine transformation with sigmoid non-linearition
        and ⨀ is element-wise multiplication
    """
    for layer in range(self.num_layers):
        # σ(x)
        gate = F.sigmoid(self.gate[layer](x))

        # f(G(x))
        non_linear = self.f(self.non_linear[layer](x))

        # Q(x)
        linear = self.linear[layer](x)

        # σ(x) ⨀ (f(G(x))) + (1 - σ(x)) ⨀ (Q(x))
        x = gate * non_linear + (1 - gate) * linear

    return x

### Encoder

Now let's see the encoder implementation. Our encoder contains 2 LSTM cells.


*   First LSTM encoder takes the source sentence as input.
*   Second LSTM encoder takes the paraphrase sentence as input along with first encoder's final hidden state as input
*   After encoding the paraphrase sentence, we will create the latent parameters
$\mu , \sigma$ from the final hidden state of the paraphrase encoder by passing through feed-forward layer



In [0]:
class Encoder(nn.Module):
  """ This code contains the encoder part of the paper.
  
  Args:
    params: A Parameters instance which contains the model parameters details.
    highway: A Highway network, applying for the input before passing to encoder.
  """
  def __init__(self, params, highway):
    super().__init__()
    
    self.params = params
    self.hw1 = highway
    self.embedding = nn.Embedding(self.params.vocab_size, self.params.word_embed_size)
    
    # encoding rnns for the source and paraphrase
    self.rnns = nn.ModuleList([
        nn.LSTM(
            input_size=self.params.word_embed_size,
            hidden_size=self.params.encoder_rnn_size,
            num_layers=self.params.encoder_num_layers,
            batch_first=True,
            bidirectional=True)
        for _ in range(2)
    ])
    
    # latent layers
    # we will combine both hidden and cell states of the final layer of the bidirectional paraphrase lstm
    self.context_to_mu = nn.Linear(self.params.encoder_rnn_size * 4, self.params.latent_variable_size)
    self.context_to_logvar = nn.Linear(self.params.encoder_rnn_size * 4, self.params.latent_variable_size)
  
  def forward(self, input_source, input_paraphrase):
    # input_source => [batch_size, inp_src_max_len]
    # input_paraphrase => [batch_size, inp_par_max_len]
    
    # initial state for encoding the source sentence is zeros
    # after encoding the source sentence, the state can be used to encode the paraphrase
    state = None
    for i, input in enumerate([input_source, input_paraphrase]):
      # before passing the input to encoder rnn, pass it through the highway network
      input = self.embedding(input)
      [batch_size, seq_len, embedding_size] = input.size()
      
      input = input.view(-1, embedding_size)
      input = self.hw1(input)
      input = input.view(batch_size, seq_len, embedding_size)
      
      _, state = self.rnns[i](input, state)
    
    # final hidden, cell state after encoding the paraphrase
    [h_state, c_state] = state
    
    # consider only the final layer's hidden state and cell state
    h_state = h_state.view(self.params.encoder_num_layers, 2, batch_size, self.params.encoder_rnn_size)[-1]
    c_state = c_state.view(self.params.encoder_num_layers, 2, batch_size, self.params.encoder_rnn_size)[-1]
    
    # make the hidden, cell states as batch major
    h_state = h_state.permute(1, 0, 2).contiguous().view(batch_size, -1)
    c_state = c_state.permute(1, 0, 2).contiguous().view(batch_size, -1)
    
    # concat the hidden and cell state
    # final_state => [batch_size, encoder_rnn_size * 4]
    final_state = torch.cat([h_state, c_state], -1)
    
    # latent parameters => [batch_size, latent_size]
    mu = self.context_to_mu(final_state)
    log_var = self.context_to_logvar(final_state)
    
    return mu, log_var

### Decoder

Now let's see the decoder implementation. Our decoder contains 2 LSTMs.



*   First LSTM is used for encoding the source sentence
*   The second decoder is used for creating the paraphrase using the states from first encoder and latent vectors
*   We will sample from latent vectors and pass along with word input as the input to the second decoder
*   The output from the second decoder is passed through an output layer for predicting the word



In [0]:
class Decoder(nn.Module):
  def __init__(self, params, highway):
    super().__init__()
    
    self.params = params
    self.hw1 = highway
    self.embedding = nn.Embedding(self.params.vocab_size, self.params.word_embed_size)
    
    # encoding of the source sentence is bidirectional, since source is available at inference time
    self.encoding_rnn = nn.LSTM(
        input_size=self.params.word_embed_size,
        hidden_size=self.params.encoder_rnn_size,
        num_layers=self.params.encoder_num_layers,
        batch_first=True,
        bidirectional=True)
    
    # decoding of the paraphrase sentence, input is word concatenated with latent vector
    self.decoding_rnn = nn.LSTM(
        input_size=self.params.word_embed_size + self.params.latent_variable_size,
        hidden_size=self.params.decoder_rnn_size,
        num_layers=self.params.decoder_num_layers,
        batch_first=True)
    
    # for converting the source decoder hidden state to hidden state of the paraphrase decoder
    self.h_to_initial_state = nn.Linear(self.params.encoder_rnn_size * 2, self.params.decoder_num_layers * self.params.decoder_rnn_size)
    
    # for converting the source decoder cell state to cell state of the paraphrase decoder
    self.c_to_initial_state = nn.Linear(self.params.encoder_rnn_size * 2, self.params.decoder_num_layers * self.params.decoder_rnn_size)
    
    # output layer
    self.fc = nn.Linear(self.params.decoder_rnn_size, self.params.vocab_size)
  
  def build_initial_state(self, input):
    # run the encoder rnn on the input to produce the hidden and cell state for paraphrase decoder
    input = self.embedding(input)
    [batch_size, seq_len, embed_size] = input.size()
    input = input.view(-1, embed_size)
    input = self.hw1(input)
    input = input.view(batch_size, seq_len, embed_size)

    _, cell_state = self.encoding_rnn(input)
    [h_state, c_state] = cell_state

    # consider only the final layer hidden and cell state
    h_state = h_state.view(self.params.encoder_num_layers, 2, batch_size, self.params.encoder_rnn_size)[-1]
    c_state = c_state.view(self.params.encoder_num_layers, 2, batch_size, self.params.encoder_rnn_size)[-1]

    # convert to batch major format
    h_state = h_state.permute(1, 0, 2).contiguous().view(batch_size, -1)
    c_state = c_state.permute(1, 0, 2).contiguous().view(batch_size, -1)

    # pass the hidden and cell state through a linear layer to get initial hidden and cell states for decoder
    h_initial = self.h_to_initial_state(h_state).view(batch_size, self.params.decoder_num_layers, self.params.decoder_rnn_size)
    # h_inital => [num_layers, batch_size, decoder_rnn_size]
    h_initial = h_initial.permute(1, 0, 2).contiguous()

    # pass the hidden and cell state through a linear layer to get initial hidden and cell states for decoder
    c_initial = self.h_to_initial_state(c_state).view(batch_size, self.params.decoder_num_layers, self.params.decoder_rnn_size)
    # c_inital => [num_layers, batch_size, decoder_rnn_size]
    c_initial = c_initial.permute(1, 0, 2).contiguous()

    return (h_initial, c_initial)
  
  def forward(self, encoder_input, decoder_input, z, drop_prob, initial_state=None):
    # encoder_input => [batch_size, src_seq_len, embed_size]
    # decoder_input => [batch_size, para_seq_len, embed_size]
    # z is the latent variable shape => [batch_size, latent_variable_size]
    # drop_prob is the probability of an element of decoder input to be zeroed in sense of dropout
    # initial_state is the initial state of decoder
    
    if initial_state is None:
      assert encoder_input is not None, "Input should be provided to the encoder part of decoder"
      initial_state = self.build_initial_state(encoder_input)
    
    decoder_input = self.embedding(decoder_input)
    [batch_size, seq_len, _] = decoder_input.size()
    
    # replicate z for seq_len times
    z = torch.cat([z] * seq_len, 1).view(batch_size, seq_len, self.params.latent_variable_size)
    
    decoder_input = F.dropout(decoder_input, drop_prob)
    
    # concat the latent variable and word decoder input
    decoder_input = torch.cat([decoder_input, z], 2)
    # decoder_input => [batch_size, para_seq_len, embed_size + latent_variable_size]
    
    # decoder rnn
    rnn_out, final_state = self.decoding_rnn(decoder_input, initial_state)
    
    # output layer
    # rnn_out => [batch_size, para_seq_len, decoder_rnn_size]
    rnn_out = rnn_out.contiguous().view(-1, self.params.decoder_rnn_size)
    result = self.fc(rnn_out)
    result = result.view(batch_size, seq_len, self.params.vocab_size)
    
    return result, final_state
    

### Paraphraser

Now let's look into the Paraphraser implementation, which uses the Encoder and Decoder.

We will implement the training part, evaluation part, inference code in the paraphraser.

In [0]:
class Paraphraser(nn.Module):
  def __init__(self, params, device):
    super().__init__()
    
    self.params = params
    self.device = device
    
    # use 2 layers, relu activation for Highway
    self.highway = Highway(self.params.word_embed_size, 2, F.relu)
    self.encoder = Encoder(self.params, self.highway)
    self.decoder = Decoder(self.params, self.highway)
    
  def forward(self, drop_prob, encoder_input=None, decoder_input=None, z=None, initial_state=None):
    # encoder_input: A list of 2 tensors (original sentence, paraphrase sentence) with shape => [batch_size, seq_len]
    # decoder_input: A list of 2 tensors (original sentence, paraphrase sentence) with shape => [batch_size, seq_len]
    # initial_state: Initial state of decoder rnn in order to perform sampling
    # drop_prob: probability of an element of decoder input to be zeroed in sense of dropout
    # z: context if sampling
    
    if z is None:
      # training case, get context from encoder and sample z ~ N(mu, std)
      [batch_size, _] = encoder_input[0].size()
      
      mu, log_var = self.encoder(encoder_input[0], encoder_input[1])
      std = torch.exp(0.5 * log_var)
      
      z = Variable(torch.randn([batch_size, self.params.latent_variable_size])).to(self.device)
      z = z * std + mu
      
      # kl divergence loss
      kld = (-0.5 * torch.sum(log_var - torch.pow(mu, 2) - torch.exp(log_var) + 1, 1)).mean().squeeze()
    else:
      kld = None
    
    out, final_state = self.decoder(decoder_input[0], decoder_input[1], z, drop_prob, initial_state)
    
    return out, final_state, kld
  
  def learnable_parameters(self):
    return [p for p in self.parameters() if p.requires_grad]
  
  def trainer(self, optimizer, batch_loader, pad_label):
    def train(i, batch_size, dropout):
      # i is the step number
      # dropout to use in the decoder
      
      # get the batch data from training data
      input = batch_loader.next_batch(batch_size, 'train')
      input = [var.to(self.device) for var in input]
      
      [encoder_input_source, encoder_input_paraphrase, decoder_input_source, decoder_input_paraphrase, target] = input
      
      # forward method

      logits, _, kld = self(dropout, (encoder_input_source, encoder_input_paraphrase), (decoder_input_source, decoder_input_paraphrase), z=None)
      
      logits = logits.view(-1, self.params.vocab_size)
      target = target.view(-1)
      
      cross_entropy_loss = F.cross_entropy(logits, target, ignore_index=pad_label)
      
      # total loss is weighted reconstruction loss + weighted kl divergence loss
      loss = self.params.cross_entropy_penalty_weight * cross_entropy_loss + self.params.get_kld_coef(i) * kld
      
      optimizer.zero_grad()
      loss.backward()
      optimizer.step()
      
      return cross_entropy_loss, kld, self.params.get_kld_coef(i)
      
    return train
  
  def validator(self, batch_loader, pad_label):
    # we don't need to optimize the model during validation so optimizer not required
    # batch_loader, which contains all the data related things
    
    def get_samples(logits, target):
      # get the samples from logits
      # logits => [batch_size, seq_len, vocab_size]
      # target => [batch_size, seq_len]
      
      prediction = F.softmax(logits, dim=-1).data.cpu().numpy()
      target = target.data.cpu().numpy()
      
      sampled, expected = [], []
      for i in range(prediction.shape[0]):
        sampled += [' '.join([batch_loader.sample_word_from_distribution(d) for d in prediction[i]])]
        expected += [' '.join([batch_loader.get_word_by_idx(idx) for idx in target[i]])]
      
      return sampled, expected
    
    # validation loop of the model
    def validate(batch_size, need_samples=False):
      # batch size for the validation
      # need_samples, where to have the sentences converted or not
      if need_samples:
        input, sentences = batch_loader.next_batch(batch_size, 'test', return_sentences=True)
        sentences = [[' '.join(s) for s in q] for q in sentences]
      else:
        input = batch_loader.next_batch(batch_size, 'test')
      
      input = [var.to(self.device) for var in input]
      
      [encoder_input_source, encoder_input_paraphrase, decoder_input_source, decoder_input_paraphrase, target] = input
      
      # consider all the words during validation
      logits, _, kld = self(0., (encoder_input_source, encoder_input_paraphrase), (decoder_input_source, decoder_input_paraphrase), z=None)
      
      if need_samples:
        [s1, s2] = sentences
        sampled, _ = get_samples(logits, target)
      else:
        s1, s2 = (None, None)
        sampled, _ = (None, None)
      
      logits = logits.view(-1, self.params.vocab_size)
      target = target.view(-1)
      
      cross_entropy_loss = F.cross_entropy(logits, target, ignore_index=pad_label)
      return cross_entropy_loss, kld, (sampled, s1, s2)
    
    return validate
  
  def sample_with_input(self, batch_loader, seq_len, use_mean, input):
    [encoder_input_source, encoder_input_paraphrase, decoder_input_source, _, _] = input
    
    encoder_input = [encoder_input_source, encoder_input_paraphrase]
    
    [batch_size, _] = encoder_input[0].shape
    
    mu, logvar = self.encoder(encoder_input[0], encoder_input[1])
    std = torch.exp(0.5 * logvar)
    
    if use_mean:
      z = mu
    else:
      z = Variable(torch.randn([batch_size, self.params.latent_variable_size])).to(self.device)
      z = z * std + mu
    
    initial_state = self.decoder.build_initial_state(decoder_input_source)
    
    # initial input token to the decoder is the go label
    decoder_input = batch_loader.get_raw_input_from_sentences([batch_loader.go_label])
    
    result = ''
    
    for i in range(seq_len):
      decoder_input = decoder_input.to(self.device)
      
      logits, initial_state = self.decoder(None, decoder_input, z, 0.0, initial_state)
      logits = logits.view(-1, self.params.vocab_size)
      
      prediction = F.softmax(logits, dim=-1)
      word = batch_loader.likely_word_from_distribution(prediction.data.cpu().numpy()[-1])
      
      if word == batch_loader.end_label:
        break
      result += ' ' + word
      
      decoder_input = batch_loader.get_raw_input_from_sentences([word])
    
    return result
      
  def sample_with_pair(self, batch_loader, seq_len, source_sent, target_sent):
    input = batch_loader.input_from_sentences([[source_sent], [target_sent]])
    input = [var.to(self.device) for var in input]
    
    return self.sample_with_input(batch_loader, seq_len, False, input)

### Training Process

In [0]:
WORD_EMBEDDING_SIZE = 300
LEARNING_RATE = 0.00005
NUM_STEPS = 100000
BATCH_SIZE = 32
DROPOUT = 0.3

model_name = 'paraphrase_model.pt'

cross_entropy_result_train = []
kld_result_train = []
cross_entropy_result_valid = []
kld_result_valid = []
cross_entropy_cur_train = []
kld_cur_train = []


In [0]:
batch_loader = BatchLoader(corpus)
parameters = Parameters(batch_loader.max_seq_len, batch_loader.vocab_size, WORD_EMBEDDING_SIZE)
paraphraser = Paraphraser(parameters, device).to(device)
pad_label = batch_loader.get_idx_by_word(batch_loader.pad_label)

In [0]:
optimizer = optim.Adam(paraphraser.parameters(), LEARNING_RATE)
train_step = paraphraser.trainer(optimizer, batch_loader, pad_label)
validate = paraphraser.validator(batch_loader, pad_label)


In [0]:
with open('batch_loader.pkl', 'wb') as f:
  pickle.dump(batch_loader, f)

In [108]:
for iteration in range(NUM_STEPS):
  cross_entropy, kld, coef = train_step(iteration, BATCH_SIZE, DROPOUT)
  
  cross_entropy_cur_train += [cross_entropy.data.cpu().numpy()]
  kld_cur_train += [kld.data.cpu().numpy()]
  
  if iteration % 500 == 0:
    cross_entropy_result_train += [np.mean(cross_entropy_cur_train)]
    kld_result_train += [np.mean(kld_cur_train)]
    
    cross_entropy_cur_train = []
    kld_cur_train = []
    
    print('\n')
    print('------------------------------------------- TRAIN -------------------------------------------------------------------------')
    print(f"ITERATION = {iteration}, CROSS-ENTROPY = {cross_entropy_result_train[-1]}, KLD = {kld_result_train[-1]}, KLD-COEF = {coef}")
    print('------------------------------------------- VALID --------------------------------------------------------------------------')

    cross_entropy, kld = [], []
    for i in range(20):
      ce, kl, _ = validate(BATCH_SIZE)
      cross_entropy += [ce.data.cpu().numpy()]
      kld += [kl.data.cpu().numpy()]
      
    cross_entropy = np.mean(cross_entropy)
    kld = np.mean(kld)
    
    cross_entropy_result_valid += [cross_entropy]
    kld_result_valid += [kld]
    
    print(f"CROSS-ENTROPY = {cross_entropy}, KLD = {kld}")
    print('---------------------------------------- SAMPLING --------------------------------------------------------------------------')
    
    _, _, (sampled, s1, s2) = validate(2, need_samples=True)
    for i in range(len(sampled)):
      result = paraphraser.sample_with_pair(batch_loader, 20, s1[i], s2[i])
      print(f"source: {s1[i]}")
      print(f"target: {s2[i]}")
      print(f"valid: {sampled[i]}")
      print(f"sampled: {result}")
      print('---------------------------------------------------------------------------------------------------------------------------')
   
  if (iteration % 10000 == 0 and iteration != 0) or (iteration == NUM_STEPS - 1):
    torch.save(paraphraser.state_dict(), model_name)
    np.save('ce_result_valid_{}.npy'.format(iteration), np.array(cross_entropy_result_valid))
    np.save('kld_result_valid_{}.npy'.format(iteration), np.array(kld_result_valid))
    np.save('ce_result_train_{}.npy'.format(iteration), np.array(cross_entropy_result_train))
    np.save('kld_result_train_{}.npy'.format(iteration), np.array(kld_result_train))





------------------------------------------- TRAIN -------------------------------------------------------------------------
ITERATION = 0, CROSS-ENTROPY = 3.3902270793914795, KLD = 6.252923965454102, KLD-COEF = 0.0009110511944006583
------------------------------------------- VALID --------------------------------------------------------------------------
CROSS-ENTROPY = 3.3494057655334473, KLD = 6.618390083312988
---------------------------------------- SAMPLING --------------------------------------------------------------------------
source: are indians so obsessed with the notion of religion and caste ?
target: why are indians so obsessed with their religion caste class society and community ?
valid: will do adulterated topics prove scene expansion campuses look exists anthem ? nibiru elections <eos>
sampled:  what are the pros of the indian government of india and what are the reason of the united states ?
-------------------------------------------------------------------------

In [104]:
torch.tensor([[1,2,3], [1,2,3]])

tensor([[1, 2, 3],
        [1, 2, 3]])