<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fgeneration/applications/generation/utterance_generation/General%20Utterance%20Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Utterance Generation

Utterance generation is an important problem in NLP, especially in question answering, information retrieval, information extraction, conversation systems, to name a few. It could also be used to create synthentic training data for many NLP problems.


## MS COCO Dataset


In [1]:
!wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip

--2020-06-28 15:41:36--  http://images.cocodataset.org/annotations/annotations_trainval2017.zip
Resolving images.cocodataset.org (images.cocodataset.org)... 52.216.10.35
Connecting to images.cocodataset.org (images.cocodataset.org)|52.216.10.35|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 252907541 (241M) [application/zip]
Saving to: ‘annotations_trainval2017.zip’


2020-06-28 15:41:45 (29.8 MB/s) - ‘annotations_trainval2017.zip’ saved [252907541/252907541]



In [2]:
!unzip annotations_trainval2017.zip

Archive:  annotations_trainval2017.zip
  inflating: annotations/instances_train2017.json  
  inflating: annotations/instances_val2017.json  
  inflating: annotations/captions_train2017.json  
  inflating: annotations/captions_val2017.json  
  inflating: annotations/person_keypoints_train2017.json  
  inflating: annotations/person_keypoints_val2017.json  


In [3]:
!ls -lah annotations

total 796M
drwxr-xr-x 2 root root 4.0K Jun 28 15:41 .
drwxr-xr-x 1 root root 4.0K Jun 28 15:41 ..
-rw-rw-r-- 1 root root  88M Sep  1  2017 captions_train2017.json
-rw-rw-r-- 1 root root 3.7M Sep  1  2017 captions_val2017.json
-rw-rw-r-- 1 root root 449M Sep  1  2017 instances_train2017.json
-rw-rw-r-- 1 root root  20M Sep  1  2017 instances_val2017.json
-rw-rw-r-- 1 root root 228M Sep  1  2017 person_keypoints_train2017.json
-rw-rw-r-- 1 root root 9.6M Sep  1  2017 person_keypoints_val2017.json


## Required Installations

In [4]:
!pip install youtokentome

Collecting youtokentome
[?25l  Downloading https://files.pythonhosted.org/packages/a3/65/4a86cf99da3f680497ae132329025b291e2fda22327e8da6a9476e51acb1/youtokentome-1.0.6-cp36-cp36m-manylinux2010_x86_64.whl (1.7MB)
[K     |▏                               | 10kB 19.4MB/s eta 0:00:01[K     |▍                               | 20kB 5.5MB/s eta 0:00:01[K     |▋                               | 30kB 6.8MB/s eta 0:00:01[K     |▊                               | 40kB 7.7MB/s eta 0:00:01[K     |█                               | 51kB 6.5MB/s eta 0:00:01[K     |█▏                              | 61kB 7.4MB/s eta 0:00:01[K     |█▍                              | 71kB 7.8MB/s eta 0:00:01[K     |█▌                              | 81kB 8.3MB/s eta 0:00:01[K     |█▊                              | 92kB 7.5MB/s eta 0:00:01[K     |██                              | 102kB 7.7MB/s eta 0:00:01[K     |██                              | 112kB 7.7MB/s eta 0:00:01[K     |██▎                      

## Imports

In [5]:
import json
import time
import codecs
import random
import math
import spacy
import numpy as np
import pandas as pd

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

import youtokentome
from torchtext import data, vocab

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker

import seaborn as sns

  import pandas.util.testing as tm


In [6]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [7]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Data Exploration


MS COCO dataset contains various types of annotations designed for different tasks. Since we only need utterances which are similar to each other, we consider only caption files. 

In [8]:
caption_files = ['annotations/captions_train2017.json', 'annotations/captions_val2017.json']

Let's group the captions by `image_id`

In [9]:
captions = {}

for each_file in caption_files:
    with open(each_file, 'r') as f:
        content = json.load(f)

    for i in content['annotations']:
        if i['image_id'] in captions:
            captions[i['image_id']].append(i['caption'])
        else:
            captions[i['image_id']] = [i['caption']]

Sample check

In [10]:
captions[203564]

['A bicycle replica with a clock as the front wheel.',
 'The bike has a clock as a tire.',
 'A black metal bicycle with a clock inside the front wheel.',
 'A bicycle figurine in which the front wheel is replaced with a clock\n',
 'A clock with the appearance of the wheel of a bicycle ']

Since each image has 5 captions, we randomly drop 1 out of 5 and create 2 (sentence, utterance) pairs out of the remaining 4 captions.

In [11]:
coco_data = []

for _, caption in captions.items():
    
    if len(caption) == 5:
        # randomly pick an id
        random_id = random.choice(range(len(caption)))
        
        # remove the caption corresponding to that id
        _ = caption.pop(random_id)
    if len(caption) < 4 and len(caption) > 2:
        coco_data.append((caption[0], caption[1]))
        continue

    # add the remaining four captions as (sentence, utterance) pairs
    coco_data.append((caption[0], caption[1]))
    coco_data.append((caption[2], caption[3]))

In [12]:
len(coco_data)

246574

Let's use `90%` of the data to train the model, `5%` as validation data and `5%` as test data

In [17]:
total_length = len(coco_data)
train_data_size = int(0.9 * total_length)
valid_data_size = int(0.05 * total_length)

In [18]:
train_data = coco_data[:train_data_size]
valid_data = coco_data[train_data_size:train_data_size + valid_data_size]
test_data = coco_data[train_data_size + valid_data_size:]

In [19]:
len(train_data), len(valid_data), len(test_data)

(221916, 12328, 12330)

In [20]:
train_df = pd.DataFrame(train_data, columns=['sentence', 'utterance'])
valid_df = pd.DataFrame(valid_data, columns=['sentence', 'utterance'])
test_df = pd.DataFrame(test_data, columns=['sentence', 'utterance'])

In [21]:
train_df.head()

Unnamed: 0,sentence,utterance
0,The bike has a clock as a tire.,A black metal bicycle with a clock inside the ...
1,A bicycle figurine in which the front wheel is...,A clock with the appearance of the wheel of a ...
2,Blue and white color scheme in a small bathroom.,This is a blue and white bathroom with a wall ...
3,A blue boat themed bathroom with a life preser...,A bathroom with walls that are painted baby blue.
4,A car that seems to be parked illegally behind...,two cars parked on the sidewalk on the street


Save the data into files

In [22]:
train_df.to_csv('train_ds.csv')
valid_df.to_csv('valid_ds.csv')
test_df.to_csv('test_ds.csv')

In [23]:
!ls -lah

total 268M
drwxr-xr-x 1 root root 4.0K Jun 28 15:44 .
drwxr-xr-x 1 root root 4.0K Jun 28 15:40 ..
drwxr-xr-x 2 root root 4.0K Jun 28 15:41 annotations
-rw-r--r-- 1 root root 242M Jul 10  2018 annotations_trainval2017.zip
drwxr-xr-x 1 root root 4.0K Jun 25 17:02 .config
drwxr-xr-x 1 root root 4.0K Jun 17 16:18 sample_data
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 test_ds.csv
-rw-r--r-- 1 root root  25M Jun 28 15:44 train_ds.csv
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 valid_ds.csv


Create a file which contains all the training data, so that it can be used for training the tokenizer

In [24]:
# using only training data to train the BPE tokenizer
all_data = list(train_df['sentence'].str.lower().values) + list(train_df['utterance'].str.lower().values)

with codecs.open("all_data.txt", "w", encoding="utf-8") as f:
        f.write("\n".join(all_data))

In [25]:
# free some ram
del all_data

## Tokenizer

Using the `youtokentome` library to train the BPE (Byte-Pair Encoding) tokenizer

In [26]:
# Perform BPE
print("\nLearning BPE...")
youtokentome.BPE.train(data="all_data.txt", vocab_size=20000, model="bpe.model")


Learning BPE...


<youtokentome.youtokentome.BPE at 0x7f3527e3e5c0>

In [27]:
!ls -lah

total 291M
drwxr-xr-x 1 root root 4.0K Jun 28 15:44 .
drwxr-xr-x 1 root root 4.0K Jun 28 15:40 ..
-rw-r--r-- 1 root root  23M Jun 28 15:44 all_data.txt
drwxr-xr-x 2 root root 4.0K Jun 28 15:41 annotations
-rw-r--r-- 1 root root 242M Jul 10  2018 annotations_trainval2017.zip
-rw-r--r-- 1 root root 271K Jun 28 15:44 bpe.model
drwxr-xr-x 1 root root 4.0K Jun 25 17:02 .config
drwxr-xr-x 1 root root 4.0K Jun 17 16:18 sample_data
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 test_ds.csv
-rw-r--r-- 1 root root  25M Jun 28 15:44 train_ds.csv
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 valid_ds.csv


Load the trained `BPE` tokenizer

In [28]:
# Load BPE model
print("\nLoading BPE model...")
bpe_model = youtokentome.BPE(model="bpe.model")


Loading BPE model...


In [29]:
# Special Tokens
print(f"<BOS>: {bpe_model.subword_to_id('<BOS>')}")    # Begining of the sentence token
print(f"<EOS>: {bpe_model.subword_to_id('<EOS>')}")    # End of the sentence token
print(f"<UNK>: {bpe_model.subword_to_id('<UNK>')}")    # Unknown token
print(f"<PAD>: {bpe_model.subword_to_id('<PAD>')}")    # Pad token

<BOS>: 2
<EOS>: 3
<UNK>: 1
<PAD>: 0


In [30]:
pad_index = bpe_model.subword_to_id('<PAD>')

In [31]:
sentence = "This is a sample sentence"
encoded_ids = bpe_model.encode(sentence.lower(), output_type=youtokentome.OutputType.ID, bos=True, eos=True)
encoded_text = bpe_model.encode(sentence.lower(), output_type=youtokentome.OutputType.SUBWORD, bos=True, eos=True)
decoded_text = bpe_model.decode(encoded_ids, ignore_ids=[2, 3])

print(encoded_ids)
print(encoded_text)
print(decoded_text)

[2, 547, 112, 67, 10274, 70, 416, 9110, 3]
['<BOS>', '▁this', '▁is', '▁a', '▁sample', '▁s', 'ent', 'ence', '<EOS>']
['this is a sample sentence']


In [32]:
# define a tokenizer method which takes in a sentence and returns ids
# by defining this, we can configure the tokenizer to torchtext Field
def bpe_tokenizer(sentence):
    encoded_ids = bpe_model.encode(sentence.lower(), output_type=youtokentome.OutputType.ID, bos=True, eos=True)
    return encoded_ids

In [33]:
bpe_tokenizer("This is a sample sentence")

[2, 547, 112, 67, 10274, 70, 416, 9110, 3]

## Read the dataset

In [34]:
tokenizer = data.get_tokenizer(bpe_tokenizer)
TEXT = data.Field(tokenize=tokenizer, batch_first=True, use_vocab=False, pad_token=pad_index)

In [35]:
fields = [(None, None), ("source", TEXT), ("target", TEXT)]

train_dataset, valid_dataset, test_dataset = data.TabularDataset.splits(path='.',
                                     train='train_ds.csv', validation='valid_ds.csv', test='test_ds.csv',
                                     format='csv', skip_header=True, fields=fields)

In [36]:
print(f"Number of training examples: {len(train_dataset)}")
print(f"Number of validation examples: {len(valid_dataset)}")
print(f"Number of testing examples: {len(test_dataset)}")

Number of training examples: 221916
Number of validation examples: 12328
Number of testing examples: 12330


In [37]:
print(vars(train_dataset.examples[0]))

{'source': [2, 82, 816, 318, 67, 528, 453, 67, 6360, 3], 'target': [2, 67, 257, 1036, 850, 95, 67, 528, 638, 82, 264, 5596, 3]}


In [38]:
# Building vocabulary is not required as the BPE tokenizer already convert the sentence into ids

# let's check the vocab size
print(f"Vocab size: {bpe_model.vocab_size()}")

Vocab size: 20000


## Data Iterators

In [39]:
BATCH_SIZE = 64

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_dataset, valid_dataset, test_dataset),
    batch_size=BATCH_SIZE,
    sort_key=lambda x: len(x.source),
    device=device
)

In [40]:
temp = next(iter(train_iterator))
temp.source.shape, temp.target.shape

(torch.Size([64, 26]), torch.Size([64, 21]))

## Model

The model we will be using of type Sequence-to-Sequence, which takes in a sequence(sentence) and outputs a sequence(utterance).

![seq-to-seq](https://drive.google.com/uc?id=1TkFl1a68iOCfRaW6QGRBs5Jf4SHLqk9j)

In particular, we will be using Transformer Model.

![transformer](https://drive.google.com/uc?id=1Bg_PrLjFXmmfXqktCSJI_UisyH121oqS)

### Multi-Head Self-Attention

In [41]:
class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads, dropout, pad_idx, device):
        super().__init__()

        assert d_model % n_heads == 0, "n_heads must be a factor of d_model"
        self.d_model = d_model
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)

        self.scale = torch.sqrt(torch.FloatTensor([self.head_dim])).to(device)

        self.fc = nn.Linear(d_model, d_model)

        self.pad_idx = pad_idx
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, query, key, value, mask=None):
        # query => [batch_size, seq_len, d_model] 
        # key => [batch_size, seq_len, d_model]
        # value => [batch_size, seq_len, d_model]

        batch_size = query.shape[0]

        Q = self.q(query)
        K = self.k(key)
        V = self.v(value)
        # Q, K, V => [batch_size, seq_len, d_model]

        Q = Q.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        K = K.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        V = V.view(batch_size, -1, self.n_heads, self.head_dim).permute(0, 2, 1, 3)
        # Q, K, V => [batch_size, n_heads, seq_len, head_dim]

        energy = torch.matmul(Q, K.permute(0 ,1, 3, 2))
        energy = energy / self.scale
        # energy => [batch_size, n_heads, query_len, key_len]

        if mask is not None:
            energy = energy.masked_fill(mask == 0, -1e10)
        
        attention = torch.softmax(energy, dim=-1)
        # attention => [batch_size, n_heads, query_len, key_len]

        weighted = torch.matmul(attention, V)
        # weighted => [batch_size, n_heads, query_len, head_dim]

        weighted = weighted.permute(0, 2, 1, 3).contiguous()
        # weighted => [batch_size, query_len, n_heads, head_dim]

        x = weighted.view(batch_size, -1, self.d_model)
        # x => [batch_size, query_len, d_model]

        x = self.fc(x)
        # x => [batch_size, query_len, d_model]
        # attention => [batch_size, n_heads, query_len, key_len]

        return x, attention


### Feed Forward 

In [42]:
class PositionWiseFeedForward(nn.Module):
    def __init__(self, d_model, pff_dim, dropout):
        super().__init__()

        self.fc1 = nn.Linear(d_model, pff_dim)
        self.fc2 = nn.Linear(pff_dim, d_model)
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input):
        # input => [batch_size, seq_len, d_model]

        x = self.dropout(torch.relu(self.fc1(input)))
        # x => [batch_size, seq_len, pff_dim]

        x = self.fc2(x)
        # x => [batch_size, seq_len, d_model]

        return x


### Encoder Layer

In [43]:
class EncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, pff_dim, dropout, pad_idx, device):
        super().__init__()

        self.self_attention = SelfAttention(d_model, n_heads, dropout, pad_idx, device)
        self.pff = PositionWiseFeedForward(d_model, pff_dim, dropout)

        self.self_attn_layer_norm = nn.LayerNorm(d_model)
        self.pff_layer_norm = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src, src_mask):
        # src => [batch_size, src_len, d_model]

        # self attention on src
        _src, _ = self.self_attention(src, src, src, src_mask)
        # _src => [batch_size, src_len, d_model]

        # residual connection and layer normalization
        src = self.self_attn_layer_norm(src + self.dropout(_src))
        # src => [batch_size, src_len, d_model]

        # position wise feed forward
        _src = self.pff(src)
        # _src => [batch_size, src_len, d_model]

        # residual connection and layer normalization
        src = self.pff_layer_norm(src + self.dropout(_src))
        # src => [batch_size, src_len, d_model]

        return src

### Encoder

In [44]:
class Encoder(nn.Module):
    def __init__(self, input_dim, d_model, n_layers, n_heads, pff_dim, dropout, pad_idx, device, max_len=500):
        super().__init__()

        self.n_layers = n_layers
        self.device = device
        self.word_embedding = nn.Embedding(input_dim, d_model)
        self.pos_embedding = nn.Embedding(max_len, d_model)
        self.layers = nn.ModuleList([EncoderLayer(d_model, n_heads, pff_dim, dropout, pad_idx, device) for _ in range(n_layers)])
        self.dropout = nn.Dropout(dropout)
        self.scale = torch.sqrt(torch.FloatTensor([d_model])).to(device)
    
    def forward(self, src, src_mask=None):
        # src => [batch_size, src_len]

        batch_size = src.shape[0]
        src_len = src.shape[1]

        pos = torch.arange(0, src_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        # pos => [batch_size, src_len]

        word_embed = self.word_embedding(src)
        word_embed = word_embed * self.scale
        # word_embed => [batch_size, src_len, d_model]

        pos_embed = self.pos_embedding(pos)
        # pos_embed => [batch_size, src_len, d_model]

        src = self.dropout(word_embed + pos_embed)
        # src => [batch_size, src_len, d_model]

        for layer in self.layers:
            src = layer(src, src_mask)
        
        # src => [batch_size, src_len, d_model]
        return src


### Decoder Layer

In [45]:
class DecoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, pff_dim, dropout, pad_idx, device):
        super().__init__()

        self.self_attention = SelfAttention(d_model, n_heads, dropout, pad_idx, device)
        self.enc_attention = SelfAttention(d_model, n_heads, dropout, pad_idx, device)
        self.pff = PositionWiseFeedForward(d_model, pff_dim, dropout)

        self.self_attn_layer_norm = nn.LayerNorm(d_model)
        self.enc_attn_layer_norm = nn.LayerNorm(d_model)
        self.pff_layer_norm = nn.LayerNorm(d_model)

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, trg, enc_src, trg_mask, src_mask):
        
        # self attention
        _trg, _ = self.self_attention(trg, trg, trg, trg_mask)
        # _trg => [batch_size, trg_len, d_model]

        # residual connection and layer normalization
        trg = self.self_attn_layer_norm(trg + self.dropout(_trg))
        # trg => [batch_size, trg_len, d_model]

        # enc_attention
        _trg, attention = self.enc_attention(trg, enc_src, enc_src, src_mask)
        # _trg => [batch_size, trg_len, d_model]
        # attention => [batch_size, n_heads, trg_len, src_len]

        # residual connection and layer normalization
        trg = self.enc_attn_layer_norm(trg + self.dropout(_trg))
        # trg => [batch_size, trg_len, d_model]

        # positionwise feed forward
        _trg = self.pff(trg)
        # _trg => [batch_size, trg_len, d_model]

        # residual connection and layer normalization
        trg = self.pff_layer_norm(trg + self.dropout(_trg))
        # trg => [batch_size, trg_len, d_model]

        return trg, attention


### Decoder

In [46]:
class Decoder(nn.Module):
    def __init__(self, output_dim, d_model, n_layers, n_heads, pff_dim, dropout, pad_idx, device, max_len=500):
        super().__init__()
        
        self.device = device
        self.word_embedding = nn.Embedding(output_dim, d_model)
        self.pos_embedding = nn.Embedding(max_len, d_model)

        self.layers = nn.ModuleList([DecoderLayer(d_model, n_heads, pff_dim, dropout, pad_idx, device) for _ in range(n_layers)])
        self.dropout = nn.Dropout(dropout)

        self.fc_out = nn.Linear(d_model, output_dim)
        self.scale = torch.sqrt(torch.FloatTensor([d_model])).to(device)
    
    def forward(self, trg, enc_src, trg_mask=None, src_mask=None):

        batch_size = trg.shape[0]
        trg_len = trg.shape[1]
        
        pos = torch.arange(0, trg_len).unsqueeze(0).repeat(batch_size, 1).to(self.device)
        # pos => [batch_size, trg_len]

        word_embedding = self.word_embedding(trg)
        word_embedding = word_embedding * self.scale
        # word_embedding => [batch_size, trg_len, d_model]

        pos_embedding = self.pos_embedding(pos)
        # pos_embedding => [batch_size, trg_len, d_model]

        trg = self.dropout(word_embedding + pos_embedding)
        # trg => [batch_size, trg_len, d_model]

        for layer in self.layers:
            trg, attention = layer(trg, enc_src, trg_mask, src_mask)

        logits = self.fc_out(trg)
        # logits => [batch_size, trg_len, output_dim]
        # attention => [batch_size, n_heads, trg_len, src_len]

        return logits, attention


### Transformer

In [47]:
class Transformer(nn.Module):
    def __init__(self, encoder, decoder, pad_idx, device):
        super().__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.pad_idx = pad_idx
        self.device = device
    
    def make_src_mask(self, src):
        src_mask = (src != self.pad_idx).unsqueeze(1).unsqueeze(2).to(self.device)
        # src_mask => [batch_size, 1, 1, src_len]
        
        return src_mask
    
    def make_trg_mask(self, trg):
        trg_pad_mask = (trg != self.pad_idx).unsqueeze(1).unsqueeze(2).to(self.device)
        trg_len = trg.shape[1]

        trg_sub_mask = torch.tril(torch.ones((trg_len, trg_len), device=self.device)).bool()

        trg_mask = trg_pad_mask & trg_sub_mask
        # trg_mask => [batch_size, 1, trg_len, trg_len]

        return trg_mask
 
    def forward(self, src, trg):
        # src => [batch_size, src_len]
        # trg => [batch_size, trg_len]

        src_mask = self.make_src_mask(src)
        trg_mask = self.make_trg_mask(trg)

        enc_src = self.encoder(src, src_mask)

        output, attention = self.decoder(trg, enc_src, trg_mask, src_mask)

        return output, attention

## Configurations

In [50]:
PAD_IDX = pad_index
INPUT_DIM = bpe_model.vocab_size()
HID_DIM = 256
ENC_LAYERS = 3
DEC_LAYERS = 3
ENC_HEADS = 8
DEC_HEADS = 8
ENC_PF_DIM = 512
DEC_PF_DIM = 512
ENC_DROPOUT = 0.3
DEC_DROPOUT = 0.3

enc = Encoder(INPUT_DIM, 
              HID_DIM, 
              ENC_LAYERS, 
              ENC_HEADS, 
              ENC_PF_DIM, 
              ENC_DROPOUT,
              PAD_IDX,
              device)

dec = Decoder(INPUT_DIM, 
              HID_DIM, 
              DEC_LAYERS, 
              DEC_HEADS, 
              DEC_PF_DIM, 
              DEC_DROPOUT, 
              PAD_IDX,
              device)

In [51]:
model = Transformer(enc, dec, PAD_IDX, device).to(device)

### Initialize weights

In [52]:
def init_weights(m):
    if hasattr(m, 'weight') and m.weight.dim() > 1:
        nn.init.xavier_uniform_(m.weight.data)

model.apply(init_weights)

Transformer(
  (encoder): Encoder(
    (word_embedding): Embedding(20000, 256)
    (pos_embedding): Embedding(500, 256)
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attention): SelfAttention(
          (q): Linear(in_features=256, out_features=256, bias=True)
          (k): Linear(in_features=256, out_features=256, bias=True)
          (v): Linear(in_features=256, out_features=256, bias=True)
          (fc): Linear(in_features=256, out_features=256, bias=True)
          (dropout): Dropout(p=0.3, inplace=False)
        )
        (pff): PositionWiseFeedForward(
          (fc1): Linear(in_features=256, out_features=512, bias=True)
          (fc2): Linear(in_features=512, out_features=256, bias=True)
          (dropout): Dropout(p=0.3, inplace=False)
        )
        (self_attn_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (pff_layer_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True)
        (dropout): Dropout(p=0.3, inplace=False

In [53]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 19,589,664 trainable parameters


### Optimizer & Criterion

In [55]:
LEARNING_RATE = 0.0005

optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)
criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

### Train Loop

In [56]:
def train(model, iterator, criterion, optimizer, clip):
    model.train()

    epoch_loss = 0
    for i, batch in enumerate(iterator):
        src = batch.source
        trg = batch.target

        optimizer.zero_grad()

        output, _ = model(src, trg[:, :-1])
        # output => [batch_size, trg_len - 1, output_dim]
        # trg => [batch_size, trg_len]

        output_dim = output.shape[-1]

        output = output.contiguous().view(-1, output_dim)
        trg = trg[:, 1:].contiguous().view(-1)

        loss = criterion(output, trg)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()

        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)


### Validation Loop

In [57]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            src = batch.source
            trg = batch.target

            output, _ = model(src, trg[:, :-1])
            # output => [batch_size, trg_len - 1, output_dim]

            output_dim = output.shape[-1]

            output = output.contiguous().view(-1, output_dim)
            trg = trg[:, 1:].contiguous().view(-1)

            loss = criterion(output, trg)

            epoch_loss += loss.item()
    return epoch_loss / len(iterator)

In [58]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = elapsed_time - (elapsed_mins * 60)
    return elapsed_mins, elapsed_secs

## Training

In [59]:
N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    start_time = time.time()

    train_loss = train(model, train_iterator, criterion, optimizer, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'model.pt')
    print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss): 7.3f}")
    print(f"\tValid Loss: {valid_loss:.3f} | Valid PPL: {math.exp(valid_loss): 7.3f}")

Epoch: 01 | Time: 4m 25.250166416168213s
	Train Loss: 3.563 | Train PPL:  35.271
	Valid Loss: 3.128 | Valid PPL:  22.820
Epoch: 02 | Time: 4m 24.61474061012268s
	Train Loss: 2.927 | Train PPL:  18.675
	Valid Loss: 2.956 | Valid PPL:  19.212
Epoch: 03 | Time: 4m 24.679898738861084s
	Train Loss: 2.757 | Train PPL:  15.750
	Valid Loss: 2.882 | Valid PPL:  17.856
Epoch: 04 | Time: 4m 25.98631000518799s
	Train Loss: 2.651 | Train PPL:  14.162
	Valid Loss: 2.837 | Valid PPL:  17.072
Epoch: 05 | Time: 4m 25.264737367630005s
	Train Loss: 2.574 | Train PPL:  13.114
	Valid Loss: 2.814 | Valid PPL:  16.676
Epoch: 06 | Time: 4m 25.38588285446167s
	Train Loss: 2.514 | Train PPL:  12.352
	Valid Loss: 2.799 | Valid PPL:  16.431
Epoch: 07 | Time: 4m 26.530824184417725s
	Train Loss: 2.466 | Train PPL:  11.781
	Valid Loss: 2.808 | Valid PPL:  16.584
Epoch: 08 | Time: 4m 26.672357320785522s
	Train Loss: 2.427 | Train PPL:  11.320
	Valid Loss: 2.803 | Valid PPL:  16.498
Epoch: 09 | Time: 4m 26.77908039093

In [60]:
# Load the trained model
model.load_state_dict(torch.load('model.pt'))

<All keys matched successfully>

## Testing

In [61]:
model.eval()
test_loss = evaluate(model, test_iterator, criterion)
print(f"\tTest Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss): 7.3f}")

	Test Loss: 2.738 | Test PPL:  15.451


In [62]:
!ls -lah

total 366M
drwxr-xr-x 1 root root 4.0K Jun 28 15:52 .
drwxr-xr-x 1 root root 4.0K Jun 28 15:40 ..
-rw-r--r-- 1 root root  23M Jun 28 15:44 all_data.txt
drwxr-xr-x 2 root root 4.0K Jun 28 15:41 annotations
-rw-r--r-- 1 root root 242M Jul 10  2018 annotations_trainval2017.zip
-rw-r--r-- 1 root root 271K Jun 28 15:44 bpe.model
drwxr-xr-x 1 root root 4.0K Jun 25 17:02 .config
-rw-r--r-- 1 root root  75M Jun 28 16:27 model.pt
drwxr-xr-x 1 root root 4.0K Jun 17 16:18 sample_data
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 test_ds.csv
-rw-r--r-- 1 root root  25M Jun 28 15:44 train_ds.csv
-rw-r--r-- 1 root root 1.4M Jun 28 15:44 valid_ds.csv


## Utterance generation with Greedy Search

In [63]:
def generate_utterance_greedy(sentence, bpe_model, model, device, max_len=50):
    model.eval()

    if isinstance(sentence, str):
        tokens = bpe_tokenizer(sentence)
    else:
        tokens = [int(token) for token in sentence]

    src_indexes = tokens
 
    # convert to tensor format
    # since the inference is done on single sentence, batch size is 1
    src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)
    # src_tensor => [1, seq_len]

    src_mask = model.make_src_mask(src_tensor)

    with torch.no_grad():
        enc_src = model.encoder(src_tensor, src_mask)
    
    # the starting input to decoder is always <bos>
    trg_indexes = [bpe_model.subword_to_id('<BOS>')]

    for i in range(max_len):
        trg_tensor = torch.LongTensor(trg_indexes).unsqueeze(0).to(device)
        trg_mask = model.make_trg_mask(trg_tensor)

        with torch.no_grad():
            output, attention = model.decoder(trg_tensor, enc_src, trg_mask, src_mask)
        
        pred_token = output.argmax(2)[:, -1].item()

        trg_indexes.append(pred_token)

        # if the predicted token is <eos> means stop the decoding
        if pred_token == bpe_model.subword_to_id('<EOS>'):
            break
    
    # convert the predicted token ids to words
    trg_tokens = bpe_model.decode(trg_indexes, ignore_ids=[2,3])[0] # ignore <bos>, <eos>

    return tokens, trg_tokens, attention


## Utterane Generation with Beam Search

One of the ways to mitigate the repetition in the generation of utterances is to use Beam Search. By choosing the top-scored word at each step (greedy) may lead to a sub-optimal solution but by choosing a lower scored word that may reach an optimal solution.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

![beam](https://drive.google.com/uc?id=1lzTlU3Ui4V_qwc3bDEXkPK41vOhMOAI9)

In [64]:
def generate_utterance_beam(sentence, bpe_model, model, device, max_len=50, beam_size=10, length_norm_coefficient=0.6):
    with torch.no_grad():
        k = beam_size

        # minimum number of hypotheses to complete
        n_completed_hypotheses = min(k, 10)

        # vocab size
        vocab_size = bpe_model.vocab_size()

        if isinstance(sentence, str):
            tokens = bpe_tokenizer(sentence)
        else:
            tokens = [int(token) for token in sentence]

        src_indexes = tokens
        
        # convert to tensor format
        # since the inference is done on single sentence, batch size is 1
        src_tensor = torch.LongTensor(src_indexes).unsqueeze(0).to(device)
        # src_tensor => [1, seq_len]

        # encode
        enc_src = model.encoder(src_tensor)
        # enc_src => [1, src_len, d_model]

        # Our hypothesis to begin with is just <bos>
        hypotheses = torch.LongTensor([[bpe_model.subword_to_id('<BOS>')]]).to(device)  # (1, 1)

        # Tensor to store hypotheses' scores; now it's just 0
        hypotheses_scores = torch.zeros(1).to(device)  # (1)

        # Lists to store completed hypotheses and their scores
        completed_hypotheses = list()
        completed_hypotheses_scores = list()

        # Start decoding
        step = 1

        # Assume "s" is the number of incomplete hypotheses currently in the bag; a number less than or equal to "k"
        # At this point, s is 1, because we only have 1 hypothesis to work with, i.e. "<sos>"
        while True:
            s = hypotheses.size(0)
            trg_mask = model.make_trg_mask(hypotheses)
            decoder_sequences, _ = model.decoder(hypotheses, enc_src.repeat(s, 1, 1), trg_mask)
            # decoder_sequences => [s, step_size, vocab_size]

            # Scores at this step
            scores = decoder_sequences[:, -1, :]  # (s, vocab_size)
            scores = F.log_softmax(scores, dim=-1)  # (s, vocab_size)

            # Add hypotheses' scores from last step to scores at this step to get scores for all possible new hypotheses
            scores = hypotheses_scores.unsqueeze(1) + scores  # (s, vocab_size)

            # Unroll and find top k scores, and their unrolled indices
            top_k_hypotheses_scores, unrolled_indices = scores.view(-1).topk(k, 0, True, True)  # (k)

            # Convert unrolled indices to actual indices of the scores tensor which yielded the best scores
            prev_word_indices = unrolled_indices // vocab_size  # (k)
            next_word_indices = unrolled_indices % vocab_size  # (k)

            # Construct the the new top k hypotheses from these indices
            top_k_hypotheses = torch.cat([hypotheses[prev_word_indices], next_word_indices.unsqueeze(1)],
                                         dim=1)  # (k, step + 1)
            
            # Which of these new hypotheses are complete (reached <eos>)?
            complete = next_word_indices == bpe_model.subword_to_id('<EOS>')  # (k), bool

            # Set aside completed hypotheses and their scores normalized by their lengths
            # For the length normalization formula, see
            # "Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation"
            completed_hypotheses.extend(top_k_hypotheses[complete].tolist())
            norm = math.pow(((5 + step) / (5 + 1)), length_norm_coefficient)
            completed_hypotheses_scores.extend((top_k_hypotheses_scores[complete] / norm).tolist())

            # Stop if we have completed enough hypotheses
            if len(completed_hypotheses) >= n_completed_hypotheses:
                break

            # Else, continue with incomplete hypotheses
            hypotheses = top_k_hypotheses[~complete]  # (s, step + 1)
            hypotheses_scores = top_k_hypotheses_scores[~complete]  # (s)
            hypotheses_lengths = torch.LongTensor(hypotheses.size(0) * [hypotheses.size(1)]).to(device)  # (s)

            # Stop if things have been going on for too long
            if step > 100:
                break
            step += 1
        
        # If there is not a single completed hypothesis, use partial hypotheses
        if len(completed_hypotheses) == 0:
            completed_hypotheses = hypotheses.tolist()
            completed_hypotheses_scores = hypotheses_scores.tolist()
        
        # Decode the hypotheses
        all_hypotheses = list()
        for i, hypo in enumerate(completed_hypotheses):
            h = bpe_model.decode(hypo, ignore_ids=[2, 3])[0]    # ignore <bos>, <eos>
            all_hypotheses.append({"hypothesis": h, "score": completed_hypotheses_scores[i]})
        
        # Find the best scoring completed hypothesis
        i = completed_hypotheses_scores.index(max(completed_hypotheses_scores))
        best_hypothesis = all_hypotheses[i]["hypothesis"]

        return tokens, best_hypothesis, all_hypotheses

## Inference

In [67]:
example_idx = 55

src = vars(valid_dataset.examples[example_idx])['source']
trg = vars(valid_dataset.examples[example_idx])['target']

print(f'src = {bpe_model.decode(src, ignore_ids=[2, 3])[0]}')
print(f'trg = {bpe_model.decode(trg, ignore_ids=[2,3])[0]}\n')


_, utterance, attention = generate_utterance_greedy(src, bpe_model, model, device)
_, best_one, all_utterances = generate_utterance_beam(src, bpe_model, model, device)

print(f'greedy generated utterance = {utterance}\n')
print("All beam generated utterances:")
print("------------------------------")
for i in all_utterances:
    print(f'{i["hypothesis"]}')


src = an athlete swinging a tennis racket at a tennis ball
trg = a man swinging a tennis racket backhanded at a tennis ball.

greedy generated utterance = a tennis player is hitting a ball with a racket

All beam generated utterances:
------------------------------
a tennis player hitting a ball with a racquet.
a tennis player hitting a ball with his racket.
a tennis player hitting a ball with a racket.
a tennis player hitting a ball with a racket
a tennis player hitting a tennis ball with his racket.
a tennis player hitting a tennis ball with a racquet.
a man holding a tennis racquet on a tennis court.
a tennis player hitting a tennis ball with a racket.
a tennis player hitting a tennis ball with a racket
a tennis player is hitting a tennis ball with a racquet.
a tennis player is hitting a tennis ball with a racket.
a tennis player is hitting a tennis ball with a racket


In [69]:
src = "A car is parked on the side of a road"

tokens, utterance, attention = generate_utterance_greedy(src, bpe_model, model, device)
_, _, all_utterances = generate_utterance_beam(src, bpe_model, model, device)
print(f'src = {src}\n')
print(f'Greedy generated utterance = {utterance}\n')
print("Beam generated utterances:")
print("------------------------------")
for i in all_utterances:
    print(f'{i["hypothesis"]}')

src = A car is parked on the side of a road

Greedy generated utterance = a car driving down a street with a lot of traffic.

Beam generated utterances:
------------------------------
a car parked on the side of a road.
a car parked on the side of the road.
a car parked on the side of a street.
a truck driving down a street next to houses.
a car driving down a street next to a building.
a truck driving down a street next to a building.
a car driving down a street next to a parking meter.
a car driving down a street next to a parking lot.
a car driving down a street next to a parking lot
a car driving down a street next to a parking meter


In [71]:
src = "A plane is flying high in the sky"

tokens, utterance, attention = generate_utterance_greedy(src, bpe_model, model, device)
_, _, all_utterances = generate_utterance_beam(src, bpe_model, model, device)
print(f'src = {src}\n')
print(f'Greedy generated utterance = {utterance}\n')
print("Beam generated utterances:")
print("------------------------------")
for i in all_utterances:
    print(f'{i["hypothesis"]}')


src = A plane is flying high in the sky

Greedy generated utterance = a plane flying in the sky with a sky background

Beam generated utterances:
------------------------------
a plane that is flying in the sky.
a plane that is flying in the air.
a plane is flying high in the sky.
a plane that is flying in the sky
a plane is flying high in the sky
a plane flying in the sky above a mountain.
a plane flying in the air above a mountain.
a plane flying in the sky with a sky background
a plane flying high in the sky above a mountain.
a plane flying high in the sky above a city.
a plane flying high in the sky above a field.


## Further Improvements

With this I am ending the utterance generation series. However, following enhancements could be done.

*   Evaluation metrics like BLEU, Rouge 
*   Combined training / finetuning on Quora dataset, so that model can generate utterances for general sentences and question type utterances
*   Explore the [Paraphrase Dataset](http://paraphrase.org/) and train the model on that data
*   Using pretrained models like GPT-2/T5/BART model

*Note: Raise an issue [here](https://github.com/graviraja/100-Days-of-NLP/issues) in case of any issues/modifications*
