<a href="https://colab.research.google.com/github/graviraja/100-Days-of-NLP/blob/applications%2Fgeneration/applications/generation/Generating%20Names.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating Names with LSTM

Given a starting character, generate a name starting with that character. 

We’ll train LSTM character-level language model. That is, we’ll give the LSTM a huge chunk of names and ask it to model the probability distribution of the next character in the sequence given a sequence of previous characters. This will then allow us to generate new name one character at a time.

![arch](https://drive.google.com/uc?id=1G8Oh6WUeShjXSEWfMv45sCxsaio-fNnM)

As a working example, suppose we only had a vocabulary of all alphabets in English, and wanted to train an RNN on the training sequence "Jennie". This training sequence is in fact a source of 5 separate training examples: 
1. The probability of `e` should be likely given the context of `J`, 
2. `n` should be likely in the context of `Je`, 
3. `n` should also be likely given the context of `Jen`,
4. `i` should also be likely given the context of `Jenn`, 
and finally 
5. `e` should be likely given the context of `Jenni`.

#### Resources

- [Unreasonable effectiveness of RNN](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)

- [Language Modelling - ChunML](https://github.com/ChunML/NLP/blob/master/text_generation/)

## Dataset

The dataset I used is [Us Baby Names](https://www.kaggle.com/kaggle/us-baby-names?select=NationalNames.csv)  present in kaggle.

I have downloaded the dataset and kept it in google drive for ease of use

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [2]:
datapath = './drive/My\ Drive/NationalNames.csv.zip'
!unzip {datapath}

Archive:  ./drive/My Drive/NationalNames.csv.zip
  inflating: NationalNames.csv       


In [3]:
!ls

drive  NationalNames.csv  sample_data


## Initial Setup

In [0]:
import time
import string
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset

import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv('NationalNames.csv')
df.head()

Unnamed: 0,Id,Name,Year,Gender,Count
0,1,Mary,1880,F,7065
1,2,Anna,1880,F,2604
2,3,Emma,1880,F,2003
3,4,Elizabeth,1880,F,1939
4,5,Minnie,1880,F,1746


In [6]:
len(df)

1825433

Since there are nearly 2 million values, I am considering the names which occur atleast 1000 times

In [0]:
df2 = df[df['Count'] > 1000]

In [8]:
len(df2)

50464

In [9]:
## Dropping other columns as they are not useful
df2 = df2[['Name']]
df2.head()

Unnamed: 0,Name
0,Mary
1,Anna
2,Emma
3,Elizabeth
4,Minnie


In [0]:
X_train = df2['Name'].values

In [11]:
print(f"Number of training examples: {len(X_train)}")

Number of training examples: 50464


In [12]:
print(f"input = {X_train[0]}")

input = Mary


## Vocabulary

Since all names are not of equal length, we have to `pad` the shorter ones. For that $<pad>$ token is required. 

In order to indicate the end of a word $<eos>$ token is used

In [0]:
def build_vocab():
    all_letters = string.ascii_letters
    char2id = {}
    char2id['<pad>'] = 0
    char2id['<eos>'] = 1
    for i, char in enumerate(all_letters):
        char2id[char] = i + 2
    return char2id

In [0]:
char2id = build_vocab()
id2char = {id: char for char, id in char2id.items()}

In [15]:
char2id

{'<eos>': 1,
 '<pad>': 0,
 'A': 28,
 'B': 29,
 'C': 30,
 'D': 31,
 'E': 32,
 'F': 33,
 'G': 34,
 'H': 35,
 'I': 36,
 'J': 37,
 'K': 38,
 'L': 39,
 'M': 40,
 'N': 41,
 'O': 42,
 'P': 43,
 'Q': 44,
 'R': 45,
 'S': 46,
 'T': 47,
 'U': 48,
 'V': 49,
 'W': 50,
 'X': 51,
 'Y': 52,
 'Z': 53,
 'a': 2,
 'b': 3,
 'c': 4,
 'd': 5,
 'e': 6,
 'f': 7,
 'g': 8,
 'h': 9,
 'i': 10,
 'j': 11,
 'k': 12,
 'l': 13,
 'm': 14,
 'n': 15,
 'o': 16,
 'p': 17,
 'q': 18,
 'r': 19,
 's': 20,
 't': 21,
 'u': 22,
 'v': 23,
 'w': 24,
 'x': 25,
 'y': 26,
 'z': 27}

## Dataset (For Loader)

As discussed, for each input character in the sequence the target will be next character in the sequence. The dataset should return the `input_seq` as well as `target_seq`

Three methods are compulsory when declaring a `dataset`:
- **`__init__`**: Load all the data and calculate the length of the data
- **`__getitem__`**: Return the requested datapoint. *Can perform all the data processing steps here*
- **`__len__`**: Return the length of the dataset

Since I am using Names for the generation problem, I will be creating a **`NamesDataset`**

In [0]:
class NamesDataset(Dataset):
    def __init__(self, names, char2id):  
        self.input = names
        self.length = len(names)
        self.char2id = char2id
    
    def __getitem__(self, index):
        input_data = self.input[index]
        input_seq, target_seq = self.preprocess(input_data)
        return input_seq, target_seq

    def __len__(self):
        return self.length
    
    def preprocess(self, input):
        # convert the character input to numerical input by using vocabulary
        input_seq = [self.char2id[input[li]] for li in range(len(input))]

        # create the target seq by skipping the first element in the input and adding <eos> at the end
        target_seq = [self.char2id[input[li]] for li in range(1, len(input))] + [self.char2id['<eos>']]
        return torch.Tensor(input_seq), torch.Tensor(target_seq)

In [0]:
# let's check the dataset
temp_data = NamesDataset(X_train, char2id)

In [18]:
temp_data[1]

(tensor([28., 15., 15.,  2.]), tensor([15., 15.,  2.,  1.]))

Sorting the elements by their size will reduce the amount of padding required. In-order to do that, we can define a `collate_fn` which takes in the data and sort according the input length

In [0]:
def collate_fn(data):
    def merge(sequences):
        lengths = [len(seq) for seq in sequences]
        padded_seqs = torch.zeros(len(sequences), max(lengths)).long()
        for i, seq in enumerate(sequences):
            end = lengths[i]
            padded_seqs[i, :end] = seq[:end]
        return padded_seqs, lengths

    # sort a list by sequence length (descending order) to use pack_padded_sequence
    data.sort(key=lambda x: len(x[0]), reverse=True)

    # seperate source and target sequences
    src_seqs, trg_seqs = zip(*data)

    # merge sequences (from tuple of 1D tensor to 2D tensor)
    src_seqs, src_lengths = merge(src_seqs)
    trg_seqs, trg_lengths = merge(trg_seqs)

    return src_seqs, trg_seqs

In [0]:
BATCH_SIZE = 64

## Data Loader

Data loader makes the dataset iterable with each iteration containing the `batch_size`. We have to pass the **`collate_fn`** while declaring the dataloader.

In [0]:
def get_loader(data, char2id, train=True, batch_size=BATCH_SIZE):
    # build a custom dataset
    dataset = NamesDataset(data, char2id)

    # data loader for custom dataset
    # this will return (src_seqs, trg_seqs) for each iteration
    # please see collate_fn for details
    if train:
        shuffle=True
    else:
        shuffle=False

    data_loader = torch.utils.data.DataLoader(dataset=dataset,
                                              batch_size=batch_size,
                                              shuffle=shuffle,
                                              collate_fn=collate_fn)

    return data_loader

In [0]:
train_data_loader = get_loader(X_train, char2id, train=True)

In [0]:
data_iter = iter(train_data_loader)
src_seqs, trg_seqs = next(data_iter)

In [24]:
src_seqs.shape, trg_seqs.shape

(torch.Size([64, 9]), torch.Size([64, 9]))

In [25]:
src_seqs[0], trg_seqs[0]

(tensor([37, 16, 20,  6, 17,  9, 10, 15,  6]),
 tensor([16, 20,  6, 17,  9, 10, 15,  6,  1]))

## Model

The model we will be using is a Character LSTM i.e it takes characters as input and predicts a character over the distribution.

![arch](https://drive.google.com/uc?id=1G8Oh6WUeShjXSEWfMv45sCxsaio-fNnM)

In [0]:
class RNN(nn.Module):
    def __init__(self, input_size, emb_size, hidden_size, output_size, dropout, pad_idx):
        super().__init__()
        
        self.hidden_size = hidden_size
        self.embedding = nn.Embedding(input_size, emb_size, padding_idx=pad_idx)
        self.rnn = nn.LSTM(emb_size, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(dropout)
        self.out = nn.Linear(hidden_size, output_size)

    
    def forward(self, input, prev_state):
        # input => [batch_size, seq_len]
        # prev_state => (h, c)
        #            => h - [1, batch_size, hid_dim]
        #            => c - [1, batch_size, hid_dim]

        embedded = self.embedding(input)
        # embedded => [batch_size, seq_len, emb_size]

        output, state = self.rnn(embedded, prev_state)
        # output => [batch_size, seq_len, hid_dim]
        # state => (h, c)
        #           h => (n_layers, batch_size, hid_dim)
        #           c => (n_layers, batch_size, hid_dim)
    
        logits = self.out(self.dropout(output))
        # logits => [batch_size, seq_len, hidden_dim]

        return logits, state
    
    def init_hidden(self, batch_size):
        # initial hidden state

        return (torch.zeros(1, batch_size, self.hidden_size),
                torch.zeros(1, batch_size, self.hidden_size))
        

In [46]:
INPUT_DIM = len(char2id)
EMBEDDING_DIM = 10
HIDDEN_DIM = 256
OUTPUT_DIM = len(char2id)
DROPOUT = 0.5
PAD_IDX = char2id['<pad>']

model = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX)
model

RNN(
  (embedding): Embedding(54, 10, padding_idx=0)
  (rnn): LSTM(10, 256, batch_first=True)
  (dropout): Dropout(p=0.5, inplace=False)
  (out): Linear(in_features=256, out_features=54, bias=True)
)

In [47]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model)} trainable paramters')

The model has 288850 trainable paramters


## Optimizer & Criterion

We use the **`Adam`** optimizer as it shows better optimization compared to `SGD`.

**`CrossEntropyLoss`**: This criterion combines `nn.LogSoftmax()` and `nn.NLLLoss()` in one single class. It is useful when training a classification problem with C classes. Ignore the `<pad>` index as it does not contribute to loss

In [0]:
optimizer = optim.Adam(model.parameters())
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

## Generate Names



In [0]:
def generate(model, initial_letter, n_letters, char2id, id2char, pad_token, eos_token, top_k=5):
    model.eval()

    # initial character of the word
    word = initial_letter

    state_h, state_c = model.init_hidden(1)

    # convert the character to index    
    choice = char2id[initial_letter]

    # run the LSTM model for a maximum of `n_letters` steps
    for _ in range(n_letters):
        ix = torch.tensor([[choice]])

        # forward pass
        output, (state_h, state_c) = model(ix, (state_h, state_c))

        # get the top_k values from the predictions
        _, top_ix = torch.topk(output[0], k=top_k)
        choices = top_ix.tolist()

        # randomly choose from the top_k values
        # if the max value is chosen all the time means it is greedy search
        choice = np.random.choice(choices[0])

        # if the choice indicates the <eos> token means, stop the loop
        if choice == eos_token:
            break
        
        # if the choice is <pad> token means, ignore it
        if choice == pad_token:
            continue
    
        # add the character to the name
        word += id2char[choice]

    print(f"GENERATED NAME: {word}")

## Training


In [0]:
def train(model, iterator, optimizer, criterion, clip):
    epoch_loss = 0
    iteration = 0
    for batch in iterator:
        x, y = batch
        # x => names - [batch_size, seq_len]
        # y => targets: names shifted by 1 char and <eos> at the end - [batch_size, seq_len]

        batch_size = x.shape[0]
        iteration += 1

        # initialize hidden state
        state_h, state_c = model.init_hidden(batch_size)    
        # state_h => [1, batch_size, hid_dim]
        # state_c => [1, batch_size, hid_dim]

        # keep the model in train mode
        model.train()

        # zero the gradients
        optimizer.zero_grad()
        
        # forward pass
        logits, (state_h, state_c) = model(x, (state_h, state_c))
        # logits => [batch_size, seq_len, hid_dim]

        # transpose the logits, so that it will be compatible to cal. loss
        # transposed logits => [batch_size, hid_dim, seq_len]
        #               y   => [batch_size, seq_len]
        loss = criterion(logits.transpose(1, 2), y)

        # backward pass
        loss.backward()

        # detach the states, since the batches are not connected.
        # otherwise states will carried over next batches and backward pass will take too much time
        state_h = state_h.detach()
        state_c = state_c.detach()

        # clip the gradients above a certain value to handle explosion gradient problem
        _ = torch.nn.utils.clip_grad_norm_(
                model.parameters(), clip)
        
        # update the paramteres of the model
        optimizer.step()

        # update the loss
        epoch_loss += loss.item()

        # log the loss for every 100 iterations
        if iteration % 100 == 0:
            print(f'Iteration: {iteration}, Loss: {loss.item()}')
        
        # generate a word for every 500 iterations
        if iteration % 500 == 0:
            generate(model, 'J', 10, char2id, id2char, char2id['<pad>'], char2id['<eos>'], top_k=5)
    
    # return the loss
    return epoch_loss / len(iterator)

In [0]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = elapsed_time - (elapsed_mins * 60)
    return elapsed_mins, elapsed_secs

In [52]:
N_EPOCHS = 5
MAX_CLIP_GRADIENT = 5

for epoch in range(N_EPOCHS):
    start_time = time.time()
    train_loss = train(model, train_data_loader, optimizer, criterion, MAX_CLIP_GRADIENT)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'Train Loss: {train_loss:.3f}')

torch.save(model.state_dict(), 'model.pt')

Iteration: 100, Loss: 2.567448139190674
Iteration: 200, Loss: 2.2804269790649414
Iteration: 300, Loss: 2.2448203563690186
Iteration: 400, Loss: 2.035532236099243
Iteration: 500, Loss: 2.0230836868286133
GENERATED NAME: Jesronyelye
Iteration: 600, Loss: 1.874259114265442
Iteration: 700, Loss: 1.8168166875839233
Epoch: 01 | Epoch Time: 0m 29.160451412200928s
Train Loss: 2.156
Iteration: 100, Loss: 1.7658997774124146
Iteration: 200, Loss: 1.4835509061813354
Iteration: 300, Loss: 1.4833662509918213
Iteration: 400, Loss: 1.4772356748580933
Iteration: 500, Loss: 1.4238742589950562
GENERATED NAME: Jearrynt
Iteration: 600, Loss: 1.3643320798873901
Iteration: 700, Loss: 1.1794513463974
Epoch: 02 | Epoch Time: 0m 29.020665645599365s
Train Loss: 1.435
Iteration: 100, Loss: 1.1719890832901
Iteration: 200, Loss: 1.1589194536209106
Iteration: 300, Loss: 1.280555248260498
Iteration: 400, Loss: 1.1393224000930786
Iteration: 500, Loss: 1.1151059865951538
GENERATED NAME: Janobene
Iteration: 600, Loss: 1

In [53]:
model.load_state_dict(torch.load('model.pt'))

<All keys matched successfully>

In [54]:
for id in range(2, len(char2id)):
    start_char = id2char[id]
    print(f"Name generated with {start_char}")
    generate(model, start_char, 10, char2id, id2char, char2id['<pad>'], char2id['<eos>'], top_k=5)
    print(f"--------------------------")

Name generated with a
GENERATED NAME: aylea
--------------------------
Name generated with b
GENERATED NAME: bennarinnan
--------------------------
Name generated with c
GENERATED NAME: cicahoraro
--------------------------
Name generated with d
GENERATED NAME: daryescolie
--------------------------
Name generated with e
GENERATED NAME: ehmol
--------------------------
Name generated with f
GENERATED NAME: frayleys
--------------------------
Name generated with g
GENERATED NAME: gughe
--------------------------
Name generated with h
GENERATED NAME: havayda
--------------------------
Name generated with i
GENERATED NAME: iridy
--------------------------
Name generated with j
GENERATED NAME: janey
--------------------------
Name generated with k
GENERATED NAME: kyliz
--------------------------
Name generated with l
GENERATED NAME: lvebestheli
--------------------------
Name generated with m
GENERATED NAME: megondarla
--------------------------
Name generated with n
GENERATED NAME: nhilip