<h1>Sequence Modeling for NLP<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Introduction-to-Recurrent-Neural-Networks" data-toc-modified-id="Introduction-to-Recurrent-Neural-Networks-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Introduction to Recurrent Neural Networks</a></span><ul class="toc-item"><li><span><a href="#Implemnting-an-Elman-RNN" data-toc-modified-id="Implemnting-an-Elman-RNN-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Implemnting an Elman RNN</a></span></li></ul></li><li><span><a href="#Example:-Classifying-Surname-Nationality-Using-a-Character-RNN" data-toc-modified-id="Example:-Classifying-Surname-Nationality-Using-a-Character-RNN-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Example: Classifying Surname Nationality Using a Character RNN</a></span><ul class="toc-item"><li><span><a href="#Vocabulary,-Vectorizer,-Dataset" data-toc-modified-id="Vocabulary,-Vectorizer,-Dataset-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Vocabulary, Vectorizer, Dataset</a></span></li><li><span><a href="#Modeling" data-toc-modified-id="Modeling-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Modeling</a></span></li><li><span><a href="#Init-and-Model-Training-+-Evaluation" data-toc-modified-id="Init-and-Model-Training-+-Evaluation-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Init and Model Training + Evaluation</a></span></li><li><span><a href="#Inference" data-toc-modified-id="Inference-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Inference</a></span></li></ul></li></ul></div>

## Introduction

- A Sequence is an ordered collection of items ie Language sentences or words.
- In DL, Modeling sequences involves maintaining hidden state information or a hidden state. As each item in the sequence is encountered, the hidden state is updated. This hidden state vector(sequence representation) can then be used for different tasks like classification or predicting sequences.

## Introduction to Recurrent Neural Networks

- The Purpose of recurrent neural networds is to model sequences of tensors.
- The basic form of RNN is called _Elman RNN_.
- Goal of RNN is to learn a representation of a sequence.

**RNN Steps**
- Hidden state vector is maintained to capture the current state of the sequence.
- The hidden state vector is computed from both a current input vector and the previous hidden state vector.
- The input from the current time step and the hidden state vector from the previous time step are mapped to the hidden state vector of the current time step.
- A new hidden vector is computed using a hidden-to-hideen weigth matric to map the previous hidden state vector and an input-to-hidden weight matric to map the input vector.
- The hidden-to-hidden and input-to-hidden weights are shared across the different time steps. During training these weights will be adjusted so that RNN learns how to incorporate incoming inforation and maintain a state representation summarizing the input seen so far.
- Using the same weights to transform inputs into outputs at every time step is another example of parameter sharing which is used by CNN. RNN shares parameters across time and CNN shares parameters across space.

![Figure 6.1](../images/figure_6_1.png)
![Figure 6.2](../images/figure_6_2.png)

### Implemnting an Elman RNN

In [1]:
%load_ext nb_black

import torch
import torch.nn as nn

<IPython.core.display.Javascript object>

In [2]:
class ElmanRNN(nn.Module):
    def __init__(self, input_size, hidden_size, batch_first=False):
        super(ElmanRNN, self).__init__()
        self.rnn_cell = nn.RNNCell(input_size, hidden_size)
        self.batch_first = batch_first
        self.hidden_size = hidden_size

    def _initialize_hidden(self, batch_size):
        return torch.zeros((batch_size, self.hidden_size))

    def forward(self, x_in, initial_hidden=None):
        if self.batch_first:
            batch_size, seq_size, feat_size = x_in.size()
            x_in = x_in.permute(1, 0, 2)
        else:
            seq_size, batch_size, feat_size = x_in.size()
        hiddens = []
        if initial_hidden is None:
            initial_hidden = self._initialize_hidden(batch_size)
            initial_hidden = initial_hidden.to(x_in.device)
        hidden_t = initial_hidden
        for t in range(seq_size):
            hidden_t = self.rnn_cell(x_in[t], hidden_t)
            hiddens.append(hidden_t)
        hiddens = torch.stack(hiddens)
        if self.batch_first:
            hiddens = hiddens.permute(1, 0, 2)
        return hiddens

<IPython.core.display.Javascript object>

## Example: Classifying Surname Nationality Using a Character RNN

In [3]:
from argparse import Namespace
import os
import json

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from tqdm import tqdm_notebook

import utils

<IPython.core.display.Javascript object>

### Vocabulary, Vectorizer, Dataset

In [4]:
class Vocabulary(object):
    def __init__(self, token_to_idx=None):
        if token_to_idx is None:
            token_to_idx = {}
        self._token_to_idx = token_to_idx
        self._idx_to_token = {idk: token for tken, idx in self._token_to_idx.items()}

    def to_serializable(self):
        return {"token_to_idx": self._token_to_idx}

    @classmethod
    def from_serializable(cls, contents):
        return cls(**contents)

    def add_token(self, token):
        if token in self._token_to_idx:
            index = self._token_to_idx[token]
        else:
            index = len(self._token_to_idx)
            self._token_to_idx[token] = index
            self._idx_to_token[index] = token
        return index

    def add_many(self, tokens):
        return [self.add_token(token) for token in tokens]

    def lookup_token(self, token):
        return self._token_to_idx[token]

    def lookup_index(self, index):
        if index not in self._idx_to_token:
            raise KeyError(f"The index {index} is not in the Vocab.")
        return self._idx_to_token[index]

    def __str__(self):
        return f"<Vocabulary(size={len(self)})>"

    def __len__(self):
        return len(self._token_to_idx)


class SequenceVocabulary(Vocabulary):
    def __init__(
        self,
        token_to_idx=None,
        unk_token="<UNK>",
        mask_token="<MASK>",
        begin_seq_token="<BEGIN>",
        end_seq_token="<ENF>",
    ):
        super(SequenceVocabulary, self).__init__(token_to_idx)
        self._mask_token = mask_token
        self._unk_token = unk_token
        self._begin_seq_token = begin_seq_token
        self._end_seq_token = end_seq_token

        self.mask_index = self.add_token(self._mask_token)
        self.unk_index = self.add_token(self._unk_token)
        self.begin_seq_index = self.add_token(self._begin_seq_token)
        self.end_seq_index = self.add_token(self._end_seq_token)

    def to_serializable(self):
        contents = super(SequenceVocabulary, self).to_serializable()
        contents.update(
            {
                "unk_token": self._unk_token,
                "mask_token": self._mask_token,
                "begin_seq_token": self._begin_seq_token,
                "end_seq_token": self._end_seq_token,
            }
        )
        return contents

    def lookup_token(self, token):
        if self.unk_index >= 0:
            return self._token_to_idx.get(token, self.unk_index)
        else:
            return self._token_to_idx[token]

<IPython.core.display.Javascript object>

In [5]:
class SurnameVectorizer(object):
    def __init__(self, char_vocab, nationality_vocab):
        self.char_vocab = char_vocab
        self.nationality_vocab = nationality_vocab

    def vectorize(self, surname, vector_length=-1):
        indices = [self.char_vocab.begin_seq_index]
        indices.extend(self.char_vocab.lookup_token(token) for token in surname)
        indices.append(self.char_vocab.end_seq_index)
        if vector_length < 0:
            vector_length = len(indices)
        out_vector = np.zeros(vector_length, dtype=np.int64)
        out_vector[: len(indices)] = indices
        out_vector[len(indices) :] = self.char_vocab.mask_index
        return out_vector, len(indices)

    @classmethod
    def from_dataframe(cls, surname_df):
        char_vocab = SequenceVocabulary()
        nationality_vocab = Vocabulary()
        for index, row in surname_df.iterrows():
            for char in row.surname:
                char_vocab.add_token(char)
            nationality_vocab.add_token(row.nationality)
        return cls(char_vocab=char_vocab, nationality_vocab=nationality_vocab)

    @classmethod
    def from_serializable(cls, contents):
        char_vocab = SequenceVocabulary.from_serializable(contents["char_vocab"])
        nat_vocab = Vocabulary.from_serializable(contents["nationality_vocab"])
        return cls(char_vocab=char_vocab, nationality_vocab=nationality_vocab)

    def to_serializable(self):
        return {
            "char_vocab": self.char_vocab.to_serializable(),
            "nationality_vocab": self.nationality_vocab.to_serializable(),
        }

<IPython.core.display.Javascript object>

In [6]:
class SurnameDataset(Dataset):
    def __init__(self, surname_df, vectorizer):
        self.surname_df = surname_df
        self._vectorizer = vectorizer
        self._max_seq_length = max(map(len, self.surname_df.surname)) + 2

        self.train_df = self.surname_df[self.surname_df.split == "train"]
        self.train_size = len(self.train_df)

        self.val_df = self.surname_df[self.surname_df.split == "val"]
        self.val_size = len(self.val_df)

        self.test_df = self.surname_df[self.surname_df.split == "test"]
        self.test_size = len(self.test_df)

        self._lookup_dict = {
            "train": (self.train_df, self.train_size),
            "val": (self.val_df, self.val_size),
            "test": (self.test_df, self.test_size),
        }
        self.set_split("train")
        class_counts = self.train_df.nationality.value_counts().to_dict()

        def sort_key(item):
            return self._vectorizer.nationality_vocab.lookup_token(item[0])

        sorted_counts = sorted(class_counts.items(), key=sort_key)
        frequencies = [count for _, count in sorted_counts]
        self.class_weights = 1.0 / torch.tensor(frequencies, dtype=torch.float32)

    @classmethod
    def load_dataset_and_make_vectorizer(cls, surname_csv):
        surname_df = pd.read_csv(surname_csv)
        train_surname_df = surname_df[surname_df.split == "train"]
        return cls(surname_df, SurnameVectorizer.from_dataframe(train_surname_df))

    @staticmethod
    def load_vectorizer_only(vectorizer_filepath):
        with open(vectorizer_filepath) as fp:
            return SurnameVectorizer.from_serializable(json.load(fp))

    def save_vectorizer(self, vectorizer_filepath):
        with open(vectorizer_filepath, "w") as fp:
            json.dump(self._vectorizer.to_serializable(), fp)

    def get_vectorizer(self):
        return self._vectorizer

    def set_split(self, split="train"):
        self._train_split = split
        self._target_df, self._target_size = self._lookup_dict[split]

    def __len__(self):
        return self._target_size

    def __getitem__(self, index):
        row = self._target_df.iloc[index]
        surname_vector, vec_length = self._vectorizer.vectorize(
            row.surname, self._max_seq_length
        )
        nationality_index = self._vectorizer.nationality_vocab.lookup_token(
            row.nationality
        )
        return {
            "x_data": surname_vector,
            "y_target": nationality_index,
            "x_length": vec_length,
        }

    def get_num_batches(self, batch_size):
        return len(self) // batch_size

<IPython.core.display.Javascript object>

### Modeling

In [7]:
def column_gather(y_out, x_lengths):
    x_lengths = x_lengths.long().detach().cpu().numpy() - 1
    out = []
    for batch_index, column_index in enumerate(x_lengths):
        out.append(y_out[batch_index, column_index])
    return torch.stack(out)


class ElmanRNN(nn.Module):
    def __init__(self, input_size, hidden_size, batch_first=False):
        super(ElmanRNN, self).__init__()
        self.rnn_cell = nn.RNNCell(input_size, hidden_size)
        self.batch_first = batch_first
        self.hidden_size = hidden_size

    def _initial_hidden(self, batch_size):
        return torch.zeros((batch_size, self.hidden_size))

    def forward(self, x_in, initial_hidden=None):
        if self.batch_first:
            batch_size, seq_size, feat_size = x_in.size()
            x_in = x_in.permute(1, 0, 2)
        else:
            seq_size, batch_size, feat_size = x_in.size()
        hiddens = []
        if initial_hidden is None:
            initial_hidden = self._initial_hidden(batch_size)
            initial_hidden = initial_hidden.to(x_in.device)
        hidden_t = initial_hidden
        for t in range(seq_size):
            hidden_t = self.rnn_cell(x_in[t], hidden_t)
            hiddens.append(hidden_t)
        hiddens = torch.stack(hiddens)
        if self.batch_first:
            hiddens = hiddens.permute(1, 0, 2)
        return hiddens


class SurnameClassifier(nn.Module):
    def __init__(
        self,
        embedding_size,
        num_embeddings,
        num_classes,
        rnn_hidden_size,
        batch_first=True,
        padding_idx=0,
    ):
        super(SurnameClassifier, self).__init__()
        self.emb = nn.Embedding(
            num_embeddings=num_embeddings,
            embedding_dim=embedding_size,
            padding_idx=padding_idx,
        )
        self.rnn = ElmanRNN(
            input_size=embedding_size,
            hidden_size=rnn_hidden_size,
            batch_first=batch_first,
        )
        self.fc1 = nn.Linear(
            in_features=rnn_hidden_size, out_features=rnn_hidden_size)
        self.fc2 = nn.Linear(
            in_features=rnn_hidden_size, out_features=num_classes)

    def forward(self, x_in, x_lengths=None, apply_softmax=False):
        x_embedded = self.emb(x_in)
        y_out = self.rnn(x_embedded)
        if x_lengths is not None:
            y_out = column_gather(y_out, x_lengths)
        else:
            y_out = y_out[:, -1, :]
        y_out = F.relu(self.fc1(F.dropout(y_out, 0.5)))
        y_out = self.fc2(F.dropout(y_out, 0.5))
        if apply_softmax:
            y_out = F.softmax(y_out, dim=1)
        return y_out

<IPython.core.display.Javascript object>

### Init and Model Training + Evaluation

In [8]:
args = Namespace(
    # Data and path information
    surname_csv="../data/surnames/surnames_with_splits.csv",
    vectorizer_file="vectorizer.json",
    model_state_file="model.pth",
    save_dir="models/chapter06/surname_classification",
    # Model hyper parameter
    char_embedding_size=100,
    rnn_hidden_size=64,
    # Training hyper parameter
    num_epochs=100,
    learning_rate=1e-3,
    batch_size=64,
    seed=1337,
    early_stopping_criteria=5,
    # Runtime hyper parameter
    cuda=True,
    catch_keyboard_interrupt=True,
    reload_from_files=False,
    expand_filepaths_to_save_dir=True,
)

if not torch.cuda.is_available():
    args.cuda = False

args.device = torch.device("cuda" if args.cuda else "cpu")

print("Using CUDA: {}".format(args.cuda))


if args.expand_filepaths_to_save_dir:
    args.vectorizer_file = os.path.join(args.save_dir, args.vectorizer_file)

    args.model_state_file = os.path.join(args.save_dir, args.model_state_file)

# Set seed for reproducibility
utils.set_seed_everywhere(args.seed, args.cuda)

# handle dirs
utils.handle_dirs(args.save_dir)

Using CUDA: False


<IPython.core.display.Javascript object>

In [9]:
dataset = SurnameDataset.load_dataset_and_make_vectorizer(args.surname_csv)
dataset.save_vectorizer(args.vectorizer_file)
vectorizer = dataset.get_vectorizer()
classifier = SurnameClassifier(
    embedding_size=args.char_embedding_size,
    num_embeddings=len(vectorizer.char_vocab),
    num_classes=len(vectorizer.nationality_vocab),
    rnn_hidden_size=args.rnn_hidden_size,
    padding_idx=vectorizer.char_vocab.mask_index,
)
print(classifier)
classifer = classifier.to(args.device)
dataset.class_weights = dataset.class_weights.to(args.device)
loss_func = nn.CrossEntropyLoss(dataset.class_weights)
optimizer = optim.Adam(classifier.parameters(), lr=args.learning_rate)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer=optimizer, mode="min", factor=0.5, patience=1
)

train_state = utils.train_model(
    classifier=classifier,
    loss_func=loss_func,
    optimizer=optimizer,
    scheduler=scheduler,
    dataset=dataset,
    args=args,
)
train_state = utils.evaluate_test_split(
    classifier, dataset, loss_func, train_state, args
)

Training Routine:   0%|          | 0/100 [00:00<?, ?it/s]

split=train:   0%|          | 0/120 [00:00<?, ?it/s]

split=val:   0%|          | 0/25 [00:00<?, ?it/s]

--------------- 0th Epoch Stats---------------
Training Loss=2.853266392151515, Training Accuracy=12.083333333333336
Validation Loss=2.7643321990966796, Validation Accuracy=20.437499999999996.
------------------------------------------------------------
--------------- 10th Epoch Stats---------------
Training Loss=1.686806084712347, Training Accuracy=41.19791666666667
Validation Loss=1.8859842014312747, Validation Accuracy=43.1875.
------------------------------------------------------------
--------------- 20th Epoch Stats---------------
Training Loss=1.4386519243319829, Training Accuracy=45.18229166666667
Validation Loss=1.8459743452072142, Validation Accuracy=42.8125.
------------------------------------------------------------
--------------- 30th Epoch Stats---------------
Training Loss=1.395014308889707, Training Accuracy=45.88541666666668
Validation Loss=1.7970454692840578, Validation Accuracy=42.81250000000001.
------------------------------------------------------------
------

<IPython.core.display.Javascript object>

### Inference

In [10]:
def predict_nationality(surname, classifier, vectorizer):
    vectorized_surname, vec_length = vectorizer.vectorize(surname)
    vectorized_surname = torch.tensor(vectorized_surname).unsqueeze(dim=0)
    vec_length = torch.tensor([vec_length], dtype=torch.int64)

    result = classifier(vectorized_surname, vec_length, apply_softmax=True)
    probability_values, indices = result.max(dim=1)

    index = indices.item()
    prob_value = probability_values.item()

    predicted_nationality = vectorizer.nationality_vocab.lookup_index(index)

    return {
        "nationality": predicted_nationality,
        "probability": prob_value,
        "surname": surname,
    }

<IPython.core.display.Javascript object>

In [11]:
classifier = classifier.to("cpu")
for surname in ["McMahan", "Nakamoto", "Wan", "Cho"]:
    print(predict_nationality(surname, classifier, vectorizer))

{'nationality': 'Irish', 'probability': 0.5489820241928101, 'surname': 'McMahan'}
{'nationality': 'Japanese', 'probability': 0.7205414772033691, 'surname': 'Nakamoto'}
{'nationality': 'Vietnamese', 'probability': 0.4349416494369507, 'surname': 'Wan'}
{'nationality': 'Chinese', 'probability': 0.39028239250183105, 'surname': 'Cho'}


<IPython.core.display.Javascript object>

In [12]:
print(classifer)

SurnameClassifier(
  (emb): Embedding(80, 100, padding_idx=0)
  (rnn): ElmanRNN(
    (rnn_cell): RNNCell(100, 64)
  )
  (fc1): Linear(in_features=64, out_features=64, bias=True)
  (fc2): Linear(in_features=64, out_features=18, bias=True)
)


<IPython.core.display.Javascript object>