<a href="https://colab.research.google.com/github/hrithinnnn/name-predictor/blob/main/NameCountryPrediction_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
---
The scope of this tutorial is to introduce the audience to the world of deep learning models and PyTorch as a training framework.  

We will use a simple country prediction task to achieve this goal. More details on the task is available below.  

# Task description
---
The task we will try to model today is that of predicting the country of a person from his name, i.e given a name string predict the country where he most probably belongs to.  

We will achieve this using a deep learning model using Character-level Convolutional Neural Network (Char-CNN). More details on the model and intuitions as to why this architecture is chosen will be made clear in the following sections.  

# Dataset
---
The dataset we will be using is a very simple one used in one of the PyTorch introductory tutorials. It consists of a small zip file containing names of people from 18 different nationalities/regions.  

We will use this data to train our model and then test it out.  

## Download the data

In [None]:
!rm data.zip
!wget https://download.pytorch.org/tutorial/data.zip
!unzip -o data.zip
!rm data.zip
!mkdir models

In [None]:
%pip install pytorch_lightning
%pip install torch
%pip install pandas
%pip install torchinfo

## Consolidated imports
The below cell consolidates all the required inputs

In [None]:
import glob
import os
import random
import re
import string
import unicodedata
from argparse import Namespace
from typing import Dict, List

import numpy as np
import pandas as pd
import torch
import torchinfo
from IPython.display import display
from pytorch_lightning import LightningModule, Trainer
from pytorch_lightning.callbacks import LearningRateMonitor, ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from torch.nn import (
    Conv1d,
    Dropout,
    Embedding,
    Linear,
    LogSoftmax,
    MaxPool1d,
    NLLLoss,
    ReLU,
    Sequential,
)
from torch.utils.data import DataLoader, Dataset

## Seed RNG's
In order to make the experiments reproducible we seed the Random Number Generator's in all the libraries used in this training with a fixed seed value.

In [None]:
SEED_VAL = 42
random.seed(SEED_VAL)
np.random.seed(SEED_VAL)
torch.manual_seed(SEED_VAL)
torch.cuda.manual_seed_all(SEED_VAL)

## Creating our vocab and indices
As said before we restrict our input strings to contain only a subset of ASCII characters. So our vocab is as shown in the below and we index the vocab using its position in the below string.  

In [None]:
VOCAB = ["PAD"] + list(string.ascii_letters + " .,;'")
print("Vocab size: " + str(len(VOCAB)))
print("Index of 'D' in vocab: " + str(VOCAB.index("D")))
PAD_IDX = 0

## Convert names to ASCII
Some languages in our input set of names do contain Unicode characters with diacritics. Such names are converted to their equivalent ASCII characters to compromising on pronunciation but reducing the character set for running the training on.  

For example, **`Ślusàrski`** is converted to **`Slusarski`**.  

The letters we intend to restrict our characters to are **`a-z`**, **`A-Z`** and the special characters **`[space].,;'`**.  


In [None]:
def convert_unicode_to_ascii(s: str):
    return "".join(
        c
        for c in unicodedata.normalize("NFD", s)
        if unicodedata.category(c) != "Mn" and c in VOCAB
    )

## Read and load the data

In [None]:
files = glob.glob("./data/names/*")

category_vs_lines: Dict[str, List[str]] = {}
for f in files:
    with open(f, "r") as fp:
        lines: List[str] = fp.readlines()
        lines = [convert_unicode_to_ascii(line.strip()) for line in lines]
        category = f.split(os.sep)[-1].split(".")[0]
        category_vs_lines[category] = lines

display(
    pd.DataFrame(
        [{"Category": k, "Count": len(v)} for k, v in category_vs_lines.items()]
    )
    .sort_values(by="Count", ascending=False)
    .reset_index(drop=True)
)

CLASSES = sorted(category_vs_lines.keys())
NUM_CLASSES = len(category_vs_lines)

## Find the mean size of names in our dataset

In [None]:
df = pd.DataFrame(
    [
        {"Value": v, "ValueLen": len(v)}
        for _, lines in category_vs_lines.items()
        for v in lines
    ]
)
display(df.head())
print("Mean: {}".format(df["ValueLen"].mean()))
print("Median: {}".format(df["ValueLen"].median()))
print("Mode: {}".format(df["ValueLen"].mode()))
print("Max val: {}".format(df["ValueLen"].max()))
del df

SEQUENCE_LEN = 10

As the max value of names is 20 we might tend to use that as our maximum sequence length. However on analyzing the mean, median and mode values of lengths of the names in the data we have, we can see that 10 might be a reasonable value for the max sequence length.  

So we go ahead and fix our sequence length as 10. We will truncate longer names to 10 characters and pad smaller names with special pad token.

# Model
---
We now move on to the model architecture and its creation. We will choose a simple character CNN model with 2 CNN layers and a Pooling layer sandwiched between them.  

The final layer will be a linear output layer giving the logits of 18 different categories.  

In [None]:
class NameCountryPredictor(LightningModule):
    def __init__(
        self,
        batch_size: int,
        sequence_length: int,
        num_classes: int,
        dropout: float = 0.1,
        learning_rate: float = 0.001,
    ):
        super(NameCountryPredictor, self).__init__()
        self.learning_rate = learning_rate

        """
        Input size : N * sequence_length i.e N * 10 here
        Output size: N * embed_dim * sequence_length i.e N * 64 * 10 here
        """
        self.embed_layer = Embedding(num_embeddings=len(VOCAB), embedding_dim=128)

        """
        Input size: N * embed_dim * sequence_length i.e N * 64 * 10 here
        Output size: Computed dynamically
        """
        self.conv_layers = Sequential(
            Conv1d(in_channels=128, out_channels=256, kernel_size=3),
            ReLU(),
            MaxPool1d(kernel_size=2, stride=2),
        )

        conv_output_dim = self.__get_conv_output_size((batch_size, sequence_length))

        """
        Input size: N * conv_output_dim
        Output_size: N * num_classes
        """
        self.fc_layers = Sequential(
            Linear(in_features=conv_output_dim, out_features=256),
            ReLU(),
            Dropout(p=dropout),
            Linear(in_features=256, out_features=num_classes),
            LogSoftmax(dim=-1),
        )

        self.__initialize_weights()

    def __initialize_weights(self):
        for param in self.parameters():
            if param.dim() > 1:
                torch.nn.init.xavier_uniform_(param)

    def __get_conv_output_size(self, input_size: tuple):
        """
        Method to compute the dimensions of the output after the convolutional layers
        """
        x = torch.ones(input_size, dtype=torch.long)
        out = self.encode(x)

        # Changing the size of the matrix to retain the last dimension
        # and squash all other dimensions to one single dimension
        out = out.view(out.size(0), -1)
        out_dim = out.size(1)
        return out_dim

    def forward(self, batch):
        out = self.shared_step(input_data=batch)
        predicted_class = out.argmax(dim=-1)
        return predicted_class

    def training_step(self, batch, batch_idx):
        loss, acc, out = self.__predict_and_compute_loss(
            batch=batch, batch_idx=batch_idx
        )
        self.log_dict({"train_acc": acc}, prog_bar=True, on_epoch=True, on_step=False)
        self.log_dict({"loss": loss}, prog_bar=False, on_epoch=True, on_step=True)
        return {"loss": loss, "out": out}

    def training_epoch_end(self, outputs):
        for name, param in self.named_parameters(prefix="c_cnn/", recurse=True):
            self.logger.experiment.add_histogram(name, param, self.current_epoch)

    def validation_step(self, batch, batch_idx):
        loss, acc, out = self.__predict_and_compute_loss(
            batch=batch, batch_idx=batch_idx
        )
        self.log_dict({"val_loss": loss, "val_acc": acc}, prog_bar=True, on_epoch=True)
        return {"val_loss": loss, "val_acc": acc, "out": out}

    def test_step(self, batch, batch_idx):
        loss, acc, _ = self.__predict_and_compute_loss(batch=batch, batch_idx=batch_idx)
        self.log_dict(
            {"test_loss": loss, "test_acc": acc}, prog_bar=True, on_epoch=True
        )
        return {"test_loss": loss, "test_acc": acc}

    def configure_optimizers(self):
        print("lr=", self.learning_rate)
        optim = torch.optim.SGD(
            self.parameters(),
            lr=self.learning_rate,
        )
        out = {"optimizer": optim}
        return out

    def encode(self, input_data: torch.Tensor):
        out = self.embed_layer(input_data)
        # N x input_len x embed_dim -> N x embed_dim(channels_in) x input_len
        out = out.permute(0, 2, 1)
        out = self.conv_layers(out)
        return out

    def shared_step(self, input_data: torch.Tensor):
        out = self.encode(input_data=input_data)
        out = out.view(out.size(0), -1)
        out = self.fc_layers(out)
        return out

    def __shared_step(self, batch):
        input_data, _ = batch
        out = self.shared_step(input_data=input_data)
        return out

    def __predict_and_compute_loss(self, batch, batch_idx):
        _, output_data = batch
        predicted_out = self.__shared_step(batch=batch)

        loss_fn = NLLLoss()
        loss = loss_fn(predicted_out, output_data)

        predicted_classes = torch.argmax(input=predicted_out, dim=-1)
        batch_accuracy = torch.sum(torch.eq(predicted_classes, output_data)) / float(
            torch.numel(output_data)
        )
        return loss, batch_accuracy, (output_data, predicted_classes)


BATCH_SIZE = 50

model = NameCountryPredictor(BATCH_SIZE, SEQUENCE_LEN, NUM_CLASSES)

torchinfo.summary(
    model,
    input_size=[1, 10],
    col_names=["input_size", "output_size", "num_params", "kernel_size"],
    dtypes=[torch.long],
)

# Training
---
After defining the model, we go ahead and implement the training pipeline.  

The training is somewhat simplified for us by using the [PytorchLightning](https://www.pytorchlightning.ai/) framework.  

We use it's [Trainer](https://pytorch-lightning.readthedocs.io/en/latest/common/trainer.html) object in order to implement multi device training and learning rate finder options.  

## Decide on whether to use CPU or GPU for training
Pytorch Lightning allows us to seamlessly switch our training accelerator from CPU to GPU's and vice versa. We will record the training device and use it as a trainer parameter.  

In [None]:
gpus_val = torch.cuda.device_count() if torch.cuda.is_available() else None

In [None]:
class NameCountryDataCollator(object):
    def __call__(self, batch):
        batch_inp_, batch_op_ = zip(*batch)
        stacked_inp = torch.row_stack(list(batch_inp_))
        stacked_out = torch.cat(list(batch_op_))
        return stacked_inp, stacked_out


class NameCountryDataset(Dataset):
    def __init__(self, data):
        self.data = data

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return len(self.data)


def create_split_data_loader(split_df: pd.DataFrame):
    data_pairs = split_df.apply(func=tuple, axis=1).to_list()
    vectorized_data = [
        (
            torch.tensor(
                inp
                if len(inp) == SEQUENCE_LEN
                else inp + ([PAD_IDX] * (SEQUENCE_LEN - len(inp))),
                dtype=torch.long,
            ),
            torch.tensor([op], dtype=torch.long),
        )
        for (inp, op) in data_pairs
    ]
    return vectorized_data


def create_data_loaders():
    df = pd.DataFrame(
        {
            "Input": list(map(lambda x: VOCAB.index(x), list(line[0:SEQUENCE_LEN]))),
            "Output": CLASSES.index(category),
        }
        for category, values in category_vs_lines.items()
        for line in values
    )
    display(df.head())
    train, val, test = np.split(
        df.sample(frac=1, random_state=SEED_VAL),
        [int(0.8 * len(df)), int(0.9 * len(df))],
    )

    data_collator = NameCountryDataCollator()
    train_dataloader = DataLoader(
        NameCountryDataset(create_split_data_loader(train)),
        batch_size=BATCH_SIZE,
        shuffle=True,
        collate_fn=data_collator,
    )
    val_dataloader = DataLoader(
        NameCountryDataset(create_split_data_loader(val)),
        batch_size=BATCH_SIZE,
        shuffle=False,
        collate_fn=data_collator,
    )
    test_dataloader = DataLoader(
        NameCountryDataset(create_split_data_loader(test)),
        batch_size=BATCH_SIZE,
        shuffle=False,
        collate_fn=data_collator,
    )
    return train_dataloader, val_dataloader, test_dataloader


def run_lr_finder(
    trainer: Trainer,
    training_data_loader: DataLoader,
    validation_data_loader: DataLoader,
):
    lr_finder = trainer.tuner.lr_find(
        model=model,
        train_dataloader=training_data_loader,
        val_dataloaders=validation_data_loader,
        min_lr=0.001,
        num_training=100,
    )
    print(lr_finder.results)
    new_lr = lr_finder.suggestion()
    print("New learning rate {}".format(new_lr))

    lr_finder.plot(suggest=True, show=True)
    model.learning_rate = new_lr


def run_train(args):
    (
        training_data_loader,
        validation_data_loader,
        test_data_loader,
    ) = create_data_loaders()

    callbacks = []

    check_pointing = ModelCheckpoint(
        monitor="val_loss",
        mode="min",
        save_top_k=5,
        filename="{epoch}-{val_loss:.4f}-{val_acc:.3f}",
    )

    callbacks.append(check_pointing)
    callbacks.append(LearningRateMonitor())

    tb_logger = TensorBoardLogger(
        save_dir=os.path.join(os.getcwd(), "models"), name="cnn"
    )

    trainer = Trainer(
        check_val_every_n_epoch=5,
        max_epochs=args.epochs,
        logger=tb_logger,
        callbacks=callbacks,
        fast_dev_run=args.is_fast_dev_mode,
        gpus=args.gpus,
    )

    if args.is_lr_mode:
        run_lr_finder(
            trainer=trainer,
            training_data_loader=training_data_loader,
            validation_data_loader=validation_data_loader,
        )
    else:
        trainer.fit(
            model=model,
            train_dataloader=training_data_loader,
            val_dataloaders=validation_data_loader,
        )
        trainer.test(model=model, test_dataloaders=test_data_loader)

    return check_pointing

## Find the best learning rate
We run a [learning rate finder](https://pytorch-lightning.readthedocs.io/en/latest/advanced/training_tricks.html#using-lightning-s-built-in-lr-finder) algorithm supported by [PyTorch Lightning](https://pytorch-lightning.readthedocs.io). This is implementation of a PhD. thesis that gives a good initial learning rate for faster convergence of models.  

More details on the paper is available in the PyTorch Lightning link.

In [None]:
args = Namespace(is_fast_dev_mode=False, gpus=gpus_val, is_lr_mode=True, epochs=100)
run_train(args)

## Run training
Now we are ready with our newly learnt, learning rate which has been already set in the model instance.  

We will now use this model to run the training.  

In [None]:
display(model.learning_rate)
args = Namespace(is_fast_dev_mode=False, gpus=gpus_val, is_lr_mode=False, epochs=50)
model_checkpoint = run_train(args)

## Training graphs for loss and accuracy

We will now go ahead and see the visualizations of the loss values logged at each time step, both training and validation losses via [Tensorboard](https://www.tensorflow.org/tensorboard/)

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./models

# Inference
---
We will now infer the classes of various inputs as predicted by the best model with respect to validation loss.  

In [None]:
def run_inference(model_path: str, input_vals: List[str]):
    best_model = NameCountryPredictor.load_from_checkpoint(
        model_path,
        batch_size=BATCH_SIZE,
        num_classes=NUM_CLASSES,
        sequence_length=SEQUENCE_LEN,
    )

    inputs = [
        re.sub(
            "[^{}]".format("".join(VOCAB)),
            "",
            convert_unicode_to_ascii(inp)[0:SEQUENCE_LEN],
        )
        for inp in input_vals
    ]
    display(inputs)

    inp_tensor = torch.tensor(
        [
            [VOCAB.index(ch) for ch in inp]
            + (
                [PAD_IDX]
                * ((SEQUENCE_LEN - len(inp)) if len(inp) < SEQUENCE_LEN else 0)
            )
            for inp in inputs
        ],
        dtype=torch.long,
    )
    predicted_classes = model(inp_tensor)
    display(
        pd.DataFrame(
            [
                {"Name": name, "Category": CLASSES[predicted_class]}
                for name, predicted_class in zip(input_vals, predicted_classes.tolist())
            ]
        )
    )


infer_inps = [
    "Alexis",
    "Aimée Leigh",
    "Vasily Grigoryevich Zaitsev",
    "Joaquin",
    "Émer",
    "Aleksander",
]
run_inference(model_checkpoint.best_model_path, infer_inps)