# RNN for Classifying Names

In this notebook we are building and training a basic character-level RNN to classify
words. A character-level RNN reads words as a series of characters -
outputting a prediction and "hidden state" at each step, feeding its
previous hidden state into each next step. We take the final prediction
to be the output, i.e. which class the word belongs to.

### Preparing the Data

Download the data in folder `data/names` from GitHub.

Included in the ``data/names`` directory are 18 text files named as
``[Language].txt``. Each file contains a bunch of names, one name per
line, mostly romanized (but we still need to convert from Unicode to
ASCII).

In [1]:
import string
import unicodedata

# these is the vocabulary we will use
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

print(f"Vocab is of size {n_letters} and contains:", all_letters)

Vocab is of size 57 and contains: abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ .,;'


In [2]:
# we convert anything into ascii
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))
print(unicodeToAscii('Heute ist es schön heiß'))

Slusarski
Heute ist es schon hei


In [3]:
from io import open
import glob
import os

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

all_categories = []
X = []
y = []

for filename in glob.glob('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    for line in lines:
        X.append(line)
        y.append(category)
    
n_categories = len(all_categories)
n_categories, len(X)

(18, 20074)

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print("Train data points:", len(X_train))

Train data points: 16059


Turning Names into Tensors
--------------------------

Now that we have all the names organized, we need to turn them into
Tensors to make any use of them.

To represent a single letter, we use a "one-hot vector" of size
``<1 x n_letters>``. A one-hot vector is filled with 0s except for a 1
at index of the current letter, e.g. ``"b" = <0 1 0 0 0 ...>``.

To make a word we join a bunch of those into a 2D matrix
``<line_length x 1 x n_letters>``.

That extra 1 dimension is because PyTorch assumes everything is in
batches - we're just using a batch size of 1 here.




In [5]:
import torch

def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    index = all_letters.find(letter)
    tensor[0][index] = 1
    return tensor

print(letterToTensor('J'))

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])


In [6]:
# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for i, letter in enumerate(line):
        index = all_letters.find(letter)
        tensor[i][0][index] = 1
    return tensor

print(lineToTensor('Jones').size())

torch.Size([5, 1, 57])


In [7]:
def categoryToTensor(category):
    index = all_categories.index(category)
    return torch.tensor([index], dtype=torch.long)

categoryToTensor("Korean")

tensor([16])

Creating the Network
====================

Before autograd, creating a recurrent neural network in Torch involved
cloning the parameters of a layer over several timesteps. The layers
held hidden state and gradients which are now entirely handled by the
graph itself. This means you can implement a RNN in a very "pure" way,
as regular feed-forward layers.

This RNN module is just 2 linear layers which operate on an input and hidden state, with
a LogSoftmax layer after the output.
You can see the architecture here: https://i.imgur.com/Z2xbySO.png

In [8]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = 128 # number of hidden layer size

        self.i2h = nn.Linear(input_size + self.hidden_size, self.hidden_size)
        self.i2o = nn.Linear(input_size + self.hidden_size, output_size)

    def forward(self, x, hidden):
        combined = torch.cat((x, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(combined)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

To run a step of this network we need to pass an input (in our case, the
Tensor for the current letter) and a previous hidden state (which we
initialize as zeros at first). We'll get back the output (probability of
each language) and a next hidden state (which we keep for the next
step).




In [9]:
rnn = RNN(n_letters, n_categories)

x = letterToTensor('A')
hidden = torch.zeros(1, 128)

output, next_hidden = rnn(x, hidden)
print(torch.softmax(output, 1))

tensor([[0.0532, 0.0598, 0.0525, 0.0550, 0.0553, 0.0539, 0.0598, 0.0571, 0.0614,
         0.0522, 0.0532, 0.0553, 0.0557, 0.0538, 0.0564, 0.0532, 0.0557, 0.0567]],
       grad_fn=<SoftmaxBackward0>)


As you can see the output is a ``<1 x n_categories>`` Tensor, where
every item is the likelihood of that category (higher is more likely).




Task 1: Training the Network
--------------------

Finish the following training function to train the RNN on the training data set.

In [10]:
import math


criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(rnn.parameters(), lr=0.005)

for epoch in range(1, 10):
    print("Training epoch:", epoch)
    # iterate through all names in X_train
    # for every name:
        # init the hidden layer of the rnn
        # insert the name character by character into the rnn and compute the final output
        # note: you need to carry on the hidden state in every time step
        # define the loss on the last output of the rnn and the category (=label)
        # backpropagate the loss and take an optimizer step

Training epoch: 1
Training epoch: 2
Training epoch: 3
Training epoch: 4
Training epoch: 5
Training epoch: 6
Training epoch: 7
Training epoch: 8
Training epoch: 9


### Task 2: Evaluating the Results

Evaluate the accuarcy of the RNN on the test data.

### Task 3: Running on User Input

Write a function that takes an abritrary names as input and outputs the top 3 categories of the RNN for the input.
