# How to classify surnames with RNN

We follow the tutorial available at: https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

We build a classifier for surnames from scratch (pre-processing included), using character-level RNN, which predicts which among 18 languages the surname is most likely from.

## The data
The data is available at: https://download.pytorch.org/tutorial/data.zip

The folder `data/names` contains 18 Unicode text files, one for each of 18 languages, with names such as `[Language].txt`. With the following code we read the files, conver the data to ASCII and create a dictionary with shape `{language: [name1, name2, ...]}` 

In [15]:
# import a few packages
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os
import unicodedata
import string

In [16]:
# get paths of the language files 
def findFiles(path): return glob.glob(path)
print(findFiles('data/names/*.txt'))

['data/names/Greek.txt', 'data/names/German.txt', 'data/names/Portuguese.txt', 'data/names/Irish.txt', 'data/names/Scottish.txt', 'data/names/Czech.txt', 'data/names/English.txt', 'data/names/Vietnamese.txt', 'data/names/Polish.txt', 'data/names/Korean.txt', 'data/names/French.txt', 'data/names/Spanish.txt', 'data/names/Arabic.txt', 'data/names/Chinese.txt', 'data/names/Dutch.txt', 'data/names/Japanese.txt', 'data/names/Italian.txt', 'data/names/Russian.txt']


In [17]:
# from Unicode string to plain ASCII, following https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# for example:
print(unicodeToAscii('Ślusàrski'))

Slusarski


In [7]:
# define a function to read a file, convert and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

In [13]:
# build the dictionary, a list of names for each language
category_dict = {} # init empty dict
all_categories = [] # init empty list

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0] # get language name from file path
    all_categories.append(category)
    names = readLines(filename)
    category_dict[category] = names

In [14]:
# inspect the data a bit
n_categories = len(all_categories)
print(n_categories)
print(category_dict.keys())
print(category_dict['Italian'][:5])

18
dict_keys(['French', 'Vietnamese', 'Spanish', 'Japanese', 'Korean', 'Czech', 'Arabic', 'German', 'Chinese', 'English', 'Dutch', 'Polish', 'Italian', 'Irish', 'Scottish', 'Russian', 'Portuguese', 'Greek'])
['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


## From names to tensors

We use one-hot encoding for the letters: each letter is encoded in a vector of 0s and 1s, whose length is equal to the cardinality of the alphabet; for example: letter `a = [1,0,0,0,...]`, letter `b = [0,1,0,0,...]` and so on. 

The encoding of a word is then obtained as a matrix whose rows are the encodings of the letters of the name.

In [19]:
all_letters = string.ascii_letters + " .,;'" # define the alphabet

# define a function to find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# example
letterToIndex("k")

10

In [22]:
# first import torch to define tensors
import torch

# define a function to turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters) # first, all positions to 0; 1 as first argument is dimension of tensor
    tensor[0][letterToIndex(letter)] = 1 # then set the index position to 1
    return tensor

# example:
letterToTensor("k")

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])

In [24]:
# define a function to convert a word into a matrix of one-hot encodings
def wordToTensor(word):
    tensor = torch.zeros(len(word), 1, n_letters)
    for li, letter in enumerate(word):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

#example:
wordToTensor("hey")

tensor([[[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]]])

## The neural network

The RNN is quite simple: just 2 linear layers which operate on an input and hidden state, with a LogSoftmax layer after the output (see https://pytorch.org/tutorials/beginner/former_torchies/nn_tutorial.html#example-2-recurrent-net).

In [25]:
import torch.nn as nn # we customize the nn.Module

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()

        self.hidden_size = hidden_size

        self.i2h = nn.Linear(input_size + hidden_size, hidden_size) # layer from input to hidden state
        self.i2o = nn.Linear(input_size + hidden_size, output_size) # layer from input to output
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1) # the source of recurrency: input and hidden are combined here
        hidden = self.i2h(combined) # the result is fed to linear transformation going to hidden layer itself
        output = self.i2o(combined) # and to output layer
        output = self.softmax(output) # which is then softmax-transformed to probability over categories
        return output, hidden

    def initHidden(self): # init hidden layer with zeros
        return torch.zeros(1, self.hidden_size)

In [26]:
# build the network setting a few parameteres
n_hidden = 128 # size of hidden layer
n_letters = len(all_letters) # size of alphabet


rnn = RNN(n_letters, n_hidden, n_categories) # n_categories is defined above

Let's look at an example of a forward pass of the network, i.e. with a letter-tensor and "empty" hidden state as input:

In [33]:
input = letterToTensor('A')
hidden =torch.zeros(1, n_hidden)

output, next_hidden = rnn(input, hidden)

print(output)

tensor([[-2.9188, -2.9809, -2.8579, -2.8908, -2.9275, -2.8592, -2.9753, -2.8142,
         -2.9296, -2.9269, -2.8196, -2.9079, -2.9008, -2.7968, -2.8994, -2.8143,
         -2.9405, -2.8920]], grad_fn=<LogSoftmaxBackward>)


A more efficient way to do this is to compute the word-tensor once and using slices for each letter:

In [32]:
input = wordToTensor('Albert')
hidden = torch.zeros(1, n_hidden)

output, next_hidden = rnn(input[0], hidden)
print(output)

tensor([[-2.9188, -2.9809, -2.8579, -2.8908, -2.9275, -2.8592, -2.9753, -2.8142,
         -2.9296, -2.9269, -2.8196, -2.9079, -2.9008, -2.7968, -2.8994, -2.8143,
         -2.9405, -2.8920]], grad_fn=<LogSoftmaxBackward>)


## Training

to do.