## Name classification with Keras

In this example we will try to reproduce the results presented in [this paper](https://www.ijcai.org/proceedings/2017/0289.pdf). 

The dataset that we will use in this example can be downloaded from [here](https://www.dropbox.com/s/vx88k39dja9zcxj/data.tar.gz?dl=0)

Place the two files into a directory called `data`. In addition, create a second directory called `data_processed`. 

The directory structure should be:

```
.
├── data
│   ├── country2ethnicity.txt
│   └── countryResult.txt
├── data_processed
└── name_classification_rnn.ipynb
```

The authors classify names of athletes using a stack of `lstms` with a clever initialization. 

Let's start by having a look to the data.

## 1-Data Preparation

In [27]:
import numpy as np
import pandas as pd
import pickle
import gensim
import nltk
import re
import os

from random import shuffle
from itertools import chain
from nltk import ngrams
from gensim.models import Word2Vec
from bs4 import BeautifulSoup
from tqdm import tqdm
from sklearn.model_selection import train_test_split

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, Flatten, LSTM, Dense
from keras.layers.merge import concatenate
from keras.models import Model
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.metrics import top_k_categorical_accuracy

In [2]:
raw_data_file = "./data/countryResult.txt"
data_dir = "./data_processed"
dataset = open(raw_data_file).read().strip().split('\n')

We have 31595 records 

In [3]:
print(len(dataset))
dataset[:5]

31595


['Belarus\tBeijing 2008\tsilver\t20.28\tAthletics\tNatallia MIKHNEVICH/',
 'Belarus\tVancouver 2010\tsilver\t48:32.0\tBiathlon\tSergey NOVIKOV/',
 'Belarus\tBeijing 2008\tsilver\t8551\tAthletics\tAndrei KRAUCHANKA/',
 'Belarus\tVancouver 2010\tgold\tFINAL\tFreestyle Skiing\tAlexei GRISHIN/',
 'Belarus\tBeijing 2008\tbronze\t81.51\tAthletics\tIvan TSIKHAN/']

Let's clean the data and build more functional objects/dictionaries

In [4]:
remove_chars = [':', '©', '¶']
def clean_names(name):
    name_text = BeautifulSoup(name).get_text()
    name_text = re.sub("[^a-zA-Z\'.']"," ", name_text)
    name_text = re.sub(" +"," ",name_text)
    name_text = name_text.strip()
    clean_name = name_text.title()
    return clean_name

In [5]:
name2country = dict()
name2year = dict()
for line in tqdm(dataset):
    try:
        country, olympic_year, medal, record, sports, names_raw = line.split('\t')
        country = country.replace(',', ' ')
        country = country.strip()
    except ValueError as e:
        pass
    # In the olympics one has teams (i.e more than one individual per row)
    if len(names_raw.split('/')) >= 2:
        names = names_raw.split('/')
        names = [n for n in names if n!=""]
        for name in names:
            c_name = clean_names(name)
            if c_name in name2country:
                # and some athlete change countries. We keep the most recent nationality
                if country != name2country[c_name]:
                    previous_year = int(name2year[c_name].split(' ')[-1])
                    current_year  = int(olympic_year.split(' ')[-1])
                    if  previous_year <= current_year:
                        continue
                    else:
                        pass
            name2country[c_name] = country
            name2year[c_name] = olympic_year

100%|██████████| 31595/31595 [01:29<00:00, 351.43it/s]


During the cleaning we lose 14k observations approximately. This is partially due to the fact that the cleaning above is fairly "rough". I will leave to you to carry on a better cleaning of the data so we keep more observations

In [6]:
len(name2country)

17715

Let's have a look to the cleaned text

In [7]:
name2country

{'Natallia Mikhnevich': 'Belarus',
 'Sergey Novikov': 'Belarus',
 'Andrei Krauchanka': 'Belarus',
 'Alexei Grishin': 'Belarus',
 'Ivan Tsikhan': 'Belarus',
 'Maryna Shkermankova': 'Belarus',
 'Vadim Devyatovskiy': 'Belarus',
 'Iryna Kulesha': 'Belarus',
 'Aksana Miankova': 'Belarus',
 'Darya Domracheva': 'Belarus',
 'Fernanda Ribeiro': 'Portugal',
 'Rui Silva': 'Portugal',
 'Nelson Evora': 'Portugal',
 'Rosa Mota': 'Portugal',
 'Jose Manuel Gentil Quina': 'Portugal',
 'Mario Gentil Quina': 'Portugal',
 'Armando Da Silva Marques': 'Portugal',
 'Fernando Silva Paes': 'Portugal',
 'Francisco Valadas': 'Portugal',
 'Luiz Silva': 'Portugal',
 'Sergio Paulinho': 'Portugal',
 'Nuno Barreto': 'Portugal',
 'Victor Hugo Rocha': 'Portugal',
 'Emanuel Silva': 'Portugal',
 'Fernando Pimenta': 'Portugal',
 'Domingos De Sousa Coutinho Marques Do Funchal': 'Portugal',
 'Jose Beltrao': 'Portugal',
 'Carlos Lopes': 'Portugal',
 'Francis Obikwelu': 'Portugal',
 "Duarte M.D'Almeida Bello": 'Portugal',
 'F

In [8]:
country2idx = dict([(cntr,i) for i,cntr in enumerate(sorted(set(name2country.values())))])
country2idx

{'Algeria': 0,
 'Argentina': 1,
 'Armenia': 2,
 'Australasia (1908-1912)': 3,
 'Australia': 4,
 'Austria': 5,
 'Azerbaijan': 6,
 'Bahamas': 7,
 'Belarus': 8,
 'Belgium': 9,
 'Brazil': 10,
 'Bulgaria': 11,
 'Canada': 12,
 'Chile': 13,
 'Chinese Taipei': 14,
 'Colombia': 15,
 'Croatia': 16,
 'Cuba': 17,
 'Czech Republic': 18,
 'Czechoslovakia': 19,
 "Democratic People's Republic Of Korea": 20,
 'Denmark': 21,
 'Egypt': 22,
 'Estonia': 23,
 'Ethiopia': 24,
 'Federal Republic Of Germany (1950-1990  "GER" Since)': 25,
 'Finland': 26,
 'France': 27,
 'Georgia': 28,
 'German Democratic Republic (1955-1990': 29,
 'Germany': 30,
 'Great Britain': 31,
 'Greece': 32,
 'Guatemala': 33,
 'Guyana': 34,
 'Haiti': 35,
 'Hong Kong  China': 36,
 'Hungary': 37,
 'Iceland': 38,
 'Independant Participant': 39,
 'India': 40,
 'Indonesia': 41,
 'Iraq': 42,
 'Ireland': 43,
 'Islamic Republic Of Iran': 44,
 'Israel': 45,
 'Italy': 46,
 'Jamaica': 47,
 'Japan': 48,
 'Kazakhstan': 49,
 'Kenya': 50,
 'Kuwait': 51

Save the resulst

In [9]:
pickle.dump(name2country, open(os.path.join(data_dir, 'name2country.p'), 'wb'))
pickle.dump(country2idx, open(os.path.join(data_dir,'country2idx.p'), 'wb'))

Let's now define a helper function to get the n-grams given a name. We will see later what these are used for

In [10]:
def get_ngram(corpus, n):
    n_grams = set()
    for strg in corpus:
        ngram_gen = ngrams(strg,n)
        for n_gram in ngram_gen:
            n_grams.add("".join(n_gram))
    return list(n_grams)

In [11]:
print(get_ngram(['javier'], 2))
print(get_ngram(['javier'], 3))

['vi', 'av', 'ja', 'er', 'ie']
['avi', 'ier', 'jav', 'vie']


In [12]:
all_names = name2country.keys()
unigrams = sorted(list(set(" ".join(all_names))))
bigrams  = sorted(get_ngram(all_names, 2))
trigrams = sorted(get_ngram(all_names, 3))
unigram2idx = dict([(ng, i) for i,ng in enumerate(unigrams)])
bigram2idx  = dict([(ng, i) for i,ng in enumerate(bigrams)])
trigram2idx = dict([(ng, i) for i,ng in enumerate(trigrams)])

Save the resulst

In [13]:
pickle.dump(unigram2idx, open(os.path.join(data_dir,'unigram2idx.p'), 'wb'))
pickle.dump(bigram2idx, open(os.path.join(data_dir,'bigram2idx.p'), 'wb'))
pickle.dump(trigram2idx, open(os.path.join(data_dir,'trigram2idx.p'), 'wb'))

with the aim of preserving order, let's move from dictionaries to tuples

In [14]:
# we need to preserve order, so dictionaries are not good...
tmp = list(name2country.items())
tmp = sorted(tmp, key=lambda tmp: tmp[0])
all_names, all_countries = [], []
for n, c in tmp:
    all_names.append(n)
    all_countries.append(c)
all_names[:10]

['..... Daumain',
 'A Lam Shin',
 'A. Albert',
 'A. B Gli',
 'A. B. Zumelzu',
 'A. Faehlmann',
 'A. Fasani',
 'A. Fauquet Lemaitre',
 'A. Ferraris',
 'A. Gilpin']

let's build our corpus of n-grams

In [15]:
# Build corpus of ngrams with n=1,2,3
unig_corpus = [list((''.join(ng) for ng in ngrams(name, 1)))
               for name in all_names]
bigr_corpus = [list((''.join(ng) for ng in ngrams(name, 2)))
               for name in all_names]
trig_corpus = [list((''.join(ng) for ng in ngrams(name, 3)))
               for name in all_names]
bigr_corpus[:5]

[['..', '..', '..', '..', '. ', ' D', 'Da', 'au', 'um', 'ma', 'ai', 'in'],
 ['A ', ' L', 'La', 'am', 'm ', ' S', 'Sh', 'hi', 'in'],
 ['A.', '. ', ' A', 'Al', 'lb', 'be', 'er', 'rt'],
 ['A.', '. ', ' B', 'B ', ' G', 'Gl', 'li'],
 ['A.', '. ', ' B', 'B.', '. ', ' Z', 'Zu', 'um', 'me', 'el', 'lz', 'zu']]

and numerically encode the sequences using the n-grams2idx dictionaries

In [16]:
unig_seq = [list(unigram2idx[gram] for gram in name)
            for name in unig_corpus]
bigr_seq = [list(bigram2idx[gram] for gram in name)
            for name in bigr_corpus]
trig_seq = [list(trigram2idx[gram] for gram in name)
            for name in trig_corpus]
bigr_seq[:5]

[[45, 45, 45, 45, 44, 4, 127, 533, 1019, 820, 521, 728],
 [66, 12, 269, 525, 819, 19, 400, 699, 728],
 [67, 44, 1, 79, 795, 544, 632, 947],
 [67, 44, 2, 94, 7, 188, 802],
 [67, 44, 2, 95, 44, 26, 507, 1019, 824, 626, 818, 1130]]

When passing the data to the network, we need to ensure that all sequences have the same length (when using pytorch one could get around this constrain)

In [21]:
MAX_SEQUENCE_LENGTH = 30
unig_X = np.vstack(pad_sequences(unig_seq, MAX_SEQUENCE_LENGTH))
bigr_X = np.vstack(pad_sequences(bigr_seq, MAX_SEQUENCE_LENGTH))
trig_X = np.vstack(pad_sequences(trig_seq, MAX_SEQUENCE_LENGTH))
print(bigr_X.shape)
bigr_X[0, :]

(17715, 30)


array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   45,   45,   45,   45,
         44,    4,  127,  533, 1019,  820,  521,  728], dtype=int32)

let's build our target, nationality/ethnicity

In [22]:
country2ethnicity = pd.read_csv('data/country2ethnicity.txt', header=None, names=['country', 'ethnicity'])
country2ethnicity.head()

Unnamed: 0,country,ethnicity
0,Algeria,ARA
1,Argentina,SPA
2,Armenia,EEU
3,Australasia (1908-1912),ELSE
4,Australia,ENG


In [23]:
ethnicity2idx = sorted(country2ethnicity.ethnicity.unique())
ethnicity2idx = dict([(e, i) for i, e in enumerate(ethnicity2idx)])
pickle.dump(ethnicity2idx, open('data_processed/ethnicity2idx.p', 'wb'))
ethnicity2idx

{'ARA': 0,
 'CEA': 1,
 'CHI': 2,
 'EEU': 3,
 'ELSE': 4,
 'ENG': 5,
 'FRA': 6,
 'GER': 7,
 'GRE': 8,
 'IND': 9,
 'ITA': 10,
 'JAP': 11,
 'KOR': 12,
 'NEU': 13,
 'NHL': 14,
 'POR': 15,
 'RUS': 16,
 'SPA': 17}

When compiling the model, our loss will be `categorical_crossentropy`. This set up needs one-hot encoded categories. Keras makes our life easy as this can be done in a liner

In [24]:
country2ethnicity = country2ethnicity.replace({'ethnicity': ethnicity2idx})
country2ethnicity = pd.Series(
    country2ethnicity.ethnicity.values,
    country2ethnicity.country.values
    ).to_dict()
Y = np.array([country2ethnicity[c] for c in all_countries])
Y[:10]

array([ 6, 12,  6,  7, 17,  3,  6,  4, 10,  4])

In [25]:
Y = to_categorical(Y)
Y[:10]

array([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 1.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0.]], dtype=float32)

and as with any other ML problem, train/test split

In [26]:
unig_X_tr, unig_X_te = train_test_split(
    unig_X, test_size=0.3, random_state=1981)
bigr_X_tr, bigr_X_te = train_test_split(
    bigr_X, test_size=0.3, random_state=1981)
trig_X_tr, trig_X_te = train_test_split(
    trig_X, test_size=0.3, random_state=1981)
Y_tr, Y_te = train_test_split(Y, test_size=0.3, random_state=1981)

## Build the model

The following figure illustrates the model implemented by [Lee et al](https://www.ijcai.org/proceedings/2017/0289.pdf)

<img src="images/architecture.png" alt="drawing" width="450"/>

Here we will be a simplified version using only one LSTM. 

The clever trick that the authors used consists in initialising the n-grams using the word2vec algorithm. As a result, the n-grams will be initialise based on their context (i.e. other n-grams surrounding them), which might speed up convergence. 

To implement this initialization we will use the `gensim` package in `python`, which comes with a handy `Word2Vec` method.

In [28]:
# let's define an initializer
def initializer(sequences, ngram2idx, emb_dim):
    sequences = [list((str(idx) for idx in name)) for name in sequences]
    model = Word2Vec(sequences, size=emb_dim, window=5, min_count=0, iter=10)
    init = np.zeros((len(ngram2idx), emb_dim), dtype=np.float32)
    for ngram, idx in ngram2idx.items():
        init[idx] = model[str(idx)]
    return init

Initialize weights

In [29]:
unig_emb_init = initializer(unig_seq, unigram2idx, 50)
bigr_emb_init = initializer(bigr_seq, bigram2idx,  100)
trig_emb_init = initializer(trig_seq, trigram2idx, 150)
unig_emb_init.shape

  import sys


(55, 50)

let's define the model

In [30]:
# Input Layers
unig_inp = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
bigr_inp = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
trig_inp = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')

# Embeddings layers
unig_emb_layer = Embedding(len(unigram2idx),
                           50,
                           weights=[unig_emb_init],
                           input_length=MAX_SEQUENCE_LENGTH,
                           trainable=True)
bigr_emb_layer = Embedding(len(bigram2idx),
                           100,
                           weights=[bigr_emb_init],
                           input_length=MAX_SEQUENCE_LENGTH,
                           trainable=True)
trig_emb_layer = Embedding(len(trigram2idx),
                           150,
                           weights=[trig_emb_init],
                           input_length=MAX_SEQUENCE_LENGTH,
                           trainable=True)

# unigrams networks
unig_emb = unig_emb_layer(unig_inp)
unig_lstm = LSTM(128, dropout=0.3, recurrent_dropout=0.3)(unig_emb)

# bigrams networks
bigr_emb = bigr_emb_layer(bigr_inp)
bigr_lstm = LSTM(128, dropout=0.3, recurrent_dropout=0.3)(bigr_emb)

# trigrams networks
trig_emb = trig_emb_layer(trig_inp)
trig_lstm = LSTM(128, dropout=0.3, recurrent_dropout=0.3)(trig_emb)

# concatenate the output
allgrams = concatenate([unig_lstm, bigr_lstm, trig_lstm])

# final FC layer
preds = Dense(len(ethnicity2idx), activation='softmax')(allgrams)

model = Model([unig_inp, bigr_inp, trig_inp], preds)
print(model.summary()) 

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            (None, 30)           0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            (None, 30)           0                                            
__________________________________________________________________________________________________
input_3 (InputLayer)            (None, 30)           0                                            
__________________________________________________________________________________________________
embedding_1 (Embedding)         (None, 30, 50)       2750        input_1[0][0]                    
__________________________________________________________________________________________________
embedding_

Compile and run

In [32]:
def top_k_mod(y_true, y_pred, k=3):
    return top_k_categorical_accuracy(y_true, y_pred, k)

model.compile(loss='categorical_crossentropy',
              optimizer=Adam(lr=0.001),
              metrics=[top_k_mod])
model.fit([unig_X_tr, bigr_X_tr, trig_X_tr], Y_tr, batch_size=128, epochs=10)
_, top_k_acc = model.evaluate([unig_X_te, bigr_X_te, trig_X_te], Y_te)
print(top_k_acc)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
0.9063029162746943


We are overfitting and I am sure we could get better metrics. Try different architectures