## Name classification with Keras

In this example we will try to reproduce the results presented in [this paper](https://www.ijcai.org/proceedings/2017/0289.pdf). 

The dataset that we will use in this example can be downloaded from [here](https://www.dropbox.com/sh/ksvlzdh72ildhih/AAB5XhAwzo18CsoMDweaj11fa?dl=0)

Place the two files into a directory called `data`. In addition, create a second directory called `data_processed`. 

The directory structure should be:

```
├── Step01_preprocess_names.ipynb
├── Step02_Build_Model.ipynb
├── Step03_classify_name.ipynb
├── data
│   ├── country2ethnicity.txt
│   └── countryResult.txt
└── data_processed
```

The authors classify names of athletes using a stack of `lstms` with a clever initialization. 

Let's start by having a look to the data.

## 1-Data Preparation

In [63]:
import numpy as np
import pandas as pd
import pickle
import gensim
import nltk
import re
import os

from random import shuffle
from itertools import chain
from nltk import ngrams
from gensim.models import Word2Vec
from bs4 import BeautifulSoup
from tqdm import tqdm
from sklearn.model_selection import train_test_split

from keras.preprocessing.sequence import pad_sequences
from keras.layers import Input, Embedding, Flatten, LSTM, Dense
from keras.layers.merge import concatenate
from keras.models import Model
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.metrics import top_k_categorical_accuracy

In [64]:
raw_data_file = "./data/countryResult.txt"
data_dir = "./data_processed"
dataset = open(raw_data_file).read().strip().split('\n')

We have 31595 records 

In [65]:
print(len(dataset))
dataset[:5]

31595


['Belarus\tBeijing 2008\tsilver\t20.28\tAthletics\tNatallia MIKHNEVICH/',
 'Belarus\tVancouver 2010\tsilver\t48:32.0\tBiathlon\tSergey NOVIKOV/',
 'Belarus\tBeijing 2008\tsilver\t8551\tAthletics\tAndrei KRAUCHANKA/',
 'Belarus\tVancouver 2010\tgold\tFINAL\tFreestyle Skiing\tAlexei GRISHIN/',
 'Belarus\tBeijing 2008\tbronze\t81.51\tAthletics\tIvan TSIKHAN/']

Let's clean the data and build more functional objects/dictionaries

In [66]:
remove_chars = [':', '©', '¶']
def clean_names(name):
    name_text = BeautifulSoup(name).get_text()
    name_text = re.sub("[^a-zA-Z\'.']"," ", name_text)
    name_text = re.sub(" +"," ",name_text)
    name_text = name_text.strip()
    clean_name = name_text.title()
    return clean_name

In [67]:
name2country = dict()
name2year = dict()
for line in tqdm(dataset):
    try:
        country, olympic_year, medal, record, sports, names_raw = line.split('\t')
        country = country.replace(',', ' ')
        country = country.strip()
    except ValueError as e:
        pass
    # In the olympics one has teams (i.e more than one individual per row)
    if len(names_raw.split('/')) >= 2:
        names = names_raw.split('/')
        names = [n for n in names if n!=""]
        for name in names:
            c_name = clean_names(name)
            if c_name in name2country:
                # and some athlete change countries. We keep the most recent nationality
                if country != name2country[c_name]:
                    previous_year = int(name2year[c_name].split(' ')[-1])
                    current_year  = int(olympic_year.split(' ')[-1])
                    if  previous_year <= current_year:
                        continue
                    else:
                        pass
            name2country[c_name] = country
            name2year[c_name] = olympic_year

100%|██████████| 31595/31595 [01:34<00:00, 335.25it/s]


During the cleaning we lose 14k observations approximately. This is partially due to the fact that the cleaning above is fairly "rough". I will leave to you to carry on a better cleaning of the data so we keep more observations

In [68]:
len(name2country)

17715

Let's have a look to the cleaned text

In [69]:
name2country

{'Natallia Mikhnevich': 'Belarus',
 'Sergey Novikov': 'Belarus',
 'Andrei Krauchanka': 'Belarus',
 'Alexei Grishin': 'Belarus',
 'Ivan Tsikhan': 'Belarus',
 'Maryna Shkermankova': 'Belarus',
 'Vadim Devyatovskiy': 'Belarus',
 'Iryna Kulesha': 'Belarus',
 'Aksana Miankova': 'Belarus',
 'Darya Domracheva': 'Belarus',
 'Fernanda Ribeiro': 'Portugal',
 'Rui Silva': 'Portugal',
 'Nelson Evora': 'Portugal',
 'Rosa Mota': 'Portugal',
 'Jose Manuel Gentil Quina': 'Portugal',
 'Mario Gentil Quina': 'Portugal',
 'Armando Da Silva Marques': 'Portugal',
 'Fernando Silva Paes': 'Portugal',
 'Francisco Valadas': 'Portugal',
 'Luiz Silva': 'Portugal',
 'Sergio Paulinho': 'Portugal',
 'Nuno Barreto': 'Portugal',
 'Victor Hugo Rocha': 'Portugal',
 'Emanuel Silva': 'Portugal',
 'Fernando Pimenta': 'Portugal',
 'Domingos De Sousa Coutinho Marques Do Funchal': 'Portugal',
 'Jose Beltrao': 'Portugal',
 'Carlos Lopes': 'Portugal',
 'Francis Obikwelu': 'Portugal',
 "Duarte M.D'Almeida Bello": 'Portugal',
 'F

In [89]:
country2idx = dict([(cntr,i) for i,cntr in enumerate(sorted(set(name2country.values())))])
country2idx

{'Algeria': 0,
 'Argentina': 1,
 'Armenia': 2,
 'Australasia (1908-1912)': 3,
 'Australia': 4,
 'Austria': 5,
 'Azerbaijan': 6,
 'Bahamas': 7,
 'Belarus': 8,
 'Belgium': 9,
 'Brazil': 10,
 'Bulgaria': 11,
 'Canada': 12,
 'Chile': 13,
 'Chinese Taipei': 14,
 'Colombia': 15,
 'Croatia': 16,
 'Cuba': 17,
 'Czech Republic': 18,
 'Czechoslovakia': 19,
 "Democratic People's Republic Of Korea": 20,
 'Denmark': 21,
 'Egypt': 22,
 'Estonia': 23,
 'Ethiopia': 24,
 'Federal Republic Of Germany (1950-1990  "GER" Since)': 25,
 'Finland': 26,
 'France': 27,
 'Georgia': 28,
 'German Democratic Republic (1955-1990': 29,
 'Germany': 30,
 'Great Britain': 31,
 'Greece': 32,
 'Guatemala': 33,
 'Guyana': 34,
 'Haiti': 35,
 'Hong Kong  China': 36,
 'Hungary': 37,
 'Iceland': 38,
 'Independant Participant': 39,
 'India': 40,
 'Indonesia': 41,
 'Iraq': 42,
 'Ireland': 43,
 'Islamic Republic Of Iran': 44,
 'Israel': 45,
 'Italy': 46,
 'Jamaica': 47,
 'Japan': 48,
 'Kazakhstan': 49,
 'Kenya': 50,
 'Kuwait': 51

Save the resulst

In [71]:
pickle.dump(name2country, open(os.path.join(data_dir, 'name2country.p'), 'wb'))
pickle.dump(country2idx, open(os.path.join(data_dir,'country2idx.p'), 'wb'))

Let's now define a helper function to get the n-grams given a name. We will see later what these are used for

In [72]:
def get_ngram(corpus, n):
    n_grams = set()
    for strg in corpus:
        ngram_gen = ngrams(strg,n)
        for n_gram in ngram_gen:
            n_grams.add("".join(n_gram))
    return list(n_grams)

In [73]:
print(get_ngram(['javier'], 2))
print(get_ngram(['javier'], 3))

['ie', 'vi', 'ja', 'av', 'er']
['jav', 'avi', 'vie', 'ier']


In [74]:
all_names = name2country.keys()
unigrams = sorted(list(set(" ".join(all_names))))
bigrams  = sorted(get_ngram(all_names, 2))
trigrams = sorted(get_ngram(all_names, 3))
unigram2idx = dict([(ng, i) for i,ng in enumerate(unigrams)])
bigram2idx  = dict([(ng, i) for i,ng in enumerate(bigrams)])
trigram2idx = dict([(ng, i) for i,ng in enumerate(trigrams)])

Save the resulst

In [75]:
pickle.dump(unigram2idx, open(os.path.join(data_dir,'unigram2idx.p'), 'wb'))
pickle.dump(bigram2idx, open(os.path.join(data_dir,'bigram2idx.p'), 'wb'))
pickle.dump(trigram2idx, open(os.path.join(data_dir,'trigram2idx.p'), 'wb'))

## Building the Model

with the aim of preserving order, let's move from dictionaries to tuples

In [87]:
# we need to preserve order, so dictionaries are not good...
tmp = list(name2country.items())
tmp = sorted(tmp, key=lambda tmp: tmp[0])
all_names, all_countries = [], []
for n, c in tmp:
    all_names.append(n)
    all_countries.append(c)
all_names[:10]

['..... Daumain',
 'A Lam Shin',
 'A. Albert',
 'A. B Gli',
 'A. B. Zumelzu',
 'A. Faehlmann',
 'A. Fasani',
 'A. Fauquet Lemaitre',
 'A. Ferraris',
 'A. Gilpin']

In [88]:
# Build corpus of ngrams with n=1,2,3
unig_corpus = [list((''.join(ng) for ng in ngrams(name, 1)))
               for name in all_names]
bigr_corpus = [list((''.join(ng) for ng in ngrams(name, 2)))
               for name in all_names]
trig_corpus = [list((''.join(ng) for ng in ngrams(name, 3)))
               for name in all_names]
bigr_corpus[:5]

[['..', '..', '..', '..', '. ', ' D', 'Da', 'au', 'um', 'ma', 'ai', 'in'],
 ['A ', ' L', 'La', 'am', 'm ', ' S', 'Sh', 'hi', 'in'],
 ['A.', '. ', ' A', 'Al', 'lb', 'be', 'er', 'rt'],
 ['A.', '. ', ' B', 'B ', ' G', 'Gl', 'li'],
 ['A.', '. ', ' B', 'B.', '. ', ' Z', 'Zu', 'um', 'me', 'el', 'lz', 'zu']]

In [84]:
unig_corpus

[['.', '.', '.', '.', '.', ' ', 'D', 'a', 'u', 'm', 'a', 'i', 'n'],
 ['A', ' ', 'L', 'a', 'm', ' ', 'S', 'h', 'i', 'n'],
 ['A', '.', ' ', 'A', 'l', 'b', 'e', 'r', 't'],
 ['A', '.', ' ', 'B', ' ', 'G', 'l', 'i'],
 ['A', '.', ' ', 'B', '.', ' ', 'Z', 'u', 'm', 'e', 'l', 'z', 'u'],
 ['A', '.', ' ', 'F', 'a', 'e', 'h', 'l', 'm', 'a', 'n', 'n'],
 ['A', '.', ' ', 'F', 'a', 's', 'a', 'n', 'i'],
 ['A',
  '.',
  ' ',
  'F',
  'a',
  'u',
  'q',
  'u',
  'e',
  't',
  ' ',
  'L',
  'e',
  'm',
  'a',
  'i',
  't',
  'r',
  'e'],
 ['A', '.', ' ', 'F', 'e', 'r', 'r', 'a', 'r', 'i', 's'],
 ['A', '.', ' ', 'G', 'i', 'l', 'p', 'i', 'n'],
 ['A', '.', ' ', 'G', 'u', 'e', 'r', 'r', 'i', 'e', 'r'],
 ['A', '.', ' ', 'H', 'a', 's', 'l', 'a', 'm'],
 ['A', '.', ' ', 'H', 'e', 'l', 'm', 'a', 'n'],
 ['A', '.', ' ', 'H', 'e', 'n', 'r', 'y', ' ', 'T', 'h', 'o', 'm', 'a', 's'],
 ['A', '.', ' ', 'L', 'a', 'w', 'r', 'e', 'y'],
 ['A', '.', ' ', 'M', 'a', 'r', 'a'],
 ['A', '.', ' ', 'M', 'a', 'r', 'i', 'a', 'c', 'h',

In [85]:
# Numerical-encoded sequences
unig_seq = [list(unigram2idx[gram] for gram in name) for name in unig_corpus]
bigr_seq = [list(bigram2idx[gram] for gram in name) for name in bigr_corpus]
trig_seq = [list(trigram2idx[gram] for gram in name)
            for name in trig_corpus]