# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.
6. Write a function that takes a string as input and predicts the origin (as its original string value)

In [None]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf
import unicodedata
import string
import pandas as pd

In [None]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2020-11-10 22:02:10--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 13.32.204.49, 13.32.204.34, 13.32.204.65, ...
Connecting to download.pytorch.org (download.pytorch.org)|13.32.204.49|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2020-11-10 22:02:11 (80.9 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.tx

In [None]:
def read_files():
  data = []
  unique_origins = []
  for filename in glob('data/names/*.txt'):
    origin = filename.split('/')[-1].split('.txt')[0]
    unique_origins.append(origin)
    names = open(filename).readlines()
    for name in names:
      data.append((name.strip(), origin))
  return data, unique_origins

In [None]:
def unicode_to_ascii(str):
  all_letters = string.ascii_letters + " .,;'"
  n_letters = len(all_letters)
  return ''.join(
    c for c in unicodedata.normalize('NFD', str)
    if unicodedata.category(c) != 'Mn'
    and c in all_letters
  )

In [None]:
# Create dataset
data, unique_origins = read_files()
print('unique_origins:', unique_origins)
print('len(unique_origins):', len(unique_origins))
df = pd.DataFrame(data=data)
df.rename({0: 'Name', 1: 'Origin'}, axis=1, inplace=True)

# Create categories for y
df['Origin'] = df['Origin'].astype('category')
df['origin_cat'] = df['Origin'].cat.codes
df.head()

unique_origins: ['Korean', 'Greek', 'Chinese', 'Polish', 'French', 'Arabic', 'Scottish', 'Spanish', 'Portuguese', 'Vietnamese', 'German', 'Japanese', 'Italian', 'English', 'Russian', 'Irish', 'Czech', 'Dutch']
len(unique_origins): 18


Unnamed: 0,Name,Origin,origin_cat
0,Ahn,Korean,11
1,Baik,Korean,11
2,Bang,Korean,11
3,Byon,Korean,11
4,Cha,Korean,11


In [None]:
# Create category dictionary
y_dictionary = dict(enumerate(df['Origin'].cat.categories))
y_dictionary
# TODO try to_categorical once more

{0: 'Arabic',
 1: 'Chinese',
 2: 'Czech',
 3: 'Dutch',
 4: 'English',
 5: 'French',
 6: 'German',
 7: 'Greek',
 8: 'Irish',
 9: 'Italian',
 10: 'Japanese',
 11: 'Korean',
 12: 'Polish',
 13: 'Portuguese',
 14: 'Russian',
 15: 'Scottish',
 16: 'Spanish',
 17: 'Vietnamese'}

In [None]:
# Split data into train/test
# names, origins = zip(*data)
# names_train, names_test, origins_train, origins_test = train_test_split(names, origins, test_size=0.25, shuffle=True, random_state=123)
names_train, names_test, origins_train, origins_test = train_test_split(df['Name'], df['origin_cat'], test_size=0.25, shuffle=True, random_state=123)

In [None]:
# Initialize the encoder/tokenizer and fit it to the text on a character-level
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)
tokenizer.fit_on_texts(df['Name'])

In [None]:
dataset_size = tokenizer.document_count
print('dataset_size:', dataset_size)
category_size = len(tokenizer.word_index)
print('category_size:', category_size)

dataset_size: 20074
category_size: 58


In [None]:
# Prepare X
def prepare_x(x):
  names = x
  unicoded_names = names.apply(lambda name: unicode_to_ascii(name))
  tokenized_names = [tokenizer.texts_to_sequences([name]) for name in unicoded_names]
  flattened_names = [y for x in tokenized_names for y in x]
  padded_names = tf.keras.preprocessing.sequence.pad_sequences(
    flattened_names, padding="post"
  )
  one_hot_encoded_names = tf.one_hot(padded_names, depth=category_size)
  return one_hot_encoded_names

In [None]:
X_train = prepare_x(names_train)
X_test = prepare_x(names_test)
y_train = origins_train
y_test = origins_test

In [None]:
y_train

8460      4
19082     8
14631    14
7387      4
9252      4
         ..
7763      4
15377    14
17730    14
15725    14
19966     3
Name: origin_cat, Length: 15055, dtype: int8

In [None]:
X_train

<tf.Tensor: shape=(15055, 19, 58), dtype=float32, numpy=
array([[[0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0., 0., ..., 0., 0., 0.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [1., 0., 0., ..., 0., 0., 0.],
        [1., 0

In [None]:
model = tf.keras.models.Sequential([
  # tf.keras.layers.Embedding(input_dim=(category_size+1),
  #                           output_dim=64,
  #                           mask_zero=True),
  tf.keras.layers.LSTM(128, return_sequences= True, input_shape=[None, category_size]),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.LSTM(128),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(category_size, activation='softmax')
])
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_4 (LSTM)                (None, None, 128)         95744     
_________________________________________________________________
dropout_2 (Dropout)          (None, None, 128)         0         
_________________________________________________________________
lstm_5 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dropout_3 (Dropout)          (None, 128)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 58)                7482      
Total params: 234,810
Trainable params: 234,810
Non-trainable params: 0
_________________________________________________________________


In [None]:
history = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_split=0.2)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [None]:
predictions = np.argmax(model.predict(X_test), axis=-1)
predictions

array([14,  4,  4, ..., 14,  8,  4])

In [None]:
def convert_from_category_to_origin(y_category):
  return y_dictionary[y_category]

origins_test_original = [convert_from_category_to_origin(category) for category in origins_test]
prediction_origin = [convert_from_category_to_origin(category) for category in predictions]

print('origins_test_original[:5]', origins_test_original[:5])
print('prediction_origin[:5]', prediction_origin[:5])

origins_test_original[:5] ['Russian', 'German', 'Dutch', 'Czech', 'English']
prediction_origin[:5] ['Russian', 'English', 'English', 'Russian', 'English']


In [None]:
results = model.evaluate(X_test, y_test)
results



[0.6627506017684937, 0.7979677319526672]