# Prepare lastname -> target language data

This notebook is part of article [Explaining RNNs without neural networks](https://explained.ai/rnn/index.html) and should be run before the other  notebooks as this notebook creates files: `data/X.pkl` and `data/y.pkl`.

The application is a classifier mapping last names (family names) to the nationality or language.

In [1]:
import pandas as pd
import numpy as np
import math
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
np.set_printoptions(precision=2, suppress=True, linewidth=3000, threshold=20000)
from typing import Sequence

dtype = torch.float

## Load

Let's download [training](https://raw.githubusercontent.com/hunkim/PyTorchZeroToAll/master/data/names_train.csv.gz) data for last names.   

In [29]:
df_train = pd.read_csv("data/names_train.csv", header=None)
df_train.columns = ['name','language']

In [30]:
df_train.shape

(13374, 2)

In [31]:
df_train.head(2)

Unnamed: 0,name,language
0,Adsit,Czech
1,Ajdrna,Czech


## Clean

In [32]:
badname = df_train['name']=='To The First Page' # wth?
df_train[badname].head(2)

Unnamed: 0,name,language
8340,To The First Page,Russian
8341,To The First Page,Russian


In [33]:
# probably destroying useful info, but make all lowercase for much smaller vocab
df_train['name'] = df_train['name'].str.lower()

## Split names into variable-length lists

In [34]:
X_train = [list(name) for name in df_train['name']]

In [35]:
df_train = df_train[df_train['name']!='To The First Page']

In [36]:
X, y = df_train['name'], df_train['language']
X = [list(name) for name in X]
X[0:2], y[0:2]

([['a', 'd', 's', 'i', 't'], ['a', 'j', 'd', 'r', 'n', 'a']], 0    Czech
 1    Czech
 Name: language, dtype: object)

## Encode target language (class)

Get categories from training only, not valid/test sets. Then apply cats to those set y's.

In [37]:
y = y.astype('category').cat.as_ordered()
y_cats = y.cat.categories
y = y.cat.codes
y = y.values
y_cats

Index(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish',
       'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese'],
      dtype='object')

## Save prepped X, y

In [39]:
import pickle

with open('data/X.pkl', 'wb') as f:
    pickle.dump(X, f)
with open('data/y.pkl', 'wb') as f:
    pickle.dump(y, f)