# Prepare lastname -> target language data

This notebook is part of article [Explaining RNNs without neural networks](https://explained.ai/rnn/index.html) and should be run before the other  notebooks as this notebook creates files: `data/X.pkl` and `data/y.pkl`.

The application is a classifier mapping last names (family names) to the nationality or language.

In [17]:
import pandas as pd
import numpy as np
import math
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, TensorDataset
from torch.nn.utils.rnn import pad_sequence
import torch.nn.functional as F
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
np.set_printoptions(precision=2, suppress=True, linewidth=3000, threshold=20000)
from typing import Sequence
import requests

dtype = torch.float

## Load

Let's download [training](https://raw.githubusercontent.com/hunkim/PyTorchZeroToAll/master/data/names_train.csv.gz) data for last names.   

In [18]:
!mkdir -p data
%cd data
!wget --unlink https://raw.githubusercontent.com/hunkim/PyTorchZeroToAll/master/data/names_train.csv.gz
!gzip --force -d names_train.csv.gz
%cd ..

/Users/parrt/github/ml-articles/rnn/notebooks/data
--2020-07-14 13:01:07--  https://raw.githubusercontent.com/hunkim/PyTorchZeroToAll/master/data/names_train.csv.gz
Resolving raw.githubusercontent.com... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 50237 (49K) [application/octet-stream]
Saving to: ‘names_train.csv.gz’


2020-07-14 13:01:07 (4.62 MB/s) - ‘names_train.csv.gz’ saved [50237/50237]

/Users/parrt/github/ml-articles/rnn/notebooks


In [19]:
df_train = pd.read_csv("data/names_train.csv", header=None)
df_train.columns = ['name','language']

In [20]:
df_train.shape

(13374, 2)

In [21]:
df_train.head(2)

Unnamed: 0,name,language
0,Adsit,Czech
1,Ajdrna,Czech


In [22]:
df_train['language'].unique()

array(['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese', 'Russian', 'French', 'Irish', 'English', 'Spanish', 'Greek', 'Italian', 'Portuguese', 'Scottish', 'Dutch', 'Korean', 'Polish'], dtype=object)

## Clean

In [23]:
badname = df_train['name']=='To The First Page' # wth?
df_train[badname].head(2)

Unnamed: 0,name,language
8340,To The First Page,Russian
8341,To The First Page,Russian


In [24]:
# probably destroying useful info, but make all lowercase for much smaller vocab
df_train['name'] = df_train['name'].str.lower()

## Split names into variable-length lists

In [25]:
X_train = [list(name) for name in df_train['name']]

In [26]:
df_train = df_train[df_train['name']!='To The First Page']

In [27]:
X, y = df_train['name'], df_train['language']
X = [list(name) for name in X]
X[0:2], y[0:2]

([['a', 'd', 's', 'i', 't'], ['a', 'j', 'd', 'r', 'n', 'a']], 0    Czech
 1    Czech
 Name: language, dtype: object)

## Encode target language (class)

Get categories from training only, not valid/test sets. Then apply cats to those set y's.

In [28]:
y = y.astype('category').cat.as_ordered()
y_cats = y.cat.categories
y = y.cat.codes
y = y.values
y_cats

Index(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French', 'German',
       'Greek', 'Irish', 'Italian', 'Japanese', 'Korean', 'Polish',
       'Portuguese', 'Russian', 'Scottish', 'Spanish', 'Vietnamese'],
      dtype='object')

In [43]:
# in case we need it actually...
y = df_train['language']
y = y.astype('category').cat.as_ordered()
y_cats = y.cat.categories
lang2idx = {name:i for i,name in enumerate(y_cats)}
lang2idx

{'Arabic': 0,
 'Chinese': 1,
 'Czech': 2,
 'Dutch': 3,
 'English': 4,
 'French': 5,
 'German': 6,
 'Greek': 7,
 'Irish': 8,
 'Italian': 9,
 'Japanese': 10,
 'Korean': 11,
 'Polish': 12,
 'Portuguese': 13,
 'Russian': 14,
 'Scottish': 15,
 'Spanish': 16,
 'Vietnamese': 17}

## Save prepped X, y

In [29]:
import pickle

with open('data/X.pkl', 'wb') as f:
    pickle.dump(X, f)
with open('data/y.pkl', 'wb') as f:
    pickle.dump(y, f)