# Kuzushiji Character Recognition

## Imports and Configuration
First, we need to import our libraries and set random seeds for when we do our train/test split

In [11]:
# import libraries
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [12]:
# reproducibility
np.random.seed(2021)
random.seed(2021)

## Dataset
Our dataset (Kuzishiji-49) already puts our data as numpy arrays, in a compressed `npz` format.

In [13]:
# load compressed numpy arrays

# train images
with np.load('./kuzushiji-49/k49-train-imgs.npz') as data:
    X_tr = data['arr_0']

# train labels
with np.load('./kuzushiji-49/k49-train-labels.npz') as data:
    Y_tr = data['arr_0']
    
# validation images
with np.load('./kuzushiji-49/k49-val-imgs.npz') as data:
    X_val = data['arr_0']

# validation labels
with np.load('./kuzushiji-49/k49-val-labels.npz') as data:
    Y_val = data['arr_0']

If we take a look at our labels, we can see that they're just uints, so no need to do any conversions. We are provided with a classmap that maps numbers to unicode characters. We'll load that and make a helpful function to convert a label to a character.

In [14]:
Y_val[:10]

array([19, 23, 10, 31, 26, 12, 24,  9, 24,  8], dtype=uint8)

In [15]:
# create lookup table + conversion function to convert label to UTF-8 char
lookup_df = pd.read_csv('./kuzushiji-49/k49_classmap.csv')
lookup_df = lookup_df[['codepoint', 'char']]

def label_to_char(label):
    return lookup_df.iloc[label]['char']

In [16]:
label_to_char(Y_val[:10])

19    と
23    ね
10    さ
31    み
26    ひ
12    す
24    の
9     こ
24    の
8     け
Name: char, dtype: object

Taking a look at the shapes of our data and the quantity of samples we have:

In [17]:
print("# of training instances:", X_tr.shape[0])
print("# of validation instances:", X_val.shape[0])
print("Total # of instances:", X_tr.shape[0] + X_val.shape[0])
print("\n")
print("Shape of training instance features:", X_tr.shape[1:])
print("Shape of validation instance features:", X_val.shape[1:])

# of training instances: 232365
# of validation instances: 38547
Total # of instances: 270912


Shape of training instance features: (28, 28)
Shape of validation instance features: (28, 28)


Instead of having features in the shape of (28, 28), let's reshape our features to (784,) so that we can easily manipulate these arrays.

In [18]:
X_tr.shape

(232365, 28, 28)

In [19]:
X_tr = X_tr.reshape((232365, 28*28))

In [20]:
X_tr.shape

(232365, 784)