# Kuzushiji-49: An Introduction to CNNs

## Part 3: Training an AutoML Image Classification Model using Google Cloud Platform

### Introduction

In [Part 1]() and [Part 2]() we were introduced to the fundamentals of computer vision and building a simple image classifier with a convolutional neural network (CNN) in Tensorflow and running on Google Colab.

Now, because we are the modern day and age of cloud-computing and "self-serve AI", we don't actually need to train a model from scratch but can use managed services to do some [AutoML](https://en.wikipedia.org/wiki/Automated_machine_learning) for us. In this post I will do so using the [Vertex AI AutoML](https://cloud.google.com/vertex-ai/docs/beginner/beginners-guide) offering on Google Cloud Platform (GCP).

This post is in no way affiliated nor endorsed by Google, and though I choose to use GCP here, I should of course mention that there are equivalent offerings from the other major cloud providers where the same kind of work can be done with little to no coding experience:
- [Amazon Sagemaker Autopilot](https://aws.amazon.com/sagemaker/autopilot/)
- [Azure AutoML](https://azure.microsoft.com/en-ca/products/machine-learning/automatedml/)

I'm really curious to see a couple things:

1) Will Vertex AI's AutoML perform at all with the incredibly tiny images of this dataset?  
- I would assume it will, though using AutoML on such a tiny dataset is probably massively overkill. Still, in this post, I am taking on the role of a "citizen data scientist" without experience building models from scratch.  
2) Will there be any issues with using Japanese characters as class labels / file paths?  
- I would hope not, as I assume GCP is big in Japan.

### Prepping the Data

For computer vision models, Vertex AI [expects your data](https://cloud.google.com/vertex-ai/docs/image-data/classification/prepare-data#csv_1) to be individual files in Google Cloud Storage (GCS) and an associated CSV with their paths and class labels. 

This required writing some simple code to iterate over the dataset and export all the images, since our data was contained in numpy arrays.

First, we reload the data into memory:

In [2]:
# Reload the data

# Train
X_train = np.load('data/k49-train-imgs.npz')['arr_0']
y_train = np.load('data/k49-train-labels.npz')['arr_0']

# Test
X_test = np.load('data/k49-test-imgs.npz')['arr_0']
y_test = np.load('data/k49-test-labels.npz')['arr_0']

# Classmap
classmap = pd.read_csv('data/k49_classmap.csv')

Then, read in the classmap and create output directories:

In [49]:
# Read in the classmap and create the subfolders
import os

for i in classmap['char']:
  
    # Generate the subfolder filepath
    filepath = f'output/img/{i}'
    
    # Create
    os.mkdir(filepath)

Finally, we write out the image files. This step took approximately 10 minutes:

In [70]:
from tqdm import tqdm

for i in tqdm(range(0, X_train.shape[0])):
    
    label = classmap.loc[y_train[i], 'char']
    
    filename = f'output/img/{label}/{i:06}.png'
    #print(filename)

    plt.imsave(filename, X_train[i, :, :], cmap='gray_r')


100%|█████████████████████████████████████████████████████████████████████████| 232365/232365 [09:59<00:00, 387.32it/s]


And write out the class label file. I've assumed I'm going to create a Storage Bucket named `kuzushii49-mharrison':

In [24]:
base_path = 'gs://kuzushiji49-mharrison/'

f = open('kuzushiji49_gcp.csv', 'w', encoding='utf8')

for i in range(0, X_train.shape[0]):
    
    label = classmap.loc[y_train[i], 'char']
    
    filename = f'output/img/{label}/{i:06}.png'
    #print(filename)

    gcp_path = base_path + filename + ',' + label
    
    f.write(gcp_path)
    f.write('\n')
    
f.close()

Let's just double check the format of the output:

In [26]:
# Check
df = pd.read_csv('kuzushiji49_gcp.csv', header=None)

df.head()

Unnamed: 0,0,1
0,gs://kuzushiji49-mharrison/output/img/ま/000000...,ま
1,gs://kuzushiji49-mharrison/output/img/と/000001...,と
2,gs://kuzushiji49-mharrison/output/img/な/000002...,な
3,gs://kuzushiji49-mharrison/output/img/ま/000003...,ま
4,gs://kuzushiji49-mharrison/output/img/く/000004...,く


### Moving the Data to Google Cloud Storage

Now that we have the necessary files, we need to create our storage bucket and upload the files. This can be done through the browser, but is also easily accomplished from the command line using [`gsutil`](https://cloud.google.com/storage/docs/gsutil). 

As [suggested in the documentation](https://cloud.google.com/storage/docs/gsutil/addlhelp/GlobalCommandLineOptions), we will use the `-m` flag for multi-threaded copy and also the `-q` flag to suppress output, since we are copying a very large number of files:

```bash

# Set project
gcloud config set project brilliant-will-391123

# Create bucket
gsutil mb gs://kuzushiji-49-mharrison

# Upload the files
 gsutil -mq cp -r output/img gs://kuzushiji-49-mharrison/

```

This step was quite time-consuming, as the total size of the final bucket is about 220MB of a very large number of small files, and took about an hour. We can check the operation was a success afterward by looking at the bucket contents in the console:

![](./img/gsbucket.png)

And also by using the command line:

```bash
gsutil du -sh gs://kuzushiji-49-mharrison/

221.63 MiB   gs://kuzushiji-49-mharrison

```

Finally, we copy the metadata file to use with Vertex AI to the route of the bucket:

```bash
gsutil cp kuzushiji49_gcp.csv gs://kuzushiji-49-mharrison/

Copying file://kuzushiji49_gcp.csv [Content-Type=application/vnd.ms-excel]...
- [1 files][ 12.8 MiB/ 12.8 MiB]
Operation completed over 1 objects/12.8 MiB.
```

Next, in the GCP console, we navigate to Vertex AI, and click 'Enable All Recommended APIs':

![](./img/vertexai_api.png)

Then click 'Datasets' and create a new dataset:

### 