In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Vision Workshop - Model Experimentation

## Overview

[Vision Workshop](https://github.com/mblanc/vision-workshop) is a series of labs on how to build an image classification system on Google Cloud. Throughout the Vision Workshop labs, you will learn how to read image data stored in data lake, perform exploratory data analysis (EDA), train a model, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model.

### Objective

This notebook shows how to pull features from Feature Store for training, run data exploratory analysis on features, build a machine learning model locally, experiment with various hyperparameters, evaluate the model and deloy it to a Vertex AI endpoint. 

This lab uses the following Google Cloud services and resources:

- [Vertex AI](https://cloud.google.com/vertex-ai/)

Steps performed in this notebook:

- Use a Feature Store to pull training data
- Do some exploratory analysis on the extracted data
- Train the model and track the results using Vertex AI Experiments

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Load configuration settings from the setup notebook

Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook.

In [None]:
GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
BUCKET_NAME = f"{PROJECT_ID}-vision-workshop"
config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py
print(config.n)
exec(config.n)

### Mount Google Cloud Storage with gcsfuse

What if I told you there is no need to `gsutil cp -r `?

If you’ve developed machine learning models before, you know that data quality and governance issues are predominant. When developing models, you’ll spin up a Vertex AI Workbench Jupyter Notebook and copy some data from Cloud Storage. If the dataset is large, then you’ll wait some time while all data is copied to the notebook. Now you have two copies of the data. Multiply this X times the number of data scientists in your organization and now you have a data reconciliation problem.

Now, with Cloud Storage FUSE, you can mount Cloud Storage buckets as file systems on Vertex AI Workbench Notebooks and Vertex AI training jobs. This way you can keep all your data in a single repository (Cloud Storage) and make it available across multiple teams as a single source of truth.

#### Cloud Storage FUSE

Cloud Storage FUSE is a File System in User Space mounted on Vertex AI systems. It provides 3 benefits over the traditional ways of accessing Cloud Storage:

Jobs can start quickly without downloading any data

Jobs can perform I/O easily at scale, without the friction of calling the Cloud Storage APIs, handling the responses, or integrating with client-side libraries.

Jobs can leverage the optimized performance of Cloud Storage FUSE.

In all custom training jobs, Vertex AI mounts Cloud Storage buckets that you have access to in the /gcs/ directory of each training node’s filesystem. You can read and write directly to the local filesystem in order to read data from Cloud Storage or write data to Cloud Storage.

For Vertex AI Workbench Notebooks, Cloud Storage FUSE is supported with just a few steps and next we’ll go through how to do this. Let’s get started!

In [None]:
!fusermount -u /home/jupyter/gcs/{BUCKET_NAME}
!rm -rf ~/gcs

In [None]:
!mkdir -p ~/gcs/{BUCKET_NAME}

In [None]:
BUCKET_NAME

In [None]:
!gcsfuse --implicit-dirs \
--rename-dir-limit=100 \
--disable-http2 \
--max-conns-per-host=100 \
{BUCKET_NAME} /home/jupyter/gcs/{BUCKET_NAME}

In [None]:
import pathlib
data_dir = pathlib.Path(f"/home/jupyter/gcs/{BUCKET_NAME}/aiornot/train")

In [None]:
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

### Import libraries

In [None]:
import numpy as np
import os
import pathlib
# os.environ["CUDA_VISIBLE_DEVICES"] = "-1"
import PIL
import PIL.Image
from fastai.vision.all import *
import torch
import timm
import time
from datetime import datetime, timedelta
from google.cloud import aiplatform as vertex_ai

In [None]:
import torch
print(torch.__version__)
print(timm.__version__)

### Define constants

In [None]:
TIMESTAMP = str(int(time.time()))

## Experiment
EXPERIMENT_NAME = "vision-experiment-" + TIMESTAMP

### Initialize clients

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME, experiment=EXPERIMENT_NAME)

## Load data using a Keras utility

Next, load these images off disk using the helpful tf.keras.utils.image_dataset_from_directory utility. This will take you from a directory of images on disk to a tf.data.Dataset in just a couple lines of code. If you like, you can also write your own data loading code from scratch by visiting the [Load and preprocess images](https://www.tensorflow.org/tutorials/load_data/images) tutorial.

In [None]:
import pathlib
data_dir = pathlib.Path(f"/home/jupyter/gcs/{BUCKET_NAME}/aiornot/train")

In [None]:
image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

### Create a dataset

Define some parameters for the loader:

In [None]:
config = SimpleNamespace(
    batch_size=32,
    img_size=256,
    seed=42,
    pretrained=True,
    model_name="maxxvit_rmlp_small_rw_256", #"regnetx_040",
    epochs=5)


It's good practice to use a validation split when developing your model. Use 80% of the images for training and 20% for validation.

In [None]:
data_dir

In [None]:
def get_flowers(batch_size, img_size, seed):
    "The dog breeds pets datasets"
    dls = ImageDataLoaders.from_folder(data_dir, 
                                        valid_pct=0.2, 
                                        seed=seed, 
                                        bs=batch_size,
                                        item_tfms=Resize(img_size),
                                        batch_tfms=aug_transforms(mult=2)) 
    return dls

In [None]:
dls = get_flowers(config.batch_size, config.img_size, config.seed)

### Data exploration
Here we use a subset of data for data exploration and better understanding of the data.

You can find the class names in the class_names attribute on these datasets. These correspond to the directory names in alphabetical order.

In [None]:
# class_names = train_ds.class_names
# print(class_names)
# num_classes = len(train_ds.class_names)

Here are the first nine images from the training dataset:

In [None]:
dls.valid.show_batch(max_n=9, nrows=3)

Overfitting generally occurs when there are a small number of training examples. Data augmentation takes the approach of generating additional training data from your existing examples by augmenting them using random transformations that yield believable-looking images. This helps expose the model to more aspects of the data and generalize better.

You will implement data augmentation using the following Keras preprocessing layers: tf.keras.layers.RandomFlip, tf.keras.layers.RandomRotation, and tf.keras.layers.RandomZoom. These can be included inside your model like other layers, and run on the GPU.

In [None]:
dls.train.show_batch(max_n=8, nrows=2, unique=True)

Visualize a few augmented examples by applying data augmentation to the same image several times:

## Builing a custom model

Make sure to use buffered prefetching, so you can yield data from disk without having I/O become blocking. These are two important methods you should use when loading data:

Dataset.cache keeps the images in memory after they're loaded off disk during the first epoch. This will ensure the dataset does not become a bottleneck while training your model. If your dataset is too large to fit into memory, you can also use this method to create a performant on-disk cache.
Dataset.prefetch overlaps data preprocessing and model execution while training.
Interested readers can learn more about both methods, as well as how to cache data to disk in the Prefetching section of the [Better performance with the tf.data API](https://www.tensorflow.org/guide/data_performance) guide.

In [None]:
# AUTOTUNE = tf.data.AUTOTUNE

# train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
# val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

### Training

In this section, we will train a model using tensorflow. Typically, to perform training, you might want to use a Vertex AI traning pipeline, however, as we are experimenting here, we simply use the tensorflow package interactively to train our model in this notebook. 

We will test two different architectures, and will logs or expriments in Vertex AI Experiments.

Let's start with a basic Keras model :

The Keras Sequential model consists of three convolution blocks (tf.keras.layers.Conv2D) with a max pooling layer (tf.keras.layers.MaxPooling2D) in each of them. There's a fully-connected layer (tf.keras.layers.Dense) with 128 units on top of it that is activated by a ReLU activation function ('relu'). This model has not been tuned for high accuracy; the goal of this tutorial is to show a standard approach.

In [None]:
cbs = [MixedPrecision(), ShowGraphCallback()]
learn = vision_learner(dls, config.model_name, metrics=accuracy, 
                               cbs=cbs, pretrained=config.pretrained)
print(learn.summary())

In [None]:
learn.model

#### Training the model
Before running Tensorflow, we can set some hyperparameters, which has a strong impact on performance. As a best practice, you can use Vertex AI HyperParameter Tuning to automatically find the best parameters. However, in this notebook, for the sake of simplicity and expedience, we specify these hyperparemeters manually and randomly. 

In [None]:
epochs=5

run_name=f"fastai-{TIMESTAMP}"
vertex_ai.start_run(run=run_name)
vertex_ai.log_params(config.__dict__)

Train the model for 5 epochs with the Keras Model.fit method:

In [None]:
lr = learn.lr_find()

In [None]:
lr.valley

In [None]:
learn.fit_one_cycle(config.epochs, lr.valley)

Create plots of the loss and accuracy on the training and validation sets:

In [None]:
learn.recorder.plot_loss(skip_start=0, with_valid=True)

In [None]:
learn.show_results()

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

In [None]:
interp.plot_top_losses(4, nrows=2)

In [None]:
interp.print_classification_report()

In [None]:
loss, accuracy = learn.validate()
vertex_ai.log_metrics({"loss": loss, "accuracy": accuracy})


In [None]:
vertex_ai.end_run()

In a second experiment we will try to fine tune an EfficientNetV2 architecture :

In [None]:
data_dir

In [None]:
from fastai.vision.all import *

def get_config(model_name, img_size=224, epochs=10):
    return SimpleNamespace(batch_size=32, img_size=224, seed=42, pretrained=True, model_name=model_name, epochs=epochs)

configs = [ 
    get_config('regnety_040'), 
    get_config('tf_efficientnetv2_s', 384),
    get_config('coatnext_nano_rw_224'), 
    get_config('regnetx_040'), 
    get_config('maxvit_rmlp_small_rw_224'),
    get_config('regnetz_040'), 
]

for config in configs:
    print(config.model_name)
    TIMESTAMP = str(int(time.time()))
    dls = ImageDataLoaders.from_folder(data_dir, 
                                        valid_pct=0.2, 
                                        seed=config.seed, 
                                        bs=config.batch_size,
                                        item_tfms=Resize(config.img_size),
                                        batch_tfms=aug_transforms(mult=2)) 
    
    
    cbs = [MixedPrecision()]
    learn = vision_learner(dls, config.model_name, metrics=accuracy, cbs=cbs, pretrained=config.pretrained)
    print("lr_find")
    lr = learn.lr_find()

    run_name=f"fastai-{TIMESTAMP}"
    vertex_ai.start_run(run=run_name)
    vertex_ai.log_params(config.__dict__)

    learn.fit_one_cycle(config.epochs, lr.valley)
    
    loss_metric, accuracy_metric = learn.validate()
    
    vertex_ai.log_metrics({"loss": loss_metric, "accuracy": accuracy_metric})
    vertex_ai.end_run()

We can also extract all parameters and metrics associated with any experiment into a dataframe for further analysis.

In [None]:
experiment_df = vertex_ai.get_experiment_df()
experiment_df.sort_values(by=['metric.accuracy'])

Also we can visualize experiments in Cloud Console. Run the following to get the URL of Vertex AI Experiments for your project and click on that URL to see those results on the Cloud Console.

In [None]:
print("Vertex AI Experiments:")
print(
    f"https://console.cloud.google.com/ai/platform/experiments/experiments?folder=&organizationId=&project={PROJECT_ID}"
)

Let's test our last model by making a prediction on a new image
TODO