In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.m

# Vision Workshop - Managed Dataset and AutoML

## Overview

[Vision Workshop](https://github.com/mblanc/vision-workshop) is a series of labs on how to build an image classification system on Google Cloud. Throughout the Vision Workshop labs, you will learn how to read image data stored in data lake, perform exploratory data analysis (EDA), train a model, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model.

### Objective

This notebook shows how to pull features from Feature Store for training, run data exploratory analysis on features, build a machine learning model locally, experiment with various hyperparameters, evaluate the model and deloy it to a Vertex AI endpoint. 

This lab uses the following Google Cloud services and resources:

- [Vertex AI](https://cloud.google.com/vertex-ai/)

Steps performed in this notebook:

- Use a Feature Store to pull training data
- Do some exploratory analysis on the extracted data
- Train the model and track the results using Vertex AI Experiments

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

### Load configuration settings from the setup notebook

Set the constants used in this notebook and load the config settings from the `00_environment_setup.ipynb` notebook.

In [None]:
GCP_PROJECTS = !gcloud config get-value project
PROJECT_ID = GCP_PROJECTS[0]
BUCKET_NAME = f"{PROJECT_ID}-vision-workshop"
config = !gsutil cat gs://{BUCKET_NAME}/config/notebook_env.py
print(config.n)
exec(config.n)

### Import libraries

In [None]:
import os
import pandas as pd
from google.cloud import storage
from google.cloud import aiplatform as vertex_ai

### Vertex AI Managed Dataset

To load an image dataset in Vertex AI Managed Dataset, you will need to create a file listing your images and their label(s).

see [Prepare image training data for classification](https://cloud.google.com/vertex-ai/docs/image-data/classification/prepare-data)

This input file can be in the `CSV` or `JSONL` format.

For the `CSV` format :

CSV format:


```[ML_USE],GCS_FILE_PATH,[LABEL]```

List of columns

* `ML_USE` (Optional) - For data split purposes when training a model. Use TRAINING, TEST, or VALIDATION. For more information about manual data splitting, see About data splits for AutoML models.
* `GCS_FILE_PATH` - This field contains the Cloud Storage URI for the image. Cloud Storage URIs are case-sensitive.
* `LABEL` (Optional) - Labels must start with a letter and only contain letters, numbers, and underscores.

Example CSV - image_classification_single_label.csv:

```
test,gs://bucket/filename1.jpeg,daisy
training,gs://bucket/filename2.gif,dandelion
gs://bucket/filename3.png
gs://bucket/filename4.bmp,sunflowers
validation,gs://bucket/filename5.tiff,tulips
...
```

Let's create such an input file from our image dataset and store it on GCS.

We list all the images in our dataset

In [None]:
client = storage.Client() 

blobs = list(client.list_blobs(BUCKET_NAME, prefix='flowers/'))

We extract their uris and their label from the name of the folder

In [None]:
d = [[f"gs://{blob.bucket.name}/{blob.name}", os.path.split(os.path.dirname(blob.name))[1]] for blob in blobs]

In [None]:
df = pd.DataFrame(d)
df

We save the result as a `CSV` file directly on our GCS bucket.

In [None]:
df.to_csv(f"gs://{BUCKET_NAME}/flowers/flowers.csv",index=False, header=False)

### Create the Dataset

Next, create the Dataset resource using the create method for the ImageDataset class, which takes the following parameters:

* `display_name`: The human readable name for the Dataset resource.
* `gcs_source`: A list of one or more dataset index files to import the data items into the Dataset resource.
* `import_schema_uri`: The data labeling schema for the data items.


Learn more about [ImageDataset](https://cloud.google.com/vertex-ai/docs/datasets/prepare-image).

This operation may take several minutes.

In [None]:
vertex_ai.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_NAME)

In [None]:
%%time

ds = vertex_ai.ImageDataset.create(
    display_name="flowers",
    gcs_source=f"gs://{BUCKET_NAME}/flowers/flowers.csv",
    import_schema_uri=vertex_ai.schema.dataset.ioformat.image.single_label_classification,
    sync=True,
)

ds.wait()

print(ds.display_name)
print(ds.resource_name)

### Train an AutoML Model

In [None]:
ds = vertex_ai.ImageDataset.list(filter="display_name=flowers")[0]
ds

To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.

#### Create training pipeline
An AutoML training pipeline is created with the AutoMLImageTrainingJob class, with the following parameters:

* `display_name`: The human readable name for the TrainingJob resource.
* `prediction_type`: The type task to train the model for.
    * `classification`: An image classification model.
    * `object_detection`: An image object detection model.
    * `multi_label`: If a classification task, whether single (False) or multi-labeled (True).
* `model_type`: The type of model for deployment.
    * `CLOUD`: Deployment on Google Cloud
    * `CLOUD_HIGH_ACCURACY_1`: Optimized for accuracy over latency for deployment on Google Cloud.
    * `CLOUD_LOW_LATENCY_`: Optimized for latency over accuracy for deployment on Google Cloud.
    * `MOBILE_TF_VERSATILE_1`: Deployment on an edge device.
    * `MOBILE_TF_HIGH_ACCURACY_1`:Optimized for accuracy over latency for deployment on an edge device.
    * `MOBILE_TF_LOW_LATENCY_1`: Optimized for latency over accuracy for deployment on an edge device.
* `base_model`: (optional) Transfer learning from existing Model resource -- supported for image classification only.

The instantiated object is the DAG (directed acyclic graph) for the training job.

In [None]:
job = vertex_ai.AutoMLImageTrainingJob(
    display_name="flowers_automl_job",
    prediction_type="classification",
    model_type="CLOUD",
    multi_label=False,
)

#### Run the training pipeline

Next, you run the DAG to start the training job by invoking the method run, with the following parameters:

* `dataset`: The Dataset resource to train the model.
* `model_display_name`: The human readable name for the trained model.
* `training_fraction_split`: The percentage of the dataset to use for training.
* `test_fraction_split`: The percentage of the dataset to use for test (holdout data).
* `validation_fraction_split`: The percentage of the dataset to use for validation.
* `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).
* `disable_early_stopping`: If True, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.

The `run` method when completed returns the `Model` resource.

The execution of the training pipeline will take upto 2 hours.

In [None]:
model = job.run(
    dataset=ds,
    model_display_name="flowers_automl",
    budget_milli_node_hours=8000,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    disable_early_stopping=False,
    sync=False,
)