# Classify Flower species categories using AutoML Object Detection 

[Source](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/automl/sdk_automl_image_object_detection_batch.ipynb)


### Objective

The purpose of this project is to use AutoML image object detection model to classify the Flower specicies of the images provided. 

The steps performed include: 

- Create a Vertex `Dataset` resource.
- Train the model.
- View the model evaluation.
- Make a batch prediction.

# Project Variables

In [1]:
# Project variables 
#
# These are the project variable used in this ML Model: 
#
PROJECT_ID = "" # @param {type:"string"}
automl_type = "image" #@param ["image", "text", "tabular", "video"]
model_display_name = "automl_image_flower_species_model"
job_display_name = model_display_name + "_job"
endpoint_display_name = model_display_name + "_endpoint"

## datasets 
dataset_display_name = model_display_name + "_dataset"
dataset_source_uri = "gs://cloud-samples-data/ai-platform/flowers/flowers.csv"
dataset_source_public = "https://storage.googleapis.com/cloud-samples-data/ai-platform/flowers/flowers.csv"
batch_display_name = model_display_name + "_batch_prediction"

# bucket details
BUCKET_NAME = "auto-ml-tutorials" #auto-ml bucket 
BUCKET_URI = f"gs://{BUCKET_NAME}/{automl_type}/" # automl bucket uri
BUCKET_PREDICTION_OUTPUT = f"gs://auto_ml_datasets_predictions/{automl_type}/" 
BUCKET_INPUT_BATCHPREDICT = f"gs://{BUCKET_NAME}/{automl_type}/{model_display_name}" # contains the files to be used for batch prediction

print("All project variables set. Lets go")

All project variables set. Lets go


### Dataset

The image files used are from the flower dataset. These input images are stored in a public GCS bucket with a CSV file for data import. This file has two columns: the first column lists an image's URI in GCS, and the second column contains the image's label. This dataset has **3,667 images **

The 5 species are:
1. Daisy 
1. Dandelion 
1. Roses
1. Sunflowers
1. Tulips

## Installation

Install the latest version of Vertex AI SDK for Python.

In [None]:
import os

# Google Cloud Notebook
if os.path.exists("/opt/deeplearning/metadata/env_version"):
    USER_FLAG = "--user"
else:
    USER_FLAG = ""

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG

Install the latest GA version of *google-cloud-storage* library.

In [None]:
! pip3 install -U --upgrade tensorflow google-cloud-storage $USER_FLAG

Install the latest version of *tensorflow* library.

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [5]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

Otherwise, set your project ID here.

In [None]:
! gcloud config set project $PROJECT_ID
print(PROJECT_ID)

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)

In [3]:
REGION = "[your-region]"  # @param {type:"string"}

if REGION == "[your-region]":
    REGION = "us-central1"

print(REGION)

us-central1


### Authenticate your Google Cloud account

**If you are using Google Cloud Notebooks**, your environment is already authenticated.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

**Click Create service account**.

In the **Service account name** field, enter a name, and click **Create**.

In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex" into the filter box, and select **Vertex Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [4]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Google Cloud Notebook, then don't execute this code
if not os.path.exists("/opt/deeplearning/metadata/env_version"):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex AI SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [5]:
print(BUCKET_NAME)
print(BUCKET_URI)

auto-ml-tutorials
gs://auto-ml-tutorials/image/


**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_URI

Finally, validate access to your Cloud Storage bucket by examining its contents:

In [6]:
! gsutil ls -al $BUCKET_URI

         0  2022-11-11T17:37:51Z  gs://auto-ml-tutorials/image/#1668188271243479  metageneration=1
TOTAL: 1 objects, 0 bytes (0 B)


### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [7]:
import google.cloud.aiplatform as aiplatform

## Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [8]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME, location=REGION)

# Tutorial

Now you are ready to start creating your own AutoML image object detection model.

#### Location of Cloud Storage training data.

Now set the variable `dataset_source_uri` to the location of the CSV index file in Cloud Storage.

#### Quick peek at your data
This tutorial uses a version of the Flowers dataset that is stored in a public Cloud Storage bucket, using a CSV index file.

Start by doing a quick peek at the data. You count the number of examples by counting the number of rows in the CSV index file  (`wc -l`) and then peek at the first few rows.

In [9]:
count = ! gsutil cat $dataset_source_uri | wc -l
print("Number of Examples", int(count[0]))

print("First 10 rows")
! gsutil cat $dataset_source_uri | head -10


Number of Examples 3670
First 10 rows
gs://cloud-samples-data/ai-platform/flowers/daisy/100080576_f52e8ee070_n.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10140303196_b88d3d6cec.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10172379554_b296050f82_n.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10172567486_2748826a8b.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10172636503_21bededa75_n.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/102841525_bd6628ae3c.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/1031799732_e7f4008c03.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10391248763_1d16681106_n.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10437754174_22ec990b77_m.jpg,daisy
gs://cloud-samples-data/ai-platform/flowers/daisy/10437770546_8bb6f7bdd3_m.jpg,daisy


### Create the Dataset

Next, create the `Dataset` resource using the `create` method for the `ImageDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the `Dataset` resource.
- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.
- `import_schema_uri`: The data labeling schema for the data items.

This operation may take several minutes.

In [10]:
# set Dataset variables 

datasetTypeSelection = "ImageDataset" #@param ["ImageDataset", "TextDataset", "TabularDataset", "VideoDataset"]
datasetImportSchemaSelection = "image.bounding_box" #@param ["image.bounding_box", "text.single_label_classification", "video.object_tracking", "tabluarDeleteTheLine"]

## replace the variable with the text in the dropdown on the code directly


In [11]:
# Create dataset if it doesn't exist
datasets = aiplatform.ImageDataset.list(filter = f"display_name={dataset_display_name}")

if datasets:
    dataset = datasets[0]
    print(f"Dataset Exists: {datasets[0].display_name}")
else:
    dataset = aiplatform.ImageDataset.create (
    display_name=dataset_display_name,
    gcs_source=dataset_source_uri,
    import_schema_uri= aiplatform.schema.dataset.ioformat.image.bounding_box,
    )
    print(f"Dataset Created: {dataset.display_name}")
    
print(f'Review the Dataset in the Console:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/datasets/{datasets[0].resource_name}?project={PROJECT_ID}')


Dataset Exists: automl_image_flower_species_model_dataset
Review the Dataset in the Console:
https://console.cloud.google.com/vertex-ai/locations/us-central1/datasets/projects/993987777814/locations/us-central1/datasets/6328694371478667264?project=paulkamau


### Create and run training pipeline

To train an AutoML model, you perform two steps: 1) create a training pipeline, and 2) run the pipeline.

#### Create training pipeline

An AutoML training pipeline is created with the `AutoMLImageTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the `TrainingJob` resource.
- `prediction_type`: The type task to train the model for.
  - `classification`: An image classification model.
  - `object_detection`: An image object detection model.
- `multi_label`: If a classification task, whether single (`False`) or multi-labeled (`True`).
- `model_type`: The type of model for deployment.
  - `CLOUD`: Deployment on Google Cloud
  - `CLOUD_HIGH_ACCURACY_1`: Optimized for accuracy over latency for deployment on Google Cloud.
  - `CLOUD_LOW_LATENCY_`: Optimized for latency over accuracy for deployment on Google Cloud.
  - `MOBILE_TF_VERSATILE_1`: Deployment on an edge device.
  - `MOBILE_TF_HIGH_ACCURACY_1`:Optimized for accuracy over latency for deployment on an edge device.
  - `MOBILE_TF_LOW_LATENCY_1`: Optimized for latency over accuracy for deployment on an edge device.
- `base_model`: (optional) Transfer learning from existing `Model` resource -- supported for image classification only.

The instantiated object is the job for the training job.

In [12]:
# set job variables 

JobTypeSelection = "AutoMLImageTrainingJob" #@param ["AutoMLImageTrainingJob", "AutoMLTextTrainingJob", "AutoMLVideoTrainingJob", "AutoMLTabularTrainingJob"]
PredictionTypeSelection =  "classification" #@param ["classification", "object_detection"]
MultiLabelSelection ="False" #@param ["False", "True"]

## replace the variables below with the text in the dropdown on the code directly


In [None]:
# Define the training job and create one if it doesn't exit 
jobs = aiplatform.AutoMLImageTrainingJob.list(filter = f"display_name={job_display_name}")

if jobs:
    print(f"Jobs Exists: {jobs[0].resource_name}")
else:
    job = aiplatform.AutoMLImageTrainingJob(
    display_name=job_display_name,
    prediction_type="classification",
    multi_label=False,
    )
    print(f"job Created: {job.resource_name}")
    
print(f'Review the job in the Console:\nhttps://console.cloud.google.com/vertex-ai/locations/{REGION}/jobs/{jobs[0].resource_name}?project={PROJECT_ID}')



#### Run the training pipeline

Next, you run the job to start the training job by invoking the method `run`, with the following parameters:

- `dataset`: The `Dataset` resource to train the model.
- `model_display_name`: The human readable name for the trained model.
- `training_fraction_split`: The percentage of the dataset to use for training.
- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).
- `validation_fraction_split`: The percentage of the dataset to use for validation.
- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).
- `disable_early_stopping`: If `True`, training maybe completed before using the entire budget if the service believes it cannot further improve on the model objective measurements.

The `run` method when completed returns the `Model` resource.

The execution of the training pipeline will take upto 1 hour 30 minutes.

In [None]:
# Create model job if it doesn't exist
models = aiplatform.Model.list(filter = f"display_name={model_display_name}")

if models:
    model = models[0]
    print(f"Model Exists: {models[0].display_name}")
else:
    model = job.run(
    dataset=dataset,
    model_display_name=model_display_name,
    training_fraction_split=0.8,
    validation_fraction_split=0.1,
    test_fraction_split=0.1,
    budget_milli_node_hours=20000,
    disable_early_stopping=False,
    )
print(f"Model Created: {model.display_name}")



## Review model evaluation scores
After your model has finished training, you can review the evaluation scores for it.

First, you need to get a reference to the new model. As with datasets, you can either use the reference to the model variable you created when you deployed the model or you can list all of the models in your project.

In [None]:
# Model Evaluations
model_evaluations = model.list_model_evaluations()
model_evaluation = list(model_evaluations)[0]
print(model_evaluation)


# Print the evaluation metrics
for evaluation in model_evaluations:
    evaluation = evaluation.to_dict()
    print("Model's evaluation metrics from Training:\n")
    metrics = evaluation["metrics"]
    for metric in metrics.keys():
        print(f"metric: {metric}, value: {metrics[metric]}\n")

## Send a batch prediction request

Send a batch prediction to your deployed model.

### Get test item(s)

Now do a batch prediction to your Vertex model. You will use arbitrary examples out of the dataset as a test items. Don't be concerned that the examples were likely used in training the model -- we just want to demonstrate how to make a prediction.

In [None]:
test_items = !gsutil cat $dataset_source_uri | head -n2
cols_1 = str(test_items[0]).split(",")
cols_2 = str(test_items[1]).split(",")
if len(cols_1) == 11:
    test_item_1 = str(cols_1[1])
    test_label_1 = str(cols_1[2])
    test_item_2 = str(cols_2[1])
    test_label_2 = str(cols_2[2])
else:
    test_item_1 = str(cols_1[0])
    test_label_1 = str(cols_1[1])
    test_item_2 = str(cols_2[0])
    test_label_2 = str(cols_2[1])

print(test_item_1, test_label_1)
print(test_item_2, test_label_2)

### Copy test item(s)

For the batch prediction, copy the test items over to your Cloud Storage bucket.

In [None]:
file_1 = test_item_1.split("/")[-1]
file_2 = test_item_2.split("/")[-1]

! gsutil cp $test_item_1 $BUCKET_URI/$file_1
! gsutil cp $test_item_2 $BUCKET_URI/$file_2

test_item_1 = BUCKET_URI + "/" + file_1
test_item_2 = BUCKET_URI + "/" + file_2

### Make the batch input file

Now make a batch input file, which you will store in your local Cloud Storage bucket. The batch input file can be either CSV or JSONL. You will use JSONL in this tutorial. For JSONL file, you make one dictionary entry per line for each data item (instance). The dictionary contains the key/value pairs:

- `content`: The Cloud Storage path to the image.
- `mime_type`: The content type. In our example, it is a `jpeg` file.

For example:

                        {'content': '[your-bucket]/file1.jpg', 'mime_type': 'jpeg'}

In [None]:
import json

import tensorflow as tf

gcs_input_uri = BUCKET_URI + "/test.jsonl"
with tf.io.gfile.GFile(gcs_input_uri, "w") as f:
    data = {"content": test_item_1, "mime_type": "image/jpeg"}
    f.write(json.dumps(data) + "\n")
    data = {"content": test_item_2, "mime_type": "image/jpeg"}
    f.write(json.dumps(data) + "\n")

print(gcs_input_uri)
! gsutil cat $gcs_input_uri

### Make the batch prediction request

Now that your Model resource is trained, you can make a batch prediction by invoking the batch_predict() method, with the following parameters:

- `job_display_name`: The human readable name for the batch prediction job.
- `gcs_source`: A list of one or more batch request input files.
- `gcs_destination_prefix`: The Cloud Storage location for storing the batch prediction resuls.
-  `machine_type`: The type of machine for running batch prediction on dedicated resources. Not specifying machine type will                      result in batch prediction job being run with automatic resources.
-  `starting_replica_count`: The number of machine replicas used at the start of the batch operation. If not set, Vertex AI decides starting number, not greater than `max_replica_count`. Only used if `machine_type` is set.
-  `max_replica_count`: The maximum number of machine replicas the batch operation may be scaled to. Only used if `machine_type` is set. Default is 10.
- `sync`: If set to True, the call will block while waiting for the asynchronous batch job to complete.

For AutoML models, only manual scaling is supported. In manual scaling both starting_replica_count and max_replica_count have the same value.
For this batch job we are using manual scaling. Here we are setting both starting_replica_count and max_replica_count to the same value that is 1. 

In [None]:
# Define the training job and create one if it doesn't exit 

jobs = aiplatform.BatchPredictionJob.list(filter = f"display_name={batch_display_name}")

if jobs:
    batch_predict_job = jobs[0]
    print(f"Jobs Exists: {jobs[0].display_name}")
else:
    batch_predict_job = model.batch_predict(
    job_display_name=batch_display_name,
    gcs_source=gcs_input_uri,
    gcs_destination_prefix=OUTPUTBUCKET,
    machine_type="n1-standard-4",
    starting_replica_count=1,
    max_replica_count=1,
    sync=False,
)


### Wait for completion of batch prediction job

Next, wait for the batch job to complete. Alternatively, one can set the parameter `sync` to `True` in the `batch_predict()` method to block until the batch prediction job is completed.

In [None]:
batch_predict_job.wait()

### Get the predictions

Next, get the results from the completed batch prediction job.

The results are written to the Cloud Storage output bucket you specified in the batch prediction request. You call the method iter_outputs() to get a list of each Cloud Storage file generated with the results. Each file contains one or more prediction requests in a JSON format:

- `content`: The prediction request.
- `prediction`: The prediction response.
 - `ids`: The internal assigned unique identifiers for each prediction request.
 - `displayNames`: The class names for each class label.
 - `bboxes`: The bounding box of each detected object.

In [None]:
import json

bp_iter_outputs = batch_predict_job.iter_outputs()

prediction_results = list()
for blob in bp_iter_outputs:
    if blob.name.split("/")[-1].startswith("prediction"):
        prediction_results.append(blob.name)

tags = list()
for prediction_result in prediction_results:
    gfile_name = f"gs://{bp_iter_outputs.bucket.name}/{prediction_result}"
    with tf.io.gfile.GFile(name=gfile_name, mode="r") as gfile:
        for line in gfile.readlines():
            line = json.loads(line)
            print(line)

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Dataset
- Model
- AutoML Training Job
- Batch Job
- Cloud Storage Bucket

In [None]:
delete_bucket = False

# Delete the dataset using the Vertex dataset object
dataset.delete()

# Delete the model using the Vertex model object
model.delete()

# Delete the AutoML or Pipeline trainig job
job.delete()

# Delete the batch prediction job using the Vertex batch prediction object
batch_predict_job.delete()

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI