## AutoML tabular forecasting model for batch prediction

### Objective

In this tutorial, you learn how to create an `AutoML` tabular forecasting model from a Python script, and then do a batch prediction using the Vertex AI SDK. You can alternatively create and deploy models using the `gcloud` command-line tool or online using the Cloud Console.

This tutorial uses the following Google Cloud ML services:

- `AutoML Training`
- `Vertex AI Batch Prediction`
- `Vertex AI Model` resource

The steps performed include:

- Create a `Vertex AI Dataset` resource.
- Train an `AutoML` tabular forecasting `Model` resource.
- Obtain the evaluation metrics for the `Model` resource.
- Make a batch prediction.

## Installation

Install the following packages required to execute this notebook. 

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q

In [None]:
import urllib
import google.cloud.aiplatform as aiplatform
from google.cloud import bigquery
import random
import string

In [None]:
PROJECT_ID = "[your-project-id]" # @param {type:"string"}

! gcloud config set project $PROJECT_ID

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. It is recommended that you choose the region closest to you. For this GHack we'll use us-central1.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.


In [None]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [None]:
BUCKET_NAME = "[bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

#### UUID

To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this GHack.

In [None]:
# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

# GHack

Now you are ready to start creating your own AutoML tabular forecasting model.

In [None]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

#### Location of BigQuery training data.

Now set the variable `TRAINING_DATASET_BQ_PATH` to the location of the BigQuery table. 

In [None]:
TRAINING_DATASET_BQ_PATH = (
    "bq://bigquery-public-data:iowa_liquor_sales_forecasting.2020_sales_train"
)

### Create the Dataset

Next, create the `Dataset` resource using the `create` method for the `TimeSeriesDataset` class, which takes the following parameters:

- `display_name`: The human readable name for the `Dataset` resource.
- `gcs_source`: A list of one or more dataset index files to import the data items into the `Dataset` resource.
- `bq_source`: Alternatively, import data items from a BigQuery table into the `Dataset` resource.

This operation may take several minutes.

In [None]:
dataset = aiplatform.TimeSeriesDataset.create(
    display_name="iowa_liquor_sales_train" + "_" + UUID,
    bq_source=[TRAINING_DATASET_BQ_PATH],
)

time_column = "date"
time_series_identifier_column = "store_name"
target_column = "sale_dollars"

print(dataset.resource_name)

In [None]:
COLUMN_SPECS = {
    time_column: "timestamp",
    target_column: "numeric",
    "city": "categorical",
    "zip_code": "categorical",
    "county": "categorical",
}

### Create and run training job

To train an AutoML model, you perform two steps: 1) create a training job, and 2) run the job.

#### Create training job

An AutoML training job is created with the `AutoMLForecastingTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the `TrainingJob` resource.
- `column_transformations`: (Optional): Transformations to apply to the input columns
- `optimization_objective`: The optimization objective to minimize or maximize. Some examples:
    - `minimize-rmse`
    - `minimize-mae`
    - `minimize-rmsle`
    - `minimize-quantile-loss`
    

The instantiated object is the job for the training pipeline.

In [None]:
MODEL_DISPLAY_NAME = f"iowa-liquor-sales-forecast-model_{UUID}"

training_job = #['Fill in here']

#### Run the training pipeline

Next, you start the training job by invoking the method `run`, with the following parameters:

- `dataset`: The `Dataset` resource to train the model.
- `model_display_name`: The human readable name for the trained model.
- `training_fraction_split`: The percentage of the dataset to use for training.
- `test_fraction_split`: The percentage of the dataset to use for test (holdout data).
- `target_column`: The name of the column to train as the label.
- `budget_milli_node_hours`: (optional) Maximum training time specified in unit of millihours (1000 = hour).
- `time_column`: Time-series column for the forecast model.
- `time_series_identifier_column`: ID column for the time-series column.

The `run` method when completed returns the `Model` resource.

The execution of the training pipeline will take up to 2-3 hours.

In [None]:
model = #['Fill in here']

## Review model evaluation scores

After your model training has finished, you can review the evaluation scores 

In [None]:
#['Print your model evaluation results']

## Send a batch prediction request

Send a batch prediction to your deployed model.

### Make the batch prediction request

Now that your Model resource is trained, you can make a batch prediction by invoking the batch_predict() method using a BigQuery source and destination, with the following parameters:

- `job_display_name`: The human readable name for the batch prediction job.
- `bigquery_source`: BigQuery URI to a table, up to 2000 characters long. For example: `bq://projectId.bqDatasetId.bqTableId`
- `bigquery_destination_prefix`: The BigQuery dataset or table for storing the batch prediction resuls.
- `instances_format`: The format for the input instances. Since a BigQuery source is used here, this should be set to `bigquery`.
- `predictions_format`: The format for the output predictions, `bigquery` is used here to output to a BigQuery table.
- `generate_explanations`: Set to `True` to generate explanations.
- `sync`: If set to True, the call will block while waiting for the asynchronous batch job to complete.

In [None]:
batch_predict_bq_output_dataset_name = f"iowa_liquor_sales_predictions_{UUID}"
batch_predict_bq_output_dataset_path = "{}.{}".format(
    PROJECT_ID, batch_predict_bq_output_dataset_name
)
batch_predict_bq_output_uri_prefix = "bq://{}.{}".format(
    PROJECT_ID, batch_predict_bq_output_dataset_name
)
# Must be the same region as batch_predict_bq_input_uri
client = bigquery.Client(project=PROJECT_ID)
bq_dataset = bigquery.Dataset(batch_predict_bq_output_dataset_path)
dataset_region = "US"  # @param {type : "string"}
bq_dataset.location = dataset_region
bq_dataset = client.create_dataset(bq_dataset)
print(
    "Created bigquery dataset {} in {}".format(
        batch_predict_bq_output_dataset_path, dataset_region
    )
)

In [None]:
PREDICTION_DATASET_BQ_PATH = (
    "bq://bigquery-public-data:iowa_liquor_sales_forecasting.2021_sales_predict"
)

batch_prediction_job = #['Create batch prediction']

print(batch_prediction_job)

In [None]:
#['Print batch prediction results']

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Dataset
- AutoML Training Job
- Model
- Batch Prediction Job
- Cloud Storage Bucket

In [None]:
# Delete dataset
dataset.delete()

# Training job
training_job.delete()

# Delete model
model.delete()

# Delete batch prediction job
batch_prediction_job.delete()

# Set this to true only if you'd like to delete your bucket
delete_bucket = False

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI