# Managed Pipelines & Unified AI Platform: AutoML Tabular +TFDV 'end-to-end' workflow + subgraph components



## Introduction

The first section of this example notebook shows how to build an [AutoML Tabular](https://cloud.google.com/ai-platform-unified/docs/training/training) 'end-to-end' workflow for **Managed Pipelines**, using the [Unified AI Platform](https://cloud.google.com/ai-platform-unified/docs)'s SDK along with the [KFP SDK](https://www.kubeflow.org/docs/pipelines/sdk/).

The pipeline creates a [*Dataset*](https://cloud.google.com/ai-platform-unified/docs/datasets/datasets) from a [BigQuery](https://cloud.google.com/bigquery/) table, uses it to train an **AutoML** tabular regression model, gets evaluation information about the trained model, and if the model is sufficiently accurate, deploys the model for prediction. (It's also possible to export the model to run off-platform, though that's not included in this example).

The pipeline is built using a KFP SDK feature for generating **subgraph components**: we'll build a subgraph component for the full AutoML e2e flow, and use that component to construct the pipeline. Then, in the next section of the notebook, we'll use this subgraph component to construct a larger pipeline.

This section also shows how to use [**Cloud Build**](https://cloud.google.com/cloud-build) to create a base container image for pipeline steps; use of component I/O, to pass information between steps of a pipeline; how to set up **scheduled** (recurring) pipeline runs; and how to query the **Metadata Server** via the SDK to examine a pipeline's automatically tracked metadata.

<a href="https://storage.googleapis.com/amy-jo/images/automl/ucaip_automl_tabular_dag.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/automl/ucaip_automl_tabular_dag.png" width="90%"/></a>

Then, the next section of the example notebook shows how to use the [TFDV](https://www.tensorflow.org/tfx/guide/tfdv) (**TensorFlow Data Validation**) library to analyze dataset stats; then compare two sets of stats to determine whether an update to a dataset suggests enough 'data drift' for a model to be retrained, and initiate retraining if so. The stats generation step can be run on the [Dataflow](https://cloud.google.com/dataflow) service, or locally to the pipeline step using the [Beam](https://beam.apache.org/) DirectRunner.

We'll build a pipeline that **uses these new TFDV components in conjunction with the AutoML subgraph component** built previously, to construct the full workflow.

This section also shows how to configure resource constraints for pipeline steps.

<a href="https://storage.googleapis.com/amy-jo/images/automl/ucaip-automl-tables-tfdv.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/automl/ucaip-automl-tables-tfdv.png" width="90%"/></a>

The training can take a few hours, so this example notebook can take a while to run in full.

### About the dataset and modeling task

The  [Cloud Public Datasets Program](https://cloud.google.com/bigquery/public-data/)  makes available public datasets that are useful for experimenting with machine learning. Just as with this “[Explaining model predictions on structured data](https://cloud.google.com/blog/products/ai-machine-learning/explaining-model-predictions-structured-data)” post, we’ll use data that is essentially a join of two public datasets stored in  [BigQuery](https://cloud.google.com/bigquery/) :  [London Bike rentals](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=london_bicycles&page=dataset)  and  [NOAA weather data](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=noaa_gsod&page=dataset) , with some additional processing to clean up outliers and derive additional GIS and day-of-week fields.

We’ll use this dataset to build a *regression* model to predict the *duration* of a bike rental based on information about the start and end stations, the day of the week, the weather on that day, and other data. If we were running a bike rental company, for example, these predictions— and their explanations— could help us anticipate demand and even plan how to stock each location.

## Setup

Before you run this notebook, ensure that your Google Cloud user account and project are granted access to the Managed Pipelines Experimental. To be granted access to the Managed Pipelines Experimental, fill out this [form](http://go/cloud-mlpipelines-signup) and let your account representative know you have requested access. 

This notebook is intended to be run on either one of:
* [AI Platform Notebooks](https://cloud.google.com/ai-platform-notebooks). See the "AI Platform Notebooks" section in the Experimental [User Guide](https://docs.google.com/document/d/1JXtowHwppgyghnj1N1CT73hwD1caKtWkLcm2_0qGBoI/edit?usp=sharing) for more detail on creating a notebook server instance.
* [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb)

If you haven't already enabled the AI Platform API, on the [AI Platform (Unified) Dashboard](https://console.cloud.google.com/ai/platform) page in the Google Cloud Console, click **Enable the AI Platform API**.


We'll first install some libraries and set up some variables.


Set `gcloud` to use your project.  **Edit the following cell before running it**.

In [1]:
PROJECT_ID = 'jk-mlops-dev'  # <---CHANGE THIS

In [2]:
!gcloud config set project {PROJECT_ID}

Updated property [core/project].


If you're running this notebook on colab, authenticate with your user account:

In [5]:
import sys
if 'google.colab' in sys.modules:
  from google.colab import auth
  auth.authenticate_user()

-----------------

**If you're on AI Platform Notebooks**, authenticate with Google Cloud before running the next section, by running
```sh
gcloud auth login
```
**in the Terminal window** (which you can open via **File** > **New** in the menu). You only need to do this once per notebook instance.

### Install the KFP SDK and AI Platform Pipelines client library

For Managed Pipelines Experimental, you'll need to download a special version of the AI Platform client library.

In [3]:
!gsutil cp gs://cloud-aiplatform-pipelines/releases/20210304/aiplatform_pipelines_client-0.1.0.caip20210304-py3-none-any.whl .
# Get the Metadata SDK to query the produced metadata.
!gsutil cp gs://cloud-aiplatform-metadata/sdk/google-cloud-aiplatform-metadata-0.0.1.tar.gz .

Copying gs://cloud-aiplatform-pipelines/releases/20210304/aiplatform_pipelines_client-0.1.0.caip20210304-py3-none-any.whl...
/ [1 files][ 22.1 KiB/ 22.1 KiB]                                                
Operation completed over 1 objects/22.1 KiB.                                     
Copying gs://cloud-aiplatform-metadata/sdk/google-cloud-aiplatform-metadata-0.0.1.tar.gz...
/ [1 files][ 98.5 KiB/ 98.5 KiB]                                                
Operation completed over 1 objects/98.5 KiB.                                     


Then, install the libraries and restart the kernel.

In [6]:
if 'google.colab' in sys.modules:
  USER_FLAG = ''
else:
  USER_FLAG = '--user'

In [7]:
!python3 -m pip install {USER_FLAG} kfp==1.4 google-cloud-aiplatform-metadata-0.0.1.tar.gz aiplatform_pipelines_client-0.1.0.caip20210304-py3-none-any.whl --upgrade

Processing ./google-cloud-aiplatform-metadata-0.0.1.tar.gz
Requirement already up-to-date: kfp==1.4 in /home/jupyter/.local/lib/python3.7/site-packages (1.4.0)
Processing ./aiplatform_pipelines_client-0.1.0.caip20210304-py3-none-any.whl
Building wheels for collected packages: google-cloud-aiplatform-metadata
  Building wheel for google-cloud-aiplatform-metadata (setup.py) ... [?25ldone
[?25h  Created wheel for google-cloud-aiplatform-metadata: filename=google_cloud_aiplatform_metadata-0.0.1-py3-none-any.whl size=136056 sha256=d72e7e270c9c0128a959559e515ce710e159a8f725cca201e41516447d49c196
  Stored in directory: /home/jupyter/.cache/pip/wheels/cb/ba/c3/3c5e557e11fa1d94203be439e812dd2bbcd89169f38c0b84c2
Successfully built google-cloud-aiplatform-metadata
Installing collected packages: aiplatform-pipelines-client, google-cloud-aiplatform-metadata
  Attempting uninstall: aiplatform-pipelines-client
    Found existing installation: aiplatform-pipelines-client 0.1.0.caip20210209
    Unins

In [8]:
!python3 -m pip install {USER_FLAG} google-cloud-aiplatform

You should consider upgrading via the '/opt/conda/bin/python3 -m pip install --upgrade pip' command.[0m


In [9]:
# Automatically restart kernel after installs 
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

The KFP version should be >= 1.4.



In [1]:
# Check the KFP version
!python3 -c "import kfp; print('KFP version: {}'.format(kfp.__version__))"

KFP version: 1.4.0


If you're on colab, re-authorize after the kernel restart. **Edit the following cell for your project ID before running it.**

In [None]:
import sys
if 'google.colab' in sys.modules:
  PROJECT_ID = 'your-project-id'  # <---CHANGE THIS
  !gcloud config set project {PROJECT_ID}
  from google.colab import auth
  auth.authenticate_user()
  USER_FLAG = ''


### Set some variables

**Before you run the next cell**, **edit it** to set variables for your project.  See the "Before you begin" section of the User Guide for information on creating your API key.  For `BUCKET_NAME`, enter the name of a Cloud Storage (GCS) bucket in your project.  Don't include the `gs://` prefix.

In [2]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

# Required Parameters
USER = 'jarekk' # <---CHANGE THIS
BUCKET_NAME = 'jk-ucaip-labs'  # <---CHANGE THIS
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, USER)

PROJECT_ID = 'jk-mlops-dev'  # <---CHANGE THIS
REGION = 'us-central1'
API_KEY = 'AIzaSyBS_RiaK3liaVthTUD91XuPDKIbiwDFlV8'  # <---CHANGE THIS

print('PIPELINE_ROOT: {}'.format(PIPELINE_ROOT))

env: PATH=/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin
PIPELINE_ROOT: gs://jk-ucaip-labs/pipeline_root/jarekk


## (Optional) Create the underlying container image used for the pipeline steps



We'll use Cloud Build to generate the container image used by the AutoML pipeline components in this example, so that we don't need to load these libraries at runtime.

If you'd like to skip this section, you can instead use a prebuilt version of this same container image: `gcr.io/google-samples/automl-ucaip:v1`.

In [3]:
%%writefile Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-cpu.2-3:latest

RUN pip install -U google-cloud-aiplatform
RUN pip install -U google-cloud-storage

Writing Dockerfile


In [4]:
!gcloud builds submit --tag gcr.io/{PROJECT_ID}/custom-container-ucaip:{USER} .

Creating temporary tarball archive of 5 file(s) totalling 290.6 KiB before compression.
Uploading tarball of [.] to [gs://jk-mlops-dev_cloudbuild/source/1615563057.684333-fc4b847215cd407fa499e23524d84464.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-dev/locations/global/builds/e8d8f417-8710-4b89-acd2-493a156a7b14].
Logs are available at [https://console.cloud.google.com/cloud-build/builds/e8d8f417-8710-4b89-acd2-493a156a7b14?project=895222332033].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "e8d8f417-8710-4b89-acd2-493a156a7b14"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-dev_cloudbuild/source/1615563057.684333-fc4b847215cd407fa499e23524d84464.tgz#1615563058152786
Copying gs://jk-mlops-dev_cloudbuild/source/1615563057.684333-fc4b847215cd407fa499e23524d84464.tgz#1615563058152786...
/ [1 files][158.6 KiB/158.6 KiB]                                                
Operation completed over 1 objects/158.6 KiB

**If you built your own container image and want to use that instead of the prebuilt one, uncomment that line from the cell below** before running it.

In [5]:
# Uncomment the following to use your own container image
CONTAINER_IMAGE = 'gcr.io/{}/custom-container-ucaip:{}'.format(PROJECT_ID, USER)
# CONTAINER_IMAGE = 'gcr.io/google-samples/automl-ucaip:v1'
print(CONTAINER_IMAGE)

gcr.io/jk-mlops-dev/custom-container-ucaip:jarekk


## Create the AutoML pipeline components

In this section we'll define the pipeline components for the AutoML workflow. 
We'll build components for creating a Dataset, training a model, obtaining evaluation information for a given model, creating a model *endpoint*, and deploying the trained model to the endpoint for serving.

We'll build ['lightweight' Python-function-based components](https://www.kubeflow.org/docs/pipelines/sdk/python-function-components/) using the KFP SDK. 
First we'll define the Python functions for each step, then generate component `.yaml` files based on those function definitions.

### Create the dataset component function

Create a *Dataset* and ingest data into it from a BigQuery table. This function assumes a BQ table as the source, and would be a bit different if the source was a set of GCS files.

In [6]:
from kfp.components import OutputPath

def create_dataset_tabular_bigquery_sample(
    project: str,
    display_name: str,
    bigquery_uri: str, # eg 'bq://aju-dev-demos.london_bikes_weather.bikes_weather',
    location: str, # "us-central1",
    api_endpoint: str, # "us-central1-aiplatform.googleapis.com",
    timeout: int, # 500,
    drift_res: str,
    dataset_id: OutputPath('String'),
):

  import logging
  import subprocess
  import time

  logging.getLogger().setLevel(logging.INFO)
  if drift_res == 'false':
    logging.warning('dataset drift detected; not proceeding')
    exit(1)

  from google.cloud import aiplatform
  from google.protobuf import json_format
  from google.protobuf.struct_pb2 import Value


  client_options = {"api_endpoint": api_endpoint}
  # Initialize client that will be used to create and send requests.
  client = aiplatform.gapic.DatasetServiceClient(client_options=client_options)
  metadata_dict = {"input_config": {"bigquery_source": {"uri": bigquery_uri}}}
  metadata = json_format.ParseDict(metadata_dict, Value())

  dataset = {
      "display_name": display_name,
      "metadata_schema_uri": "gs://google-cloud-aiplatform/schema/dataset/metadata/tabular_1.0.0.yaml",
      "metadata": metadata,
  }
  parent = f"projects/{project}/locations/{location}"
  response = client.create_dataset(parent=parent, dataset=dataset)
  print("Long running operation:", response.operation.name)
  create_dataset_response = response.result(timeout=timeout)
  logging.info("create_dataset_response: %s", create_dataset_response)
  path_components = create_dataset_response.name.split('/')
  logging.info('got dataset id: %s', path_components[-1])
  # write the dataset id as output
  with open('temp.txt', "w") as outfile:
    outfile.write(path_components[-1])
  subprocess.run(['gsutil', 'cp', 'temp.txt', dataset_id])


### Create a component function to train an AutoML tabular regression model

We'll train an AutoML tabular model on the dataset.  To set up the training job, we need to provide the target (label) column and a dict of the input transformations: which fields to use as model inputs, and the input types.  These will be set later on in the notebook, when we run the pipeline.

For the model we're training, the target column will be set to `duration`, a numeric field. Thus, this function definition assumes a *regression* model, and would be a bit different if we were training a classification model instead.

In [7]:
from kfp.components import OutputPath

def training_tabular_regression(
    project: str,
    display_name: str,
    dataset_id: str,
    model_prefix: str,
    target_column: str,
    transformations_str: str,
    location: str, # "us-central1",
    api_endpoint: str, # "us-central1-aiplatform.googleapis.com"
    model_id: OutputPath('String'),
    model_dispname: OutputPath('String')
):
  import json
  import logging
  import subprocess
  import time
  from google.cloud import aiplatform
  from google.protobuf import json_format
  from google.protobuf.struct_pb2 import Value
  from google.cloud.aiplatform_v1beta1.types import pipeline_state

  SLEEP_INTERVAL = 100

  logging.getLogger().setLevel(logging.INFO)
  logging.info('using dataset id: %s', dataset_id)
  client_options = {"api_endpoint": api_endpoint}
  # Initialize client that will be used to create and send requests.
  client = aiplatform.gapic.PipelineServiceClient(client_options=client_options)
  # set the columns used for training and their data types
  transformations = json.loads(transformations_str)
  logging.info('using transformations: %s', transformations)

  training_task_inputs_dict = {
        # required inputs
        "targetColumn": target_column,
        "predictionType": "regression",
        "transformations": transformations,
        "trainBudgetMilliNodeHours": 2000,
        "disableEarlyStopping": False,
        "optimizationObjective": "minimize-rmse",
  }
  training_task_inputs = json_format.ParseDict(training_task_inputs_dict, Value())
  model_display_name = '{}_{}'.format(model_prefix, str(int(time.time())))

  training_pipeline = {
        "display_name": display_name,
        "training_task_definition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_tabular_1.0.0.yaml",
        "training_task_inputs": training_task_inputs,
        "input_data_config": {
            "dataset_id": dataset_id,
            "fraction_split": {
                "training_fraction": 0.8,
                "validation_fraction": 0.1,
                "test_fraction": 0.1,
            },
        },
        "model_to_upload": {"display_name": model_display_name},
  }
  parent = f"projects/{project}/locations/{location}"
  response = client.create_training_pipeline(
        parent=parent, training_pipeline=training_pipeline
  )
  training_pipeline_name = response.name
  logging.info("pipeline name: %s", training_pipeline_name)
  # Poll periodically until training completes
  while True:
    mresponse = client.get_training_pipeline(name=training_pipeline_name)
    logging.info('mresponse: %s', mresponse)
    logging.info('job state: %s', mresponse.state)
    if mresponse.state == pipeline_state.PipelineState.PIPELINE_STATE_SUCCEEDED:
      logging.info('training finished')
      # write some outputs once finished
      model_name = mresponse.model_to_upload.name 
      logging.info('got model name: %s', model_name)
      with open('temp.txt', "w") as outfile:
        outfile.write(model_name)
      subprocess.run(['gsutil', 'cp', 'temp.txt', model_id])
      with open('temp2.txt', "w") as outfile:
        outfile.write(model_display_name)
      subprocess.run(['gsutil', 'cp', 'temp2.txt', model_dispname])      
      break
    else:
      time.sleep(SLEEP_INTERVAL)


### Create a component function to get evaluation information for the trained model

Once the model training has finished, we can get model evaluation metrics.


In [8]:

def get_model_evaluation_tabular(
    project: str,
    model_id: str,
    location: str, #"us-central1",
    api_endpoint: str, # "us-central1-aiplatform.googleapis.com",
    eval_info: OutputPath('String'),
):
  import json
  import logging
  import subprocess  
  from google.cloud import aiplatform

  def get_eval_id(client, model_name):
    from google.protobuf.json_format import MessageToDict
    response = client.list_model_evaluations(parent=model_name)
    for evaluation in response:
        print("model_evaluation")
        print(" name:", evaluation.name)
        print(" metrics_schema_uri:", evaluation.metrics_schema_uri)
        metrics = MessageToDict(evaluation._pb.metrics)
        for metric in metrics.keys():
            logging.info('metric: %s, value: %s', metric, metrics[metric])
        metrics_str = json.dumps(metrics)

    return (evaluation.name, metrics_str)  # for regression, only one slice

  logging.getLogger().setLevel(logging.INFO)

  client_options = {"api_endpoint": api_endpoint}
  # Initialize client that will be used to create and send requests.
  client = aiplatform.gapic.ModelServiceClient(client_options=client_options)
  eval_name, metrics_str = get_eval_id(client, model_id)
  logging.info('got evaluation name: %s', eval_name)
  logging.info('got metrics dict string: %s', metrics_str)
  with open('temp.txt', "w") as outfile:
    outfile.write(metrics_str)
  subprocess.run(['gsutil', 'cp', 'temp.txt', eval_info])  


### Create a component function to create a serving endpoint

Here we're creating a serving endpoint to which we'll deploy the trained model.  If an existing endpoint path is given as an arg, we'll use that instead of creating a new endpoint. 

In [9]:
from kfp.components import OutputPath

def create_endpoint(
    project: str,
    display_name: str,
    endpoint_path: str,    
    location: str, # "us-central1",
    api_endpoint: str, # "us-central1-aiplatform.googleapis.com",
    timeout: int,
    endpoint_id: OutputPath('String'),

):
  import logging
  import subprocess  
  from google.cloud import aiplatform

  logging.getLogger().setLevel(logging.INFO)
  if endpoint_path == 'new':  # then create new endpoint, using given display name
    logging.info('creating new endpoint with display name: %s', display_name)
    client_options = {"api_endpoint": api_endpoint}
    # Initialize client that will be used to create and send requests.
    client = aiplatform.gapic.EndpointServiceClient(client_options=client_options)
    endpoint = {"display_name": display_name}
    parent = f"projects/{project}/locations/{location}"
    response = client.create_endpoint(parent=parent, endpoint=endpoint)
    logging.info("Long running operation: %s", response.operation.name)
    create_endpoint_response = response.result(timeout=timeout)
    logging.info("create_endpoint_response: %s", create_endpoint_response)
    endpoint_name = create_endpoint_response.name 
    logging.info('endpoint name: %s', endpoint_name)
  else:  # otherwise, use given endpoint path expression (TODO: add error checking)
    logging.info('using existing endpoint: %s', endpoint_path)
    endpoint_name = endpoint_path
  # write the endpoint name (path expression) as output
  with open('temp.txt', "w") as outfile:
    outfile.write(endpoint_name)
  subprocess.run(['gsutil', 'cp', 'temp.txt', endpoint_id])


### Create a component function to deploy the trained model for serving

This function deploys the trained model for serving.  It supports simple 'gating' logic on model quality.

The model eval metrics dict (extracted in a previous step of the pipeline) is passed as a component input, and is compared  with a dict of metrics threshold values, passed as a pipeline arg.  If the model doesn't meet the given threshold value(s), it won't be deployed. 

In [10]:
def deploy_automl_tabular_model(
    project: str,
    endpoint_name: str,
    model_name: str,
    deployed_model_display_name: str,
    eval_info: str,
    location: str = "us-central1",
    api_endpoint: str = "us-central1-aiplatform.googleapis.com",
    timeout: int = 7200,
    thresholds_dict_str: str = '{"meanAbsoluteError": 470}'
):
  import json
  import logging
  from google.cloud import aiplatform

  # check the model metrics against the given thresholds dict
  def regression_thresholds_check(metrics_dict, thresholds_dict):
    for k, v in thresholds_dict.items():
      logging.info('k {}, v {}'.format(k, v))
      if k in ['rootMeanSquaredError', 'meanAbsoluteError']:  # lower is better
        if metrics_dict[k] > v:  # if over threshold
          logging.info('{} > {}; returning False'.format(
              metrics_dict[k], v))
          return False
      elif k in ['rSquared']:  # higher is better
        if metrics_dict[k] < v:  # if under threshold
          logging.info('{} < {}; returning False'.format(
              metrics_dict[k], v))
          return False
      else:  # unhandled key in thresholds dict
        # TODO: should the default instead be to deploy?
        logging.info('unhandled threshold key %s; not deploying', k)
        return False
    logging.info('threshold checks passed.')
    return True  

  logging.getLogger().setLevel(logging.INFO)
  metrics_dict = json.loads(eval_info)
  thresholds_dict = json.loads(thresholds_dict_str)
  logging.info('got metrics dict: %s', metrics_dict)
  logging.info('got thresholds dict: %s', thresholds_dict)
  deploy = regression_thresholds_check(metrics_dict, thresholds_dict)
  if not deploy:
    # then don't deploy the model
    logging.warning('model is not accurate enough to deploy')
    return 

  client_options = {"api_endpoint": api_endpoint}
  # Initialize client that will be used to create and send requests.
  client = aiplatform.gapic.EndpointServiceClient(client_options=client_options)
  deployed_model = {
      # format: 'projects/{project}/locations/{location}/models/{model}'
      "model": model_name,
      "display_name": deployed_model_display_name,
      "dedicated_resources": {
          "min_replica_count": 1,
          "machine_spec": {
              "machine_type": "n1-standard-8",
              # Accelerators can be used only if the model specifies a GPU image.
              # 'accelerator_type': aiplatform.AcceleratorType.NVIDIA_TESLA_K80,
              # 'accelerator_count': 1,
          },
      }        
  }
  # key '0' assigns traffic for the newly deployed model
  # Traffic percentage values must add up to 100
  # Leave dictionary empty if endpoint should not accept any traffic
  traffic_split = {"0": 100}
  response = client.deploy_model(
      endpoint=endpoint_name, deployed_model=deployed_model, traffic_split=traffic_split
  )
  print("Long running operation:", response.operation.name)
  deploy_model_response = response.result(timeout=timeout)
  print("deploy_model_response:", deploy_model_response)


### Create components from the python functions

Now we'll use the `kfp.components.func_to_container_op` method to create `yaml` component files for the functions above.  In all cases we'll use as the component `base_image` the container we built above.  

The `yaml` component files make it easy to version-track and share component definitions. 

In [12]:
from kfp import components
from kfp.v2 import dsl
from kfp.v2 import compiler

components.func_to_container_op(create_dataset_tabular_bigquery_sample,
      output_component_file='tables_create_dataset_component.yaml', 
      base_image=CONTAINER_IMAGE)

components.func_to_container_op(training_tabular_regression,
      output_component_file='tables_train_component.yaml', 
      base_image=CONTAINER_IMAGE)

components.func_to_container_op(get_model_evaluation_tabular,
      output_component_file='tables_eval_component.yaml', 
      base_image=CONTAINER_IMAGE)

components.func_to_container_op(create_endpoint,
      output_component_file='tables_endpoint_component.yaml', 
      base_image=CONTAINER_IMAGE)

components.func_to_container_op(deploy_automl_tabular_model,
      output_component_file='tables_deploy_component.yaml', 
      base_image=CONTAINER_IMAGE)


<function Deploy automl tabular model(project: str, endpoint_name: str, model_name: str, deployed_model_display_name: str, eval_info: str, location: str = 'us-central1', api_endpoint: str = 'us-central1-aiplatform.googleapis.com', timeout: int = '7200', thresholds_dict_str: str = '{"meanAbsoluteError": 470}')>

## Define and run an e2e AutoML pipeline using the components you defined



Now we're ready to define and run a pipeline, using the components defined above.

First, we'll create pipeline 'ops' from the components, using the `load_component_from_file` method. While we're not using it here, there is also a `load_component_from_url` method, which is handy if your component files are checked into a repo or otherwise stored online. (For GitHub files, use the 'raw' URL).

Next, we'll build a **subgraph component** from the ops.  The subgraph component lets us package a (sub-)workflow in a reusable manner.

Then, we'll define a pipeline using that subgraph component.  
(In the next section, we'll use that subgraph component in a larger pipeline).



We'll first do some imports.  Note that currently, for Managed Pipelines, we're using the `kfp.v2` namespace for the `dsl` and `compiler` packages.

In [13]:
import json
from kfp.v2 import dsl
from kfp.v2 import compiler
from kfp import components

Now we'll instantiate the pipeline op constructors from the component files we generated above.

In [14]:
create_dataset_op = components.load_component_from_file(
  './tables_create_dataset_component.yaml'
  )

train_op = components.load_component_from_file(
  './tables_train_component.yaml'
  )

eval_op = components.load_component_from_file(
  './tables_eval_component.yaml'
  )

create_endpoint_op = components.load_component_from_file(
  './tables_endpoint_component.yaml'
  )

deploy_op = components.load_component_from_file(
  './tables_deploy_component.yaml'
  )


Next, we'll define a _subgraph component_ using those ops.

> Because of [this bug](https://github.com/kubeflow/pipelines/issues/5173), we're temporarily re-instantiating the ops from URLs rather than using the op definitions above.

In the following, **note that some of the ops take as inputs the outputs from other ops.**

Note also that the `create_endpoint` step has no input dependencies, and thus no implicit ordering constraints. Thus, it can run right away, so it can run concurrently with the steps that create a dataset and train a model.  

In [16]:
from kfp import components

create_dataset_op = components.load_component_from_url(
  'https://gist.githubusercontent.com/amygdala/33b424c8cb2a287728fe0f07a81b19e0/raw/c79fc70908f7f36267f37edd923088e4c28da725/tables_create_dataset_component.yaml'
  )

train_op = components.load_component_from_url(
  'https://gist.githubusercontent.com/amygdala/7892b872b3b6ccfa518ac7f447010f22/raw/e3424e8a5ed9010668478cd5a8e3b7396fe738d2/tables_train_component.yaml'
  )

eval_op = components.load_component_from_url(
  'https://gist.githubusercontent.com/amygdala/7c3a74dcabb4a42bcf46470d68715c09/raw/a5f140acf5e0a952f379b07f1d77251fdebe1998/tables_eval_component.yaml'
  )

create_endpoint_op = components.load_component_from_url(
  'https://gist.githubusercontent.com/amygdala/8331619bd400e695f4f41c62747d9bd1/raw/63b0d69857ad6dc82cc9fe9e6f783a55fa0cc465/tables_endpoint_component.yaml'
  )

deploy_op = components.load_component_from_url(
  'https://gist.githubusercontent.com/amygdala/3a0db8c44cbeb573f6437978b65ff3ec/raw/6923d101d0c0319088dfa19a2b0fa6919b484c55/tables_deploy_component.yaml'
  )


def automl_tables( 
  gcp_project_id: str = 'YOUR_PROJECT_HERE',
  gcp_region: str = 'us-central1',
  dataset_display_name: str = 'YOUR_DATASET_NAME',
  api_endpoint: str = 'us-central1-aiplatform.googleapis.com',
  timeout: int = 2000,
  bigquery_uri: str = 'bq://aju-dev-demos.london_bikes_weather.bikes_weather',
  # bigquery_uri: str = 'bq://aju-vtests2.demos.bw_ts_sorted2',
  target_col_name: str = 'duration',
  time_col_name: str = 'none',    
  transformations: str = '{}',
  train_budget_milli_node_hours: int = 1000,
  model_prefix: str = 'bw',    
  # optimization_objective: str = 'minimize-rmse', 
  training_display_name: str = 'CHANGE THIS',
  endpoint_display_name: str = 'CHANGE THIS',
  # if set to other than 'new', use the given endpoint path rather than create new endpoint.  
  endpoint_path: str = 'new',
  thresholds_dict_str: str = '{"meanAbsoluteError": 470}',
  drift_res: str = 'true'
  ):  

  create_dataset = create_dataset_op(
    gcp_project_id,
    dataset_display_name,
    bigquery_uri,
    gcp_region,
    api_endpoint,
    timeout,
    drift_res
    )
  
  train = train_op(
    gcp_project_id,
    training_display_name,
    create_dataset.outputs['dataset_id'],
    model_prefix,
    target_col_name,
    transformations,
    gcp_region,
    api_endpoint
    )
  
  eval = eval_op(
    gcp_project_id,
    train.outputs['model_id'],
    gcp_region,
    api_endpoint
    )
  
  create_endpoint = create_endpoint_op(
    gcp_project_id,
    dataset_display_name,
    endpoint_path,
    gcp_region,
    api_endpoint,
    timeout
  )

  deploy = deploy_op(
    gcp_project_id,
    create_endpoint.outputs['endpoint_id'],
    train.outputs['model_id'],
    train.outputs['model_dispname'],
    eval.outputs['eval_info'],
    gcp_region,
    api_endpoint,
    timeout,
    thresholds_dict_str
  )

Now, we'll call `components.create_graph_component_from_pipeline_func()` to generate a subgraph component `.yaml` file from the definition above.

In [17]:
automl_e2e_op = components.create_graph_component_from_pipeline_func(
    automl_tables,
    output_component_file='automl_tables_component.yaml',
)

Instantiate an op from the new component:

In [18]:
automl_e2e_op = components.load_component_from_file(
  './automl_tables_component.yaml'
  )

Next, we'll define the pipeline, using the subgraph component op defined above.

We'll first define some dataset type information to be used by AutoML tabular during training.

In [19]:
import json
# We'll use this transformation specification as an arg for the training step.
TRANSFORMATIONS = [
    {"auto": {"column_name": "bike_id"}},
    {"auto": {"column_name": "day_of_week"}},
    {"auto": {"column_name": "dewp"}},
    {"auto": {"column_name": "duration"}},
    {"auto": {"column_name": "end_latitude"}},
    {"auto": {"column_name": "end_longitude"}},
    {"categorical": {"column_name": "end_station_id"}},
    {"auto": {"column_name": "euclidean"}},
    {"categorical": {"column_name": "loc_cross"}},
    {"auto": {"column_name": "max"}},
    {"auto": {"column_name": "min"}},
    {"auto": {"column_name": "prcp"}},
    {"auto": {"column_name": "start_latitude"}},
    {"auto": {"column_name": "start_longitude"}},
    {"categorical": {"column_name": "start_station_id"}},
    {"auto": {"column_name": "temp"}},
    {"timestamp": {"column_name": "ts"}}
]
TRANSFORMATIONS_STR = json.dumps(TRANSFORMATIONS)

Now the pipeline definition itself.  This pipeline uses only the subgraph component, which defines the AutoML 'e2e' workflow.  
(In the next section, we'll build a larger pipeline that adds some additional components).

In [20]:
@dsl.pipeline(
  name='ucaip-automl-tables-subgraph',
  description='Demonstrate a uCAIP AutoML Tables workflow'
)
def automl_tables_subgraph( 
  gcp_project_id: str = 'YOUR_PROJECT_HERE',
  gcp_region: str = 'us-central1',
  dataset_display_name: str = 'YOUR_DATASET_NAME',
  api_endpoint: str = 'us-central1-aiplatform.googleapis.com',
  timeout: int = 2000,
  bigquery_uri: str = 'bq://aju-dev-demos.london_bikes_weather.bikes_weather',
  # bigquery_uri: str = 'bq://aju-vtests2.demos.bw_ts_sorted2',
  target_col_name: str = 'duration',
  time_col_name: str = 'none',    
  transformations: str = TRANSFORMATIONS_STR,
  train_budget_milli_node_hours: int = 1000,
  model_prefix: str = 'bw',    
  # optimization_objective: str = 'minimize-rmse', 
  training_display_name: str = 'CHANGE THIS',
  endpoint_display_name: str = 'CHANGE THIS',
  # if set to other than 'new', use the given endpoint path rather than create new endpoint.  
  endpoint_path: str = 'new',
  thresholds_dict_str: str = '{"meanAbsoluteError": 470}',
  drift_res: str = 'true'
  ):

  automl_e2e = automl_e2e_op(
    gcp_project_id,
    gcp_region,
    dataset_display_name,
    api_endpoint, timeout, bigquery_uri,
    target_col_name, time_col_name, transformations, 
    train_budget_milli_node_hours, model_prefix,
    training_display_name, endpoint_display_name,
    endpoint_path, thresholds_dict_str,
    drift_res
  )

Compile the pipeline...

In [21]:
compiler.Compiler().compile(pipeline_func=automl_tables_subgraph,
                            pipeline_root=PIPELINE_ROOT,
                            output_path='automl_pipeline_spec.json')

... then create an API client object and run it.

In [22]:
import time
from aiplatform.pipelines import client

api_client = client.Client(project_id=PROJECT_ID, region=REGION, api_key=API_KEY)
DISPLAY_NAME = 'automl{}'.format(str(int(time.time())))
print(DISPLAY_NAME)


automl1615564030


Note that we can define pipeline input values via the `parameter_values` arg.

In [23]:
result = api_client.create_run_from_job_spec(
          job_spec_path='automl_pipeline_spec.json',
          name = DISPLAY_NAME,
#           pipeline_root=PIPELINE_ROOT,  # you can add this arg if you want to override the compiled value
          parameter_values={'gcp_project_id': '{}'.format(PROJECT_ID),
                           'dataset_display_name': DISPLAY_NAME,
                            'endpoint_display_name': DISPLAY_NAME,
                            'training_display_name': DISPLAY_NAME,
                            'thresholds_dict_str': '{"meanAbsoluteError": 470}',
                           })

Visit the running pipeline job in the Cloud Console by clicking the link above. As it runs, you should see a graph like the following.  

<a href="https://storage.googleapis.com/amy-jo/images/automl/ucaip_automl_tabular_pipeline_in_progress.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/automl/ucaip_automl_tabular_pipeline_in_progress.png" width="90%"/></a>

You can view and manage information about your dataset, model, and endpoint in the [Cloud Console](https://console.cloud.google.com/ai/platform/models) as well.


## View Pipelines metadata using the Metadata Server

The set of artifacts and executions produced by the pipeline can also be queried using the AIPlatform Metadata SDK. The following shows a snippet for querying the metadata for a given pipeline run:

In [None]:
from google.cloud import aiplatform

from google import auth
from google.cloud.aiplatform_v1alpha1 import MetadataServiceClient
from google.auth.transport import grpc, requests
from google.cloud.aiplatform_v1alpha1.services.metadata_service.transports import grpc as transports_grpc

import pandas as pd

def _initialize_metadata_service_client() -> MetadataServiceClient:
  scope = 'https://www.googleapis.com/auth/cloud-platform'
  api_uri = 'us-central1-aiplatform.googleapis.com'
  credentials, _ = auth.default(scopes=(scope,))
  request = requests.Request()
  channel = grpc.secure_authorized_channel(credentials, request, api_uri)

  return MetadataServiceClient(
      transport=transports_grpc.MetadataServiceGrpcTransport(channel=channel))

md_client = _initialize_metadata_service_client()

In [None]:
def get_run_context_name(pipeline_run):
  contexts = md_client.list_contexts(parent='projects/{}/locations/{}/metadataStores/default'.format(PROJECT_ID, REGION))
  for context in contexts:
    if context.display_name == pipeline_run:
      return context.name
  
run_context_name = get_run_context_name(DISPLAY_NAME)  # <- Name of the pipeline run

md_client.query_context_lineage_subgraph(context=run_context_name)

## Create a scheduled recurrent pipeline job

This section shows how to create a scheduled pipeline job.  

> At time of writing, it is not yet possible to pass pipeline args to the scheduling method.
(This feature will be **supported in the next release** of the client libs). So, for purposes of this example, we'll generate a version of the pipeline with its inputs hardwired.  Otherwise, this pipeline is the same as the one defined above.

Under the hood, the scheduled jobs are supported by the Cloud Scheduler and a Cloud Functions function.  **Check first that the APIs for both of these services are enabled**. 

See the [Cloud Scheduler](https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules) documentation for more on the cron syntax used.

In [None]:
from aiplatform.pipelines import schedule

In [None]:
import time
DISPLAY_NAME = 'automl{}'.format(str(int(time.time())))
print(DISPLAY_NAME)


Create a scheduled pipeline job. (You may see an ignorable auth warning).

In [None]:
# adjust time zone and cron schedule as necessary
schedule.create_from_pipeline_file(
    pipeline_path='automl_pipeline_spec.json',
    schedule='0 7 * * *',  # run at 7am every day
    project_id=PROJECT_ID,
    region=REGION,
    time_zone='America/Los_Angeles', # change this as necessary
    parameter_values={'gcp_project_id': '{}'.format(PROJECT_ID),
                      'dataset_display_name': DISPLAY_NAME,
                      'endpoint_display_name': DISPLAY_NAME,
                      'training_display_name': DISPLAY_NAME,
                      'thresholds_dict_str': '{"meanAbsoluteError": 470}',
                      }    
)

Once the scheduled job is created, you can see it listed in the [Cloud Scheduler](https://console.cloud.google.com/cloudscheduler/) panel in the Console.

<a href="https://storage.googleapis.com/amy-jo/images/kf-pls/pipelines_scheduler.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/kf-pls/pipelines_scheduler.png" width="95%"/></a>

You can test the setup from the Cloud Scheduler panel by clicking 'RUN NOW'.

The implementation is using a GCF function, which you can see listed in the [Cloud Functions](https://console.cloud.google.com/functions/list) panel in the console as `templated_http_request-v1`.  
Don't delete this GCF function, as it will prevent the Cloud Scheduler jobs from actually kicking off the pipeline run.  If you do delete it, you will need to create a new scheduled job in order to recreate it. 

> After experimenting with the scheduled job, you will probably want to **DELETE** or **PAUSE** it so that it does not continue to run every day.

## Define and add TFDV pipeline components

Now that we have the basics of the AutoML e2e workflow running, we can consider an additional scenario: as new data becomes available, how might we determine whether our deployed model needs retraining?  One aspect of that decision is whether or not the original and newer datasets are significantly different.

This part of the example demonstrates use of the [TFDV](https://www.tensorflow.org/tfx/guide/tfdv) library to build pipeline *components* that derive dataset **statistics** and detect **drift** between older and newer datasets, and shows how to use drift information to decide whether to retrain a model on newer data.

> Note: TFDV also supports some nice notebook widgets for visualization of the analysis results, though this notebook doesn't include a demo of that.

We'll define both TFDV components as ['lightweight' Python-function-based components](https://www.kubeflow.org/docs/pipelines/sdk/python-function-components/). For each component, we define a function, then call `kfp.components.func_to_container_op()` on that function to build a reusable component in `.yaml` format. 

For these components, we need to specify a base container image that will run the function.  We'll use one that has the TFDV libraries installed (its Dockerfile is [here](https://github.com/amygdala/code-snippets/blob/keras_tuner3/ml/kubeflow-pipelines/keras_tuner/components/tfdv/Dockerfile)).

### Component to generate stats on the dataset

This component uses TFDV to generate statistics on a given dataset.

It uses a Beam pipeline— not to be confused with KFP Pipelines— to do this. Depending upon configuration, the component can use either the Direct (local) runner or the [Dataflow](https://cloud.google.com/dataflow#section-5) runner. This is determined by the `use_dataflow` input param.

Running the Beam pipeline on Dataflow rather than locally can make sense with large datasets. 

If you do run this component locally (that is, you leave the computation on the pipeline step node rather than starting a Dataflow job), then with the training datatset used for this example, you will need to give that node more memory than the default.  This is done at the pipeline definition stage— see that section below for more info.

If you use Dataflow, first ensure that the Datflow API is enabled for your GCP project.  You will also need to give service account used by Managed Pipelines permission to run the Dataflow jobs: 

```
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member=serviceAccount:service-${PROJECT_NUMBER}@gcp-sa-aiplatform-cc.iam.gserviceaccount.com \
--role=roles/dataflow.admin
```



We'll first define the python function:

In [None]:

from kfp.components import OutputPath

def generate_tfdv_stats(input_data: str, output_path: str, job_name: str, use_dataflow: str,
                        project_id: str, region:str, gcs_temp_location: str, gcs_staging_location: str,
                        whl_location: str, requirements_file: str,
                        stats: OutputPath('String')
):

  import logging
  import subprocess
  import time

  import tensorflow_data_validation as tfdv
  import tensorflow_data_validation.statistics.stats_impl
  from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions

  logging.getLogger().setLevel(logging.INFO)
  logging.info("output path: %s", output_path)
  logging.info("stats: %s", stats)
  logging.info("Building pipeline options")
  # Create and set your PipelineOptions.
  options = PipelineOptions()

  if use_dataflow == 'true':
    logging.info("using Dataflow")
    if not whl_location:
      logging.warning('tfdv whl file required with dataflow runner.')
      exit(1)
    # For Cloud execution, set the Cloud Platform project, job_name,
    # staging location, temp_location and specify DataflowRunner.
    google_cloud_options = options.view_as(GoogleCloudOptions)
    google_cloud_options.project = project_id
    google_cloud_options.job_name = '{}-{}'.format(job_name, str(int(time.time())))
    google_cloud_options.staging_location = gcs_staging_location
    google_cloud_options.temp_location = gcs_temp_location
    google_cloud_options.region = region
    options.view_as(StandardOptions).runner = 'DataflowRunner'

    setup_options = options.view_as(SetupOptions)
    # PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
    setup_options.extra_packages = [whl_location]
    setup_options.requirements_file = 'requirements.txt'

  tfdv.generate_statistics_from_csv(
    data_location=input_data, output_path=output_path,
    pipeline_options=options)
  # write the output path as output
  with open('temp.txt', "w") as outfile:
    outfile.write(output_path)
  subprocess.run(['gsutil', 'cp', 'temp.txt', stats])  


Next, we'll create the component from the function definition:

In [None]:
from kfp import components

components.func_to_container_op(generate_tfdv_stats,
    output_component_file='tfdv_component.yaml', base_image='gcr.io/google-samples/tfdv-tests:v1')

### Component to detect dataset 'drift'

Next we'll define a component to detect drift on the `duration` field between two sets of derived stats files from two datasets.  See the [TFDV docs](https://www.tensorflow.org/tfx/data_validation/get_started#checking_data_skew_and_drift) for more detail.

(The component itself makes no assumptions about the datasets used to generate of its input stats files.  However, when we add these to the pipeline, we'll use as the 'new stats' file the output of the component above, and compare that new file to previously-generated stats for the previous version of the dataset).


In [None]:

from kfp.components import OutputPath

def tfdv_detect_drift(
    stats_older_path: str, stats_new_path: str, # expects outputs files from tfdv.generate_statistics_from_csv()
    drift_res: OutputPath('String')
):

  import logging
  import subprocess
  import time

  import tensorflow_data_validation as tfdv
  import tensorflow_data_validation.statistics.stats_impl

  logging.getLogger().setLevel(logging.INFO)
  logging.info('stats_older_path: %s', stats_older_path)
  logging.info('stats_new_path: %s', stats_new_path)

  if stats_older_path == 'none':
    return ('true', )

  stats1 = tfdv.load_statistics(stats_older_path)
  stats2 = tfdv.load_statistics(stats_new_path)

  schema1 = tfdv.infer_schema(statistics=stats1)
  tfdv.get_feature(schema1, 'duration').drift_comparator.jensen_shannon_divergence.threshold = 0.01
  drift_anomalies = tfdv.validate_statistics(
      statistics=stats2, schema=schema1, previous_statistics=stats1)
  logging.info('drift analysis results: %s', drift_anomalies.drift_skew_info)

  from google.protobuf.json_format import MessageToDict
  d = MessageToDict(drift_anomalies)
  val = d['driftSkewInfo'][0]['driftMeasurements'][0]['value']
  thresh = d['driftSkewInfo'][0]['driftMeasurements'][0]['threshold']
  logging.info('value %s and threshold %s', val, thresh)
  res = 'true'
  if val < thresh:
    res = 'false'
  logging.info('train decision: %s', res)
  with open('temp.txt', "w") as outfile:
    outfile.write(res)
  subprocess.run(['gsutil', 'cp', 'temp.txt', drift_res])    


Create the component from the function definition...

In [None]:
from kfp import components

components.func_to_container_op(tfdv_detect_drift,
    output_component_file='tfdv_drift_component.yaml', base_image='gcr.io/google-samples/tfdv-tests:v1')

## Define and run a TFDV + AutoML pipeline using your components



First, we'll create pipeline ops from the components, using the `load_component_from_file` method.

While we're not using it here, there is also a `load_component_from_url` method, which is handy if your component files are checked into a repo or otherwise stored online. (For GitHub files, use the 'raw' URL).

In [None]:
import json
from kfp.v2 import dsl
from kfp.v2 import compiler
from kfp import components

automl_e2e_op = components.load_component_from_file(
  './automl_tables_component.yaml'
  )

tfdv_op = components.load_component_from_file(
  './tfdv_component.yaml'
  )
tfdv_drift_op = components.load_component_from_file(
  './tfdv_drift_component.yaml'
  )

Next, we'll define the pipeline, using the ops defined above.

In [None]:
import json
# We'll use this transformation specification as an arg for the training step.
TRANSFORMATIONS = [
    {"auto": {"column_name": "bike_id"}},
    {"auto": {"column_name": "day_of_week"}},
    {"auto": {"column_name": "dewp"}},
    {"auto": {"column_name": "duration"}},
    {"auto": {"column_name": "end_latitude"}},
    {"auto": {"column_name": "end_longitude"}},
    {"categorical": {"column_name": "end_station_id"}},
    {"auto": {"column_name": "euclidean"}},
    {"categorical": {"column_name": "loc_cross"}},
    {"auto": {"column_name": "max"}},
    {"auto": {"column_name": "min"}},
    {"auto": {"column_name": "prcp"}},
    {"auto": {"column_name": "start_latitude"}},
    {"auto": {"column_name": "start_longitude"}},
    {"categorical": {"column_name": "start_station_id"}},
    {"auto": {"column_name": "temp"}},
    {"timestamp": {"column_name": "ts"}}
]
TRANSFORMATIONS_STR = json.dumps(TRANSFORMATIONS)

We'll first define some pipeline inputs.

Note: In the following, we're finessing one aspect of our scenario: the  Dataset pipeline component reads from a BigQuery table.
The TFDV stats-generation component reads from a set of CSV files exported from that same BigQuery table. 
A more robust design would have the Dataset component read from the same set of CSV files (instead of BigQuery), _or_ automate the export from BigQuery to CSV as part of the pipeline.

**Edit this next cell before running.**

In [None]:
OUTPUT_PATH = 'gs://your/gcs/path'  # CHANGE THIS

(This next cell is doing some work that will soon not be necessary once a bug is fixed).

In [None]:
# Temporarily, we need to generate a unique run id manually.  We'll use that to generate
# a unique output path for this run, under which the TFDV eval .pb files will be generated.
# In future, this will not be necessary.
import time
RUN_ID_PLACEHOLDER = 'run{}'.format(str(int(time.time())))

# Also temporarily, we need to explicitly construct some of the pipeline steps params from the pipeline input params.
# In future, it will be possible to do this construction inline, as part of the pipeline definition.
GCS_TEMP_LOCATION = '{}/tmp'.format(OUTPUT_PATH)
# DATA_DIR = 'gs://aju-dev-demos-codelabs/bikes_weather_chronological/ds1/'
# BIGQUERY_URI = 'bq://aju-dev-demos.london_bikes_weather.bw_ts_sorted1'
DATA_DIR = 'gs://aju-dev-demos-codelabs/bikes_weather_chronological/ds2/' # the 'newer' dataset
BIGQUERY_URI = 'bq://aju-dev-demos.london_bikes_weather.bw_ts_sorted2'  # the 'newer' dataset
TEST_OUTPUT_PATH = '{}/{}/eval/evaltest.pb'.format(OUTPUT_PATH, RUN_ID_PLACEHOLDER)
TRAIN_OUTPUT_PATH = '{}/{}/eval/evaltrain.pb'.format(OUTPUT_PATH, RUN_ID_PLACEHOLDER)
INPUT_DATA_TEST = '{}test-*.csv'.format(DATA_DIR)
INPUT_DATA_TRAIN = '{}train-*.csv'.format(DATA_DIR)
JOB_NAME_TEST = RUN_ID_PLACEHOLDER + '-1'
JOB_NAME_TRAIN = RUN_ID_PLACEHOLDER + '-2'
print(TRAIN_OUTPUT_PATH)

In [None]:
print(INPUT_DATA_TRAIN)

Now we'll define the new pipeline, which augments the AutoML pipeline built in the previous section with the two new TFDV steps.

If you run the `tfdv2` step with `use_dataflow` set to `false`, then you will need to uncomment the first line below:
```
  # tfdv2.set_memory_limit('32G')
  # tfdv2.set_gpu_limit(1)
  # tfdv2.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-k80')
```
to give the step more memory. The 2nd and 3rd line show how you'd also run the step on a GPU-enabled node.  

In [None]:
@dsl.pipeline(
  name='ucaip-automl-tables-tfdv-subgraph',
  description='Demonstrate a ucaip AutoML Tables workflow'
)
def automl_tables_tfdv_subgraph( 
  gcp_project_id: str = 'YOUR_PROJECT_HERE',
  gcp_region: str = 'us-central1',
  dataset_display_name: str = 'YOUR_DATASET_NAME',
  api_endpoint: str = 'us-central1-aiplatform.googleapis.com',
  timeout: int = 2000,
  bigquery_uri: str = 'bq://aju-dev-demos.london_bikes_weather.bw_ts_sorted2',
  target_col_name: str = 'duration',
  time_col_name: str = 'none',    
  transformations: str = TRANSFORMATIONS_STR,
  train_budget_milli_node_hours: int = 1000,
  model_prefix: str = 'bw',    
  # optimization_objective: str = 'minimize-rmse', 
  training_display_name: str = 'CHANGE THIS',
  endpoint_display_name = 'CHANGE THIS',
  # if set to other than 'new', use the given endpoint path rather than create new endpoint.  
  endpoint_path:str = 'new',
  thresholds_dict_str = '{"meanAbsoluteError": 470}',

  # tfdv-related
  region: str = 'us-central1',
  requirements_file: str = 'requirements.txt',
  job_name: str = 'testx',
  gcs_staging_location: str = OUTPUT_PATH,
  gcs_temp_location:str = GCS_TEMP_LOCATION,
  data_dir: str = 'gs://aju-dev-demos-codelabs/bikes_weather_chronological/ds2/', # requires trailing slash
  output_path: str = OUTPUT_PATH,
  whl_location: str = 'tensorflow_data_validation-0.26.0-cp37-cp37m-manylinux2010_x86_64.whl',
  use_dataflow: str = '',
  thresholds: str = '{"root_mean_squared_error": 2000}',
  stats_older_path: str = 'gs://aju-dev-demos-codelabs/bikes_weather_chronological/evaltrain1.pb'  # tfdv stats for the 'older' dataset
  ):

  tfdv1 = tfdv_op(
    input_data = INPUT_DATA_TEST,
    output_path = TEST_OUTPUT_PATH,
    job_name=JOB_NAME_TEST,
    use_dataflow='false',
    # use_dataflow=use_dataflow,
    project_id=gcp_project_id, region=region,
    gcs_temp_location=gcs_temp_location, gcs_staging_location=gcs_staging_location,
    whl_location=whl_location, requirements_file=requirements_file
    )
  tfdv2 = tfdv_op(
    input_data = INPUT_DATA_TRAIN,
    output_path = TRAIN_OUTPUT_PATH,
    job_name=JOB_NAME_TRAIN,
    use_dataflow=use_dataflow,
    # use_dataflow='false',
    project_id=gcp_project_id, region=region,
    gcs_temp_location=gcs_temp_location, gcs_staging_location=gcs_staging_location,
    whl_location=whl_location, requirements_file=requirements_file
    )
  # tfdv2.set_memory_limit('32G')
  # tfdv2.set_gpu_limit(1)
  # tfdv2.add_node_selector_constraint('cloud.google.com/gke-accelerator', 'nvidia-tesla-k80')

  tfdv_drift = tfdv_drift_op(stats_older_path, tfdv2.outputs['stats'])

  automl_e2e = automl_e2e_op(
    gcp_project_id,
    gcp_region,
    dataset_display_name,
    api_endpoint, timeout, bigquery_uri,
    target_col_name, time_col_name, transformations, 
    train_budget_milli_node_hours, model_prefix,
    training_display_name, endpoint_display_name,
    endpoint_path, thresholds_dict_str,
    tfdv_drift.outputs['drift_res']
  )  


Compile the pipeline...

In [None]:
compiler.Compiler().compile(pipeline_func=automl_tables_tfdv_subgraph,
                            pipeline_root=PIPELINE_ROOT,
                            output_path='automl_tfdv_pipeline_spec.json')

... then run it.

In [None]:
import time
from aiplatform.pipelines import client

api_client = client.Client(project_id=PROJECT_ID, region=REGION, api_key=API_KEY)
DISPLAY_NAME = 'automltfdv{}'.format(str(int(time.time())))
print(DISPLAY_NAME)


Note that we can define pipeline input values via the `parameter_values` arg.

In [None]:
result = api_client.create_run_from_job_spec(
          job_spec_path='automl_tfdv_pipeline_spec.json',
          name = DISPLAY_NAME,
          parameter_values={'gcp_project_id': '{}'.format(PROJECT_ID),
                           'dataset_display_name': DISPLAY_NAME,
                            'endpoint_display_name': DISPLAY_NAME,
                            'training_display_name': DISPLAY_NAME,
                            'thresholds_dict_str': '{"meanAbsoluteError": 470}',
                            'use_dataflow': 'true',
                            'data_dir': DATA_DIR, 'bigquery_uri': BIGQUERY_URI
                           })

Visit the running pipeline job in the Cloud Console by clicking the link above. As it runs, you should see a graph like the following.  

<a href="https://storage.googleapis.com/amy-jo/images/automl/ucaip-automl-tables-tfdv.png" target="_blank"><img src="https://storage.googleapis.com/amy-jo/images/automl/ucaip-automl-tables-tfdv.png" width="90%"/></a>

You can view and manage information about your dataset, model, and endpoint in the [Cloud Console](https://console.cloud.google.com/ai/platform/models) as well.


### How to automatically rerun this pipeline in the presence of new data?

[This MP example notebook](https://colab.research.google.com/drive/1njIzO3XIEgpaMe2CV2xUl_rAj3zId2JF) shows how to use GCF to support event-triggered Managed Pipelines. 

(**TODO**): add a version of that example which shows use of TFDV on Managed Pipelines. 

## (TODO) Using your deployed model for prediction

...

In [None]:
from google.cloud import aiplatform

def predict_custom_model_sample(endpoint: str, instance: dict, parameters_dict: dict):
    client_options = dict(api_endpoint="us-central1-prediction-aiplatform.googleapis.com")
    client = aiplatform.gapic.PredictionServiceClient(client_options=client_options)

    from google.protobuf import json_format
    from google.protobuf.struct_pb2 import Value

    # The format of the parameters must be consistent with what the model expects.
    parameters = json_format.ParseDict(parameters_dict, Value())

    # The format of the instances must be consistent with what the model expects.
    instances_list = [instance]
    instances = [json_format.ParseDict(s, Value()) for s in instances_list]
    response = client.predict(
        endpoint=endpoint, instances=instances, parameters=parameters
    )

    print("response")
    print(" deployed_model_id:", response.deployed_model_id)
    predictions = response.predictions
    print("predictions")
    for prediction in predictions:
        print(" prediction:", dict(prediction))

In [None]:
endpoint_path = "projects/467744782358/locations/us-central1/endpoints/6770352799193497600"  # aju temp testing
instance1 =  {
      "bike_id": "5373",
      "day_of_week": "3",
      "end_latitude": 51.52059681,
      "end_longitude": -0.116688468,
      "end_station_id": "68",
      "euclidean": 3589.5146210024977,
      "loc_cross": "POINT(-0.07 51.52)POINT(-0.12 51.52)",
      "max": 44.6,
      "min": 34.0,
      "prcp": 0,
      "ts": "1480407420",
      "start_latitude": 51.52388,
      "start_longitude": -0.065076,
      "start_station_id": "445",
      "temp": 38.2,
      "dewp": 28.6
    }

predict_custom_model_sample(
    endpoint_path,
    instance1, {}
)

-----------------------------
Copyright 2021 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.