##### Copyright &copy; 2020 Google Inc.

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>
<hr/>

# Managed Pipelines EAP: Create and run a pipeline

## Introduction

[AI Platform Pipelines - Managed (Managed Pipelines)](https://docs.google.com/document/d/1FAyZhXRmZwJ7oCjRZZmzRG-ERYxyZyUQikrjR28Ev4E/edit?ts=5ec30a40#) makes it easier for you to run your ML Pipelines in a scalable and cost-effective way, while offering you ‘no lock-in’ flexibility. You build your pipelines in Python using [TensorFlow Extended (TFX)](tensorflow.org/tfx), and then execute your pipelines on Google Cloud serverlessly. You don’t have to worry about scale and only pay for what you use. (You can also take the same TFX pipelines and run them using Kubeflow Pipelines).

This notebook shows an example of how to use AI Platform Pipelines.   
The notebook is designed to run on AI Platform Notebooks. If you want to run this notebook in your own development environment, you will need to do a bit more setup first.  See [these instructions](<https://docs.google.com/document/d/1FAyZhXRmZwJ7oCjRZZmzRG-ERYxyZyUQikrjR28Ev4E/edit?ts=5ec30a40#heading=h.pyk4nfqsszzz>).  

### About the dataset and ML Task

You will build a pipeline using a [Chicago Taxi Trips public dataset](
https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew).  The task is to learn a model that predicts whether the tip was >= 20% of the fare.

## Step 1: Follow the 'before you begin' steps in the Managed Pipelines User Guide

Before proceeeding, make sure that you've followed all the steps in the ["Before you Begin" section](https://docs.google.com/document/d/1FAyZhXRmZwJ7oCjRZZmzRG-ERYxyZyUQikrjR28Ev4E/edit?ts=5ec30a40#heading=h.65kbhyyf93x0) of the Managed Pipelines User Guide.  You'll need to use the API key that you created for this notebook.

## Step 2: set up your environment

First, ensure that Python 3 is being used.

In [None]:
import sys
sys.version

### Install the TFX SDK

Next, we'll upgrade pip and install the TFX SDK.

In [None]:
SDK_LOCATION = 'gs://caip-pipelines-sdk/releases/20200727/tfx-0.22.0.caip20200727-py3-none-any.whl'

In [None]:
%%capture
!pip install pip --upgrade
!gsutil cp {SDK_LOCATION} /tmp/tfx-0.22.0.caip20200727-py3-none-any.whl && pip install --no-cache-dir /tmp/tfx-0.22.0.caip20200727-py3-none-any.whl

# Automatically restart kernel after installs
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Ensure that you can import TFX and that its version is >= 0.22.

In [None]:
# Check version
import tfx
tfx.__version__

### Identify or Create a GCS bucket to use for your pipeline

Below, you will need to specify a Google Gloud Storage (GCS) bucket for the Pipelines run to use.  If you do not already have one that you want to use, you can [create a new bucket](https://cloud.google.com/storage/docs/creating-buckets).

### Set up variables

Let's set up some variables used to customize the pipelines below. **Before you execute the following cell, make the indicated 'Change this' edits**.

In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin
    
USER = 'YOUR_USERNAME'  # Change this to your username.
BUCKET_NAME = 'YOUR_GCS_BUCKET'  # Change this to your GCS bucket name.  Do not include the `gs://`

# It is not necessary to append your username to the pipeline root, 
# but this may be useful if multiple people are using the same project.
PIPELINE_ROOT = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, USER)
PROJECT_ID = 'YOUR_PROJECT_ID' # Change this to your project id
BASE_IMAGE = 'gcr.io/caip-pipelines-assets/tfx:0.22.0.caip20200727'

API_KEY = 'YOUR_API_KEY'  # Change this to the API key that you created during initial setup
# ENDPOINT = 'alpha-ml.googleapis.com'  # this is the default during EAP

PIPELINE_ROOT

## Step 3: Run the 'Chicago Taxi' Pipeline

In this section, we'll run the canonical Chicago Taxi Pipeline.

We'll first do some imports. You can ignore the `RuntimeParameter` warning.

In [None]:
from typing import Any, Dict, List, Optional, Text

import os
import tensorflow_model_analysis as tfma

from tfx.extensions.google_cloud_big_query.example_gen.component import BigQueryExampleGen
from tfx.components import CsvExampleGen
from tfx.components import Evaluator
from tfx.components import ExampleValidator
from tfx.components import InfraValidator
from tfx.components import Pusher
from tfx.components import ResolverNode
from tfx.components import SchemaGen
from tfx.components import StatisticsGen
from tfx.components import Trainer
from tfx.components import Transform
from tfx.dsl.experimental import latest_artifacts_resolver
from tfx.orchestration import pipeline as tfx_pipeline
from tfx.orchestration.ai_platform_pipelines import ai_platform_pipelines_dag_runner
from tfx.proto import pusher_pb2
from tfx.proto import trainer_pb2
from tfx.types import standard_artifacts
from tfx.utils import dsl_utils
from tfx.types import channel

Next, we'll set some variables to define our data sources.

We're defining both a [BigQuery](https://cloud.google.com/bigquery/docs/) query and the path to a folder of CSV data.  Below, we'll show examples of how to use each.

In [None]:
# Define the query used for BigQueryExampleGen.
QUERY = """
        SELECT
          pickup_community_area,
          fare,
          EXTRACT(MONTH FROM trip_start_timestamp) AS trip_start_month,
          EXTRACT(HOUR FROM trip_start_timestamp) AS trip_start_hour,
          EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS trip_start_day,
          UNIX_SECONDS(trip_start_timestamp) AS trip_start_timestamp,
          pickup_latitude,
          pickup_longitude,
          dropoff_latitude,
          dropoff_longitude,
          trip_miles,
          pickup_census_tract,
          dropoff_census_tract,
          payment_type,
          company,
          trip_seconds,
          dropoff_community_area,
          tips
        FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
        WHERE (ABS(FARM_FINGERPRINT(unique_key)) / 0x7FFFFFFFFFFFFFFF)
          < 0.000001"""

# Data location for the CsvExampleGen. The content of the data is equivalent
# to the query above.
CSV_INPUT_PATH = 'gs://ml-pipeline/sample-data/chicago-taxi/data'

Now we're ready to build and run a TFX pipeline. If you look at the helper function below, you'll notice that it's using the `BigQueryExampleGen` component if `query` is defined.  This means that we'll get our data from BigQuery for this first pipeline run.

In [None]:
# Create a helper function to construct a TFX pipeline.
def create_tfx_pipeline(
    query: Optional[Text] = None,
    input_path: Optional[Text] = None,
):
  """Creates an end-to-end Chicago Taxi pipeline in TFX."""
  if bool(query) == bool(input_path):
    raise ValueError('Exact one of query or input_path is expected.')

  if query:
    example_gen = BigQueryExampleGen(query=query)
  else:
    example_gen = CsvExampleGen(input=dsl_utils.external_input(input_path))

  beam_pipeline_args = [
      # Uncomment to use Dataflow.
      # '--runner=DataflowRunner',
      # '--experiments=shuffle_mode=auto',
      # '--region=us-central1',
      # '--disk_size_gb=100',
      '--temp_location=' + os.path.join(PIPELINE_ROOT, 'dataflow', 'temp'),
      '--project={}'.format(PROJECT_ID)  # Always needed for BigQueryExampleGen.
  ]

  # Use a module file built-in the TFX image to make sure things are in sync.
  module_file = '/tfx-src/tfx/examples/chicago_taxi_pipeline/taxi_utils.py'

  statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
  schema_gen = SchemaGen(
      statistics=statistics_gen.outputs['statistics'],
      infer_feature_shape=False)
  example_validator = ExampleValidator(
      statistics=statistics_gen.outputs['statistics'],
      schema=schema_gen.outputs['schema'])
  transform = Transform(
      examples=example_gen.outputs['examples'],
      schema=schema_gen.outputs['schema'],
      module_file=module_file)

  trainer = Trainer(
      transformed_examples=transform.outputs['transformed_examples'],
      schema=schema_gen.outputs['schema'],
      transform_graph=transform.outputs['transform_graph'],
      train_args=trainer_pb2.TrainArgs(num_steps=10),
      eval_args=trainer_pb2.EvalArgs(num_steps=5),
      module_file=module_file,
  )

  # Set the TFMA config for Model Evaluation and Validation.
  eval_config = tfma.EvalConfig(
      model_specs=[tfma.ModelSpec(signature_name='eval')],
      metrics_specs=[
          tfma.MetricsSpec(
              metrics=[tfma.MetricConfig(class_name='ExampleCount')],
              thresholds={
                  'binary_accuracy':
                      tfma.MetricThreshold(
                          value_threshold=tfma.GenericValueThreshold(
                              lower_bound={'value': 0.5}),
                          change_threshold=tfma.GenericChangeThreshold(
                              direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                              absolute={'value': -1e-10}))
              })
      ],
      slicing_specs=[
          tfma.SlicingSpec(),
          tfma.SlicingSpec(feature_keys=['trip_start_hour'])
      ])

  evaluator = Evaluator(
      examples=example_gen.outputs['examples'],
      model=trainer.outputs['model'],
      eval_config=eval_config)

  pusher = Pusher(
      model=trainer.outputs['model'],
      model_blessing=evaluator.outputs['blessing'],
      push_destination=pusher_pb2.PushDestination(
          filesystem=pusher_pb2.PushDestination.Filesystem(
              base_directory=os.path.join(PIPELINE_ROOT, 'model_serving'))))

  components=[
      example_gen, statistics_gen, schema_gen, example_validator, transform,
      trainer, evaluator, pusher
  ]

  return tfx_pipeline.Pipeline(
      pipeline_name='taxi-pipeline-{}'.format(USER),
      pipeline_root=PIPELINE_ROOT,
      enable_cache=True,
      components=components,
      beam_pipeline_args=beam_pipeline_args
  )

We'll call the helper function to create the pipeline:

In [None]:
bigquery_taxi_pipeline = create_tfx_pipeline(query=QUERY)

### Step 3.1: Run the pipeline using BigQuery-based example generation

Now we're ready to run the pipeline! As you can see below, we're configuring the runner to use AI Platform Pipelines.

In [None]:
runner = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunner(
    config=ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunnerConfig(
        project_id=PROJECT_ID,
        display_name='big-query-taxi-pipeline-{}'.format(USER),
        default_image=BASE_IMAGE))

In [None]:
runner.run(bigquery_taxi_pipeline, api_key=API_KEY)
# If you want to inspect the pipeline proto, run the following and look at the file contents.
# runner = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunner(config=config, output_filename='pipeline.json')
# runner.compile(taxi_pipeline)

See the Pipeline job [here](https://console.cloud.google.com/ai-platform/pipelines/runs).

See the CMLE steps [here](https://console.cloud.google.com/ai-platform/jobs).  This is where you can monitor the details of the pipeline component executions.

### Step 3.2: Run the pipeline using file-based example generation

Start another run that uses file-based example generation instead of BigQuery.

In [None]:
file_based_example_gen_taxi_pipeline = create_tfx_pipeline(input_path=CSV_INPUT_PATH)

runner = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunner(
    config=ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunnerConfig(
        project_id=PROJECT_ID,
        display_name='fbeg-taxi-pipeline-{}'.format(USER),
        default_image=BASE_IMAGE))

In [None]:
runner.run(file_based_example_gen_taxi_pipeline, api_key=API_KEY)

## Step 4: Explore Caching

In Step 3, the pipelines were run with caching enabled. You can see that the helper function above sets `enable_cache=True` when creating the Pipeline object.  

Let's run the Step 3.1 pipeline again. **Wait until the first job  is done** (as confirmed in the Cloud Console UI) before running the next cell. You should see the run below complete more quickly in the Console.

In [None]:
# run this after the job from step 3.1 has finished
runner = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunner(
    config=ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunnerConfig(
        project_id=PROJECT_ID,
        display_name='big-query-taxi-pipeline-{}'.format(USER),
        default_image=BASE_IMAGE))

runner.run(bigquery_taxi_pipeline, api_key=API_KEY)

If you like, disable the cache and run it again. This time, it should re-run all steps:

In [None]:
bigquery_taxi_pipeline.enable_cache = False
runner.run(bigquery_taxi_pipeline, api_key=API_KEY)

> Note: The `CsvExampleGen` component used for the pipeline in Step 3.2 does not support caching as of this writing, but will soon.

## Cleanup

If you like, you can do some cleanup to avoid storage costs.

To remove the files from your GCS bucket, run:

In [None]:
!gsutil rm 'gs://{BUCKET_NAME}/**'

You can remove your GCR container images by visiting the [Container Registry](https://console.cloud.google.com/gcr/) panel in the Cloud Console.  Click on an image name to list and remove any of its versions.

## Summary

This notebook showed examples of defining TFX pipelines using prebuilt components, and running them on AI Platform Managed Pipelines.

Next, explore notebooks that show how to use custom functions and containers; and how to run a TFX Templates pipeline on Managed Pipelines. See the EAP guide for the links.