# 05 - Continuous Training

After testing, compiling, and uploading the pipeline definition to Cloud Storage, the pipeline is executed with respect to a trigger. We use [Cloud Functions](https://cloud.google.com/functions) and [Cloud Pub/Sub](https://cloud.google.com/pubsub) as a triggering mechanism. The triggering can be scheduled using [Cloud Schedular](https://cloud.google.com/scheduler). The trigger source sends a message to a Cloud Pub/Sub topic that the Cloud Function listens to, and then it submits the pipeline to AI Platform Managed Pipelines to be executed.

This notebook covers the following steps:
1. Create the Cloud Pub/Sub topic.
2. Deploy the Cloud Function 
3. Test triggering a pipeline.

## Installation

Install the latest version of Vertex SDK.

In [None]:
import sys
import os


# Google Cloud Notebook
if os.path.exists("/opt/deeplearning/metadata/env_version"):
    USER_FLAG = '--user'
else:
    USER_FLAG = ''

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG

Install the latest GA version of *google-cloud-storage* library as well.

In [None]:
! pip3 install -U google-cloud-storage $USER_FLAG

Install deep learning dependencies

In [None]:
! pip3 install -U tfx==0.30.0 $USER_FLAG
! pip3 install -r requirements.txt $USER_FLAG

### Restart the kernel

Once you've installed the Vertex SDK and Google *cloud-storage*, you need to restart the notebook kernel so it can find the packages.

In [None]:
if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Setup

In [None]:
import json
import os
import logging
import tensorflow as tf
import tfx

logging.getLogger().setLevel(logging.INFO)

print("Tensorflow Version:", tfx.__version__)

### Setup your Google Cloud project

Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands.

In [None]:
PROJECT_ID = "[your-project-id]"  #@param {type:"string"}

In [None]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = !gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)
    
! gcloud config set project $PROJECT_ID

#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex. Not all regions provide support for all Vertex services. For the latest support per region, see the [Vertex locations documentation](https://cloud.google.com/ai-platform-unified/docs/general/locations)

In [None]:
REGION = 'us-central1'  #@param {type: "string"}

In [None]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Setup a bucket for continuous training

In [None]:
BUCKET_NAME = "gs://[your-bucket-name]"  #@param {type:"string"}

In [None]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "_vertex-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION $BUCKET_NAME

In [None]:
VERSION = 'v01'
DATASET_DISPLAY_NAME = 'chicago_taxi_tips'
MODEL_DISPLAY_NAME = f'{DATASET_DISPLAY_NAME}_classifier_{VERSION}'
PIPELINE_NAME = f'{MODEL_DISPLAY_NAME}-train-pipeline'

PIPELINES_STORE = f'{BUCKET_NAME}/vertex_demo/compiled_pipelines/'
GCS_PIPELINE_FILE_LOCATION = os.path.join(PIPELINES_STORE, f'{PIPELINE_NAME}.json')
PUBSUB_TOPIC = f'trigger-{PIPELINE_NAME}'
CLOUD_FUNCTION_NAME = f'trigger-{PIPELINE_NAME}-fn'

## (Optional) Create a dummy pipeline for testing

In [None]:
DUMMY_PIPELINE_ROOT = f"{BUCKET_NAME}/vertex_demo/dummy/pipelines"
PIPELINE_NAME = 'dummy-pipeline'
PARAMETER_NAMES = 'file_uri'

### Implement the pipeline

In [None]:
from tfx.dsl.components.common.importer import Importer
from tfx.types.experimental.simple_artifacts import File
from tfx.orchestration import data_types


def create_dummy_pipeline(
    pipeline_root,
    file_uri
):
    importer = Importer(
        source_uri=file_uri,
        artifact_type=File
    ).with_id("DummyImporterStep")
    
    return tfx.orchestration.pipeline.Pipeline(
        pipeline_name=PIPELINE_NAME,
        pipeline_root=pipeline_root,
        components=[importer]
    )

### Compile the pipeline

In [None]:
from tfx.orchestration.kubeflow.v2 import kubeflow_v2_dag_runner

dummy_pipeline_definition_file = f'{PIPELINE_NAME}.json'

dummy_pipeline = create_dummy_pipeline(
    pipeline_root=DUMMY_PIPELINE_ROOT,
    file_uri=data_types.RuntimeParameter(
        name='file_uri',
        default='path/to/default/dummy.txt',
        ptype=str,
    )
)

runner = kubeflow_v2_dag_runner.KubeflowV2DagRunner(
    config=kubeflow_v2_dag_runner.KubeflowV2DagRunnerConfig(),
    output_filename=dummy_pipeline_definition_file
)
    
runner.run(dummy_pipeline, write_out=True)

### Upload pipeline to Cloud Storage

In [None]:
GCS_PIPELINE_FILE_LOCATION = f'{BUCKET_NAME}/vertex_demo/compiled_pipelines/{PIPELINE_NAME}.json'
! gsutil cp {PIPELINE_NAME}.json {GCS_PIPELINE_FILE_LOCATION}

### Trigger the pipeline on Vertex AI Managed Pipelines

In [None]:
from src.pipeline_triggering import main
import base64

os.environ['PROJECT'] = PROJECT_ID
os.environ['REGION'] = REGION
os.environ['GCS_PIPELINE_FILE_LOCATION'] = GCS_PIPELINE_FILE_LOCATION
os.environ['PARAMETER_NAMES'] = PARAMETER_NAMES

parameters = {
    'file_uri': 'path/to/trigger/trigger/dummy.txt',
    'unused_param': 0}

message = base64.b64encode(json.dumps(parameters).encode())
main.trigger_pipeline(
    event={'data': message},
    context=None
)

## 1. Create a Pub/Sub topic

In [None]:
! gcloud pubsub topics create {PUBSUB_TOPIC}

## 2. Deploy the Cloud Function

In [None]:
# Dash separated pipeline parameter names
PARAMETER_NAMES='num_epochs-hidden_units-learning_rate-batch_size'

ENV_VARS=f"""\
PROJECT={PROJECT_ID},\
REGION={REGION},\
GCS_PIPELINE_FILE_LOCATION={GCS_PIPELINE_FILE_LOCATION},\
PARAMETER_NAMES={PARAMETER_NAMES}
"""

! echo {ENV_VARS}

In [None]:
! rm -r src/pipeline_triggering/.ipynb_checkpoints

In [None]:
!gcloud functions deploy {CLOUD_FUNCTION_NAME} \
    --region={REGION} \
    --trigger-topic={PUBSUB_TOPIC} \
    --runtime=python37 \
    --source=src/pipeline_triggering\
    --entry-point=trigger_pipeline\
    --stage-bucket={BUCKET_NAME}\
    --update-env-vars={ENV_VARS}

## 3. Test Triggering the Pipeline

In [None]:
from google.cloud import pubsub

publish_client = pubsub.PublisherClient()
topic = f'projects/{PROJECT_ID}/topics/{PUBSUB_TOPIC}'
data = {
    'source_uri': 'pubsub/function/pipline',
    'num_epochs': 4,
    'learning_rate': 0.001,
    'batch_size': 512,
    'hidden_units': '256,126'
}
message = json.dumps(data)

_ = publish_client.publish(topic, message.encode())