##### Copyright &copy; 2020 Google Inc.

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>
<hr/>

# Orchestrating BQML training and deployment with Managed Pipelines

This notebook demonstrates how to use custom Python function-based components together with TFX standard components. In the notebook, you will orchestrate training and deployment of a BQML logistic regression model. 

1. BigQuery is used to prepare training data by executing an arbitrary SQL query and writing the results to a BigQuery table
2. The table with training data is used to train a BQML logistic regression model 
3. The model is deployed to AI Platform Prediction for online serving

## Setup

### Upgrade BigQuery client

In [None]:
%pip install --upgrade --user google-cloud-core==1.3.0 google-cloud-bigquery==1.26.1

In [None]:
import IPython

IPython.Application.instance().kernel.do_shutdown(True)

### Import the requiried libraries and verify a version of TFX SDK

In [None]:
import sys
import tensorflow as tf
import tensorflow_data_validation as tfdv
import tensorflow_model_analysis as tfma
import tfx

import logging
import google.cloud

from typing import Optional, Text, List, Dict, Any

from ml_metadata.proto import metadata_store_pb2
from tfx.components.base import executor_spec
from tfx.components import Pusher
from tfx.extensions.google_cloud_ai_platform.pusher import executor as ai_platform_pusher_executor

print("Tensorflow Version:", tf.__version__)
print("TFX Version:", tfx.__version__)
print("TFDV Version:", tfdv.__version__)
print("TFMA Version:", tfma.VERSION_STRING)
print("BigQuery client:", google.cloud.bigquery.__version__)

In [None]:
%load_ext autoreload
%autoreload 2

### Update `PATH` with the location of TFX SDK

In [None]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

### Configure GCP environment settings


Modify the below constants to reflect your environment

In [None]:
PROJECT_ID = 'mlops-dev-env'
REGION = 'us-central1'
BUCKET_NAME = 'mlops-dev-workspace'  # Change this to your GCS bucket name.  Do not include the `gs://`.
API_KEY =  '' # Change this to the API key that you created during initial setup
BASE_IMAGE = 'gcr.io/caip-pipelines-assets/tfx:latest'

### Create an example BigQuery dataset 

In [None]:
DATASET_LOCATION = 'US'
DATASET_ID = 'covertype_dataset'
TABLE_ID =' covertype'
DATA_SOURCE = 'gs://workshop-datasets/covertype/small/dataset.csv'
SCHEMA = 'Elevation:INTEGER,\
Aspect:INTEGER,\
Slope:INTEGER,\
Horizontal_Distance_To_Hydrology:INTEGER,\
Vertical_Distance_To_Hydrology:INTEGER,\
Horizontal_Distance_To_Roadways:INTEGER,\
Hillshade_9am:INTEGER,\
Hillshade_Noon:INTEGER,\
Hillshade_3pm:INTEGER,\
Horizontal_Distance_To_Fire_Points:INTEGER,\
Wilderness_Area:STRING,\
Soil_Type:STRING,\
Cover_Type:INTEGER'

!bq --location=$DATASET_LOCATION --project_id=$PROJECT_ID mk --dataset $DATASET_ID

In [None]:
!bq --project_id=$PROJECT_ID --dataset_id=$DATASET_ID load \
--source_format=CSV \
--skip_leading_rows=1 \
--replace \
$TABLE_ID \
$DATA_SOURCE \
$SCHEMA

## Create custom components

In this section, we will create a set of custom omponents that encapsulate calls to BigQuery and BigQuery ML.

### Create a data preprocessing component

In [None]:
%%writefile preprocess_data.py

import os
import logging
import uuid

from google.cloud import bigquery

from tfx.types.experimental.simple_artifacts import Dataset
from tfx.dsl.component.experimental.decorators import component
from tfx.dsl.component.experimental.annotations import OutputArtifact, Parameter

@component
def preprocess_data(
    project_id: Parameter[str],
    query: Parameter[str], 
    transformed_data: OutputArtifact[Dataset]):
    
    client = bigquery.Client(project=project_id)

    dataset_name = f'{project_id}.bqml_demo_{uuid.uuid4().hex}'
    table_name = f'{dataset_name}.{uuid.uuid4().hex}'
    
    dataset = bigquery.Dataset(dataset_name)
    client.create_dataset(dataset)

    job_config = bigquery.QueryJobConfig()
    job_config.create_disposition = bigquery.job.CreateDisposition.CREATE_IF_NEEDED
    job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
    job_config.destination = table_name

    logging.info(f'Starting data preprocessing')
    
    query_job = client.query(query, job_config)
    query_job.result() # Wait for the job to complete
    
    logging.info(f'Completed data preprocessing. Output in  {table_name}')
    
    # Write the location of the output table to metadata.  
    transformed_data.set_string_custom_property('output_dataset', dataset_name)
    transformed_data.set_string_custom_property('output_table', table_name)


### Create a BQML training component

In [None]:
%%writefile create_lr_model.py

import os
import logging

from google.cloud import bigquery

from tfx.types.experimental.simple_artifacts import Dataset
from tfx.types.experimental.simple_artifacts import Model as BQModel
from tfx.dsl.component.experimental.decorators import component
from tfx.dsl.component.experimental.annotations import InputArtifact, OutputArtifact, Parameter

@component
def create_lr_model(
    project_id: Parameter[str],
    model_name: Parameter[str],
    label_column: Parameter[str],
    transformed_data: InputArtifact[Dataset],
    model: OutputArtifact[BQModel]):
    
    dataset_name = transformed_data.get_string_custom_property('output_dataset')
    table_name = transformed_data.get_string_custom_property('output_table')
    model_name = f'{dataset_name}.{model_name}'
    
    query = f"""
        CREATE OR REPLACE MODEL
        `{model_name}`
        OPTIONS
          ( model_type='LOGISTIC_REG',
            auto_class_weights=TRUE,
            input_label_cols=['{label_column}']
          ) AS
        SELECT 
          *
        FROM
          `{table_name}`
    """
    
    client = bigquery.Client(project=project_id)

    logging.info(f'Starting training of the model: {model_name}')
    query_job = client.query(query)
    query_job.result()
    logging.info(f'Completed training of the model: {model_name}')
    
    # Write the location of the output table to metadata.  
    model.set_string_custom_property('bq_model_name', model_name)
    


### Create a BQML model export component

In [None]:
%%writefile export_model.py

import os
import logging
import subprocess

from google.cloud import bigquery

from tfx.types.experimental.simple_artifacts import Dataset
from tfx.types.experimental.simple_artifacts import Model as BQModel
from tfx.types.standard_artifacts import Model
from tfx.dsl.component.experimental.decorators import component
from tfx.dsl.component.experimental.annotations import InputArtifact, OutputArtifact, Parameter


@component
def export_model(
    bq_model: InputArtifact[BQModel],
    model: OutputArtifact[Model]):
    
    bq_model_name = bq_model.get_string_custom_property('bq_model_name')
    gcs_path = '{}/serving_model_dir'.format(model.uri.rstrip('/'))
    
    client = bigquery.Client()
    bqml_model = bigquery.model.Model(bq_model_name)
    
    logging.info(f'Starting model extraction')
    
    extract_job = client.extract_table(bqml_model, gcs_path)
    extract_job.result() # Wait for results
    
    logging.info(f'Model extraction completed')
   


### Create an AI Platform Prediction deploy component

This is an alternative to using the TFX Pusher component

In [None]:
%%writefile deploy_model.py

import os
import logging
import uuid

import googleapiclient.discovery

from tfx.types.standard_artifacts import Model
from tfx.dsl.component.experimental.decorators import component
from tfx.dsl.component.experimental.annotations import InputArtifact, OutputArtifact, Parameter

@component
def deploy_model(
    project_id: Parameter[str],
    model_name: Parameter[str],
    runtime_version: Parameter[str],
    python_version: Parameter[str],
    framework: Parameter[str],
    model: InputArtifact[Model]):
    
    service = googleapiclient.discovery.build('ml', 'v1')
    version_name = f'v{uuid.uuid4().hex}'
   
    saved_model_path = '{}/serving_model_dir'.format(model.uri.rstrip('/'))
    
    project_path = f'projects/{project_id}'
    model_path = f'{project_path}/models/{model_name}'
    
    response = service.projects().models().list(parent=project_path).execute()
    if 'error' in response:
        raise RuntimeError(response['error'])
        

    if not response or not [model['name'] for model in response['models'] if model['name'] == model_path]:
        request_body={'name': model_name}
        response = service.projects().models().create(parent=project_path, body=request_body).execute()
        if 'error' in response:
            raise RuntimeError(response['error'])
        
    request_body = {
        "name": version_name,
        "deployment_uri": saved_model_path,
        "machine_type": "n1-standard-8",
        "runtime_version": runtime_version,
        "python_version": python_version,
        "framework": framework
    }
    
    logging.info(f'Starting model deployment')
    response = service.projects().models().versions().create(parent=model_path, body=request_body).execute()
    if 'error' in response:
        raise RuntimeError(response['error'])
    logging.info(f'Model deployed: {response}')


## Define the pipeline

In [None]:
import os

# Only required for local run.
from tfx.orchestration.metadata import sqlite_metadata_connection_config

from tfx.orchestration.pipeline import Pipeline
from tfx.orchestration.ai_platform_pipelines import ai_platform_pipelines_dag_runner

from preprocess_data import preprocess_data
from create_lr_model import create_lr_model
from export_model import export_model
from deploy_model import deploy_model


def bqml_pipeline(
    pipeline_name: Text, 
    pipeline_root: Text, 
    query: Text, 
    project_id: Text, 
    model_name: Text, 
    label_column: Text,
    metadata_connection_config: Optional[
        metadata_store_pb2.ConnectionConfig] = None,
    ai_platform_serving_args: Optional[Dict[Text, Any]] = None):
    
    components = []
    
    preprocess = preprocess_data(
        query=query, 
        project_id=project_id)
    components.append(preprocess)
    
    train = create_lr_model(
        transformed_data=preprocess.outputs['transformed_data'],
        project_id=project_id,
        model_name=model_name,
        label_column=label_column)
    components.append(train)
    
    export = export_model(
        bq_model=train.outputs['model']
    )
    components.append(export)
    

    if ai_platform_serving_args:
        deploy = Pusher(
            custom_executor_spec=executor_spec.ExecutorClassSpec(
                ai_platform_pusher_executor.Executor),
            model=export.outputs['model'],
            custom_config={'ai_platform_serving_args': ai_platform_serving_args})
        components.append(deploy)

# The alternative using a custom deploy_model component
#    if ai_platform_serving_args:
#        deploy = deploy_model(
#            project_id=project_id,
#            runtime_version=ai_platform_serving_args['runtimeVersion'],
#            python_version=ai_platform_serving_args['pythonVersion'],
#            framework=ai_platform_serving_args['framework'],
#            model_name=ai_platform_serving_args['model_name'],
#            model=export.outputs['model']
#        )
        #components.append(deploy)
    
    return Pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        metadata_connection_config=metadata_connection_config,
        components=components
      )

## Run the pipeline locally

In [None]:
from tfx.orchestration.beam.beam_dag_runner import BeamDagRunner

query = 'SELECT * FROM `mlops-dev-env.covertype_dataset.covertype` LIMIT 1000'
label_column = 'Cover_Type'
model_name = 'covertype_classifier'
pipeline_root = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, 'bqml-test2')
pipeline_name = 'bqml-pipeline'

metadata_connection_config=sqlite_metadata_connection_config('metadata.sqlite')

ai_platform_serving_args = {
      'project_id': PROJECT_ID,
      'model_name': 'CovertypeBQMLLocal',
      'runtimeVersion': '1.15',
      'pythonVersion': '3.7',
      'framework': 'TENSORFLOW'}

logging.getLogger().setLevel(logging.INFO)

BeamDagRunner().run(bqml_pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        query=query,
        project_id=PROJECT_ID,
        model_name=model_name,
        label_column=label_column,
        metadata_connection_config=metadata_connection_config,
        ai_platform_serving_args=ai_platform_serving_args))

### Check that the metadata was produced locally

In [None]:
from ml_metadata import metadata_store
from ml_metadata.proto import metadata_store_pb2

connection_config = metadata_store_pb2.ConnectionConfig()
connection_config.sqlite.filename_uri = 'metadata.sqlite'
connection_config.sqlite.connection_mode = 3 # READWRITE_OPENCREATE
store = metadata_store.MetadataStore(connection_config)
store.get_artifacts()

## Run the pipeline in Managed Pipelines

### Package the components into a custom docker image

Next, let's package the above into a container.   
In future, it will be possible to do this via the TFX CLI. For now, we'll do this using a Dockerfile and Skaffold. 

> Note: If you're running this notebook on AI Platform Notebooks, Docker will be installed.  If you're running the notebook in a local development environment, you'll need to have Docker installed there. Confirm that you have [installed Skaffold](https://skaffold.dev/docs/install/) locally as well.

First, we'll define a `skaffold.yaml` file.  We'll first define a string to use in creating the file.

In [None]:
tag = 'demo'

SK_TEMPLATE = "{{{{.IMAGE_NAME}}}}:{}".format(tag)
print(SK_TEMPLATE)

Now we'll write out the Skaffold yaml file.

In [None]:
image_name = f'gcr.io/{PROJECT_ID}/caip-tfx-bqml'

skaffold_template = f"""
apiVersion: skaffold/v2beta3
kind: Config
metadata:
  name: my-pipeline
build:
  artifacts:
  - image: '{image_name}'
    context: .
    docker:
      dockerfile: Dockerfile
  tagPolicy:
    envTemplate:
      template: "{{SK_TEMPLATE}}"
"""
with open('skaffold.yaml', 'w') as f:
    f.write(skaffold_template.format(**globals()))

Next, we'll define the `Dockerfile`.

In [None]:
%%writefile Dockerfile
FROM gcr.io/caip-pipelines-assets/tfx:latest
RUN pip install --upgrade google-cloud-core==1.3.0 google-cloud-bigquery==1.26.1
WORKDIR /pipeline
COPY *.py ./
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"


In [None]:
!skaffold build

### Submit a run

In [None]:
query = 'SELECT * FROM `mlops-dev-env.covertype_dataset.covertype` LIMIT 1000'

label_column = 'Cover_Type'
model_name = 'covertype_classifier'
pipeline_name = 'bqml-pipeline-tests'
pipeline_root = 'gs://{}/pipeline_root/{}'.format(BUCKET_NAME, pipeline_name)
ai_platform_serving_args = {
      'project_id': PROJECT_ID,
      'model_name': 'CovertypeBQMLtest',
      'runtimeVersion': '1.15',
      'pythonVersion': '3.7',
      'framework': 'TENSORFLOW'}

pipeline = bqml_pipeline(
        pipeline_name=pipeline_name,
        pipeline_root=pipeline_root,
        query=query,
        project_id=PROJECT_ID,
        model_name=model_name,
        label_column=label_column,
        ai_platform_serving_args=ai_platform_serving_args)


In [None]:
config = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunnerConfig(
    project_id=PROJECT_ID,
    display_name=pipeline_name,
    default_image=f'{image_name}:{tag}')

runner = ai_platform_pipelines_dag_runner.AIPlatformPipelinesDagRunner(config=config)
runner.run(pipeline, api_key=API_KEY)