# KubeFlow Pipeline DSL Static Type Checking

In this notebook, we will demo: 

* Defining a KubeFlow pipeline with Python DSL
* Compile the pipeline with type checking

Since this sample focuses on the DSL type checking, we will use components that are not runnable in the system but with various type checking scenarios.

## Component definition
Components can be defined in either YAML or functions decorated by dsl.component.

## Type definition
Types can be defined as string or a dictionary formatted as:
{
    type_name: {
        property_a: value_a,
        property_b: value_b
    }
}
If you define the component using the function decorator, there are a list of [core types](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_types.py).

## Type check switch
Type checking is disabled by default. It can be enabled as --type-check argument if dsl-compile is run in the command line, or dsl.compiler.Compiler().compile(type_check=True).

## How does type checking work?
DSL compiler checks the type consistencies among components by checking the type_name as well as the property keys and values. Some special cases are listed here:
1. Type checking succeed: If the upstream component output has fewer properties than the downstream component input.
2. Type checking fail: If the upstream component has more properties than the downstream component input.
3. Type checking succeed: If the upstream/downstream components lack the type information.
4. Type checking succeed: If the type check is disabled.

## Setup

In [1]:
# Set your output and project. !!!Must Do before you can proceed!!!
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.12/kfp.tar.gz'

In [None]:
# Install Pipeline SDK
!pip3 install $KFP_PACKAGE --upgrade

## Author components in YAML(successful type checking)

In [2]:
component_a = '''\
name: component a
description: component a desc
inputs:
  - {name: field_l, type: Integer}
outputs:
  - {name: field_m, type: {GCSPath: {path_type: file, file_type: csv}}}
  - {name: field_n, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_o, type: GcsUri} 
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3, /pipelines/component/src/train.py]
    args: [
      --field-l, {inputValue: field_l},
    ]
    fileOutputs: 
      field_m: /schema.txt
      field_n: /feature.txt
      field_o: /output.txt
'''
component_b = '''\
name: component b
description: component b desc
inputs:
  - {name: field_x, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_y, type: GcsUri}
  - {name: field_z, type: {GCSPath: {path_type: file, file_type: csv}}}
outputs:
  - {name: output_model_uri, type: GcsUri}
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3]
    args: [
      --field-x, {inputValue: field_x},
      --field-y, {inputValue: field_y},
      --field-z, {inputValue: field_z},
    ]
    fileOutputs: 
      output_model_uri: /schema.txt
'''

## Author a pipeline with the above components

In [7]:
import kfp.components as comp
import kfp.dsl as dsl
import kfp.compiler as compiler
task_factory_a = comp.load_component_from_text(text=component_a)
task_factory_b = comp.load_component_from_text(text=component_b)

#Use the component as part of the pipeline
@dsl.pipeline(name='type_check',
    description='')
def pipeline_a():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_a, 'pipeline_a.tar.gz', type_check=True)

## Author components in YAML(failed type checking)

In [9]:
component_a = '''\
name: component a
description: component a desc
inputs:
  - {name: field_l, type: Integer}
outputs:
  - {name: field_m, type: {GCSPath: {path_type: file, file_type: csv}}}
  - {name: field_n, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_o, type: GcsUri} 
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3, /pipelines/component/src/train.py]
    args: [
      --field-l, {inputValue: field_l},
    ]
    fileOutputs: 
      field_m: /schema.txt
      field_n: /feature.txt
      field_o: /output.txt
'''
component_b = '''\
name: component b
description: component b desc
inputs:
  - {name: field_x, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_y, type: GcsUri}
  - {name: field_z, type: {GCSPath: {path_type: file, file_type: tsv}}}
outputs:
  - {name: output_model_uri, type: GcsUri}
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3]
    args: [
      --field-x, {inputValue: field_x},
      --field-y, {inputValue: field_y},
      --field-z, {inputValue: field_z},
    ]
    fileOutputs: 
      output_model_uri: /schema.txt
'''

## Author a pipeline with the above components

In [12]:
import kfp.components as comp
import kfp.dsl as dsl
import kfp.compiler as compiler
task_factory_a = comp.load_component_from_text(text=component_a)
task_factory_b = comp.load_component_from_text(text=component_b)

#Use the component as part of the pipeline
@dsl.pipeline(name='type_check',
    description='')
def pipeline_a():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_a, 'pipeline_a.tar.gz', type_check=True)

GCSPath has a property file_type with value: csv and tsv


InconsistentTypeException: Component "component b" is expecting field_z to be type({'GCSPath': OrderedDict([('path_type', 'file'), ('file_type', 'tsv')])}), but the passed argument is type({'GCSPath': OrderedDict([('path_type', 'file'), ('file_type', 'csv')])})

## Author a pipeline with the above components but type checking disabled.

In [None]:
compiler.Compiler().compile(pipeline_a, 'pipeline_a.tar.gz', type_check=False)

## Author components with decorator(successful type checking)

In [None]:
from kfp import compiler

# The return value "DeployerOp" represents a step that can be used directly in a pipeline function
DeployerOp = compiler.build_python_component(
    component_func=deploy_model,
    staging_gcs_path=OUTPUT_DIR,
    dependency=[kfp.compiler.VersionedDependency(name='google-api-python-client', version='1.7.0')],
    base_image='tensorflow/tensorflow:1.12.0-py3',
    target_image=TARGET_IMAGE)

#### Option Two: build a base docker container image with both tensorflow and google api client packages

In [None]:
%%docker {BASE_IMAGE} {OUTPUT_DIR}
FROM tensorflow/tensorflow:1.10.0-py3
RUN pip3 install google-api-python-client

Once the base docker container image is built, we can build a "target" container image that is base_image plus the python function as entry point. The target container image can be used as a step in a pipeline.

In [None]:
from kfp import compiler

# The return value "DeployerOp" represents a step that can be used directly in a pipeline function
DeployerOp = compiler.build_python_component(
    component_func=deploy_model,
    staging_gcs_path=OUTPUT_DIR,
    target_image=TARGET_IMAGE)

### Modify the pipeline with the new deployer

In [None]:
# My New Pipeline. It's almost the same as the original one with the last step deployer replaced.
@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset.'
)
def my_taxi_cab_classification(
    output,
    project,
    model,
    version,
    column_names=dsl.PipelineParam(
        name='column-names',
        value='gs://ml-pipeline-playground/tfx/taxi-cab-classification/column-names.json'),
    key_columns=dsl.PipelineParam(name='key-columns', value='trip_start_timestamp'),
    train=dsl.PipelineParam(
        name='train',
        value=TRAIN_DATA),
    evaluation=dsl.PipelineParam(
        name='evaluation',
        value=EVAL_DATA),
    validation_mode=dsl.PipelineParam(name='validation-mode', value='local'),
    preprocess_mode=dsl.PipelineParam(name='preprocess-mode', value='local'),
    preprocess_module: dsl.PipelineParam=dsl.PipelineParam(
        name='preprocess-module',
        value='gs://ml-pipeline-playground/tfx/taxi-cab-classification/preprocessing.py'),
    target=dsl.PipelineParam(name='target', value='tips'),
    learning_rate=dsl.PipelineParam(name='learning-rate', value=0.1),
    hidden_layer_size=dsl.PipelineParam(name='hidden-layer-size', value=HIDDEN_LAYER_SIZE),
    steps=dsl.PipelineParam(name='steps', value=STEPS),
    predict_mode=dsl.PipelineParam(name='predict-mode', value='local'),
    analyze_mode=dsl.PipelineParam(name='analyze-mode', value='local'),
    analyze_slice_column=dsl.PipelineParam(name='analyze-slice-column', value='trip_start_hour')):
    
    
    validation_output = '%s/{{workflow.name}}/validation' % output
    transform_output = '%s/{{workflow.name}}/transformed' % output
    training_output = '%s/{{workflow.name}}/train' % output
    analysis_output = '%s/{{workflow.name}}/analysis' % output
    prediction_output = '%s/{{workflow.name}}/predict' % output

    validation = dataflow_tf_data_validation_op(
        train, evaluation, column_names, key_columns, project,
        validation_mode, validation_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    preprocess = dataflow_tf_transform_op(
        train, evaluation, validation.outputs['schema'], project, preprocess_mode,
        preprocess_module, transform_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    training = tf_train_op(
        preprocess.output, validation.outputs['schema'], learning_rate, hidden_layer_size,
        steps, target, preprocess_module, training_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    analysis = dataflow_tf_model_analyze_op(
        training.output, evaluation, validation.outputs['schema'], project,
        analyze_mode, analyze_slice_column, analysis_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    prediction = dataflow_tf_predict_op(
        evaluation, validation.outputs['schema'], target, training.output,
        predict_mode, project, prediction_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    
    # The new deployer. Note that the DeployerOp interface is similar to the function "deploy_model".
    deploy = DeployerOp(
        gcp_project=project, model_name=model, version_name=version, runtime='1.9',
        model_path=training.output).apply(gcp.use_gcp_secret('user-gcp-sa'))

### Submit a new job

In [None]:
compiler.Compiler().compile(my_taxi_cab_classification,  'my-tfx.tar.gz')

run = client.run_pipeline(exp.id, 'my-tfx', 'my-tfx.tar.gz',
                          params={'output': OUTPUT_DIR,
                                  'project': PROJECT_NAME,
                                  'model': DEPLOYER_MODEL,
                                  'version': DEPLOYER_VERSION_PROD})

result = client.wait_for_run_completion(run.id, timeout=600)

## Customize a step in Python2
Let's reuse the deploy_model function defined above. However, this time we will use python2 instead of the default python3.

In [None]:
from kfp import compiler

# The return value "DeployerOp" represents a step that can be used directly in a pipeline function
#TODO: demonstrate the python2 support in another sample.
DeployerOp = compiler.build_python_component(
    component_func=deploy_model,
    staging_gcs_path=OUTPUT_DIR,
    dependency=[kfp.compiler.VersionedDependency(name='google-api-python-client', version='1.7.0')],
    base_image='tensorflow/tensorflow:1.12.0',
    target_image=TARGET_IMAGE_TWO,
    python_version='python2')

### Modify the pipeline with the new deployer

In [None]:
# My New Pipeline. It's almost the same as the original one with the last step deployer replaced.
@dsl.pipeline(
  name='TFX Taxi Cab Classification Pipeline Example',
  description='Example pipeline that does classification with model analysis based on a public BigQuery dataset.'
)
def my_taxi_cab_classification(
    output,
    project,
    model,
    version,
    column_names=dsl.PipelineParam(
        name='column-names',
        value='gs://ml-pipeline-playground/tfx/taxi-cab-classification/column-names.json'),
    key_columns=dsl.PipelineParam(name='key-columns', value='trip_start_timestamp'),
    train=dsl.PipelineParam(
        name='train',
        value=TRAIN_DATA),
    evaluation=dsl.PipelineParam(
        name='evaluation',
        value=EVAL_DATA),
    validation_mode=dsl.PipelineParam(name='validation-mode', value='local'),
    preprocess_mode=dsl.PipelineParam(name='preprocess-mode', value='local'),
    preprocess_module: dsl.PipelineParam=dsl.PipelineParam(
        name='preprocess-module',
        value='gs://ml-pipeline-playground/tfx/taxi-cab-classification/preprocessing.py'),
    target=dsl.PipelineParam(name='target', value='tips'),
    learning_rate=dsl.PipelineParam(name='learning-rate', value=0.1),
    hidden_layer_size=dsl.PipelineParam(name='hidden-layer-size', value=HIDDEN_LAYER_SIZE),
    steps=dsl.PipelineParam(name='steps', value=STEPS),
    predict_mode=dsl.PipelineParam(name='predict-mode', value='local'),
    analyze_mode=dsl.PipelineParam(name='analyze-mode', value='local'),
    analyze_slice_column=dsl.PipelineParam(name='analyze-slice-column', value='trip_start_hour')):
    
    
    validation_output = '%s/{{workflow.name}}/validation' % output
    transform_output = '%s/{{workflow.name}}/transformed' % output
    training_output = '%s/{{workflow.name}}/train' % output
    analysis_output = '%s/{{workflow.name}}/analysis' % output
    prediction_output = '%s/{{workflow.name}}/predict' % output

    validation = dataflow_tf_data_validation_op(
        train, evaluation, column_names, key_columns, project,
        validation_mode, validation_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    preprocess = dataflow_tf_transform_op(
        train, evaluation, validation.outputs['schema'], project, preprocess_mode,
        preprocess_module, transform_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    training = tf_train_op(
        preprocess.output, validation.outputs['schema'], learning_rate, hidden_layer_size,
        steps, target, preprocess_module, training_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    analysis = dataflow_tf_model_analyze_op(
        training.output, evaluation, validation.outputs['schema'], project,
        analyze_mode, analyze_slice_column, analysis_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    prediction = dataflow_tf_predict_op(
        evaluation, validation.outputs['schema'], target, training.output,
        predict_mode, project, prediction_output).apply(gcp.use_gcp_secret('user-gcp-sa'))
    
    # The new deployer. Note that the DeployerOp interface is similar to the function "deploy_model".
    deploy = DeployerOp(
        gcp_project=project, model_name=model, version_name=version, runtime='1.9',
        model_path=training.output).apply(gcp.use_gcp_secret('user-gcp-sa'))

### Submit a new job

In [None]:
compiler.Compiler().compile(my_taxi_cab_classification,  'my-tfx-two.tar.gz')

run = client.run_pipeline(exp.id, 'my-tfx-two', 'my-tfx-two.tar.gz',
                          params={'output': OUTPUT_DIR,
                                  'project': PROJECT_NAME,
                                  'model': DEPLOYER_MODEL,
                                  'version': DEPLOYER_VERSION_PROD_TWO})

result = client.wait_for_run_completion(run.id, timeout=600)

## Clean up

In [17]:
# the step is only needed if you are using an in-cluster JupyterHub instance.
!gcloud auth activate-service-account --key-file /var/run/secrets/sa/user-gcp-sa.json


!gcloud ml-engine versions delete $DEPLOYER_VERSION_PROD --model $DEPLOYER_MODEL -q
!gcloud ml-engine versions delete $DEPLOYER_VERSION_PROD_TWO --model $DEPLOYER_MODEL -q
!gcloud ml-engine versions delete $DEPLOYER_VERSION_DEV --model $DEPLOYER_MODEL -q
!gcloud ml-engine models delete $DEPLOYER_MODEL -q

Activated service account credentials for: [kubeflow3-user@bradley-playground.iam.gserviceaccount.com]
Deleting version [prod]......done.                                             
Deleting version [dev]......done.                                              
Deleting model [notebook_tfx_taxi]...done.                                     
