# KubeFlow Pipeline DSL Static Type Checking

In this notebook, we will demo: 

* Defining a KubeFlow pipeline with Python DSL
* Compile the pipeline with type checking

Since this sample focuses on the DSL type checking, we will use components that are not runnable in the system but with various type checking scenarios.

## Component definition
Components can be defined in either YAML or functions decorated by dsl.component.

## Type definition
Types can be defined as string or a dictionary formatted as:
{
    type_name: {
        property_a: value_a,
        property_b: value_b
    }
}
If you define the component using the function decorator, there are a list of [core types](https://github.com/kubeflow/pipelines/blob/master/sdk/python/kfp/dsl/_types.py).

## Type check switch
Type checking is disabled by default. It can be enabled as --type-check argument if dsl-compile is run in the command line, or dsl.compiler.Compiler().compile(type_check=True).

## How does type checking work?
DSL compiler checks the type consistencies among components by checking the type_name as well as the property keys and values. Some special cases are listed here:
1. Type checking succeed: If the upstream component output has fewer properties than the downstream component input.
2. Type checking fail: If the upstream component has more properties than the downstream component input.
3. Type checking succeed: If the upstream/downstream components lack the type information.
4. Type checking succeed: If the type check is disabled.

# Setup

In [1]:
# Set your output and project. !!!Must Do before you can proceed!!!
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.12/kfp-experiment.tar.gz'

## Install Pipeline SDK

In [2]:
!sudo pip3 install $KFP_PACKAGE --upgrade

[sudo] password for ngao: 
pam_glogin: invalid password


# Type Check with YAML components: successful scenario

## Author components in YAML

In [1]:
component_a = '''\
name: component a
description: component a desc
inputs:
  - {name: field_l, type: Integer}
outputs:
  - {name: field_m, type: {GCSPath: {path_type: file, file_type: csv}}}
  - {name: field_n, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_o, type: GcsUri} 
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3, /pipelines/component/src/train.py]
    args: [
      --field-l, {inputValue: field_l},
    ]
    fileOutputs: 
      field_m: /schema.txt
      field_n: /feature.txt
      field_o: /output.txt
'''
component_b = '''\
name: component b
description: component b desc
inputs:
  - {name: field_x, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_y, type: GcsUri}
  - {name: field_z, type: {GCSPath: {path_type: file, file_type: csv}}}
outputs:
  - {name: output_model_uri, type: GcsUri}
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3]
    args: [
      --field-x, {inputValue: field_x},
      --field-y, {inputValue: field_y},
      --field-z, {inputValue: field_z},
    ]
    fileOutputs: 
      output_model_uri: /schema.txt
'''

## Author a pipeline with the above components

In [2]:
import kfp.components as comp
import kfp.dsl as dsl
import kfp.compiler as compiler
task_factory_a = comp.load_component_from_text(text=component_a)
task_factory_b = comp.load_component_from_text(text=component_b)

#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_a',
    description='')
def pipeline_a():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_a, 'pipeline_a.tar.gz', type_check=True)

# Type Check with YAML components: failed scenario

## Author components in YAML

In [3]:
component_a = '''\
name: component a
description: component a desc
inputs:
  - {name: field_l, type: Integer}
outputs:
  - {name: field_m, type: {GCSPath: {path_type: file, file_type: csv}}}
  - {name: field_n, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_o, type: GcsUri} 
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3, /pipelines/component/src/train.py]
    args: [
      --field-l, {inputValue: field_l},
    ]
    fileOutputs: 
      field_m: /schema.txt
      field_n: /feature.txt
      field_o: /output.txt
'''
component_b = '''\
name: component b
description: component b desc
inputs:
  - {name: field_x, type: {customized_type: {property_a: value_a, property_b: value_b}}}
  - {name: field_y, type: GcsUri}
  - {name: field_z, type: {GCSPath: {path_type: file, file_type: tsv}}}
outputs:
  - {name: output_model_uri, type: GcsUri}
implementation:
  container:
    image: gcr.io/ml-pipeline/component-a
    command: [python3]
    args: [
      --field-x, {inputValue: field_x},
      --field-y, {inputValue: field_y},
      --field-z, {inputValue: field_z},
    ]
    fileOutputs: 
      output_model_uri: /schema.txt
'''

## Author a pipeline with the above components

In [4]:
import kfp.components as comp
import kfp.dsl as dsl
import kfp.compiler as compiler
from kfp.dsl._types import InconsistentTypeException
task_factory_a = comp.load_component_from_text(text=component_a)
task_factory_b = comp.load_component_from_text(text=component_b)

#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_b',
    description='')
def pipeline_b():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

try:
    compiler.Compiler().compile(pipeline_b, 'pipeline_b.tar.gz', type_check=True)
except InconsistentTypeException as e:
    print(e)

GCSPath has a property file_type with value: csv and tsv
Component "component b" is expecting field_z to be type({'GCSPath': OrderedDict([('path_type', 'file'), ('file_type', 'tsv')])}), but the passed argument is type({'GCSPath': OrderedDict([('path_type', 'file'), ('file_type', 'csv')])})


## Author a pipeline with the above components but type checking disabled.

In [5]:
compiler.Compiler().compile(pipeline_b, 'pipeline_b.tar.gz', type_check=False)

# Type Check with decorated components: successful scenario

## Author components with decorator

In [6]:
from kfp.dsl._component import component
from kfp.dsl._types import Integer, GCSPath
from kfp.dsl import ContainerOp
@component
def task_factory_a(field_l: Integer()) -> {'field_m': {'GCSPath': {'path_type': 'file', 'file_type':'tsv'}}, 
                                           'field_n': {'customized_type': {'property_a': 'value_a', 'property_b': 'value_b'}},
                                           'field_o': 'Integer'
                                          }:
    return ContainerOp(
        name = 'operator a',
        image = 'gcr.io/ml-pipeline/component-a',
        arguments = [
            '--field-l', field_l,
        ],
        file_outputs = {
            'field_m': '/schema.txt',
            'field_n': '/feature.txt',
            'field_o': '/output.txt'
        }
    )

@component
def task_factory_b(field_x: {'customized_type': {'property_a': 'value_a', 'property_b': 'value_b'}},
        field_y: Integer(),
        field_z: GCSPath(path_type='file', file_type='tsv')) -> {'output_model_uri': 'GcsUri'}:
    return ContainerOp(
        name = 'operator b',
        image = 'gcr.io/ml-pipeline/component-a',
        command = [
            'python3',
            field_x,
        ],
        arguments = [
            '--field-y', field_y,
            '--field-z', field_z,
        ],
        file_outputs = {
            'output_model_uri': '/schema.txt',
        }
    )

## Author a pipeline with the above components

In [7]:
#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_c',
    description='')
def pipeline_c():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_c, 'pipeline_c.tar.gz', type_check=True)

# Type Check with decorated components: failure scenario

## Author components with decorator

In [8]:
from kfp.dsl._component import component
from kfp.dsl._types import Integer, GCSPath
from kfp.dsl import ContainerOp
@component
def task_factory_a(field_l: Integer()) -> {'field_m': {'GCSPaths': {'path_type': 'file', 'file_type':'tsv'}}, 
                                           'field_n': {'customized_type': {'property_a': 'value_a', 'property_b': 'value_b'}},
                                           'field_o': 'Integer'
                                          }:
    return ContainerOp(
        name = 'operator a',
        image = 'gcr.io/ml-pipeline/component-a',
        arguments = [
            '--field-l', field_l,
        ],
        file_outputs = {
            'field_m': '/schema.txt',
            'field_n': '/feature.txt',
            'field_o': '/output.txt'
        }
    )

@component
def task_factory_b(field_x: {'customized_type': {'property_a': 'value_a', 'property_b': 'value_b'}},
        field_y: Integer(),
        field_z: GCSPath(path_type='file', file_type='tsv')) -> {'output_model_uri': 'GcsUri'}:
    return ContainerOp(
        name = 'operator b',
        image = 'gcr.io/ml-pipeline/component-a',
        command = [
            'python3',
            field_x,
        ],
        arguments = [
            '--field-y', field_y,
            '--field-z', field_z,
        ],
        file_outputs = {
            'output_model_uri': '/schema.txt',
        }
    )

## Author a pipeline with the above components

In [9]:
#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_d',
    description='')
def pipeline_d():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

try:
    compiler.Compiler().compile(pipeline_d, 'pipeline_d.tar.gz', type_check=True)
except InconsistentTypeException as e:
    print(e)

type name GCSPaths is different from expected: GCSPath
Component "task_factory_b" is expecting field_z to be type({'GCSPath': {'path_type': 'file', 'file_type': 'tsv'}}), but the passed argument is type({'GCSPaths': {'path_type': 'file', 'file_type': 'tsv'}})


# Type Check with missing type information

## Author components(with missing types)

In [10]:
from kfp.dsl._component import component
from kfp.dsl._types import Integer, GCSPath
from kfp.dsl import ContainerOp
@component
def task_factory_a(field_l: Integer()) -> {'field_m': {'GCSPath': {'path_type': 'file', 'file_type':'tsv'}}, 
                                           'field_o': 'Integer'
                                          }:
    return ContainerOp(
        name = 'operator a',
        image = 'gcr.io/ml-pipeline/component-a',
        arguments = [
            '--field-l', field_l,
        ],
        file_outputs = {
            'field_m': '/schema.txt',
            'field_n': '/feature.txt',
            'field_o': '/output.txt'
        }
    )

@component
def task_factory_b(field_x: {'customized_type': {'property_a': 'value_a', 'property_b': 'value_b'}},
        field_y,
        field_z: GCSPath(path_type='file', file_type='tsv')) -> {'output_model_uri': 'GcsUri'}:
    return ContainerOp(
        name = 'operator b',
        image = 'gcr.io/ml-pipeline/component-a',
        command = [
            'python3',
            field_x,
        ],
        arguments = [
            '--field-y', field_y,
            '--field-z', field_z,
        ],
        file_outputs = {
            'output_model_uri': '/schema.txt',
        }
    )

## Author a pipeline with the above components

In [11]:
#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_e',
    description='')
def pipeline_e():
    a = task_factory_a(field_l=12)
    b = task_factory_b(field_x=a.outputs['field_n'], field_y=a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_e, 'pipeline_e.tar.gz', type_check=True)

# Type Check with both named arguments and positional arguments

In [12]:
#Use the component as part of the pipeline
@dsl.pipeline(name='type_check_f',
    description='')
def pipeline_f():
    a = task_factory_a(field_l=12)
    b = task_factory_b(a.outputs['field_n'], a.outputs['field_o'], field_z=a.outputs['field_m'])

compiler.Compiler().compile(pipeline_f, 'pipeline_f.tar.gz', type_check=True)

# Type Check between pipeline parameters and component parameters

In [13]:
@component
def task_factory_a(field_m: {'GCSPath': {'path_type': 'file', 'file_type':'tsv'}}, field_o: 'Integer'):
    return ContainerOp(
        name = 'operator a',
        image = 'gcr.io/ml-pipeline/component-b',
        arguments = [
            '--field-l', field_m,
            '--field-o', field_o,
        ],
    )
@dsl.pipeline(name='type_check_g',
    description='')
def pipeline_g(a: {'GCSPath': {'path_type':'file', 'file_type': 'csv'}}='good', b: Integer()=12):
    task_factory_a(field_m=a, field_o=b)

try:
    compiler.Compiler().compile(pipeline_g, 'pipeline_g.tar.gz', type_check=True)
except InconsistentTypeException as e:
    print(e)

GCSPath has a property file_type with value: csv and tsv
Component "task_factory_a" is expecting field_m to be type({'GCSPath': {'path_type': 'file', 'file_type': 'tsv'}}), but the passed argument is type({'GCSPath': {'path_type': 'file', 'file_type': 'csv'}})


# Clean up

In [14]:
from pathlib import Path
for p in Path(".").glob("pipeline_[a-g].tar.gz"):
    p.unlink()