##### Copyright &copy; 2020 The TensorFlow Authors.

<font size=-1>Licensed under the Apache License, Version 2.0 (the \"License\");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at [https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.</font>

# Continuous training pipeline for CIFAR10 image classifier 

This example demonstrates a continuous training TFX pipeline that trains an image classification model on the CIFAR10 dataset. The pipeline runs on **AI Platform Pipelines** and uses **Cloud Dataflow** and **Cloud AI Platform Training** as execution runtimes.

![TFX CAIP](../../images/tfx-caip-1.png)

This example uses the [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html) dataset released by the The Canadian Institute for Advanced Research (CIFAR).

Note: This site provides applications using data that has been modified for use from its original source, The Canadian Institute for Advanced Research (CIFAR). The Canadian Institute for Advanced Research (CIFAR) makes no claims as to the content, accuracy, timeliness, or completeness of any of the data provided at this site. The data provided at this site is subject to change at any time. It is understood that the data provided at this site is being used at one’s own risk.

You can read more about the dataset in [CIFAR dataset homepage](https://www.cs.toronto.edu/~kriz/cifar.html).

## Configuring the environment settings

### Install KFP and TFX SDKs.

You will use TFX and KFP SDKs to compile, deploy and run the pipeline. During the installation you may see errors like the one below.

>"ERROR: some-package 0.some_version.1 has requirement other-package!=2.0.,&lt;3,&gt;=1.15, but you'll have other-package 2.0.0 which is incompatible." 

Please ignore these errors.


In [1]:
requirements_file = 'requirements.txt'

In [2]:
%%writefile {requirements_file}

tfx==0.21.0
kfp==0.2.5

Overwriting requirements.txt


In [3]:
!pip install --user --upgrade  -q -r requirements.txt

TFX CLI requires [skaffold](https://skaffold.dev/).

In [4]:
!curl -Lo skaffold https://storage.googleapis.com/skaffold/releases/latest/skaffold-linux-amd64 && chmod +x skaffold && mv skaffold /home/jupyter/.local/bin/

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 39.1M  100 39.1M    0     0  88.5M      0 --:--:-- --:--:-- --:--:-- 88.5M


Set `PATH` to include user python binary directory and a directory containing `skaffold`.

In [5]:
PATH=%env PATH
%env PATH={PATH}:/home/jupyter/.local/bin

env: PATH=/usr/local/cuda/bin:/opt/conda/bin:/opt/conda/condabin:/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games:/home/jupyter/.local/bin


Double-check the version of TFX.

In [6]:
!python -c "import tfx; print('TFX version: {}'.format(tfx.__version__))"

TFX version: 0.21.0


### Configure AI Platform Pipelines connection settings

Make sure to update the following constants with your settings:
- Set `GCP_PROJECT` to your project ID
- Set `ENDPOINT` to the endpoint of your AI Platform Pipelines environment

The endpoint of the AI Platform Pipelines environment can be found in [AI Platform Pipelines Console](https://console.cloud.google.com/ai-platform/pipelines/clusters). 
1. Open the *SETTINGS* for your instance. 
2. Use the value of the host variable in the *Connect to this Kubeflow Pipelines instance from a Python client via Kubeflow Pipelines SDK* section of the pop-up window.



In [7]:
GCP_PROJECT_ID='mlops-workshop'
ENDPOINT='b408c7cc27aa8bb-dot-us-central2.pipelines.googleusercontent.com'

### Create a GCS bucket to be used as an artifact store

As the pipeline executes it stores generated artifacts in a GCS location.

In [8]:
ARTIFACT_STORE_URI='gs://{}-artifact-store'.format(GCP_PROJECT_ID)

In [9]:
!gsutil mb {ARTIFACT_STORE_URI}

Creating gs://mlops-workshop-artifact-store/...
ServiceException: 409 Bucket mlops-workshop-artifact-store already exists.


### Set the compute region for Dataflow and AI Platform Training and Prediction
The pipeline uses **Cloud Dataflow** and **Cloud AI Platform Training and Prediction** as execution runtimes for TFX components. In this example, the `us-central1` region is used as the default region for these services. Update the `GCP_REGION` constant if you cannot use `us-central1`.

In [10]:
GCP_REGION='us-central1'

### Set the URI for the custom TFX image

The pipeline components execute in a runtime provided by a custom docker image. The image is a derivate of a base TFX image from Docker Hub - `tensorflow/tfx:0.21.x`. The custom image updates the base image with the latest TFX libraries and adds Python modules for `Transform` and `Train` components. The image will be built and pushed to your project's **Container Registry** by TFX CLI.

The custom image is defined in the Dockerfile that can be found in the root folder of this example.

In [11]:
!cat Dockerfile

FROM tensorflow/tfx:0.21.0
RUN python -m pip install -U tensorflow_model_analysis==0.21.4 tensorflow_data_validation==0.21.4 tensorflow-transform==0.21.2
WORKDIR /pipeline
COPY pipeline/transform_train.py ./
COPY schema/schema.pbtxt ./schema/
ENV PYTHONPATH="/pipeline:${PYTHONPATH}"

In [12]:
CUSTOM_TFX_IMAGE='gcr.io/' + GCP_PROJECT_ID + '/cifar-tfx-image'

## Understanding the pipeline design

The pipeline code can be found in the `pipeline` folder.

In [13]:
!ls -la pipeline

total 40
drwxr-xr-x 4 jupyter jupyter 4096 Mar  7 18:49 .
drwxr-xr-x 5 jupyter jupyter 4096 Mar  7 19:40 ..
-rw-r--r-- 1 jupyter jupyter 1314 Mar  6 19:35 config.py
drwxr-xr-x 2 jupyter jupyter 4096 Mar  7 17:21 .ipynb_checkpoints
-rw-r--r-- 1 jupyter jupyter 5522 Mar  6 19:35 pipeline.py
drwxr-xr-x 2 jupyter jupyter 4096 Mar  6 19:51 __pycache__
-rw-r--r-- 1 jupyter jupyter 3038 Mar  6 20:34 runner.py
-rw-r--r-- 1 jupyter jupyter 7969 Mar  6 19:35 transform_train.py


The `config.py` file collates all configuration settings that are environment specific and sets the default values for the settings. The default values can be overwritten when building the pipeline by providing new values in a set of environment variables.

In [14]:
!tail -n 15 pipeline/config.py

class Config:
    """Sets configuration vars."""
    
    PIPELINE_NAME=os.getenv("PIPELINE_NAME", "cifar10_continuous_training")
    MODEL_NAME=os.getenv("MODEL_NAME", "cifar10_classifier")
    PROJECT_ID=os.getenv("PROJECT_ID", "mlops-workshop")
    GCP_REGION=os.getenv("GCP_REGION", "us-central1")
    TFX_IMAGE=os.getenv("KUBEFLOW_TFX_IMAGE", "tensorflow/tfx:0.21.0")
    DATA_ROOT_URI=os.getenv("DATA_ROOT_URI", "gs://workshop-datasets/cifar10")
    ARTIFACT_STORE_URI=os.getenv("ARTIFACT_STORE_URI", "gs://mlops-workshop-artifact-store")
    RUNTIME_VERSION=os.getenv("RUNTIME_VERSION", "1.15")
    PYTHON_VERSION=os.getenv("PYTHON_VERSION", "3.7")
    
    
    

The `pipeline.py` file contains the core DSL defining the workflow implemented by the pipeline. 

The `transform_train.py` file contains data preprocessing and training code for the `Transform` and `Train` components.

The `runner.py` file contains the code that configures and executes `KubeflowDagRunner`. `KubeflowDagRunner` is responsible for compiling the pipeline's DSL into the pipeline package (in the [argo](https://argoproj.github.io/argo/) format).

## Building and deploying the pipeline

### Set the compile settings

As noted the default values for the environment specific settings that control how the pipeline is compiled can be overwritten by the values in a set of environment variables.

In [None]:
MODEL_NAME='cifar10-classifier'
PIPELINE_NAME='cifar10_continuous_training'

In [None]:
%env PROJECT_ID={GCP_PROJECT}
%env KUBEFLOW_TFX_IMAGE={CUSTOM_TFX_IMAGE}
%env ARTIFACT_STORE_URI={ARTIFACT_STORE_URI}
%env GCP_REGION={GCP_REGION}
%env MODEL_NAME={MODEL_NAME}
%env PIPELINE_NAME={PIPELINE_NAME}

### Compile the pipeline

You can build and upload the pipeline to the KFP environment in one step, using the `tfx pipeline create` command. The `tfx pipeline create` goes through the following steps:
- (Optional) Builds an image to host your components, 
- Compiles the pipeline DSL into a pipeline package 
- Uploads the pipeline package to the KFP environment.

As you are debugging the pipeline DSL, you may prefer to first use the `tfx pipeline compile` command, which only executes the compilation step. After the DSL compiles successfully you can use `tfx pipeline create` to go through all steps.

In [None]:
!tfx pipeline compile --engine kubeflow --pipeline_path pipeline/runner.py

### Deploy the pipeline to AI Platform Pipelines

After the pipeline code has been debbuged and compiles without any errors you perform the final compilation and deploy the pipeline package in one step using the `tfx pipeline create` command. This command also builds the container image that hosts TFX components and your data preprocessing and training code.

In [None]:
!tfx pipeline create  \
--pipeline_path=pipeline/runner.py \
--endpoint={ENDPOINT} \
--build_target_image={CUSTOM_TFX_IMAGE}

If you need to redeploy the pipeline you can first delete the previous version using `tfx pipeline delete` or you can update the pipeline in-place using `tfx pipeline update`.

To delete the pipeline.

In [None]:
!tfx pipeline delete --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

### Create and monitor a pipeline run
After the pipeline has been deployed, you can trigger and monitor pipeline runs using TFX CLI or KFP UI.

To submit the pipeline run using TFX CLI:

In [None]:
!tfx run create --pipeline_name={PIPELINE_NAME} --endpoint={ENDPOINT}

To list all active runs of the pipeline:

In [None]:
!tfx run list --pipeline_name {PIPELINE_NAME} --endpoint {ENDPOINT}

To retrieve the status of a given run:

In [None]:
RUN_ID=[YOUR RUN ID]

!tfx run status --pipeline_name {PIPELINE_NAME} --run_id {RUN_ID} --endpoint {ENDPOINT}

To terminate the run

In [None]:
#!tfx run terminate --run_id [YOUR_RUN_ID] --endpoint {ENDPOINT}

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Alternatively, you can clean up individual resources by visiting each consoles:
- [Google Cloud Storage](https://console.cloud.google.com/storage)
- [Google Container Registry](https://console.cloud.google.com/gcr)
- [Google Kubernetes Engine](https://console.cloud.google.com/kubernetes)
