# Reusable components

This tutorial describes the manual way of writing a full component program (in any language) and a component definition for it. For quickly building component from a python function see Build component from Python function and Data Passing in Python components.

Below is a summary of the steps involved in creating and using a component:

- Write the program that contains your component’s logic. The program must use files and command-line arguments to pass data to and from the component.
- Containerize the program.
- Write a component specification in YAML format that describes the component for the Kubeflow Pipelines system.
- Use the Kubeflow Pipelines SDK to load your component, use it in a pipeline and run that pipeline.

**Note: Make sure that you have docker installed in the local environment**

In [1]:
import kfp
import kfp.gcp as gcp
import kfp.dsl as dsl
import kfp.compiler as compiler
import kfp.components as comp
import datetime

import kubernetes as k8s

In [2]:
PROJECT_ID='kubeflow-pipeline-fantasy'

## Writing the program code

The following cell creates a file `app.py` that contains a Python script. The script takes a GCS bucket name as an input argument, gets the lists of blobs in that bucket, prints the list of blobs and also writes them to an output file.

In [27]:
%%bash

# Create folders if they don't exist.
mkdir -p tmp/reuse_components/minist_training

# Create the Python file that lists GCS blobs.
cat > ./tmp/reuse_components/minist_training/app.py <<HERE
import argparse
from datetime import datetime
import tensorflow as tf

parser = argparse.ArgumentParser()
parser.add_argument(
    '--model_file', type=str, required=True, help='Name of the model file.')
parser.add_argument(
    '--bucket', type=str, required=True, help='GCS bucket name.')
args = parser.parse_args()

bucket=args.bucket
model_file=args.model_file

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(512, activation=tf.nn.relu),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print(model.summary())    

mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

callbacks = [
  tf.keras.callbacks.TensorBoard(log_dir=bucket + '/logs/' + datetime.now().date().__str__()),
  # Interrupt training if val_loss stops improving for over 2 epochs
  tf.keras.callbacks.EarlyStopping(patience=2, monitor='val_loss'),
]

model.fit(x_train, y_train, batch_size=32, epochs=5, callbacks=callbacks,
          validation_data=(x_test, y_test))


model.save(model_file)

from tensorflow import gfile

gcs_path = bucket + "/" + model_file

if gfile.Exists(gcs_path):
    gfile.Remove(gcs_path)

gfile.Copy(model_file, gcs_path)
with open('/output.txt', 'w') as f:
  f.write(gcs_path)
HERE

## Create a Docker container
Create your own container image that includes your program. 
- If your component creates some outputs to be fed as inputs to the downstream components, each separate output must be written as a string to a separate local text file inside the container image. 
- For example, if a trainer component needs to output the trained model path, it can write the path to a local file `/output.txt`. 
- The string written to an output file cannot be too big. If it is too big (>> 100 kB), it is recommended to save the output to an external persistent storage and pass the storage path to the next component.

Now create a container that runs the script. Start by creating a `Dockerfile`. A `Dockerfile` contains the instructions to assemble a Docker image. The `FROM` statement specifies the Base Image from which you are building. `WORKDIR` sets the working directory. When you assemble the Docker image, `COPY` will copy the required files and directories (for example, `app.py`) to the filesystem of the container. `RUN` will execute a command (for example, install the dependencies) and commits the results. 

In [28]:
%%bash

# Create Dockerfile.
cat > ./tmp/reuse_components/minist_training/Dockerfile <<EOF
FROM tensorflow/tensorflow:1.15.0-py3
WORKDIR /app
COPY . /app
EOF

Now that we have created our Dockerfile we can create our Docker image. Then we need to push the image to a registry to host the image. Now create a Shell script that builds a container image and stores it in the Google Container Registry.

In [29]:
%%bash -s "{PROJECT_ID}"

IMAGE_NAME="minist_training_kf_pipeline"
TAG="latest" # "v_$(date +%Y%m%d_%H%M%S)"

# Create script to build docker image and push it.
cat > ./tmp/reuse_components/minist_training/build_image.sh <<HERE
PROJECT_ID="${1}"
IMAGE_NAME="${IMAGE_NAME}"
TAG="${TAG}"
GCR_IMAGE="gcr.io/\${PROJECT_ID}/\${IMAGE_NAME}:\${TAG}"
docker build -t \${IMAGE_NAME} .
docker tag \${IMAGE_NAME} \${GCR_IMAGE}
docker push \${GCR_IMAGE}
docker inspect --format="{{index .RepoDigests 0}}" "${IMAGE_NAME}"
docker image rm \${IMAGE_NAME}
docker image rm \${GCR_IMAGE}
HERE

Run the script.

In [30]:
%%bash

# Build and push the image.
cd tmp/reuse_components/minist_training
bash build_image.sh

Sending build context to Docker daemon   5.12kB
Step 1/3 : FROM tensorflow/tensorflow:1.15.0-py3
 ---> f24a5ca8605f
Step 2/3 : WORKDIR /app
 ---> Running in 0b26d4bfd52f
Removing intermediate container 0b26d4bfd52f
 ---> 056736b48ba8
Step 3/3 : COPY . /app
 ---> e28c0f74ace2
Successfully built e28c0f74ace2
Successfully tagged minist_training_kf_pipeline:latest
The push refers to repository [gcr.io/kubeflow-pipeline-fantasy/minist_training_kf_pipeline]
fd9add579af0: Preparing
0c9a46f378a1: Preparing
84c3bc63b701: Preparing
56ec85ad394c: Preparing
aefe991487a2: Preparing
4a58ecdd995f: Preparing
fa9f3f4bd775: Preparing
2bf9e296738e: Preparing
92486bede3ce: Preparing
19331eff40f0: Preparing
100ef12ce3a4: Preparing
97e6b67a30f1: Preparing
a090697502b8: Preparing
4a58ecdd995f: Waiting
fa9f3f4bd775: Waiting
2bf9e296738e: Waiting
92486bede3ce: Waiting
19331eff40f0: Waiting
100ef12ce3a4: Waiting
97e6b67a30f1: Waiting
a090697502b8: Waiting
84c3bc63b701: Layer already exists
56ec85ad394c: Layer

## Writing your component definition file
To create a component from your containerized program you need to write component specification in YAML format that describes the component for the Kubeflow Pipelines system.

For the complete definition of a Kubeflow Pipelines component, see the [component specification](https://www.kubeflow.org/docs/pipelines/reference/component-spec/). However, for this tutorial you don’t need to know the full schema of the component specification. The tutorial provides enough information for the relevant the components.

Start writing the component definition (component.yaml) by specifying your container image in the component’s implementation section:

In [43]:
%%bash

# Create Yaml

cat > minist_component.yaml <<HERE
name: Minist training
description: Train a minist model and save to GCS
inputs:
  - name: model_file
    description: 'Name of the model file.'
    type: String
  - name: bucket
    description: 'GCS bucket name.'
    type: String
outputs:
  - name: model_path
    description: 'Trained model path.'
    type: String
implementation:
  container:
    image: gcr.io/kubeflow-pipeline-fantasy/minist_training_kf_pipeline@sha256:205c12a4cde07b4daa25d8867785b7f392cb5b08a66896df9aefd90ea93440e5
    command: [
      python, /app/app.py,
      --model_file, {inputValue: model_file},
      --bucket,     {inputValue: bucket},
    ]
    fileOutputs:
      model_path: /output.txt
HERE

### Create your workflow as a Python function

Define your pipeline as a Python function. ` @kfp.dsl.pipeline` is a required decoration including `name` and `description` properties. Then compile the pipeline function. After the compilation is completed, a pipeline file is created.

In [44]:
import os
minist_train_op = kfp.components.load_component_from_file(os.path.join('./', 'minist_component.yaml')) 

In [45]:
minist_train_op.component_spec

ComponentSpec(name='Minist training', description='Train a minist model and save to GCS', metadata=None, inputs=[InputSpec(name='model_file', type='String', description='Name of the model file.', default=None, optional=False), InputSpec(name='bucket', type='String', description='GCS bucket name.', default=None, optional=False)], outputs=[OutputSpec(name='model_path', type='String', description='Trained model path.')], implementation=ContainerImplementation(container=ContainerSpec(image='gcr.io/kubeflow-pipeline-fantasy/minist_training_kf_pipeline@sha256:205c12a4cde07b4daa25d8867785b7f392cb5b08a66896df9aefd90ea93440e5', command=['python', '/app/app.py', '--model_file', InputValuePlaceholder(input_name='model_file'), '--bucket', InputValuePlaceholder(input_name='bucket')], args=None, env=None, file_outputs={'model_path': '/output.txt'})), version='google.com/cloud/pipelines/component/v1')

In [46]:
# Define the pipeline
@dsl.pipeline(
   name='Minist pipeline',
   description='A toy pipeline that performs minist model training.'
)
def minist_reuse_component_pipeline(
    model_file: str = 'mnist_model.h5', 
    bucket: str = "gs://kubeflow-pipeline-fantasy-kubeflow1-bucket"
):
    minist_train_op(model_file=model_file, bucket=bucket).apply(gcp.use_gcp_secret('user-gcp-sa'))
    return True

In [47]:
#Get or create an experiment and submit a pipeline run
in_cluster = True
try:
  k8s.config.load_incluster_config()
except:
  in_cluster = False
  pass

if in_cluster:
    client = kfp.Client()
else:
    host = "https://kubeflow1.endpoints.kubeflow-pipeline-fantasy.cloud.goog/pipeline"
    client_id = "493831447550-os23o55235htd9v45a9lsejv8d1plhd0.apps.googleusercontent.com"
    other_client_id = "493831447550-iu24vv6id3ng5smhf2lboovv5qukuhbh.apps.googleusercontent.com"
    other_client_secret = "cB8Xj-rb9JWCYcCRDlpTMfhc"
    client = kfp.Client(host=host, 
                        client_id=client_id,
                        other_client_id=other_client_id, 
                        other_client_secret=other_client_secret)

In [48]:
pipeline_func = minist_reuse_component_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'

compiler.Compiler().compile(pipeline_func, pipeline_filename)
#Submit a pipeline run
arguments = {"model_file":"mnist_model.h5",
             "bucket":"gs://kubeflow-pipeline-fantasy-kubeflow1-bucket"}
run_name = pipeline_func.__name__ + ' run'
experiment = client.create_experiment('python-functions-minist')

run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)

Follow the [instructions](https://www.kubeflow.org/docs/other-guides/accessing-uis/) on kubeflow.org to access Kubeflow UIs. Upload the created pipeline and run it.

**Warning:** When the pipeline is run, it pulls the image from the repository to the Kubernetes cluster to create a container. Kubernetes caches pulled images. One solution is to use the image digest instead of the tag in your component dsl, for example, `s/v1/sha256:9509182e27dcba6d6903fccf444dc6188709cc094a018d5dd4211573597485c9/g`. Alternatively, if you don't want to update the digest every time, you can try `:latest` tag, which will force the k8s to always pull the latest image..

___