# Componentization

One of the points in the mission of kubeflow is:

* Easy, repeatable, portable deployments on a diverse infrastructure (for example, experimenting on a laptop, then moving to an on-premises cluster or to the cloud)

In this tutorial I'll demonstrate how we can containerize some of our functionalities and turn them into components for use in different projects. Using containerized pieces of software has many benefits, some of which I'll highlight here:

* A containerized piece of code can run anywhere; locally, on a VM on-prem, but also in the cloud on kubernetes.
* The application code doesn't need to be provisioned on the VM anymore, leading to less likelihood of outages. All the code and dependencies are in the container, so the code is ready to go.
* Thinking in terms of pieces of code doing specific things in a pipeline is helpful, it avoids building large monoliths that take ages to test and develop before you get to the critical section.
* Kubeflow components typically pick up artifacts from a file on a (shared) disk. So also locally, there are less dependencies with remote services, databases for extraction, so it's easier to run automated system tests over the whole application.
* Segregating functionality through containerization can make it easier for teams to work together on the same project, because you're not all touching the same code and design.
* If something has to be productionized fast for further evaluation and it's in an exotic language, you don't have to rewrite everything in the code you understand, you can containerize the exotic language and start running the pipeline.
* If a component is generic enough, it can be reused very easily in another pipeline (because it's in your Container Registry).
* When you use packages/libraries, there are cases where for one part of the application version X is needed and for another version Y, because you use a package Z that has different dependencies. Splitting the functionalities up clears up a lot of these dependencies.
* Having the code in the container means that when you deploy your app, you don't continuously query package servers. Some of those package servers may not be owned by you and this could lead to problems in deployment.
* Containerization by itself doesn't require specific incompatible changes to the software itself. You can still deploy "old style" on other VM's that need the same package.

So those are the positive points for containerization. Kubeflow has a very reasonable documentation on componentization, available here: 

https://www.kubeflow.org/docs/pipelines/sdk/component-development/

In [1]:
import kfp
from kfp import dsl
from kfp.components import func_to_container_op, InputPath, OutputPath

host='http://localhost:8080'
client = kfp.Client(host=host)

## Step 1: Build two components

We'll build two components in this example pipeline. One simulates getting the data from an external source to not have to rely on any service for the example. The other is a component that preprocesses that data. Componentization is about this:

* Containerization of the script/application that you want to run, which you do through a Dockerfile.
* A component description (yaml) that is used by the pipeline compiler at build time. You do not need that file at runtime.

### Data source component

The data source component is available under src/components/datasource. A component consists of the following:

* The binary or script code for the application, baked into a docker container.
* A Dockerfile describing how to dockerize the application
* A component file for kubeflow, describing the input and output parameters.

Every component is written as a command line application. When the component needs to read input data, it typically reads that from a local file on a shared disk. Kubeflow will generate and manage the locations of the output path. An output path can be "attached" to the input path of another component.

Obviously, you can also pass cloud storage locations (GCS or S3) or add queries to be executed on databases or BigQuery, where the docker container then becomes responsible for writing the query results to the output file.

For the data source component, we don't have any input arguments for getting the data from some other source, we only have an outputPath argument where we write the data file too. The data file, for the purpose of this tutorial, is baked inside the container to make it easy to write this out.

**Component specification**

Let's start with a component specification. This specification helps kubeflow to determine which parameters it can manage, what the acceptable interface is towards the application within the container (the container is an opaque processing element otherwise, where no one knows if the parameters make sense or not). First, the component yaml describes the parameters that the container accepts. Then you see those repeated, but that describes how the parameters from the kubeflow pipeline are applied to the arguments of the application.

Here is the component specification for the datasource, with a lot of comments explaining how it works:

```
# This part is the component description, describing the interface
# for the application.
name: datasource
description: Generates some data for us to work with.
inputs:
    # There are no inputs for this application
outputs:
  - name: Output Path   # This is the name of the output
    type: OutputPath    # The type determines how it is applied in kubeflow.
                        # Simple types are simply passed, InputPath and OutputPath are more managed.
implementation:
  # The implementation part describes how the inputs/outputs described above are
  # applied to the "implementation". So you'll see the names repeated, but that makes sense.
  # It decouples the "component interface" part from the "how to run with those parameters" part.
  container:
    image: datasource:latest
    command: [
      python3,  # The interpreter that we need
      "/components/extract_data.py",  # The path to the application.
      "--output-path",  # The name of the command line argument (see argumentparser)
      {outputPath: Output y URI}, # Substituted by kubeflow with the value of "Output Path" passed
                                  # into the component
    ]
```

### Preprocessing component

The preprocessing component has two parameters. One is the InputPath, which is connected to the OutputPath of the datasource component. The other is another OutputPath, which will contain the data file after conversion. This file also has a component yaml file, a Dockerfile, a build script and a very little amount of source code.

The component description for preprocessing:

```
# This part is the component description, describing the interface
# for the application.
name: preprocess
description: Preprocesses the data from the first stage
inputs:
  - name: Input Path
outputs:
  - name: Output Path
    type: OutputPath    # The preprocessing app generates a new output file.
                        # But the location where is managed by kubeflow.
implementation:
  container:
    image: localhost:5000/preprocess:latest
    command: [
      python3,
      "/component/preprocess.py",
      "--input-path",
      {inputPath: Input Path},  # Substituted by kubeflow with "Input Path"
      "--output-path",
      {outputPath: Output Path}, # Output path generated by kubeflow
    ]
```

## Developing the pipeline

With the components designed and figured out, we can design the actual pipeline:

In [2]:
from kfp import dsl
from kfp.components import func_to_container_op, InputPath, OutputPath

# kfp.components has alternative methods for loading from a url location as well
datasource = kfp.components.load_component_from_file('components/datasource/component.yaml')
preprocess = kfp.components.load_component_from_file('components/preprocessor/component.yaml')

@dsl.pipeline(
    name='Componentization',
    description='Using components as part of the pipeline'
)
def components_pipeline():
    datasource_task = datasource()
    preprocess_task = preprocess(input_path=datasource_task.output)

In [3]:
kfp.compiler.Compiler().compile(components_pipeline, 'components_pipeline.yaml')

In [4]:
client.create_run_from_pipeline_func(components_pipeline, arguments={})

RunPipelineResult(run_id=ba27e2be-2312-46f3-ac54-97d532014a53)