# Using PVCs in Kubeflow pipeline components

These example pipelines describe how to use PVCs and MinIO within kubeflow pipelines.


## Setup & common includes


In [2]:
import kfp
from kfp import dsl
from kubernetes.client.models import (
    V1Volume,
    V1VolumeMount,
    V1PersistentVolumeClaimVolumeSource,
)
import ruamel.yaml
import os
from kfp.dsl import data_passing_methods
import subprocess


In [8]:
BASE_IMAGE = "quay.io/ibm/kubeflow-notebook-image-ppc64le:elyra3.14.1-py3.9-tf2.9.2-pt1.12.1-v0"
COMPONENT_CATALOG_FOLDER = (
    f"{os.getenv('HOME')}/components"
)


In [9]:
def delete_pipeline(pipeline_name: str):
    """Delete's a pipeline with the specified name"""

    client = kfp.Client()
    existing_pipelines = client.list_pipelines().pipelines
    matches = (
        [ep.id for ep in existing_pipelines if ep.name == pipeline_name]
        if existing_pipelines
        else []
    )
    for id in matches:
        client.delete_pipeline(id)


In [10]:
client = kfp.Client()


## Pipeline #1: Store a model with MinIO

This example shows how to upload a model to MinIO. Uploading to MinIO is the most common way to store a model, and our examples are setup that way.

The example pipeline simplifies the pre-processing and training steps by downloading an ONNX model from the web. The model is then uploaded to MinIO using a reusable component that was built by IBM.


### Comonent to download a model

This gives us an output parameter that is a model. In the real world, this model would be something that has been trained by the pipeline. This component just downloads a model to keep things simpler. The interesting part of the example is the upload component, not the creation of the model that is defined by this component.


In [11]:
def download_model(model_name: str, model: kfp.components.OutputPath(str)):
    import urllib.request
    from pathlib import Path
    import os

    MODEL_URL = (
        "https://github.com/onnx/models/raw/main/vision/classification/resnet/model"
    )

    Path(os.path.dirname(model)).mkdir(parents=True, exist_ok=True)
    urllib.request.urlretrieve(f"{MODEL_URL}/{model_name}.onnx", f"{model}")


download_model_comp = kfp.components.create_component_from_func(
    func=download_model, base_image=BASE_IMAGE
)


### Component to upload the model to MinIO

The example component provided by IBM provides the logic to both upload a model to MinIO (using the S3 protocol), and to store the model so that it can be served by an InferenceService.
This example pipeline does not actually serve the model, but in a complete pipeline deploying the inference service is common.


In [12]:
UPLOAD_MODEL_COMPONENT = (
    f"{COMPONENT_CATALOG_FOLDER}/model-building/upload-model/component.yaml"
)

upload_model_comp = kfp.components.load_component_from_file(UPLOAD_MODEL_COMPONENT)


### Define the pipeline

There are two components in the pipeline:

- download the model
- upload the model to MinIO.

MinIO is also used by Kubeflow to pass data between components. For example the output "model" parameter of download_model_comp (an OutputPath) is stored in MinIO.

The special value `{{worflow.namespace}}` can be used to instruct the pipeline to use the namespace as the export bucket.


In [13]:
@dsl.pipeline(name="upload-model-pipeline")
def upload_model_pipeline(model_name="resnet101-v2-7"):
    download_model_task = download_model_comp(model_name=model_name)

    upload_model_task = upload_model_comp(
        download_model_task.outputs["model"],
        minio_url="minio-service.kubeflow:9000",
        export_bucket="{{workflow.namespace}}",
        model_format="onnx",
        model_name=model_name,
        model_version=1,
    )


### Upload the pipeline

Compile the pipeline, with no configuration options, and it upload to Kubeflow.

If you want to run the pipeline, you can do that using the link that appears in the output of the cell.


In [14]:
PIPELINE_NAME = "upload-model-pipeline"

kfp.compiler.Compiler().compile(
    pipeline_func=upload_model_pipeline, package_path=f"{PIPELINE_NAME}.yaml"
)


# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

# upload
uploaed_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)


## Pipeline #2: Use a PVC to pass parameters

Using the default MinIO storage to pass parameters between components is convenient and often recommended, however it is not as efficient when compared to using a PVC to pass data between components.

This pipeline modifies the previous pipeine and configuration so that Kubeflow uses a PVC when exchanging data between components. The model is uploaded to MinIO by the pipeline.

The term "blackboard" is used in the examples to describe the PVC that is used for information sharing between pipeline tasks. The term comes from the blackboard design pattern and is not Kubeflow specific terminology.

In this pattern, the pipeline creates the PVC used for passing information, and the PVC is destroyed when the pipeline is destoryed.


### Pipeline definition

The pipeline creates the PVC used for information exchange within the pipeline.

- A `VolumeOp` is a pipeline component that creates the PVC that is used for passing data.
- The VolumeOp has `set_owner_reference=True` so that the PVC is destoryed when the pipeline is destroyed.
- The initial tasks in the pipeline are created with an `after` method call to ensure the PVC is created before the tasks, so that the PVC can be (implicitly) used to store the input and output parameters of the task.

Creating the PVC does not instruct Kubeflow to use it for passing data. A pipeline configuration will be created in the following cells that contains those instructions.


In [15]:
@dsl.pipeline(name="upload-model-with-blackboard")
def upload_model_with_blackboard(
    blackboard: str = "mlpipeline-artefacts", model_name: str = "resnet101-v2-7"
):

    create_blackboard = dsl.VolumeOp(
        name="Create Artefacts Blackboard",
        resource_name=blackboard,
        modes=dsl.VOLUME_MODE_RWO,
        size="4Gi",
        set_owner_reference=True,
    )

    download_model_task = download_model_comp(model_name=model_name)
    download_model_task.after(create_blackboard)

    upload_model_task = upload_model_comp(
        download_model_task.outputs["model"],
        minio_url="minio-service.kubeflow:9000",
        export_bucket="{{workflow.namespace}}",
        model_format="onnx",
        model_name=model_name,
        model_version=1,
    )


### Pipeline configuration

The pipeline needs additional configuration for this pattern to work. The configuration specifies the PVC to use, but does not cause the PVC to be created (the pipeline definition creates the PVC).

#### Data Passing Method

The `data_passing_method` needs to be set to use the PVC for passing data between pipeline tasks.

- `{{workflow.name}}` is a special value that will be replaced with the unique workflow id when the pipeline is executed. This avoids name conflicts on the volume name.
- The name of the volume must match the resource name in the `VolumeOp` that used for the blackboard in the pipeline. This is how the pipeline refers to the volume, rather than the physical name.

#### Disable Caching

The PVC used for passing data between components will be different on each pipeline run. As a result, values cannot be cached between pipeline runs. Caching is disabled by defining the `disable_cache_transformer` function and adding the transformer to pipeline configuration.


In [16]:
pipeline_conf = kfp.dsl.PipelineConf()

# Use PVC for passing data between pipeline tasks
pipeline_conf.data_passing_method = kfp.dsl.data_passing_methods.KubernetesVolume(
    volume=V1Volume(
        name="mlpipeline-artefacts",
        persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
            "{{workflow.name}}-mlpipeline-artefacts"
        ),
    ),
    path_prefix="mlpipeline-artefacts/",
)


# Disable Caching
def disable_cache_transformer(op: dsl.ContainerOp):
    if isinstance(op, dsl.ContainerOp):
        op.execution_options.caching_strategy.max_cache_staleness = "P0D"
    else:
        op.add_pod_annotation(
            name="pipelines.kubeflow.org/max_cache_staleness", value="P0D"
        )
    return op


pipeline_conf.add_op_transformer(disable_cache_transformer)


### Compile and upload the pipeline

The difference between this step and prior pipelines is that the pipeline configuration is added to the `compile` method call.

If you want to run the pipeline, you can do that using the link that appears in the output of the cell.


In [17]:
PIPELINE_NAME = "upload-model-with-blackboard"

kfp.compiler.Compiler().compile(
    pipeline_func=upload_model_with_blackboard,
    package_path=f"{PIPELINE_NAME}.yaml",
    pipeline_conf=pipeline_conf,
)


# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

# upload
uploaed_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)


### Example pipeline #3: Define pipeline tasks that mount a PVC as a file system

The prior approaches are recommended for passing data between pipeline components. Mounting a PVC to a pipeline task makes the pipeline less portable. It also makes it more difficult to reuse existing pipeline components.

Although not preferred, there are cases where mounting PVC can be useful, since it allows data to be stored and reused across multiple pipelines.

This pipeline is a simple example that shows how to download an image to a PVC. The PVC is assumed to have been created outside of the pipeline.


### Create the PVC

There are several ways to create a PVC. The simpliest is to select the volumes tab on the Kubeflow Central Dashboard. There will be a "New Volume" button that can be used to create a new volume.

The name of the volume will be the name of the PVC, and we'll use this to mount the volume when we create the pipeline definition.

This figure shows how to create a volume in using the Kubeflow dashboard.

<img src="./volumes.jpg" />

Since the volume might be used by multiple tasks at the same time, it is best to choose an access mode of "ReadWriteMany".

<img src="./new-vol-prompt.jpg" />


In [18]:
PVC_NAME = "my-data-vol"

### Component to download an Image

Because this component is storing data on a PVC, rather than returning an OutputPath, the ouput_path is a string type input value. This difference means that the output_path parameter cannot be passed as an InputPath to another component.

That is why it is recommended that the pattern in the previous example pipelines is used for passing data between components.


In [19]:
def download_image(image_url: str, output_path: str):
    import urllib.request
    from pathlib import Path

    Path(output_path).parent.mkdir(parents=True, exist_ok=True)
    urllib.request.urlretrieve(image_url, output_path)


download_image_comp = kfp.components.create_component_from_func(
    func=download_image, base_image=BASE_IMAGE
)


### Pipeline definition

The `add_volume` method is called on the task so that the volume will be mounted as file system when the pipeline executes the task. The claim source must match the PVC that you created.


In [20]:
@dsl.pipeline(
    name="download_image_to_pvc",
)
def pipeline_to_download_image_to_pvc():
    image_task = download_image_comp(
        "https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Youngkitten.JPG/320px-Youngkitten.JPG",
        "/mnt/image/youngkitten.JPG",
    )
    # add_volume defines the volume for the task, but does not define any
    # volume mounts
    image_task.add_volume(
        V1Volume(
            name="myvolume",
            # Claim volume source is the name of the Volume
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(PVC_NAME),
        )
    )
    # Add Volume Mount defines the path on the file system that the volume is 
    # mounted to.
    # Name must match the name in the V1Volume
    image_task.add_volume_mount(V1VolumeMount(name="myvolume", mount_path="/mnt/image"))


### Compile and upload pipeline

The pipeline is compiled with the default options. If there were multiple components, MinIO would be used to share information between them.

If you want to run the pipeline, you can do that using the link that appears in the output of the cell.


In [21]:
PIPELINE_NAME = "download_image_to_pvc"

kfp.compiler.Compiler().compile(
    pipeline_func=pipeline_to_download_image_to_pvc,
    package_path=f"{PIPELINE_NAME}.yaml",
)

# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

# upload
uploaed_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)
