In [1]:
%load_ext lab_black

# Honey Bee Computer Vision Example

This example shows how to use python scripts from open-source repos to train a yolov5 model, and evaluate the results.

The example trains a model to detect Honey Bees using [YoloV5](https://github.com/ultralytics/yolov5). The purpose of the example is to explain how to train a model using Kubeflow, the YOLOV5 repo and sample data. 

Accessed classes are:
* Bees (workers or foragers)
* Bees carrying pollen
* Drones
* Queens


Datasets for training of object detection models are usually large, and take a long time to train. They often contain many pre-augmented images. For our purposes, we've included a sample of training data to use for this exercise. The poor quantity and quality of the training data will not produce a good model, but it will train quickly, and allow for experimentation with the pipeline.

For fun, we also included the weights from a model using a much larger version of this dataset with 500+ epochs (This takes many hours to train).

The dataset was sampled from here: https://universe.roboflow.com/matt-nudi/honey-bee-detection-model-zgjnb (creative commons license, but requires an account/email to download).

## Author

Nick Lawrence ntl@us.ibm.com

## License
Apache-2.0 License


The assumption is that this notebook has a volume mounted (you can set this up when you create the notebook). The PVC should have been created with an access mode of ReadWriteMany. This allows other pods to mount the volume.

You can use the volume for your notebook's workspace (mounted at /home/jovyan/), or you can use a data volume.

The data set will be extracted to the volume in these next few cells. The data will be loaded into the pipeline via a volume, rather than a download. This simulates a use case where the training data has been pre loaded on a volume in the kubernetes cluster, and is large enough where a download is expensive.

The volume name is defined in this next cell, as is the path to the extracted Roboflow data set for the bees. 

In [2]:
VOLUME_CLAIM_NAME = "dist-demo-volume"
MOUNT_POINT = "/home/jovyan/"
BEE_DATA_SET_SUBPATH = "bee_dataset"
BEE_DATA_SET_PATH = f"{MOUNT_POINT}/{BEE_DATA_SET_SUBPATH}"

In [3]:
!mkdir -p {BEE_DATA_SET_PATH}
!tar -xf /home/jovyan/kubeflow-ppc64le-examples/object-detection-yolov5/data/dataset.tar.gz -C {BEE_DATA_SET_PATH} --strip-components 1

## Imports and constants

If you've build your own yolov5 container, you should change the BASE_IMAGE variable to your own image here. The Dockerfile for the image is included in the "Notebook Container Image Source" directory.

In [4]:
import kfp
from kfp import dsl
from kfp.components import InputPath, OutputPath
from kubernetes.client.models import (
    V1Volume,
    V1VolumeMount,
    V1PersistentVolumeClaimVolumeSource,
    V1EmptyDirVolumeSource,
)
from kfp.dsl import PipelineConf, data_passing_methods
import os
from typing import Optional

BASE_IMAGE = "quay.io/ntlawrence/yolov5-dist-pytorch:1.0.2"
COMPONENT_CATALOG_FOLDER = f"{os.getenv('HOME')}/components"

## Load Data component
The first component in the pipeline copies the data from an input path to an output parameter.

Essentially this moves the data from the volume into the pipeline. A common problem when using PVCs is that paths to data in a PVC (strings), are not interchangable with pipeline inputs and outputs (InputPath and OutputPath). Components like this are needed to transform data on a PVC into something that can be used by existing pipeline components that expect InputPath and OutputPath params.

In [5]:
def load_from_url(
    source: str,
    dest: OutputPath(str),
):
    import os
    import shutil
    from urllib.request import urlretrieve
    from urllib.parse import urlparse

    # Make target directories if needed
    parent_dirs = os.path.dirname(dest)
    if not os.path.exists(parent_dirs):
        os.makedirs(parent_dirs)

    # Option to use an empty file to indicate no weights
    if not source:
        with open(dest, "w") as _:
            pass

    source_details = urlparse(source)

    if source_details.scheme == "file":
        if os.path.isdir(source_details.path):
            shutil.copytree(source_details.path, dest)
        else:
            shutil.copyfile(source_details.path, dest)
    elif source_details.scheme in ("http", "https", "ftp", "ftps"):
        urlretrieve(source, filename=dest)
    else:
        raise ValueError(f"source does not use a supported url scheme")


load_from_url_comp = kfp.components.create_component_from_func(
    load_from_url, base_image=BASE_IMAGE
)

## Copy data from one path to another
This component is a helper to move data from one volume path to another.

This component allows us to copy pipeline artifacts to a mounted PVC for use outside of the pipeline.

In [6]:
def copy_data(source: str, dest: str):
    import os
    import shutil

    # Make target directories if needed
    parent_dirs = os.path.basename(dest)
    if not os.path.exists(parent_dirs):
        os.makedirs(parent_dirs)

    if os.path.isdir(source):
        shutil.copytree(source, dest)
    else:
        shutil.copyfile(source, dest)


copy_data_comp = kfp.components.create_component_from_func(
    copy_data, base_image=BASE_IMAGE
)

# Prepare shared storage
For our distributed training, the workers need to share the model configuration and initial training weights.
We also need to be able to copy the output data from the training back into the pipeline.

This component set's up the shared storage for those tasks.

(The training data is also shared between workers, but this is stored on it's on shared storage in this example, not the shared training storage PVC.)

In [7]:
def prepare_shared_storage(
    model_cfg_path: InputPath(str),
    initial_weights_path: InputPath(str),
):
    import os
    import shutil

    def copyf(source: str, dest: str(str)):
        """
        Copies a file or directory,
        creating destination parent dirs as needed
        """
        import os
        import shutil

        parent_dirs = os.path.dirname(dest)
        os.makedirs(parent_dirs, exist_ok=True)

        if os.path.isdir(source):
            shutil.copytree(source, dest)
        else:
            shutil.copyfile(source, dest)

    # The "training pvc is used to share data between the torch jobs and this component
    # We need to copy the config and initial weights to it
    copyf(model_cfg_path, f"/workspace/input/config.yaml")
    copyf(initial_weights_path, f"/workspace/input/weights.pt")

    # Create a directory to store the output of training, we'll
    # copy data from here to output paths
    os.makedirs(f"workspace/output", exist_ok=True)


prepare_shared_storage_comp = kfp.components.create_component_from_func(
    prepare_shared_storage,
    base_image=BASE_IMAGE,
)

## Train Model component

Run the python train.py CLI to train the model

This version uses multi-node GPU.

The trained model and results.csv are reported as output parameters.

In [50]:
def train_model_distributed(
    model: OutputPath(str),
    results: OutputPath(str),
    epochs: int,
    gpus: int,
    training_pvc_name: str,
    dataset_pvc_name: str,
    worker_image: str,
    dataset_subpath: str = "",
    img: int = 640,
    batch_size: int = 112,
    replicas: int = 7,
):
    import os
    import shutil
    import distributed_kf_tools.deploy as deploy
    from distributed_kf_tools.template import OwningWorkFlow, PvcMount

    devices = ",".join([str(gpu) for gpu in range(gpus)])

    ## Start the PyTorch job for distributed training
    deploy.run_pytorch_job(
        # owning_workflow setups it up so that when the pipeline is deleted,
        # the training job is cleaned up
        owning_workflow=OwningWorkFlow(
            name="{{workflow.name}}", uid="{{workflow.uid}}"
        ),
        # These place holders for namespace and job name are
        # filled in by Kubeflow when the pipeline runs.
        namespace="{{workflow.namespace}}",
        pytorch_job_name="{{workflow.name}}",
        # Shared volumes used by the training script
        # The key to the mounting is that the "runs" directory of yolov5
        # is mapped to the "/workspace/output" dir in the pvc.
        # This allows us to get to the output of the runs
        # after training (as /workspace/output in our mount of the PVC).
        pvcs=[
            PvcMount(
                pvc_name=training_pvc_name,
                mount_path=f"/yolov5/runs/",
                subpath="output",
            ),
            PvcMount(
                pvc_name=training_pvc_name,
                mount_path=f"/training",
            ),
            PvcMount(
                pvc_name=dataset_pvc_name,
                mount_path=f"/dataset",
                subpath=dataset_subpath,
            ),
        ],
        # The command to run in each worker
        # This almost always starts with "torch.distributed.run" for DDP
        command=[
            "python",
            "-m",
            "torch.distributed.run",
            "train.py",
            f"--device={devices}",
            f"--img={img}",
            f"--batch-size={batch_size}",
            "--noplots",
            f"--epochs={epochs}",
            "--weights=/training/input/weights.pt",
            "--cache=/tmp",
            "--cfg=/training/input/config.yaml",
            "--data=/dataset/data.yaml",
            "--optimizer=Adam",
        ],
        # Number of workers
        num_workers=replicas,
        # Number of GPUs per worker (OK to leave this at 1)
        gpus_per_worker=gpus,
        # The base image used for the worker pods
        worker_image=worker_image,
    )

    ########################################################
    # Copy trained model and results to output parameters
    ########################################################
    os.makedirs(os.path.dirname(model), exist_ok=True)
    os.makedirs(os.path.dirname(results), exist_ok=True)

    shutil.copyfile(f"/training/output/train/exp/weights/best.pt", model)
    shutil.copyfile(f"/training/output/train/exp/results.csv", results)


train_model_comp = kfp.components.create_component_from_func(
    train_model_distributed,
    base_image=BASE_IMAGE,
)

## Component to convert the model to ONNX

The export tool supports quantizing, the model, and so we've added that option as a parameter to the component. The ability to quantize is important for models that will be used for inferencing on edge devices. Many edge devices do not have the resources to evaluate deep neural networks, and and smaller model makes it possible to run inferencing in those environments.

In [51]:
def convert_model_to_onnx(
    model: InputPath(str),
    model_format: str,
    onnx_model: OutputPath(str),
    quantize: str = "",
):
    import subprocess
    import shutil
    import os

    # export.py uses the file name to determine the type of model
    # mode is an input path where the name is generated by kubeflow
    # We need to control the name that is used...
    named_model = f"/tmp/{os.path.basename(model)}.{model_format}"
    os.symlink(model, named_model)

    quantize_param = f"--{quantize}" if quantize else ""

    # https://github.com/ultralytics/yolov5/issues/10831
    subprocess.run(
        f"python export.py --img 640 --include=onnx  {quantize_param} "
        f"--data /dataset/data.yaml --weights {named_model} --device=cpu --opset 12",
        check=True,
        cwd="/yolov5",
        shell=True,
    )

    shutil.copyfile(f"/tmp/{os.path.basename(model)}.onnx", onnx_model)


convert_model_to_onnx_comp = kfp.components.create_component_from_func(
    convert_model_to_onnx, base_image=BASE_IMAGE
)

## Model evaluation component

This component can be used for both the original and onnx models. This allows comparisons between the ONNX model (which might be quantized), and the original model.

In [52]:
def evaluate_model(
    results: OutputPath(str),
    model: InputPath(str),
    model_format: str = "onnx",  # onnx, pt, tf ....
    conf_thres: float = 0.001,
    iou_thres: float = 0.6,
    max_det: int = 300,
):
    import subprocess
    import os
    import torch
    from ruamel.yaml import YAML
    import pathlib
    import shutil

    print(f"The size of the model is {os.path.getsize(model)}")

    # valy.py uses the file name to determine the type of model
    # mode is an input path where the name is generated by kubeflow
    # We need to control the name that is used...
    named_model = f"/tmp/{os.path.basename(model)}.{model_format}"
    os.symlink(model, named_model)

    subprocess.run(
        f"python val.py --weights {named_model} --data /dataset/data.yaml --img 640 "
        f"--conf-thres {conf_thres} --iou-thres {iou_thres} --max-det {max_det} --workers=0 ",
        check=True,
        shell=True,
        cwd="/yolov5",
    )

    os.makedirs(os.path.dirname(results), exist_ok=True)
    shutil.copytree("/yolov5/runs/val/exp", results)


evaluate_model_comp = kfp.components.create_component_from_func(
    evaluate_model, base_image=BASE_IMAGE
)

## Write artifact to PVC

Converts data from a pipeline parameter to a file stored on a PVC.

In [53]:
def write_artifact_to_path(source: InputPath(str), dest: str(str)):
    import os
    import shutil

    parent_dirs = os.path.dirname(dest)
    os.makedirs(parent_dirs, exist_ok=True)

    if os.path.isdir(source):
        shutil.copytree(source, dest)
    else:
        shutil.copyfile(source, dest)


write_artifact_to_path_comp = kfp.components.create_component_from_func(
    write_artifact_to_path, base_image=BASE_IMAGE
)

## Upload the ONNX model

The upload component is the same component that is shared with other examples. It loads the ONNX model into MinIO.

In [54]:
UPLOAD_MODEL_COMPONENT = (
    f"{COMPONENT_CATALOG_FOLDER}/model-building/upload-model/component.yaml"
)

upload_model_comp = kfp.components.load_component_from_file(UPLOAD_MODEL_COMPONENT)

## Deploy Model Component

Deploys the model to a NVIDIA Triton inference service

In [55]:
DEPLOY_MODEL_COMPONENT = f"{os.getenv('HOME')}/kubeflow-ppc64le-examples/deploy_triton_inference_service_component/deploy_triton_inference_service_component.yaml"
deploy_model_comp = kfp.components.load_component_from_file(DEPLOY_MODEL_COMPONENT)

## Pipeline Definition

This defines the pipeline, and the pipeline's paramters.

In [56]:
BLACKBOARD_RESOURCE_NAME = "ml-blackboard"

In [57]:
@dsl.pipeline(name="bee-yolov5")
def bee_yolov5(
    data_vol_pvc_name: str,
    data_vol_subpath: str,
    worker_image: str = BASE_IMAGE,
    epochs: int = 750,
    model_config_url: str = "https://github.com/ultralytics/yolov5/raw/v7.0/models/yolov5s.yaml",
    initial_weights_url: str = "https://github.com/ultralytics/yolov5/releases/download/v7.0/yolov5s.pt",
    quantize_onnx: str = "int8",
    minio_url="minio-service.kubeflow:9000",
    model_version: int = 1,
    dataset_size: str = "4Gi",
    artifact_vol_pvc_name: str = "",
    artifact_vol_subpath: str = "",
):
    def mount_volume(task, pvc_name, mount_path, volume_subpath, read_only=False):
        task.add_volume(
            V1Volume(
                name=pvc_name,
                persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(pvc_name),
            )
        )

        task.add_volume_mount(
            V1VolumeMount(
                name=pvc_name,
                mount_path=mount_path,
                sub_path=volume_subpath,
                read_only=read_only,
            )
        )

    create_blackboard = dsl.VolumeOp(
        name="Create Artefacts Blackboard",
        resource_name=BLACKBOARD_RESOURCE_NAME,
        modes=dsl.VOLUME_MODE_RWO,
        size="4Gi",
        set_owner_reference=True,
    )

    create_dataset_volume = dsl.VolumeOp(
        name=f"Create PVC for dataset",
        resource_name="dataset-pvc",
        modes=dsl.VOLUME_MODE_RWM,
        size=dataset_size,
        set_owner_reference=True,
    )

    # Load config  and data Tasks
    model_config_task = load_from_url_comp(model_config_url)
    model_config_task.after(create_blackboard)
    model_config_task.set_display_name("Load model Config")

    initial_weights_task = load_from_url_comp(initial_weights_url)
    initial_weights_task.after(create_blackboard)
    initial_weights_task.set_display_name("Load initial weights")

    # Prepare Dataset
    prepare_dataset_task = copy_data_comp("/src-vol", "/dest-vol/dataset")
    mount_volume(
        prepare_dataset_task,
        data_vol_pvc_name,
        "/src-vol",
        data_vol_subpath,
        read_only=True,
    )
    mount_volume(
        prepare_dataset_task,
        create_dataset_volume.volume.persistent_volume_claim.claim_name,
        "/dest-vol",
        "",
    )

    prepare_dataset_task.after(create_dataset_volume)
    prepare_dataset_task.after(create_blackboard)
    prepare_dataset_task.set_display_name("Copy dataset to Pipeline owned PVC")

    # Create and prepare training volume
    create_training_volume = dsl.VolumeOp(
        name=f"Create PVC for training",
        resource_name="yolov5-training",
        modes=dsl.VOLUME_MODE_RWM,
        size="2G",
        set_owner_reference=True,
    )

    prepare_training_volume_task = prepare_shared_storage_comp(
        model_cfg=model_config_task.outputs["dest"],
        initial_weights=initial_weights_task.outputs["dest"],
    )
    mount_volume(
        prepare_training_volume_task,
        create_training_volume.volume.persistent_volume_claim.claim_name,
        "/workspace",
        "",
    )
    prepare_training_volume_task.after(create_training_volume)

    # Train Model
    train_model_task = train_model_comp(
        epochs=epochs,
        gpus=1,
        worker_image=worker_image,
        training_pvc_name=create_training_volume.volume.persistent_volume_claim.claim_name,
        dataset_pvc_name=create_dataset_volume.volume.persistent_volume_claim.claim_name,
        dataset_subpath="dataset",
    )

    mount_volume(
        train_model_task,
        create_training_volume.volume.persistent_volume_claim.claim_name,
        "/training",
        "",
    )
    train_model_task.after(prepare_dataset_task)
    train_model_task.after(prepare_training_volume_task)

    # convert to ONNX
    convert_model_to_onnx_task = convert_model_to_onnx_comp(
        model=train_model_task.outputs["model"],
        model_format="pt",
        quantize=quantize_onnx,
    )
    mount_volume(
        convert_model_to_onnx_task,
        create_dataset_volume.volume.persistent_volume_claim.claim_name,
        "/dataset",
        "dataset",
    )

    # Evaluate
    evaluate_onnx_model_task = evaluate_model_comp(
        model=convert_model_to_onnx_task.outputs["onnx_model"],
    )
    evaluate_onnx_model_task.set_display_name("Evaluate ONNX Model")
    mount_volume(
        evaluate_onnx_model_task,
        create_dataset_volume.volume.persistent_volume_claim.claim_name,
        "/dataset",
        "dataset",
    )

    evaluate_pt_model_task = evaluate_model_comp(
        model=train_model_task.outputs["model"],
        model_format="pt",
    )
    evaluate_pt_model_task.set_display_name("Evaluate Torch Model")
    mount_volume(
        evaluate_pt_model_task,
        create_dataset_volume.volume.persistent_volume_claim.claim_name,
        "/dataset",
        "dataset",
    )

    # Upload ONNX model
    upload_model_task = upload_model_comp(
        convert_model_to_onnx_task.outputs["onnx_model"],
        minio_url=minio_url,
        export_bucket="{{workflow.namespace}}-bee",
        model_format="onnx",
        model_name="bee",
        model_version=model_version,
    )

    # Deploy Inference Service
    deploy_model_task = deploy_model_comp(
        name="bee",
        rm_existing=True,
        storage_uri="s3://{{workflow.namespace}}-bee/onnx",
        minio_url=minio_url,
        predictor_protocol="v2",
    )
    deploy_model_task.after(upload_model_task)

    ##### Write evaluation results
    with dsl.Condition(artifact_vol_pvc_name != "", name="store_onnx_metrics"):
        ## Save ONNX Metrics
        write_onnx_artifact_to_path_task = write_artifact_to_path_comp(
            evaluate_onnx_model_task.outputs["results"],
            "/mnt/{{workflow.name}}/onnx-metrics",
        )
        mount_volume(
            write_onnx_artifact_to_path_task,
            artifact_vol_pvc_name,
            "/mnt",
            artifact_vol_subpath,
        )
        write_onnx_artifact_to_path_task.set_display_name("Save ONNX metrics (PVC)")

        ## Save Torch Metrics
        write_torch_artifact_to_path_task = write_artifact_to_path_comp(
            evaluate_pt_model_task.outputs["results"],
            "/mnt/{{workflow.name}}/pt-metrics",
        )
        mount_volume(
            write_torch_artifact_to_path_task,
            artifact_vol_pvc_name,
            "/mnt",
            artifact_vol_subpath,
        )
        write_torch_artifact_to_path_task.set_display_name("Save Torch results (PVC)")

        ## Save Torch Model
        write_torch_model_artifact_to_path_task = write_artifact_to_path_comp(
            train_model_task.outputs["model"],
            "/mnt/{{workflow.name}}/model.pt",
        )
        mount_volume(
            write_torch_model_artifact_to_path_task,
            artifact_vol_pvc_name,
            "/mnt",
            artifact_vol_subpath,
        )
        write_torch_model_artifact_to_path_task.set_display_name(
            "Save Torch model (PVC)"
        )

        ## Save ONNX Model
        write_onnx_model_artifact_to_path_task = write_artifact_to_path_comp(
            convert_model_to_onnx_task.outputs["onnx_model"],
            "/mnt/{{workflow.name}}/model.onnx",
        )
        mount_volume(
            write_onnx_model_artifact_to_path_task,
            artifact_vol_pvc_name,
            "/mnt",
            artifact_vol_subpath,
        )
        write_onnx_model_artifact_to_path_task.set_display_name("Save ONNX model (PVC)")

## Compile Pipeline with configuration options

Uses a PVC for passing data between components.

In [58]:
# See: https://www.kubeflow.org/docs/components/pipelines/overview/caching/#managing-caching-staleness
def disable_cache_transformer(op):
    if isinstance(op, dsl.ContainerOp):
        op.execution_options.caching_strategy.max_cache_staleness = "P0D"
    else:
        op.add_pod_annotation(
            name="pipelines.kubeflow.org/max_cache_staleness", value="P0D"
        )
    return op


pipeline_conf = PipelineConf()
pipeline_conf.add_op_transformer(disable_cache_transformer)

pipeline_conf.data_passing_method = data_passing_methods.KubernetesVolume(
    volume=V1Volume(
        name=BLACKBOARD_RESOURCE_NAME,
        persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
            "{{workflow.name}}-" + BLACKBOARD_RESOURCE_NAME
        ),
    ),
    path_prefix=f"{BLACKBOARD_RESOURCE_NAME}/",
)


# This transformer is only relavent to an IBM Lab Sandbox
# In that environment, we have Power 8 hardware, which
# does not run the python 3.10 built by RocketCE.
# This transformer forces all the work to run on the
# AC922's (which have the label ai.accelerator=V100)
#def run_on_power_9_transformer(op: dsl.ContainerOp):
#    if isinstance(op, dsl.ContainerOp):
#        op.add_node_selector_constraint("ai.accelerator", "V100")


#pipeline_conf.add_op_transformer(run_on_power_9_transformer)

In [59]:
PIPELINE_NAME = "Bee detector pipeline"

kfp.compiler.Compiler().compile(
    pipeline_func=bee_yolov5,
    package_path=f"{PIPELINE_NAME}.yaml",
    pipeline_conf=pipeline_conf,
)

## Upload pipeline programmatically

The YAML file generated by the previous compile can be used to create a pipeline through the UI.
* Download the file to your workstation
* Click "Pipelines" from the side bar
* Press the "upload pipeline" button, and upload the YAML.

After adding the pipline, you can run the pipeline from the UI without looking at the pipeline code. This allows an inexperienced user to run the pipeline with specific parameters, without the compelxity of Kubeflow componet code or K8S Awareness.

These next few cells upload and run the pipeline programmatically, so that the example is "automated".

In [60]:
def delete_pipeline(pipeline_name: str):
    """Delete's a pipeline with the specified name"""

    client = kfp.Client()
    existing_pipelines = client.list_pipelines(page_size=999).pipelines
    matches = (
        [ep.id for ep in existing_pipelines if ep.name == pipeline_name]
        if existing_pipelines
        else []
    )
    for id in matches:
        client.delete_pipeline(id)


def get_experiment_id(experiment_name: str) -> str:
    """Returns the id for the experiment, creating the experiment if needed"""
    client = kfp.Client()
    existing_experiments = client.list_experiments(page_size=999).experiments
    matches = (
        [ex.id for ex in existing_experiments if ex.name == experiment_name]
        if existing_experiments
        else []
    )

    if matches:
        return matches[0]

    exp = client.create_experiment(experiment_name)
    return exp.id

In [61]:
# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

In [62]:
# upload the pipeline
client = kfp.Client()
uploaded_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)

## Run the pipeline with parameters

This is equivalent to running the pipeline from the pipelines view in the UI. Since a pipeline run needs to be part of an experiment, this code will create the experiment if it does not exist.

When we run this manually, we'll use initial weights that have been trained on a much larger superset of this data, and with many more epochs. That will give us results that can be demoed with a fast training cycle.

In [63]:
pipeline_params = {
    "data_vol_pvc_name": VOLUME_CLAIM_NAME,
    "data_vol_subpath": BEE_DATA_SET_SUBPATH,
    "worker_image": BASE_IMAGE,
    "epochs": "10",
    "artifact_vol_pvc_name": VOLUME_CLAIM_NAME,
    "artifact_vol_subpath": "runs",
}

run = client.run_pipeline(
    experiment_id=get_experiment_id("bee-exp"),
    job_name="bees",
    pipeline_id=uploaded_pipeline.id,
    params=pipeline_params,
)

## Waits for the pipeline to complete 

* waits for up to 20 Min
* Checks the results

In [64]:
TWENTY_MIN = 20 * 60
result = client.wait_for_run_completion(run.id, timeout=TWENTY_MIN)
{
    "status": result.run.status,
    "error": result.run.error,
    "time": str(result.run.finished_at - result.run.created_at),
}

{'status': 'Succeeded', 'error': None, 'time': '0:07:31'}

# Inference

We don't have yolov5 installed in the notebook server, but if we want to run the command with an inference, we can do that by running the command in a container.

We use the same trick as other places where we mount a PVC that is also mounted locally.

We need to be careful that the PVC that we share is read write multiple.

The pipeline deployed the model to a Triton service, so that's the model that we'll use for prediction. Real world applications would build a front end web-app or transformer here to interact with the inference service.

In [73]:
with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace") as n:
    NAMESPACE = n.read()

In [81]:
inference_job_template = f"""
apiVersion: batch/v1
kind: Job
metadata:
  name: bee-inference-example
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: >-
                                 false
    spec:
      # The node selector is only relevant to a specific IBM Lab
      # It forces the workload to run on an AC922 machine that has
      # the ai.accelerator annotation, rather than the P8 machines
      # in the environment. The image requires a P9 Machine. 
#    nodeSelector:
#      ai.accelerator: V100
     containers:
      - command: ["/bin/sh", "-c"]
        args: 
         - >-
           python detect.py
           --weights=http://bee.{NAMESPACE}.svc.cluster.local/v2/models/bee/infer
           --data=/dataset/data.yaml
           --source=/dataset/test/images/2016-03-15-04-09-38-1024x673_jpg.rf.aa352f3bd2a105dd7ff560e0ab42ae69.jpg
           --conf-thres=.7
           --iou-thres=.2
           --max-det=500
        image: {BASE_IMAGE}
        imagePullPolicy: IfNotPresent
        name: inference
        volumeMounts:
        - mountPath: /dataset
          name: data
          subPath: {BEE_DATA_SET_SUBPATH}
        - mountPath: /yolov5/runs/
          name: data
          subPath: kubeflow-ppc64le-examples/pytorch_distributed/yolov5/notebook/
     volumes:
      - name: data
        persistentVolumeClaim:
          claimName: {VOLUME_CLAIM_NAME}
      - emptyDir:
          medium: Memory
        name: dshm
     restartPolicy: Never
  backoffLimit: 0  
"""

!kubectl delete job bee-inference-example
!echo '{inference_job_template}' | kubectl create -f -

Error from server (NotFound): jobs.batch "bee-inference-example" not found
job.batch/bee-inference-example created


In [82]:
!kubectl wait --for=condition=Complete job bee-inference-example --timeout=20m
!kubectl logs -l job-name=bee-inference-example

job.batch/bee-inference-example condition met
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-qnx1ym7c because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[34m[1mdetect: [0mweights=['http://bee.kubeflow-ntl.svc.cluster.local/v2/models/bee/infer'], source=/dataset/test/images/2016-03-15-04-09-38-1024x673_jpg.rf.aa352f3bd2a105dd7ff560e0ab42ae69.jpg, data=/dataset/data.yaml, imgsz=[640, 640], conf_thres=0.7, iou_thres=0.2, max_det=500, device=, view_img=False, save_txt=False, save_conf=False, save_crop=False, nosave=False, classes=None, agnostic_nms=False, augment=False, visualize=False, update=False, project=runs/detect, name=exp, exist_ok=False, line_thickness=3, hide_labels=False, hide_conf=False, half=False, dnn=False, vid_stride=1
YOLOv5 🚀 v7.0-178-ga

---
You should now see the results from the classification in the local directory tree.

The results are unlikely to be good, given the small training data and low number of epics.

---