In [1]:
%load_ext lab_black

# Distributed training example using PyTorch

This notebook trains and evaluates a classifier of handwritten digits (Using the MNIST dataset). The training is distributed across multiple GPUs.

There are important features and differences in this flow than other examples in this repo, such as the MPI/Tensorflow example.

Author: Nick Lawrence ntl@us.ibm.com
License: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Container Image
The components of this pipeline use a custom built container image as their base image. The Dockerfile is included in GitHub. The container image does *not* include a notebook server, meaning you cannot use it to launch a notebook. The notebook container images from https://quay.io/repository/ibm/kubeflow-notebook-image-ppc64le have a wide range of packages installed in them, and these can be used to start an interactive notebook server. The base image used in this example has installed a much smaller set of newer packages, including Python 3.10 and PyTorch 1.13 from RocketCE. Also included is pytorch-lightning, which makes it easier to code up models for distributed training.

Since your notebook image and pipeline images are not the same you may be able to run code in the pipeline that does not run interactivly in the notebook. And you may be able to run code interactivly that does not run in the pipeline.

The custom image also includes the pytorch_distributed_kf_tools, which simplifies the creation and deployment of the PyTorch Job. The source code for this package is in the GitHub Repo.

You can see how the custom image was built and that packages that are included by looking at the Dockerfile.

## Model Training Script
When using distributed training with GPUs, most PyTorch models are trained within a script that is called from the command line, as opposed to a cell in a interactive notebook.

The script to run the model is stored in the GitHub Repo.

The Kubeflow train_and_test_model component will create a PyTorch job that runs the script across a pre-defined number of worker pods. The component will then monitor the job and wait for the job's completion.

This script is downloaded by the pipeline, rather than being built into the container image. This allows the script to be changed without rebuilding the container image.

## Shared Storage
The Kubeflow component needs to make the Python script and training data available to the PytorchJob, and it needs to be able to obtain the trained model after training is completed.

The pipeline will create a Volume as the first step in the pipeline.
The pipeline will copy the training data and training script onto this storage so that the workers can access it.
After training, the pipeline will copy the model from the shared storage to storage that is accessible outside the pipeline. (The notebook's volume in this example).


---

## Imports and Constants

In [2]:
import kfp
from kfp.components import InputPath, OutputPath
import kfp.dsl as dsl
from kfp.dsl import PipelineConf, data_passing_methods
from kubernetes.client import (
    V1Volume,
    V1PersistentVolumeClaimVolumeSource,
    V1VolumeMount,
)
import numpy as np
import os
from typing import List, NamedTuple

BASE_IMAGE = "quay.io/ntlawrence/mnist-dist-pytorch:1.0.3"

# Notebook Volume
Set the value in this next cell to your noteboook's volume. Make sure that the volume was created Read Write Many.

In [3]:
NOTEBOOK_PVC_NAME = "dist-demo-volume"

# Prepare shared storage

This cell downloads 3 things to the shared storage
* Training data
* Test data
* Model training script

The assumption in the function is that the shared volume has been mounted to the path "/workspace". This mount is defined by the pipeline definition.

In [4]:
def prepare_shared_storage():
    from torchvision.datasets import MNIST
    import urllib

    # Download training data
    _ = MNIST("/workspace", download=True, train=True)
    _ = MNIST("/workspace", download=True, train=False)

    # Download python model training script
    r = urllib.request.urlretrieve(
        "https://raw.githubusercontent.com/ntl-ibm/kubeflow-ppc64le-examples/2.0.2/distributed_training/pytorch/mnist/mnist.py",
        "/workspace/mnist.py",
    )


prepare_shared_storage_comp = kfp.components.create_component_from_func(
    prepare_shared_storage, base_image=BASE_IMAGE
)

# Train and Test the model
For this example, both training and evaluating the model is part of the same script.

The training data and training script has been downloaded to shared storage (assumed to be mounted to /workspace).

The pipeline component needs to create a PyTorchJob resource and monitor the resource. Only the creation of the resource and monitoring happens in this component, the actual training happens in independent worker pods.

Creating and monitoring the resource is complex. The pytorch_distrbuted_kf_tools package (actually from this repo) is built into the base container image and allows us to do that with a single function call.

The training script could have been included in the container image, which avoids the need to download in. However downloading has the advantage of being able to pickup versioned changes from Git, without rebuilding the container image.

In [5]:
def train_and_test_model(
    shared_pvc_name: str, worker_image: str, mlpipeline_ui_metadata_path: OutputPath(str),
):
    import shutil
    import distributed_kf_tools.deploy as deploy
    from distributed_kf_tools.template import OwningWorkFlow, PvcMount

    for retries in range(5):
        try:
            ## Start the PyTorch job for distributed training
            job_name = "{{workflow.name}}" + (f"-{retries:03d}" if retries else "")
            deploy.run_pytorch_job(
                # owning_workflow setups it up so that when the pipeline is deleted,
                # the training job is cleaned up
                owning_workflow=OwningWorkFlow(
                    name="{{workflow.name}}", uid="{{workflow.uid}}"
                ),
                # These place holders for namespace and job name are
                # filled in by Kubeflow when the pipeline runs.
                namespace="{{workflow.namespace}}",
                pytorch_job_name=job_name,
                # Shared volumes used by the training script
                pvcs=[
                    PvcMount(
                        pvc_name=(shared_pvc_name),
                        mount_path="/workspace",
                    )
                ],
                # The command to run in each worker
                # This almost always starts with "torch.distributed.run" for DDP
                command=[
                    "python",
                    "-m",
                    "torch.distributed.run",
                    "/workspace/mnist.py",
                    "--root_dir=/workspace",
                    "--data_dir=/workspace",
                    "--model=/workspace/mnist_model.pt",
                    "--batch_size=1568",
                    "--kubeflow_ui_metadata=/workspace/metadata.json",
                    "--max_epochs=15",
                ],
                # Number of workers
                num_workers=7,
                # Number of GPUs per worker (OK to leave this at 1)
                gpus_per_worker=1,
                # The base image used for the worker pods
                worker_image=worker_image,
            )
            break
        except RuntimeError as e:
            print(f"THE JOB FAILED BECAUSE OF ERROR {e}")

    # The example python script outputs a file metadata.json
    # This file is already in a format suitable to display metrics on the
    # visualizations tab of the component. We just have to copy it to
    # the output path provided by Kubeflow for it to show up.
    shutil.copyfile("/workspace/metadata.json", mlpipeline_ui_metadata_path)


train_and_test_model_comp = kfp.components.create_component_from_func(
    train_and_test_model, base_image=BASE_IMAGE
)

# Copy data
This component can be used to copy a file or directory from one path to another.

We use this to copy the model from the shared PVC to the PVC used by the notebook server.

In [6]:
def copy_data(source: str, dest: str):
    import os
    import shutil

    # Make target directories if needed
    parent_dirs = os.path.basename(dest)
    if not os.path.exists(parent_dirs):
        os.makedirs(parent_dirs)

    if os.path.isdir(source):
        shutil.copytree(source, dest)
    else:
        shutil.copyfile(source, dest)


copy_data_comp = kfp.components.create_component_from_func(
    copy_data, base_image=BASE_IMAGE
)

# Define the pipeline


The pipeline has four tasks in it:
* Create shared volume
* Prepare shared volume
   - Download training data
   - Download training script
* Train model
   - Metrics are output to a visualization
* Copy trained model to the notebook PVC

In [7]:
@dsl.pipeline(
    name="Handwritten digit classification",
    description="An example pipeline that trains using distributed pytorch",
)
def mnist_pipeline(notebook_pvc_name):

    create_shared_volume_volop = dsl.VolumeOp(
        name="Create shared volume for training",
        resource_name="shared-pvc",
        modes=dsl.VOLUME_MODE_RWM,
        size="4Gi",
        set_owner_reference=True,
    )

    prepare_shared_storage_task = prepare_shared_storage_comp()
    prepare_shared_storage_task.add_pvolumes(
        {"/workspace": create_shared_volume_volop.volume}
    )

    train_model_task = train_and_test_model_comp(
        create_shared_volume_volop.volume.persistent_volume_claim.claim_name,
        worker_image=BASE_IMAGE
    )
    train_model_task.add_pvolumes({"/workspace": create_shared_volume_volop.volume})
    train_model_task.after(prepare_shared_storage_task)
    train_model_task.set_display_name("Train and Test Model")

    copy_model_task = copy_data_comp(
        "/workspace/mnist_model.pt", "/target/mnist_model.pt"
    )
    copy_model_task.add_pvolumes({"/workspace": create_shared_volume_volop.volume})
    copy_model_task.add_volume(
        V1Volume(
            name=notebook_pvc_name,
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
                notebook_pvc_name
            ),
        )
    )
    copy_model_task.add_volume_mount(
        V1VolumeMount(name=notebook_pvc_name, mount_path="/target")
    )
    copy_model_task.set_display_name(f"Copy Model to target PVC")
    copy_model_task.after(train_model_task)

In [8]:
PIPELINE_NAME = "MNIST HW Classification Pipeline"

# Configure Pipeline

The first transformer disables caching for our pipeline. Kubeflow caches tasks based on input and output parameters. Because we are using shared storage, the input parameters are the same as previous runs, however the data on the shared storage is most likely different.

The second transformer adds a node constraint to all tasks. This is only needed in the IBM Lab. We currently have a few machines that are older Power 8 hardware in the cluster. Python 3.10 from RocketCE has been optimized for Power 9 and Power 10. We'll force all out pods to run on the newer AC922's with newer hardware. (This isn't an issue for environments that have all AC922s or newer.)

In [9]:
pipeline_conf = kfp.dsl.PipelineConf()

# Disable Caching
def disable_cache_transformer(op: dsl.ContainerOp):
    if isinstance(op, dsl.ContainerOp):
        op.execution_options.caching_strategy.max_cache_staleness = "P0D"
    else:
        op.add_pod_annotation(
            name="pipelines.kubeflow.org/max_cache_staleness", value="P0D"
        )
    return op


pipeline_conf.add_op_transformer(disable_cache_transformer)

# This transformer is only Relevant inside an IBM lab that has both P8 and P9 machines
# (Assumptioin is that the P9 machines have the ai.accelerator label on them)
# def run_on_power_9_transformer(op: dsl.ContainerOp):
#    if isinstance(op, dsl.ContainerOp):
#        op.add_node_selector_constraint("ai.accelerator", "V100")


# pipeline_conf.add_op_transformer(run_on_power_9_transformer)

# Compile and run pipeline

This creates a run of the pipeline, without uploading it. If you wanted to, you could upload the compiled pipeline's yaml, and run the pipeline through the UI.

The parameter to the pipeline is the name of the PVC to copy the model onto. *This PVC must have an access mode of ReadWriteMany!*


In [10]:
kfp.compiler.Compiler().compile(
    pipeline_func=mnist_pipeline,
    package_path=f"{PIPELINE_NAME}.yaml",
    pipeline_conf=pipeline_conf,
)

In [11]:
client = kfp.Client()
run = client.create_run_from_pipeline_package(
    f"{PIPELINE_NAME}.yaml", arguments={"notebook_pvc_name": NOTEBOOK_PVC_NAME}
)

# Check output

After the pipeine completes, you should see the model in the root directory of the pvc defined by the NOTEBOOK_PVC_NAME variable.