In [1]:
%load_ext lab_black

# Distributed training with Pytorch example

This notebook trains and evaluates a classifier of handwritten digits (Using the MNIST dataset). The training is distributed across multiple GPUs.

There are important features and differences in this flow than the MPI/Tensorflow example, and other training examples.

## Container Image
The components of this pipeline use a custom built container image as their base image. The Dockerfile is included in this directory. The container image does *not* include a notebook server, meaning you cannot use it to launch a notebook. The notebook container images from https://quay.io/repository/ibm/kubeflow-notebook-image-ppc64le have a wide range of packages installed in them, and these can be used to start an interactive notebook server. The base image used in this example has installed a much smaller set of newer packages, including Python 3.10 and PyTorch 1.13 from RocketCE. Also included is pytorch-lightning, which makes it easier to code up models for distributed training.

Since your notebook image and pipeline images are not the same you may be able to run code in the pipeline that does not run interactivly in the notebook. And you may be able to run code interactivly that does not run in the pipeline.

You can see how the custom image was built and that packages that are included by looking at the Dockerfile.

## Model Training Script
When using distributed training with GPUs, most PyTorch models are trained within a script that is called from the command line, as opposed to a cell in a interactive notebook.

The script to run the model is stored in the GitHub Repo.

The Kubeflow component will create a PyTorch job that runs the script across a pre-defined number of worker pods. The component will then monitor the job and wait for the job's completion.

## Shared Storage
The Kubeflow component needs to make the Python script and training data available to the PytorchJob, and it needs to be able to obtain the trained model after training is completed.

The pipeline will create a Volume as the first step in the pipeline.
The pipeline will copy the training data and training script onto this storage so that the workers can access it.
After training, the pipeline will copy the model from the shared storage to storage that is accessible outside the pipeline. (The notebook's volume in this example).




In [2]:
import kfp
from kfp.components import InputPath, OutputPath
import kfp.dsl as dsl
from kfp.dsl import PipelineConf, data_passing_methods
from kubernetes.client.models import V1Volume, V1PersistentVolumeClaimVolumeSource
import numpy as np
import os
from typing import List, NamedTuple

BASE_IMAGE = "quay.io/ntlawrence/pytorchv1.13:1.1"

# Prepare shared storage

This cell downloads 3 things to the shared storage
* Training data
* Test data
* Model training script

The assumption in the function is that the shared volume has been mounted to the path "/workspace". This mount is done in the pipeline definition.

In [None]:
def prepare_shared_storage():
    from torchvision.datasets import MNIST
    import urllib

    # Download training data
    _ = MNIST("/workspace", download=True, train=True)
    _ = MNIST("/workspace", download=True, train=False)

    # Download python model training script
    r = urllib.request.urlretrieve(
        "https://raw.githubusercontent.com/ntl-ibm/kubeflow-ppc64le-examples/multi-gpu-yolov5/pytorch_distributed/mnist/mnist.py",
        "/workspace/mnist.py",
    )


prepare_shared_storage_comp = kfp.components.create_component_from_func(
    prepare_shared_storage, base_image=BASE_IMAGE
)

# Train and Test the model
For this example, both training and evaluating the model is part of the same script.

The training data and training script has been downloaded to shared storage (assumed to be mounted to /workspace).

The pipeline component needs to create a PyTorchJob resource and monitor the resource. Only the creation of the resource and monitoring happens in this component, the actual training happens in independent worker pods.

Creating and monitoring the resource is complex. The component installers a deploy script (actually from this repo) to do that in a single function call.

Both this script and the training script could have been included in the container image, which avoid the need to download. However downloading has the advantage of being able to pickup versioned changes from Git, without rebuilding the container.

In [14]:
def train_and_test_model(
    shared_pvc_name: str, mlpipeline_ui_metadata_path: OutputPath(str)
):
    # Install a python package to make deploying and monitoring the pytorch job easier
    import subprocess

    subprocess.run(
        "pip install 'pytorch_distributed_kf_tools @ "
        "git+https://github.com/ntl-ibm/kubeflow-ppc64le-examples@multi-gpu-yolov5#"
        "subdirectory=pytorch_distributed/pytorch_distributed_kf_tools'",
        shell=True,
    )

    import pytorch_distributed_kf_tools.deploy as deploy
    import shutil

    ## Start the PyTorch job for distributed training
    deploy.run_pytorch_job(
        # owning_workflow setups it up so that when the pipeline is deleted, so is the
        # training job
        owning_workflow=deploy.OwningWorkFlow(
            name="{{workflow.name}}", uid="{{workflow.uid}}"
        ),
        # These place holders for namespace and job name are
        # filled in by Kubeflow when the pipeline runs.
        namespace="{{workflow.namespace}}",
        pytorch_job_name="{{workflow.name}}",
        # Shared volumes used by the training script
        pvcs=[
            deploy.PvcMount(
                pvc_name=(shared_pvc_name),
                mount_path="/workspace",
            )
        ],
        # The command to run in each worker
        # This almost always starts with "torch.distributed.run" for DDP
        command=[
            "python",
            "-m",
            "torch.distributed.run",
            "/workspace/mnist.py",
            "--root_dir=/workspace",
            "--data_dir=/workspace",
            "--model=/workspace/mnist_model.pt",
            f"--kubeflow_ui_metadata=/workspace/metadata.json",
            "--max_epochs=10",
        ],
        # Number of workers
        num_workers=6,
        # Number of GPUs per worker (OK to leave this at 1)
        gpus_per_worker=1,
        # The base image used for the worker pods
        worker_image="quay.io/ntlawrence/pytorchv1.13:1.1",
    )

    # The example python script outputs a file metadata.json
    # This file is already in a format suitable to display metrics on the
    # visualizations tab of the component. We just have to copy it to
    # the output path provided by Kubeflow for it to show up.
    shutil.copyfile("/workspace/metadata.json", mlpipeline_ui_metadata_path)


train_and_test_model_comp = kfp.components.create_component_from_func(
    train_and_test_model, base_image=BASE_IMAGE
)

In [5]:
def copy_data(source: str, dest: str):
    import os
    import shutil

    # Make target directories if needed
    parent_dirs = os.path.basename(dest)
    if not os.path.exists(parent_dirs):
        os.makedirs(parent_dirs)

    if os.path.isdir(source):
        shutil.copytree(source, dest)
    else:
        shutil.copyfile(source, dest)


copy_data_comp = kfp.components.create_component_from_func(
    copy_data, base_image=BASE_IMAGE
)

In [6]:
from kubernetes.client import V1VolumeMount


@dsl.pipeline(
    name="Handwritten digit classification",
    description="An example pipeline that trains using distributed pytorch",
)
def mnist_pipeline(notebook_pvc_name: str = "pytorch-minst-volume"):

    create_shared_volume_volop = dsl.VolumeOp(
        name="Create shared volume for training",
        resource_name="shared-pvc",
        modes=dsl.VOLUME_MODE_RWM,
        size="4Gi",
        set_owner_reference=True,
    )

    prepare_shared_storage_task = prepare_shared_storage_comp()
    prepare_shared_storage_task.add_pvolumes(
        {"/workspace": create_shared_volume_volop.volume}
    )

    train_model_task = train_and_test_model_comp(
        create_shared_volume_volop.volume.persistent_volume_claim.claim_name
    )
    train_model_task.add_pvolumes({"/workspace": create_shared_volume_volop.volume})
    train_model_task.after(prepare_shared_storage_task)
    train_model_task.set_display_name("Train and Test Model")

    copy_model_task = copy_data_comp(
        "/workspace/mnist_model.pt", "/target/mnist_model.pt"
    )
    copy_model_task.add_pvolumes({"/workspace": create_shared_volume_volop.volume})
    copy_model_task.add_volume(
        V1Volume(
            name=notebook_pvc_name,
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
                notebook_pvc_name
            ),
        )
    )
    copy_model_task.add_volume_mount(
        V1VolumeMount(name=notebook_pvc_name, mount_path="/target")
    )
    copy_model_task.set_display_name(f"Copy Model to target PVC")
    copy_model_task.after(train_model_task)

In [7]:
PIPELINE_NAME = "MNIST Classification Pipeline"

In [8]:
pipeline_conf = kfp.dsl.PipelineConf()

# Disable Caching
def disable_cache_transformer(op: dsl.ContainerOp):
    if isinstance(op, dsl.ContainerOp):
        op.execution_options.caching_strategy.max_cache_staleness = "P0D"
    else:
        op.add_pod_annotation(
            name="pipelines.kubeflow.org/max_cache_staleness", value="P0D"
        )
    return op


pipeline_conf.add_op_transformer(disable_cache_transformer)

In [9]:
kfp.compiler.Compiler().compile(
    pipeline_func=mnist_pipeline,
    package_path=f"{PIPELINE_NAME}.yaml",
    pipeline_conf=pipeline_conf,
)

In [10]:
def delete_pipeline(pipeline_name: str):
    """Delete's a pipeline with the specified name"""

    client = kfp.Client()
    existing_pipelines = client.list_pipelines(page_size=999).pipelines
    matches = (
        [ep.id for ep in existing_pipelines if ep.name == pipeline_name]
        if existing_pipelines
        else []
    )
    for id in matches:
        client.delete_pipeline(id)

In [11]:
def get_experiment_id(experiment_name: str) -> str:
    """Returns the id for the experiment, creating the experiment if needed"""
    client = kfp.Client()
    existing_experiments = client.list_experiments(page_size=999).experiments
    matches = (
        [ex.id for ex in existing_experiments if ex.name == experiment_name]
        if existing_experiments
        else []
    )

    if matches:
        return matches[0]

    exp = client.create_experiment(experiment_name)
    return exp.id

In [12]:
# Pipeline names need to be unique, so before we upload,
# check for and delete any pipeline with the same name
delete_pipeline(PIPELINE_NAME)

# upload
client = kfp.Client()
uploaded_pipeline = client.upload_pipeline(f"{PIPELINE_NAME}.yaml", PIPELINE_NAME)

In [13]:
run = client.run_pipeline(
    experiment_id=get_experiment_id("mnist"),
    job_name="mnist-classification-pipeline",
    pipeline_id=uploaded_pipeline.id,
)