In [82]:
%load_ext lab_black

# Katib Experiment using the UI

This tutorial shows how to run an experiment using Kubeflow's Katib UI. The example finds the best hyperparameters for a MNIST model implemented in PyTorch.

A Katib Experiment runs trials in parallel; each trial is a run of the training algorithm with a specific set of hyperparameters.

In practice, each trial (training run) could be distributed across multiple GPUs and there could also be multiple trials running in parallel. In this example trials do not distribute the training across GPUS. However, multiple trials run in parallel. This is a simple way to use multiple GPUs within the experiment, with needing distributed training for each trial.

We will place our training data and training script on a shared PVC so that it can be shared across trials running in parallel.

The example shows how to interact with the UI and does NOT make use of Kubeflow pipelines. (The Katib SDK does enable pipeline integration, which may be covered in other examples).


# Setup

The notebook needs a data volume to load the training data and script to. The volume needs to be mounted read-write-many.

You will have to recreate the notebook server and mount a volume if this volume is not already mounted.

In [83]:
VOLUME_NAME = "my-notebook-datavol-1"
VOLUME_MOUNT_POINT = "/home/jovyan/my-notebook-datavol-1"

# Download training data & Copy Script

In [84]:
from torchvision.datasets import MNIST

_ = MNIST(VOLUME_MOUNT_POINT, download=True, train=True)
_ = MNIST(VOLUME_MOUNT_POINT, download=True, train=False)

In [85]:
!cp ../../distributed_training/pytorch/mnist/mnist.py $VOLUME_MOUNT_POINT/mnist.py

# Define a trial

The first step of any experiment is to define a trial that will take the parameters that need to be optimized. We'll create a template for Kubernetes Job that runs the training in PyTorch. The job will define a pod that mounts the storage that contains the training data and script. The Pod will invoke the python script in its command.

Several parameters to the script are defined to reference Katib search parameters. Katib will replace these with values provided by the suggestion during the experiment.

Take care that any data written to common storage (/workspace) does not conflict with other trails running in parallel. Our program uses '/tmp' as the root directory, and disables checkpointing to avoid conficts.

The python SDK is used here to create the Job definition, although it is also common to create a YAML declaration for the Job as a string.

In [86]:
from kubernetes import client, config
from kubernetes.client import (
    V1ObjectMeta,
    V1ConfigMap,
    V1Job,
    V1JobSpec,
    V1PodTemplateSpec,
    V1PodSpec,
    V1Container,
    V1Volume,
    V1VolumeMount,
    V1PersistentVolumeClaimVolumeSource,
    V1EmptyDirVolumeSource,
    V1ResourceRequirements,
)

job = V1Job(
    api_version="batch/v1",
    kind="Job",
    spec=V1JobSpec(
        template=V1PodTemplateSpec(
            metadata=V1ObjectMeta(
                # https://github.com/kubeflow/website/issues/2011
                annotations={"sidecar.istio.io/inject": "false"},
            ),
            spec=V1PodSpec(
                restart_policy="Never",
                containers=[
                    V1Container(
                        name="pytorch",
                        image="quay.io/ntlawrence/mnist-dist-pytorch:1.0.4",
                        command=[
                            "python",
                            "/workspace/mnist.py",
                            "--root_dir=/tmp/workspace",
                            "--data_dir=/workspace",
                            "--model=/tmp/mnist_model.pt",
                            "--batch_size=672",
                            "--max_epochs=${trialParameters.epochs}",
                            "--lr=${trialParameters.lr}",
                            "--no-checkpoint",
                            "--metric_log_file=/var/log/katib/metrics.log",
                        ],
                        resources=V1ResourceRequirements(limits={"nvidia.com/gpu": 1}),
                        volume_mounts=[
                            V1VolumeMount(mount_path="/workspace", name="workspace"),
                            V1VolumeMount(mount_path="/dev/shm", name="dshm"),
                        ],
                    )
                ],
                volumes=[
                    V1Volume(
                        name="workspace",
                        persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
                            claim_name=VOLUME_NAME
                        ),
                    ),
                    V1Volume(
                        name="dshm", empty_dir=V1EmptyDirVolumeSource(medium="Memory")
                    ),
                ],
            ),
        )
    ),
)

## Convert to YAML

The python code in the previous cell is a progammatic way of creating the declaration for the job. This next cell shows what the description looks like in YAML.

In [79]:
from ruamel.yaml import YAML
from ruamel.yaml.compat import StringIO
yaml=YAML()
s = StringIO()
client_api = client.ApiClient()
yaml.dump(client_api.sanitize_for_serialization(job), s)
print(s.getvalue())

apiVersion: batch/v1
kind: Job
spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: 'false'
    spec:
      containers:
      - command:
        - python
        - /workspace/mnist.py
        - --root_dir=/tmp/workspace
        - --data_dir=/workspace
        - --model=/tmp/mnist_model.pt
        - --batch_size=672
        - --max_epochs=${trialParameters.epochs}
        - --lr=${trialParameters.lr}
        - --no-checkpoint
        - --metric_log_file=/tmp/hyper_log.json
        image: quay.io/ntlawrence/mnist-dist-pytorch:1.0.4
        name: pytorch
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /workspace
          name: workspace
        - mountPath: /dev/shm
          name: dshm
      restartPolicy: Never
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: my-notebook-datavol-1
      - emptyDir:
          medium: Memory
        name: dshm



# Create a configmap with the template
It is possible to directly copy the yaml from the previous cell when creating the Katib experiment.

If many experiments will be created using the same trial template, the template's YAML can be saved in a configmap and referenced in each experiment.

This next cell creates a config map with the trial template as the value for pytorch-mnist-job.

In [80]:
trials_config_map = client.V1ConfigMap(
    metadata=V1ObjectMeta(
     name="katib-example-trial-defs",
     labels={"katib.kubeflow.org/component": "trial-templates"}
    ),
    data={"pytorch-mnist-job" : s.getvalue()}
)

with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f:
    NAMESPACE = f.read()

In [81]:
config.load_incluster_config()
k8s_api = client.CoreV1Api()
#rsp = k8s_api.delete_namespaced_config_map(namespace=NAMESPACE, name="katib-example-trial-defs")
rsp = k8s_api.create_namespaced_config_map(namespace=NAMESPACE, body=trials_config_map)

# Create Experiment

Now that the trial has been defined, we are ready to use Katib to optimize the hyperparameters.


Navigate to the Experiments (AutoML) panel and Click the new Experiment Button

<img src="./images/Katib_AutoML.jpeg" alt="Create Experiment Button" width="700"/>


Give the experiment a name such as 'my-experiment'

The trial thresholds controls the maximum number of trials that can run in parallel and the maximum number of trials that will be explored. For this experiment, leave these at the default values.

<img src="./images/Trial_Thresholds.jpeg" alt="Set Thresholds" width="700"/>


The objective controls what metric should be used to measure the result of the hyperparameters.

For this tutorial, we want to maximize the f1 score. In practice, we could set a goal value such that the experiment would end if a set of parameters was found that met or exceeded the goal. We'll set this to a very high .999 so that the experiment runs to completion.

Add the additional metric of acc. This means that accuracy will be tracked and reported in the experiment. (But the experiment will not try to maximize accuracy).

<img src="./images/Trial_Objective.jpeg" alt="Objective Function" width="700"/>


Use Basesian Optimization as a search algorithm.

In the algorithm settings, the "Random State" should be set to a value. This makes the experiment more repeatable. This example used 42.

<img src="./images/Search_Alg.jpeg" alt="Search Algorithm" width="700"/>

For this tutorial, leave the Early Stopping as "None".

<img src="./images/Early_Stop.jpeg" alt="Early Stopping" width="700"/>

Set the search space of the hyper parameters. This tells Katib the space of values that it should consider for each suggestion.
This example optimizes two hyperparameters, lr and epochs.
* Use the existing values for lr
* Delete the other parameters
* Add an epochs parameter, type int with a range of 5 to 25 and step 5

<img src="./images/Hyper_Parameters.jpeg" alt="Hyper Parameters" width="700"/>

The python script is designed to write metrics to a file. We'll use a File metrics collector, with the default file name.

<img src="./images/Metrics_Collector.jpeg" alt="Metrics Collector" width="700"/>

Fill in the trial template. Because we saved the Job description in a config map, we can reference that here.

The primary container name should be set to 'pytorch', since this is the name of the container our training code runs under.
The Yaml for the Job is filled in automatically after you change the configMap name and namespace to the config map created in this script.

<img src="./images/Trial_Template_1.jpeg" alt="Trial Template" width="700"/>

Fill in the reference variables under the template. This is how variables in the template are matched to the parameters from the search space.

<img src="./images/Trial_Template_2.jpeg" alt="Trial Template references" width="700"/>

Press Create to create the experiment and start it running.

The running experiment will appear in the experiment's list.

<img src="./images/Running_Experiment.jpeg" alt="Experiment" width="700"/>

The best parameters are shown in the overview of the experiment.

<img src="./images/Best_Trial.jpeg" alt="Optimal Hyperparameters" width="700"/>