# Katib Experiment using the UI

This tutorial shows how to run an experiment using Kubeflow's Katib UI. The example finds the best hyperparameters for a MNIST model implemented in PyTorch.

A Katib Experiment runs trials in parallel; each trial is a run of the training algorithm with a specific set of hyperparameters.

In practice, each trial (training run) could use multiple GPUs and there could be multiple trials running in parallel. This example trains a MNIST model with a single GPU, but with multiple trials running in parallel. This is a simple way to use multiple GPUs with the experiment, with needing distributed training for each trial.

We will place our training data and training script on a shared PVC so that it can be shared across trials running in parallel.

The example does NOT make use of Kubeflow pipelines.


# Setup

The notebook needs a data volume to load the training data and model file to. The volume needs to be mounted read-write-many.

You will have to recreate the notebook server and mount a volume if this volume is not already mounted.

In [59]:
VOLUME_NAME = 'my-notebook-datavol-1'
VOLUME_MOUNT_POINT = '/home/jovyan/my-notebook-datavol-1'

# Download training data & Copy Script

In [60]:
from torchvision.datasets import MNIST
_ = MNIST(VOLUME_MOUNT_POINT, download=True, train=True)
_ = MNIST(VOLUME_MOUNT_POINT, download=True, train=False)

In [61]:
!cp ../../distributed_training/pytorch/mnist/mnist.py $VOLUME_MOUNT_POINT/mnist.py

# Create Experiment



Navigate to the Experiments (AutoML) panel


Click new experiment, name the experiment bayesian

The trial thresholds controls the maximum number of trials that can run in parallel and the maximum number of trials that will be explored.

The objective controls what metric should be used to measure the result of the hyperparameters.

For this tutorial, we want to maximize the f1 score. In practice, we could set a goal value such that the experiment would end if a set of parameters was found that met or exceeded the goal. We'll set this to a very high .999 so that the experiment runs to completion.

Add the additional metric of acc. This means that accuracy will be tracked and reported in the experiment


Use Basesian Optimization as a search algorithm

This tutorial does not use Early Stopping, but for some projects stopping early may be important.

Set the search space of the hyper parameters

The python script is designed to write metrics to a json file. We'll use a File metrics collector

The trial template defines how each trial should be launched.

We'll launch each trial as a Job. A Job is a little less powerful than a PyTorchJob, which support distributed training for PyTorch. However we already have parallelism by running trials in parallel, and a Job is a bit simpler to use with Katib.


In [55]:
from kubernetes import client, config
from kubernetes.client import (V1ObjectMeta, 
                              V1ConfigMap, 
                              V1Job, 
                              V1JobSpec, 
                              V1PodTemplateSpec, 
                              V1PodSpec, 
                              V1Container, 
                              V1Volume, 
                              V1VolumeMount,
                              V1PersistentVolumeClaimVolumeSource,
                              V1EmptyDirVolumeSource,
                              V1ResourceRequirements)

job = V1Job(
    api_version="batch/v1",
    kind="Job",
    spec=V1JobSpec(
    template=V1PodTemplateSpec(
     spec=V1PodSpec(
         restart_policy="Never",
         containers=[
             V1Container(
                 name="pytorch",
                 image="quay.io/ntlawrence/mnist-dist-pytorch:1.0.4",
                 command=[
                            "python",
                            "/workspace/mnist.py",
                            "--root_dir=/tmp/workspace",
                            "--data_dir=/workspace",
                            "--model=/tmp/mnist_model.pt",
                            "--batch_size=672",
                            "--max_epochs=${trialParameters.epochs}",
                            "--lr=${trialParameters.lr}",
                            "--no-checkpoint",
                            "--metric_log_file=/tmp/hyper_log.json"
                        ],
                 resources=V1ResourceRequirements(limits={"nvidia.com/gpu": 1}),
                 volume_mounts=[
                     V1VolumeMount(mount_path="/workspace", name="workspace"),
                     V1VolumeMount(mount_path="/dev/shm", name="dshm")
                 ]
             )
         ],
         volumes=[V1Volume(
                        name="workspace",
                        persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name=VOLUME_NAME)
                  ),
                  V1Volume(name="dshm", 
                           empty_dir=V1EmptyDirVolumeSource(medium="Memory")
                           ),
                 ]
     )
    )
   )
           )

In [56]:
from ruamel.yaml import YAML
from ruamel.yaml.compat import StringIO
yaml=YAML()
s = StringIO()
client_api = client.ApiClient()
yaml.dump(client_api.sanitize_for_serialization(job), s)
print(s.getvalue())

apiVersion: batch/v1
kind: Job
spec:
  template:
    spec:
      containers:
      - command:
        - python
        - /workspace/mnist.py
        - --root_dir=/tmp/workspace
        - --data_dir=/workspace
        - --model=/tmp/mnist_model.pt
        - --batch_size=672
        - --max_epochs=${trialParameters.epochs}
        - --lr=${trialParameters.lr}
        - --no-checkpoint
        - --metric_log_file=/tmp/hyper_log.json
        image: quay.io/ntlawrence/mnist-dist-pytorch:1.0.4
        name: pytorch
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - mountPath: /workspace
          name: workspace
        - mountPath: /dev/shm
          name: dshm
      restartPolicy: Never
      volumes:
      - name: workspace
        persistentVolumeClaim:
          claimName: my-notebook-datavol-1
      - emptyDir:
          medium: Memory
        name: dshm



In [57]:
trials_config_map = client.V1ConfigMap(
    metadata=V1ObjectMeta(
     name="katib-trial-defs",
     labels={"katib.kubeflow.org/component": "trial-templates"}
    ),
    data={"pytorch-mnist-job" : s.getvalue()}
)

with open("/var/run/secrets/kubernetes.io/serviceaccount/namespace", "r") as f:
    NAMESPACE = f.read()

In [58]:
config.load_incluster_config()
k8s_api = client.CoreV1Api()
rsp = k8s_api.delete_namespaced_config_map(namespace=NAMESPACE, name="katib-trial-defs")
rsp = k8s_api.create_namespaced_config_map(namespace=NAMESPACE, body=trials_config_map)