# Create PyTorchJob using Kubeflow Training SDK

This is a sample for Kubeflow Training SDK `kubeflow-training`.

The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob.

## Install Kubeflow Training Python SDKs

You need to install Kubeflow Training SDK to run this Notebook.

In [1]:
# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.
!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python

Collecting git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python
  Cloning https://github.com/kubeflow/training-operator.git to /tmp/pip-req-build-gr02xvgw
  Running command git clone --filter=blob:none --quiet https://github.com/kubeflow/training-operator.git /tmp/pip-req-build-gr02xvgw
  Resolved https://github.com/kubeflow/training-operator.git to commit 9e46f9d422e71f258679c5edd306c7eddf9004f1
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: kubeflow-training
  Building wheel for kubeflow-training (setup.py) ... [?25ldone
[?25h  Created wheel for kubeflow-training: filename=kubeflow_training-1.8.1-py3-none-any.whl size=140130 sha256=262bfe27f6fb930f3f4579f4b29b66e2300db0e02ab8a8f20a481b4a35587d81
  Stored in directory: /tmp/pip-ephem-wheel-cache-gdv_hr4q/wheels/4e/97/bb/7c46e489ad7772669c94e462b1f545c475d32d70259ba08209
Successfully built kubeflow-training
Installing collected packages: kubeflow-training
Successfully i

In [2]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container
from kubernetes.client import V1VolumeMount
from kubernetes.client import V1PersistentVolumeClaimVolumeSource
from kubernetes.client import V1Volume

from kubeflow.training import KubeflowOrgV1ReplicaSpec
from kubeflow.training import KubeflowOrgV1PyTorchJob
from kubeflow.training import KubeflowOrgV1PyTorchJobSpec
from kubeflow.training import KubeflowOrgV1RunPolicy
from kubeflow.training import TrainingClient

from kubeflow.training import constants

In [3]:
# get the namespace from environment variable NB_NAMESPACE
import os
import kubernetes
namespace = os.environ.get('NB_NAMESPACE', 'default')
def get_my_home():
    # get notebook name from the environment variable HOSTNAME, whose value is like 'notebook_name-0'
    notebook_name = os.environ.get('HOSTNAME', 'notebook').split('-')[0]

    # load cluster configuration
    kubernetes.config.load_incluster_config()

    # create client for the custom resource definition of notebook
    crd_api = kubernetes.client.CustomObjectsApi()

    crd_group = 'kubeflow.org'
    crd_version = 'v1alpha1'
    crd_plural = 'notebooks'

    # fetched the notebook object
    notebook = crd_api.get_namespaced_custom_object(crd_group, crd_version, namespace, crd_plural, notebook_name)
    print(notebook['spec']['template']['spec']['volumes'][0])
    # get the first PVC  of the notebook
    pvc_name = notebook['spec']['template']['spec']['volumes'][1]['persistentVolumeClaim']['claimName']
    print(pvc_name)
    return pvc_name

## Define PyTorchJob

The demo only creates a worker of PyTorchJob to run mnist sample.

In [4]:
name = "pytorch-dist-mnist-gloo"
namespace = "dm1261010"
container_name = "pytorch"

volume_mount = V1VolumeMount(
    name="model-volume",
    mount_path="/home/jovyan/",
)

volume = V1Volume(
    name="model-volume",
    persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="model-data")
)

container = V1Container(
    name=container_name,
    image="cguaicadmin/newlab-newpytorch:V1.0.23",
    command=["/home/jovyan/test.sh"],
    working_dir="/home/jovyan",
    resources={
        "requests": {
            "cpu": "4",
            "memory": "8Gi",
            "nvidia.com/gpu": "1"
        },
        "limits": {
            "cpu": "4",
            "memory": "8Gi",
            "nvidia.com/gpu": "1"
        }
    },
    volume_mounts=[volume_mount],
)

pod_spec = V1PodSpec(
    containers=[container],
    volumes=[volume]
)

replica_spec = KubeflowOrgV1ReplicaSpec(
    replicas=4,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            name=name,
            namespace=namespace,
            annotations={
                "sidecar.istio.io/inject": "false"
            }
        ),
        spec=V1PodSpec(
            containers=[container],
            volumes=[V1Volume(
                name="model-volume",
                persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name=get_my_home())
            )]
        )
    )
)

master_replica_spec = KubeflowOrgV1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            name=name,
            namespace=namespace,
            annotations={
                "sidecar.istio.io/inject": "false"
            }
        ),
        spec=V1PodSpec(
            containers=[container],
            volumes=[V1Volume(
                name="model-volume",
                persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(claim_name="torchjob-workspace")
            )]
        )
    )
)

pytorchjob = KubeflowOrgV1PyTorchJob(
    api_version=constants.API_VERSION,
    kind=constants.PYTORCHJOB_KIND,
    metadata=V1ObjectMeta(name=name, namespace=namespace),
    spec=KubeflowOrgV1PyTorchJobSpec(
        run_policy=KubeflowOrgV1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={
            "Master": master_replica_spec,
            "Worker": replica_spec
        },
    ),
)

{'emptyDir': {'medium': 'Memory'}, 'name': 'dshm'}
torchjob-workspace


## Create PyTorchJob

You have to create Training Client to deploy your PyTorchJob in you cluster.

In [5]:
# Namespace will be reused in every APIs.
training_client = TrainingClient(namespace=namespace)

# If `job_kind` is not set in `TrainingClient`, we need to set it for each API.
training_client.create_job(pytorchjob) # , job_kind=constants.PYTORCHJOB_KIND

## Get the Created PyTorchJob

You can verify the created PyTorchJob name

In [6]:
training_client.get_job(name, job_kind=constants.PYTORCHJOB_KIND).metadata.name

'pytorch-dist-mnist-gloo'

## Get the PyTorchJob Conditions

In [8]:
training_client.get_job_conditions(name=name, job_kind=constants.PYTORCHJOB_KIND)

[]

## Wait Until PyTorchJob Finishes

In [None]:
pytorchjob = training_client.wait_for_job_conditions(name=name,
                                                     job_kind=constants.PYTORCHJOB_KIND,
                                                     wait_timeout=900,
)

print(f"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}")

## Verify if PyTorchJob is Succeeded

In [None]:
training_client.is_job_succeeded(name=name, job_kind=constants.PYTORCHJOB_KIND)

## Get the PyTorchJob Training Logs

In [None]:
r,_=training_client.get_job_logs(name=name, job_kind=constants.PYTORCHJOB_KIND)
lines = r['pytorch-dist-mnist-gloo-master-0']
print(lines)
for l in lines.split('\n'):
    print(l)

## Delete the PyTorchJob

In [9]:
training_client.delete_job(name)