# Sample for Kubeflow PyTorchJob SDK

This is a sample for Kubeflow Training SDK `kubeflow-training`.

The notebook shows how to use Kubeflow Training SDK to create, get, wait, check and delete PyTorchJob.

## Install Kubeflow Training Python SDKs

You need to install Kubeflow Training SDK to run this Notebook.

In [None]:
# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.
!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python

In [6]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container

from kubeflow.training import V1ReplicaSpec
from kubeflow.training import KubeflowOrgV1PyTorchJob
from kubeflow.training import KubeflowOrgV1PyTorchJobSpec
from kubeflow.training import V1RunPolicy
from kubeflow.training import TrainingClient

## Define PyTorchJob

The demo only creates a worker of PyTorchJob to run mnist sample.

In [37]:
name = "pytorch-dist-mnist-gloo"
namespace = "kubeflow-user-example-com"
container_name = "pytorch"

container = V1Container(
    name=container_name,
    image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0",
    args=["--backend", "gloo"],
)

replica_spec = V1ReplicaSpec(
    replicas=1,
    restart_policy="OnFailure",
    template=V1PodTemplateSpec(
        metadata=V1ObjectMeta(
            name=name,
            namespace=namespace,
            annotations={
                "sidecar.istio.io/inject": "false"
            }
        ),
        spec=V1PodSpec(
            containers=[
                V1Container(
                    name=container_name,
                    image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0",
                    args=["--backend", "gloo"],
                )
            ]
        )
    )
)

pytorchjob = KubeflowOrgV1PyTorchJob(
    api_version="kubeflow.org/v1",
    kind="PyTorchJob",
    metadata=V1ObjectMeta(name=name, namespace=namespace),
    spec=KubeflowOrgV1PyTorchJobSpec(
        run_policy=V1RunPolicy(clean_pod_policy="None"),
        pytorch_replica_specs={
            "Master": replica_spec,
            "Worker": replica_spec
        },
    ),
)

## Create PyTorchJob

You have to create Training Client to deploy you PyTorchJob in you cluster.

In [38]:
training_client = TrainingClient()
training_client.create_pytorchjob(pytorchjob, namespace=namespace)

PyTorchJob kubeflow-user-example-com/pytorch-dist-mnist-gloo has been created


## Get the Created PyTorchJob

You can verify the created PyTorchJob name

In [39]:
training_client.get_pytorchjob(name).metadata.name

'pytorch-dist-mnist-gloo'

## Get the PyTorchJob Conditions

In [40]:
training_client.get_job_conditions(name=name, namespace=namespace, job_kind="PyTorchJob")

[{'last_transition_time': datetime.datetime(2023, 1, 12, 18, 30, 13, tzinfo=tzlocal()),
  'last_update_time': datetime.datetime(2023, 1, 12, 18, 30, 13, tzinfo=tzlocal()),
  'message': 'PyTorchJob pytorch-dist-mnist-gloo is created.',
  'reason': 'PyTorchJobCreated',
  'status': 'True',
  'type': 'Created'},
 {'last_transition_time': datetime.datetime(2023, 1, 12, 18, 30, 18, tzinfo=tzlocal()),
  'last_update_time': datetime.datetime(2023, 1, 12, 18, 30, 18, tzinfo=tzlocal()),
  'message': 'PyTorchJob pytorch-dist-mnist-gloo is running.',
  'reason': 'JobRunning',
  'status': 'True',
  'type': 'Running'}]

## Wait Until PyTorchJob Finishes

In [41]:
pytorchjob = training_client.wait_for_job_conditions(name=name, namespace=namespace, job_kind="PyTorchJob")

print(f"Succeeded number of replicas: {pytorchjob.status.replica_statuses['Master'].succeeded}")

pytorch-dist-mnist-gloo        Running              2023-01-12 18:30:18+00:00
pytorch-dist-mnist-gloo        Running              2023-01-12 18:30:18+00:00
pytorch-dist-mnist-gloo        Running              2023-01-12 18:30:18+00:00
pytorch-dist-mnist-gloo        Succeeded            2023-01-12 18:36:48+00:00
Succeeded number of replicas: 1


## Verify if PyTorchJob is Succeeded

In [42]:
training_client.is_job_succeeded(name=name, namespace=namespace, job_kind="PyTorchJob")

True

## Get the PyTorchJob Training Logs

In [43]:
training_client.get_job_logs(name=name, namespace=namespace, container=container_name)

The logs of pod pytorch-dist-mnist-gloo-master-0:
 Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!

accuracy=0.9665




## Delete the PyTorchJob

In [44]:
training_client.delete_pytorchjob(name)

PyTorchJob kubeflow-user-example-com/pytorch-dist-mnist-gloo has been deleted
