# Create TFJob using Kubeflow Training SDK

This is a sample for Kubeflow Training SDK `kubeflow-training`.

The notebook shows how to use Kubeflow TFJob SDK to create, get, wait, check and delete TFJob.

## Install Kubeflow Training Python SDKs

You need to install Kubeflow Training SDK to run this Notebook.

In [None]:
# TODO (andreyvelich): Change to release version when SDK with the new APIs is published.
!pip install git+https://github.com/kubeflow/training-operator.git#subdirectory=sdk/python

In [9]:
from kubernetes.client import V1PodTemplateSpec
from kubernetes.client import V1ObjectMeta
from kubernetes.client import V1PodSpec
from kubernetes.client import V1Container


from kubeflow.training import KubeflowOrgV1ReplicaSpec
from kubeflow.training import KubeflowOrgV1TFJob
from kubeflow.training import KubeflowOrgV1TFJobSpec
from kubeflow.training import KubeflowOrgV1RunPolicy
from kubeflow.training import TrainingClient

from kubeflow.training import constants

## Define TFJob

The demo runs Tensorflow MNIST example with 2 workers, chief, and parameter server for TFJob.

In [10]:
name = "mnist"
namespace = "kubeflow-user-example-com"
container_name = "tensorflow"

container = V1Container(
    name=container_name,
    image="gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0",
    command=[
        "python",
        "/var/tf_mnist/mnist_with_summaries.py",
        "--log_dir=/train/logs", "--learning_rate=0.01",
        "--batch_size=150"
        ]
)

worker = KubeflowOrgV1ReplicaSpec(
    replicas=2,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

chief = KubeflowOrgV1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

ps = KubeflowOrgV1ReplicaSpec(
    replicas=1,
    restart_policy="Never",
    template=V1PodTemplateSpec(
        spec=V1PodSpec(
            containers=[container]
        )
    )
)

tfjob = KubeflowOrgV1TFJob(
    api_version=constants.API_VERSION,
    kind=constants.TFJOB_KIND,
    metadata=V1ObjectMeta(name="mnist",namespace=namespace),
    spec=KubeflowOrgV1TFJobSpec(
        run_policy=KubeflowOrgV1RunPolicy(clean_pod_policy="None"),
        tf_replica_specs={"Worker": worker,
                          "Chief": chief,
                          "PS": ps}
    )
)

## Create TFJob

You have to create Training Client to deploy your TFJob in you cluster.

In [11]:
# Namespace and Job kind will be reused in every APIs.
training_client = TrainingClient(namespace=namespace, job_kind=constants.TFJOB_KIND)
training_client.create_job(tfjob)

TFJob kubeflow-user-example-com/mnist has been created


## Get the Created TFJob

You can verify the created TFJob status.

In [12]:
training_client.get_job(name).status

{'completion_time': None,
 'conditions': [{'last_transition_time': datetime.datetime(2023, 9, 8, 21, 42, 34, tzinfo=tzutc()),
                 'last_update_time': datetime.datetime(2023, 9, 8, 21, 42, 34, tzinfo=tzutc()),
                 'message': 'TFJob mnist is created.',
                 'reason': 'TFJobCreated',
                 'status': 'True',
                 'type': 'Created'},
                {'last_transition_time': datetime.datetime(2023, 9, 8, 21, 42, 35, tzinfo=tzutc()),
                 'last_update_time': datetime.datetime(2023, 9, 8, 21, 42, 35, tzinfo=tzutc()),
                 'message': 'TFJob kubeflow-user-example-com/mnist is running.',
                 'reason': 'TFJobRunning',
                 'status': 'True',
                 'type': 'Running'}],
 'last_reconcile_time': None,
 'replica_statuses': {'Chief': {'active': 1,
                                'failed': None,
                                'label_selector': None,
                                'sel

## Get the TFJob Conditions

In [13]:
training_client.get_job_conditions(name)

[{'last_transition_time': datetime.datetime(2023, 9, 8, 21, 42, 34, tzinfo=tzutc()),
  'last_update_time': datetime.datetime(2023, 9, 8, 21, 42, 34, tzinfo=tzutc()),
  'message': 'TFJob mnist is created.',
  'reason': 'TFJobCreated',
  'status': 'True',
  'type': 'Created'},
 {'last_transition_time': datetime.datetime(2023, 9, 8, 21, 42, 35, tzinfo=tzutc()),
  'last_update_time': datetime.datetime(2023, 9, 8, 21, 42, 35, tzinfo=tzutc()),
  'message': 'TFJob kubeflow-user-example-com/mnist is running.',
  'reason': 'TFJobRunning',
  'status': 'True',
  'type': 'Running'}]

## Wait Until TFJob Finishes

In [None]:
training_client.wait_for_job_conditions(name)

## Verify if TFJob is Succeeded

In [None]:
training_client.is_job_succeeded(name)

## Get the TFJob Training Logs

In [None]:
training_client.get_job_logs(name)

## Delete the TFJob

In [None]:
training_client.delete_job(name)