In this third notebook, we will go over the basics of submitting jobs via the SDK, either to a Ray cluster or directly to MCAD.

In [2]:
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [None]:
# Create authentication object for oc user permissions
auth = TokenAuthentication(
    token = "XXXXX",
    server = "XXXXX",
    skip_tls=False
)
auth.login()

Let's start by running through the same cluster setup as before:

In [3]:
# Create and configure our cluster object (and appwrapper)
cluster = Cluster(ClusterConfiguration(
    name='jobtest',
    namespace='default',
    min_worker=2,
    max_worker=2,
    min_cpus=1,
    max_cpus=1,
    min_memory=4,
    max_memory=4,
    gpu=0,
    instascale=False
))

Written to: jobtest.yaml


In [4]:
# Bring up the cluster
cluster.up()
cluster.wait_ready()

Waiting for requested resources to be set up...
Requested cluster up and running!


In [5]:
cluster.details()

RayCluster(name='jobtest', status=<RayClusterStatus.READY: 'ready'>, min_workers=2, max_workers=2, worker_mem_min=4, worker_mem_max=4, worker_cpu=1, worker_gpu=0, namespace='default', dashboard='http://ray-dashboard-jobtest-default.apps.meyceoz-032023.psap.aws.rhperfscale.org')

This time, however, we are going to use the CodeFlare SDK to submit batch jobs via TorchX, either to the Ray cluster we have just brought up, or directly to MCAD.

In [6]:
from codeflare_sdk.job.jobs import DDPJobDefinition

First, let's begin by submitting to Ray, training a basic NN on the MNIST dataset:

In [7]:
jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    scheduler_args={"requirements": "requirements.txt"}
)
job = jobdef.submit(cluster)

The Ray scheduler does not support port mapping.


Now we can take a look at the status of our submitted job, as well as the logs:

In [10]:
job.status()

AppStatus:
  msg: !!python/object/apply:ray.dashboard.modules.job.common.JobStatus
  - SUCCEEDED
  num_restarts: -1
  roles:
  - replicas:
    - hostname: <NONE>
      id: 0
      role: ray
      state: !!python/object/apply:torchx.specs.api.AppState
      - 4
      structured_error_msg: <NONE>
    role: ray
  state: SUCCEEDED (4)
  structured_error_msg: <NONE>
  ui_url: null

In [11]:
job.logs()



Once complete, we can bring our Ray cluster down and clean up:

In [12]:
cluster.down()

Now, an alternative option for job submission is to submit directly to MCAD, which will schedule pods to run the job with requested resources:

In [13]:
jobdef = DDPJobDefinition(
    name="mnistjob",
    script="mnist.py",
    scheduler_args={"namespace": "default"},
    j="1x1",
    gpu=0,
    cpu=1,
    memMB=8000,
    image="quay.io/project-codeflare/mnist-job-test:v0.0.1"
)
job = jobdef.submit()



Once again, we can look at job status and logs:

In [18]:
job.status()

AppStatus:
  msg: <NONE>
  num_restarts: -1
  roles:
  - replicas:
    - hostname: ''
      id: 0
      role: mnist
      state: !!python/object/apply:torchx.specs.api.AppState
      - 3
      structured_error_msg: <NONE>
    role: mnist
  state: RUNNING (3)
  structured_error_msg: <NONE>
  ui_url: null

In [20]:
job.logs()



This time, once the pods complete, we can clean them up alongside any other associated resources. The following command can also be used to delete jobs early for both Ray and MCAD submission:

In [21]:
job.cancel()

In [None]:
auth.logout()