In this notebook, we will go over the basics of submitting jobs via the SDK to a Ray cluster.

In [None]:
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication
from codeflare_sdk.job.jobs import DDPJobDefinition

In [None]:
# Create authentication object for oc user permissions
auth = TokenAuthentication(
    token="",
    server="",
    skip_tls=True
)
auth.login()

Let's start by running through the same cluster setup as before:

In [None]:
# Create and configure our cluster object (and appwrapper)
cluster = Cluster(ClusterConfiguration(
    name='jobtest',
    image='quay.io/mmurakam/runtimes:finetuning-ray-runtime-v0.2.2',
    num_workers=2,
    min_cpus=1,
    max_cpus=1,
    min_memory=4,
    max_memory=4,
    num_gpus=1,
    instascale=False
))

In [None]:
# Bring up the cluster
cluster.up()
cluster.wait_ready()

In [None]:
cluster.details()

This time, however, we are going to use the CodeFlare SDK to submit batch jobs via TorchX to the Ray cluster we have just brought up.

First, let's begin by submitting to Ray, training a basic NN on the MNIST dataset:

In [None]:
jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    scheduler_args={"requirements": "requirements.txt"}
)
job = jobdef.submit(cluster)

Now we can take a look at the status of our submitted job, as well as the logs:

In [None]:
job.status()

In [None]:
job.logs()

Once complete, we can bring our Ray cluster down and clean up:

In [None]:
cluster.down()

In [None]:
auth.logout()