In this third notebook, we will go over the basics of submitting jobs via the SDK, either to a Ray cluster or directly to MCAD.

In [None]:
# Import pieces from codeflare-sdk
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [None]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "XXXXX",
    server = "XXXXX",
    skip_tls=False
)
auth.login()

Let's start by running through the same cluster setup as before:

NOTE: We must specify the `image` which will be used in our RayCluster, we recommend you bring your own image which suits your purposes. 
The example here is a community image.

In [None]:
# Create and configure our cluster object (and appwrapper)
cluster = Cluster(ClusterConfiguration(
    name='jobtest',
    namespace='default',
    num_workers=2,
    min_cpus=1,
    max_cpus=1,
    min_memory=4,
    max_memory=4,
    num_gpus=0,
    image="quay.io/project-codeflare/ray:latest-py39-cu118",
    instascale=False
))

In [None]:
# Bring up the cluster
cluster.up()
cluster.wait_ready()

In [None]:
cluster.details()

This time, however, we are going to use the CodeFlare SDK to submit batch jobs via TorchX, either to the Ray cluster we have just brought up, or directly to MCAD.

In [None]:
from codeflare_sdk.job.jobs import DDPJobDefinition

First, let's begin by submitting to Ray, training a basic NN on the MNIST dataset:

In [None]:
jobdef = DDPJobDefinition(
    name="mnisttest",
    script="mnist.py",
    scheduler_args={"requirements": "requirements.txt"}
)
job = jobdef.submit(cluster)

Now we can take a look at the status of our submitted job, as well as retrieve the full logs:

In [None]:
job.status()

In [None]:
job.logs()

You can also view organized logs, status, and other information directly through the Ray cluster's dashboard:

In [None]:
cluster.cluster_dashboard_uri()

Once complete, we can bring our Ray cluster down and clean up:

In [None]:
cluster.down()

Now, an alternative option for job submission is to submit directly to MCAD, which will schedule pods to run the job with requested resources:

NOTE: To test this demo in an air-gapped/ disconnected environment alter the training script to use a local dataset.

In [None]:
jobdef = DDPJobDefinition(
    name="mnistjob",
    script="mnist.py",
    # script="mnist_disconnected.py", # training script for disconnected environment
    scheduler_args={"namespace": "default"},
    j="1x1",
    gpu=0,
    cpu=1,
    memMB=8000,
    image="quay.io/project-codeflare/mnist-job-test:v0.0.1"
)
job = jobdef.submit()

Once again, we can look at job status and logs:

In [None]:
job.status()

In [None]:
job.logs()

This time, once the pods complete, we can clean them up alongside any other associated resources. The following command can also be used to delete jobs early for both Ray and MCAD submission:

In [None]:
job.cancel()

In [None]:
auth.logout()