In this notebook, we will go through the basics of using the SDK to:
 - Spin up a Ray cluster with our desired resources
 - View the status and specs of our Ray cluster
 - Take down the Ray cluster when finished

In [1]:
# Import pieces from codeflare-sdk
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication

In [None]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "XXXXX",
    server = "XXXXX",
    skip_tls=False
)
auth.login()

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding RayCluster).

NOTE: 'quay.io/rhoai/ray:2.23.0-py39-cu121' is the default community image used by the CodeFlare SDK for creating a RayCluster resource. 
If you have your own Ray image which suits your purposes, specify it in image field to override the default image.

In [2]:
# Create and configure our cluster object
# The SDK will try to find the name of your default local queue based on the annotation "kueue.x-k8s.io/default-queue": "true" unless you specify the local queue manually below
cluster = Cluster(ClusterConfiguration(
    name='raytest',
    head_gpus=0, # For GPU enabled workloads set the head_gpus and num_gpus
    num_gpus=0,
    num_workers=2,
    min_cpus=1,
    max_cpus=1,
    min_memory=4,
    max_memory=4,
    # image="", # Optional Field 
    write_to_file=False, # When enabled Ray Cluster yaml files are written to /HOME/.codeflare/resources 
    # local_queue="local-queue-name" # Specify the local queue manually
))

Written to: raytest.yaml


Next, we want to bring our cluster up, so we call the `up()` function below to submit our Ray Cluster onto the queue, and begin the process of obtaining our resource cluster.

In [3]:
# Bring up the cluster
cluster.up()

Now, we want to check on the status of our resource cluster, and wait until it is finally ready for use.

In [4]:
cluster.status()

(<CodeFlareClusterStatus.QUEUED: 3>, False)

In [5]:
cluster.wait_ready()

Waiting for requested resources to be set up...
Requested cluster up and running!


In [6]:
cluster.status()

(<CodeFlareClusterStatus.READY: 1>, True)

Let's quickly verify that the specs of the cluster are as expected.

In [7]:
cluster.details()

RayCluster(name='raytest', status=<RayClusterStatus.READY: 'ready'>, workers=2, worker_mem_min=4, worker_mem_max=4, worker_cpu=1, worker_gpu=0, namespace='default', dashboard='http://ray-dashboard-raytest-default.apps.meyceoz-07122023.psap.aws.rhperfscale.org')

Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

In [8]:
cluster.down()

In [None]:
auth.logout()