In this notebook, we will go through the basics of using the SDK to:
 - Spin up a Ray cluster with our desired resources
 - View the status and specs of our Ray cluster
 - Take down the Ray cluster when finished

In [2]:
# Import pieces from codeflare-sdk
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication

In [3]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "xxxxxx",
    server = "xxxxxx",
    skip_tls=False
)
auth.login()

Authenticated with certificate located at /etc/pki/tls/custom-certs/ca-bundle.crt


'Logged into https://api.ai-cloud.xtoph156.dfw.ocp.run:6443'

Here, we want to define our cluster by specifying the resources we require for our batch workload. Below, we define our cluster object (which generates a corresponding RayCluster).

NOTE: We must specify the `image` which will be used in our RayCluster, we recommend you bring your own image which suits your purposes. 
The example here is a community image.

In [4]:
# Create and configure our cluster object
# The SDK will try to find the name of your default local queue based on the annotation "kueue.x-k8s.io/default-queue": "true" unless you specify the local queue manually below
cluster = Cluster(ClusterConfiguration(
    name='fakeremail', 
    namespace='raytraining', # Update to your namespace
    head_gpus=0, # For GPU enabled workloads set the head_gpus and num_gpus
    num_gpus=0,
    num_workers=1,
    min_cpus=1,
    max_cpus=1,
    min_memory=4,
    max_memory=4,
    image="quay.io/bpandey/faker-email:latest",
    write_to_file=False, # When enabled Ray Cluster yaml files are written to /HOME/.codeflare/resources 
    # local_queue="local-queue-name" # Specify the local queue manually
))

Yaml resources loaded for fakeremail


Next, we want to bring our cluster up, so we call the `up()` function below to submit our Ray Cluster onto the queue, and begin the process of obtaining our resource cluster.

In [5]:
# Bring up the cluster
cluster.up()

Now, we want to check on the status of our resource cluster, and wait until it is finally ready for use.

In [6]:
cluster.status()

(<CodeFlareClusterStatus.READY: 1>, True)

In [7]:
cluster.wait_ready()

Waiting for requested resources to be set up...
Requested cluster is up and running!
Dashboard is ready!


In [8]:
cluster.status()

(<CodeFlareClusterStatus.READY: 1>, True)

Let's quickly verify that the specs of the cluster are as expected.

In [9]:
cluster.details()

RayCluster(name='fakeremail', status=<RayClusterStatus.READY: 'ready'>, head_cpus=2, head_mem='8G', head_gpu=0, workers=1, worker_mem_min='4G', worker_mem_max='4G', worker_cpu=1, worker_gpu=0, namespace='raytraining', dashboard='https://ray-dashboard-fakeremail-raytraining.apps.ai-cloud.xtoph156.dfw.ocp.run')

Finally, we bring our resource cluster down and release/terminate the associated resources, bringing everything back to the way it was before our cluster was brought up.

In [None]:
cluster.down()

In [None]:
auth.logout()