In this notebook we are going to run a llama 2 fine tuning script using the CodeFlare SDK and Ray Job Submission

In [None]:
from codeflare_sdk.cluster.cluster import Cluster, ClusterConfiguration
from codeflare_sdk.cluster.auth import TokenAuthentication

In [None]:
# Create authentication object for user permissions
# IF unused, SDK will automatically check for default kubeconfig, then in-cluster config
# KubeConfigFileAuthentication can also be used to specify kubeconfig path manually
auth = TokenAuthentication(
    token = "XXXXX",
    server = "XXXXX",
    skip_tls=False
)
auth.login()

Once again, let's start by running through the same cluster setup as before:

NOTE: We must specify the `image` which will be used in our RayCluster, we recommend you bring your own image which suits your purposes. 
The example here is a community image.

In [None]:
cluster = Cluster(ClusterConfiguration(
    name='llamafinetuneloraclustertest',
    namespace='default', # Update to your namespace
    num_workers=3, 
    min_cpus=8,
    max_cpus=8,
    min_memory=35,
    max_memory=35,
    head_memory=35,
    head_gpus=1, # For GPU enabled workloads set the head_gpus and num_gpus
    num_gpus=1,
    image="quay.io/project-codeflare/ray:latest-py39-cu118",
    write_to_file=False # When enabled Ray Cluster yaml files are written to /HOME/.codeflare/resources 
    # local_queue="local-queue-name" # Specify the local queue manually
))

In [None]:
cluster.up()

In [None]:
cluster.wait_ready()

In [None]:
# Initialize the Job Submission Client
"""
The SDK will automatically gather the dashboard address and authenticate using the Ray Job Submission Client.
"""
client = cluster.job_client

In [None]:
"""
NOTE: Please update the script with your Hugging Face token and update the save and push paths to your Hugging Face Organisation.
"""
submission_id = client.submit_job(
    entrypoint="python llmfinetune.py",
    runtime_env={"working_dir": "./","pip": "requirements-llama.txt"},
)
print(submission_id)

In [None]:
client.get_job_status(submission_id)

In [None]:
client.get_job_logs(submission_id)

In [None]:
cluster.down()

In [None]:
auth.logout()