# Using KubeRay to run Distributed Workloads without CodeFlare

This notebook demonstrates a quick workflow using Ray from a notebook without the codeflare-sdk.
The current usage patterns for KubeRay require manual oc commands to be run from your notebook, so you will need to authenticate manually. We recommend usage of the codeflare-sdk alongside CodeFlare for an easier experience. An example notebook showing an almost identical usecase can be found at https://github.com/project-codeflare/codeflare-sdk/blob/main/demo-notebooks/guided-demos/3_basic_interactive.ipynb

In [None]:
!pip install --upgrade ray=="2.5.0"
!pip install pandas

## You need to get your token to authenticate to the OpenShift Cluster.

1. Go to the OpenShift Console
2. Click on the arrow next to your username
3. Click on "Copy login command"
4. Once authenticated, copy the entire section under "Log in with this token. It will look similar to the following
oc login --token=<token> --server=<url>
5. Run the following cell, making sure to use your token and server. The "!" at the beginning of the command is required.

In [None]:
!oc login --token=<token> --server=<url>

In [13]:
!oc delete -f test.yaml
!oc apply -f test.yaml

raycluster.ray.io "imdb-ray-test" deleted
raycluster.ray.io/imdb-ray-test created


In [None]:
!oc get pods -o wide | grep imdb-ray-test |  awk '{print $1, $6, $7 }'

As you can see from the above output, we have 2 worker nodes and a head node for the ray cluster. Each node has a separate IP address and different physical node it has been scheduled on.

In [None]:
!oc get svc | grep imdb-ray-test

In [None]:
import ray
from ray.air.config import ScalingConfig

# Copy the service name from above. If you are using the default service and namespace,
# the ray_cluster_uri is ray://imdb-ray-test-head-svc.opendatahub.svc:10001

ray_cluster_uri = "ray://imdb-ray-test-head-svc.opendatahub.svc:10001"

#install additional libraries that will be required for model training
runtime_env = {"pip": ["transformers", "datasets", "evaluate", "pyarrow<7.0.0", "accelerate"]}

# NOTE: This will work for in-cluster notebook servers (RHODS/ODH), but not for local machines
# To see how to connect from your laptop, go to demo-notebooks/additional-demos/local_interactive.ipynb
ray.init(address=ray_cluster_uri, runtime_env=runtime_env)

print("Ray cluster is up and running: ", ray.is_initialized())

In [None]:
@ray.remote
def train_fn():
    from datasets import load_dataset
    import transformers
    from transformers import AutoTokenizer, TrainingArguments
    from transformers import AutoModelForSequenceClassification
    import numpy as np
    from datasets import load_metric
    import ray
    from ray import tune
    from ray.train.huggingface import HuggingFaceTrainer

    dataset = load_dataset("imdb")
    tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)

    tokenized_datasets = dataset.map(tokenize_function, batched=True)

    #using a fraction of dataset but you can run with the full dataset
    small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
    small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

    print(f"len of train {small_train_dataset} and test {small_eval_dataset}")

    ray_train_ds = ray.data.from_huggingface(small_train_dataset)
    ray_evaluation_ds = ray.data.from_huggingface(small_eval_dataset)

    def compute_metrics(eval_pred):
        metric = load_metric("accuracy")
        logits, labels = eval_pred
        predictions = np.argmax(logits, axis=-1)
        return metric.compute(predictions=predictions, references=labels)

    def trainer_init_per_worker(train_dataset, eval_dataset, **config):
        model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

        training_args = TrainingArguments("/tmp/hf_imdb/test", eval_steps=1, disable_tqdm=True, 
                                          num_train_epochs=1, skip_memory_metrics=True,
                                          learning_rate=2e-5,
                                          per_device_train_batch_size=16,
                                          per_device_eval_batch_size=16,                                
                                          weight_decay=0.01,)
        return transformers.Trainer(
            model=model,
            args=training_args,
            train_dataset=train_dataset,
            eval_dataset=eval_dataset,
            compute_metrics=compute_metrics
        )

    scaling_config = ScalingConfig(num_workers=2, use_gpu=False) #num workers is the number of gpus

    # we are using the ray native HuggingFaceTrainer, but you can swap out to use non ray Huggingface Trainer. Both have the same method signature. 
    # the ray native HFTrainer has built in support for scaling to multiple GPUs
    trainer = HuggingFaceTrainer(
        trainer_init_per_worker=trainer_init_per_worker,
        scaling_config=scaling_config,
        datasets={"train": ray_train_ds, "evaluation": ray_evaluation_ds},
    )
    result = trainer.fit()

In [None]:
ray.get(train_fn.remote())


In [None]:
ray.cancel(ref)
ray.shutdown()

In [None]:
!oc delete -f test.yaml