## Phase 1 : RayCluster Setup and Ray Based Distributed Data Processing

- **CodeFlare SDK**: Ray cluster deployment and management on Kubernetes
- **Ray Job Submission**: Distributed synthetic data generation using Ray workers

### Setup Ray Cluster using Codeflare-SDK

Configure and deploy the Ray cluster for distributed data processing


In [65]:
# Setup Ray cluster using CodeFlare SDK
from codeflare_sdk import Cluster, ClusterConfiguration, TokenAuthentication
import time

token="<auth_token>"
api_server="<api_server_url>"

# Authenticate with the Openshift cluster
auth = TokenAuthentication(
    token=token,
    server=api_server,
    skip_tls=True
)
auth.login()

'Logged into https://<api-server-url>:6443'

In [66]:
# Configure Ray cluster for distributed synthetic data generation
from kubernetes.client.models import V1Volume, V1VolumeMount, V1PersistentVolumeClaimVolumeSource

ray_cluster = Cluster(ClusterConfiguration(
    name="test1-cluster",
    num_workers=2,
    # Head node configuration
    head_cpu_requests=2,
    head_cpu_limits=4,
    head_memory_requests=16,
    head_memory_limits=24,
    # Worker node configuration
    worker_cpu_requests=2,
    worker_cpu_limits=4,
    worker_memory_requests=20,
    worker_memory_limits=24,
    # UnComment in case of using accelerators for RayCluster
    head_extended_resource_requests={'nvidia.com/gpu': 2},
    worker_extended_resource_requests={'nvidia.com/gpu': 2},
    # Ray runtime image
    image="quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26",
    # Volume mount - Shared PVC storage with RWX peermissions
    volume_mounts=[
        V1VolumeMount(
            name="shared",
            mount_path="/shared"
        )
    ],
    volumes=[
        V1Volume(
            name="shared",
            persistent_volume_claim=V1PersistentVolumeClaimVolumeSource(
                claim_name="shared"
            )
        )
    ],
))

print(" Ray Cluster Configuration:")
print(f"   Name: {ray_cluster.config.name}")
print(f"   Workers: {ray_cluster.config.num_workers}")
print(f"   Worker Resources: {ray_cluster.config.worker_cpu_requests}CPU, {ray_cluster.config.worker_memory_requests} RAM, {ray_cluster.config.worker_extended_resource_requests} GPU")
print(f"   Image: {ray_cluster.config.image}")


Yaml resources loaded for test1-cluster


 Ray Cluster Configuration:
   Name: test1-cluster
   Workers: 2
   Worker Resources: 2CPU, 20G RAM, {'nvidia.com/gpu': 2} GPU
   Image: quay.io/rhoai/ray:2.35.0-py311-cu121-torch24-fa26


In [63]:
# Deploy the Ray cluster
ray_cluster.apply()

Ray Cluster: 'test1-cluster' has successfully been applied. For optimal resource management, you should delete this Ray Cluster when no longer in use.


In [67]:
# Wait for Ray cluster to be ready
ray_cluster.wait_ready()

Waiting for requested resources to be set up...
Requested cluster is up and running!
Dashboard is ready!


In [44]:
ray_cluster.details()

RayCluster(name='test1-cluster', status=<RayClusterStatus.READY: 'ready'>, head_cpu_requests=2, head_cpu_limits=4, head_mem_requests='16G', head_mem_limits='24G', num_workers=2, worker_mem_requests='20G', worker_mem_limits='24G', worker_cpu_requests=2, worker_cpu_limits=4, namespace='<test-namespace>', dashboard='https://ray-dashboard-<raycluster-name>-<namespace>.<domain_url>', worker_extended_resources={'nvidia.com/gpu': 2}, head_extended_resources={'nvidia.com/gpu': 2})

In [68]:
# Initialize the Job Submission Client
client = ray_cluster.job_client
print("Ray job client initialized")


Ray job client initialized


## Submit Ray Job for Synthetic Data Generation

Submit the synthetic data generation function to the Ray cluster:


In [81]:
# Submit the Ray Data SDG job for distributed synthetic data generation
submission_id = client.submit_job(
    entrypoint=(
            "python scripts/ray_sdg_job.py "
            "--enable-multi-node "
            "--seeds 1000 "
            "--variations 2 "
            "--batch-size 2 "
            "--quality-threshold 0.75 "
            "--output-path /shared/synthetic_data_v2 "
            "--max-concurrent-workers 6 "
            "--gpus-per-worker 1 "
            "--resume "
            "--save-every 100"
        ),    
    runtime_env={
        "env_vars": {
            'HF_HOME': '/shared/cache',
            'HF_DATASETS_CACHE': '/shared/cache/datasets',
            'TOKENIZERS_PARALLELISM': 'false',
        },
        'pip': [
            'ray[data]>=2.8.0',
            'transformers>=4.36.0',
            'torch>=2.0.0', 
            'datasets>=2.14.0',
            'accelerate>=0.24.0',
            'numpy>=1.21.0',
            'tqdm>=4.64.0',
            'pyarrow>=12.0.0,<15.0.0',
        ],
        'working_dir': './',
        "excludes": ["*.ipynb", "*.md"]
    },
)

print(f"Ray Data SDG job submitted with ID: {submission_id}")

Ray Data SDG job submitted with ID: raysubmit_SVqNSiZVvbSCtsJV


In [None]:
client.get_job_logs(submission_id)

In [82]:
# Stop/Delete any running jobs
# client.stop_job(submission_id)
client.delete_job(submission_id)

True

### Cleanup Ray Cluster

Clean up the Ray cluster resources (following ray_finetune_llm_deepspeed.ipynb pattern):


In [83]:
# Cleanup Ray cluster (following ray_finetune_llm_deepspeed.ipynb pattern)
print(" Cleaning up Ray cluster...")

# Tear down the Ray cluster
ray_cluster.down()


 Cleaning up Ray cluster...
Ray Cluster: 'test1-cluster' has successfully been deleted


In [84]:
import os, json
# Check for dataset
paths = ["/opt/app-root/src/shared/synthetic_data/synthetic_dataset.json", 
         "/opt/app-root/src/shared/synthetic_data/final_synthetic_dataset.json"]
dataset_path = next((p for p in paths if os.path.exists(p)), None)

if dataset_path:
    with open(dataset_path, "r") as f:
        data = json.load(f)
    
    if isinstance(data, list):
        total_samples = len(data)
        avg_quality = sum(item.get('overall_quality', 0) for item in data) / total_samples if total_samples > 0 else 0
        sample = data[0] if data else None
        
        print(f" Dataset found: {total_samples} samples")
        print(f"   Avg quality: {avg_quality:.2f} \n   Source: {sample.get('source', 'N/A') if sample else 'N/A'}")
    
    # Show sample
    if sample:
        print(f"   Sample Question -> {sample['question']}")
        print(f"   Sample Answer -> {sample['answer']}")
    
    print("\n Ready for training!")    
else:
    print(" Dataset not found. Run Ray Data job first.")

 Dataset found: 295 samples
   Avg quality: 0.76 
   Source: ray_data_sdg_qwen
   Sample Question -> A school bought 7 boxes of pencils. Each box contains 24 pencils. How many pencils did the school buy?
   Sample Answer -> To find out how many pencils the school bought, we multiply the number of boxes by the number of pencils per box.
Number of boxes: 7
Number of pencils per box: 24
Total pencils = 7 * 24 = 168
Therefore, the school bought 168 pencils.

 Ready for training!
