## Distributed XGBoost with GPUs on Ray 

Ray provides a version of XGBoost to perform distributed data parallelism. With drop-in replacements of `xgboost` native classes, XGboost Ray allows you to leverage multi-node clusters to distribute your training. 

This demo uses a dataset created from `00-create-dataset` with 100M rows x 100 features columns x 1 target column (5 classes) for multi-class classification. This dataset is ~40GiB. 




#### FAQs
When do I switch to a distributed version of XGBoost? 
- XGboost datasets > than 1B rows should use distributed data parallelism (DDP). I'm only using 100M rows here for demonstration purposes. 
- Consider using a single-node and multi-threading across all CPUs before switching to distributed training leveraging multi-GPUs. 

How much memory (VRAM) do I need for my dataset? A quick behind the napkin math:
- 100M rows x 100 columns x 4 bytes (float16) = ~40GB 
- We'll need between 1-3x the data footprint in VRAM across our GPUs (we'll go with 2x so ~80GiB) to train our model (accounting for gradients, model). 
- depending on `num_boost_round` (a.k.a. `num_estimators`) or `max_depth` you may require 4x-8x more memory per node. 
- I'm using g4dn.12xlarge worker nodes on AWS (4 GPUs/node at 16GiB VRAM/GPU). With that said, I'll need about 6 GPUs to train one model.


#### Compute specifications to run this notebook
```json
{
    "num_workers": 2,
    "cluster_name": "Multi-node MLR w/ GPUs",
    "spark_version": "16.4.x-gpu-ml-scala2.13",
    "spark_conf": {
        "spark.task.resource.gpu.amount": "0",
        "spark.executor.memory": "1g"
    },
    "aws_attributes": {
        "first_on_demand": 1,
        "availability": "SPOT_WITH_FALLBACK",
        "zone_id": "auto",
        "spot_bid_price_percent": 100,
        "ebs_volume_count": 0
    },
    "node_type_id": "g4dn.12xlarge",
    "driver_node_type_id": "g4dn.12xlarge",
    "spark_env_vars": {
        "DATABRICKS_TOKEN": "{{secrets/development/jon_cheung_PAT}}"
    },
    "enable_elastic_disk": false,
    "single_user_name": "jon.cheung@databricks.com",
    "enable_local_disk_encryption": false,
    "data_security_mode": "SINGLE_USER",
    "runtime_engine": "STANDARD",
    "assigned_principal": "user:jon.cheung@databricks.com",
}
```

In [0]:
%pip install -qU ray[all]=2.47.1

dbutils.library.restartPython()

In [0]:
num_training_rows = 100_000_000
num_training_columns = 100
num_labels = 5

catalog = "main"
schema = "ray_gtm_examples"

table = f"synthetic_data_{num_training_rows}_rows_{num_training_columns}_columns_{num_labels}_labels"
label="target"

# If running in a multi-node cluster, this is where you
# should configure the run's persistent storage that is accessible
# across all worker nodes.
ray_xgboost_path = '/dbfs/Users/jon.cheung@databricks.com/ray_xgboost_trainer/' 
# This is for stashing the cluster logs
ray_logs_path = "/dbfs/Users/jon.cheung@databricks.com/ray_collected_logs"

In [0]:
import ray
from ray.util.spark import setup_ray_cluster, shutdown_ray_cluster

restart = True
if restart is True:
  try:
    shutdown_ray_cluster()
  except:
    pass
  try:
    ray.shutdown()
  except:
    pass

# The below configuration mirrors my Spark worker cluster set up. Change this to match your cluster configuration. 
setup_ray_cluster(
  min_worker_nodes=2,
  max_worker_nodes=2,
  num_cpus_worker_node=48,
  num_gpus_worker_node=4,
  num_cpus_head_node=48,
  num_gpus_head_node=4,
  collect_log_to_path=ray_logs_path
)

In [0]:
import ray
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback
import os


try: 
  ## Option 1 (PREFERRED): Build a Ray Dataset using a Databricks SQL Warehouse
  # Insert your SQL warehouse ID here. I've queried my 100M row dataset using a Small t-shirt sized cluster.

  # Ensure you've set the DATABRICKS_TOKEN so you can query using the warehouse compute
  # Mine is commented out because I stashed it as an environment variable in the Spark config. 
  # os.environ['DATABRICKS_TOKEN'] = dbutils.secrets.get(scope = "development", key = "jon_cheung_PAT")
  ds = ray.data.read_databricks_tables(
    warehouse_id='2a72600bb68f00ee',
    catalog=catalog,
    schema=schema,
    query=f'SELECT * FROM {table}',
  )
except: 
  ## Option 2: Build a Ray Dataset using a Parquet files
  # If you have too many Ray nodes, you may not be able to create a Ray dataset using the warehouse method above because of rate limits. One back up solution is to create parquet files from the delta table and build a ray dataset from that. This is not the recommended route because, in essence, you are duplicating data.
  parquet_path = f'/Volumes/{catalog}/{schema}/synthetic_data/{table}'
  ds = ray.data.read_parquet(parquet_path)

train_dataset, val_dataset = ds.train_test_split(test_size=0.25)



In [0]:
import xgboost
import ray.train
from ray.train.xgboost import XGBoostTrainer, RayTrainReportCallback

## Distributed training function per worker.
# This should look very similar to a vanilla xgboost training.
# In essence it's simply retrieving and training on a shard of the distributed dataset.
def train_fn_per_worker(params: dict):
    # Get dataset shards for this worker
    train_shard = ray.train.get_dataset_shard("train")
    val_shard = ray.train.get_dataset_shard("val")

    # Convert shards to pandas DataFrames
    train_df = train_shard.materialize().to_pandas()
    val_df = val_shard.materialize().to_pandas()

    train_X = train_df.drop(label, axis=1)
    train_y = train_df[label]
    val_X = val_df.drop(label, axis=1)
    val_y = val_df[label]
    
    dtrain = xgboost.DMatrix(train_X, label=train_y)
    deval = xgboost.DMatrix(val_X, label=val_y)

    # Do distributed data-parallel training.
    # Ray Train sets up the necessary coordinator processes and
    # environment variables for workers to communicate with each other.
    evals_results = {}
    bst = xgboost.train(
        params,
        dtrain=dtrain,
        evals=[(deval, "validation")],
        num_boost_round=params['num_estimators'],
        evals_result=evals_results,
        # early_stopping_rounds=params['early_stopping_rounds'],
        callbacks=[RayTrainReportCallback(metrics={params['eval_metric']: f"validation-{params['eval_metric']}"},
                                          frequency=1)],
    )
    
    # # Retrieve the evaluation metric values from the training process
    final_eval_metric = evals_results['validation'][params['eval_metric']][-1]

    # # Report evaluation metric to Ray Train
    # ray.train.report({params['eval_metric']: final_eval_metric,
    #                "done": True})

In [0]:

from ray.tune.integration.ray_train import TuneReportCallback
# The driver function is responsible for triggering distributed training,
# collecting results, and coordinating workers. Here we use Ray's XGBoostTrainer
def train_driver_fn(config: dict, train_dataset, val_dataset):
    # Unpack run-level hyperparameters.
    # Tune feeds in hyperparameters defined in the `param_space` below.
    num_workers = config["num_workers"]
    use_gpu = config["use_gpu"]
    params = config['params']

    trainer = XGBoostTrainer(
      train_loop_per_worker=train_fn_per_worker,
      train_loop_config=params,
      # by default Ray uses 1 GPU and 1 CPU per worker if we don't specify resources_per_worker. 
      # Note that algorithms like XGBoost are multi-threaded so you can assign multiple CPUs per worker 
      # However, you cannot do the same for GPUs. Hence why we are doing DDP. 
      scaling_config=ray.train.ScalingConfig(num_workers=num_workers, 
                                             use_gpu=use_gpu),
      datasets={"train": train_dataset, "val": val_dataset},
      run_config=ray.train.RunConfig(storage_path=ray_xgboost_path)
                                    # These parameters will be enabled below when we integrate this trainer with Ray Tune
                                    #  callbacks=[TuneReportCallback()])
                                    #  name=f"train-trial_id={ray.tune.get_context().get_trial_id()}")
    )
                                    
    result = trainer.fit()
    # propagate metrics back up
    ray.tune.report({'mlogloss': result.metrics['mlogloss']})
    # return result

In [0]:
# # specify model hyper-parameters
# # TODO: early_stopping_rounds will stop but causes an error in the RayTrainReportCallback. 
# # Perform one single run to ensure the driver function works
# # I want to use 4 GPUs so I explicitly define 4 workers.
# config = {"num_workers": 6,
#           "use_gpu": True,
#           "params":{
#             "objective": "multi:softmax",
#             'eval_metric': 'mlogloss', 
#             "tree_method": "hist",
#             "device": "cuda",
#             "num_class": 5,
#             "num_estimators": 20
#           }
#           }


# result = train_driver_fn(config=config,
#                        train_dataset = train_dataset,
#                        val_dataset = val_dataset)

In [0]:
# with test.checkpoint.as_directory() as checkpoint_dir:
#     model_path = os.path.join(checkpoint_dir, RayTrainReportCallback.CHECKPOINT_NAME)
#     print(model_path)
#     model = xgboost.Booster()
#     model.load_model(model_path)

## Advanced: Ray Tune with Ray Train

https://docs.ray.io/en/latest/train/user-guides/hyperparameter-optimization.html#hyperparameter-tuning-with-ray-tune

In [0]:
from ray import tune
from ray.tune.tuner import Tuner
from ray.tune.search.optuna import OptunaSearch


# Define resources per HPO trial and calculate max concurrent HPO trials
num_gpu_workers_per_trial = 6
num_samples = 2
resources = ray.cluster_resources()
total_cluster_gpus = resources.get("GPU") 
max_concurrent_trials = int(total_cluster_gpus // num_gpu_workers_per_trial)


# Define the hyperparameter search space.
# XGB sample hyperparameter configs
param_space = {
    "num_workers": num_gpu_workers_per_trial,
    "use_gpu": True,
    "params":{"objective": "multi:softmax",
              'eval_metric': 'mlogloss', 
              "tree_method": "hist",
              "device": "cuda",
              "num_class": num_labels,
              "learning_rate": tune.uniform(0.01, 0.3),
              "num_estimators": tune.randint(25, 50)}
}

# Set up search algorithm. Here we use Optuna and use the default the Bayesian sampler (i.e. TPES)
# optuna = OptunaSearch(metric="mlogloss", 
#                       mode="min")


tuner = tune.Tuner(
    tune.with_parameters(train_driver_fn,
                         train_dataset = train_dataset,
                         val_dataset = val_dataset),
    tune_config=tune.TuneConfig(num_samples=num_samples,
                                # search_alg=optuna,
                                max_concurrent_trials=max_concurrent_trials),
    param_space=param_space,
    )

results = tuner.fit()

results.get_best_result(metric="mlogloss", 
                        mode="min").config