# Distributed training with VertexAI

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/google/yggdrasil-decision-forests/blob/main/documentation/public/docs/tutorial/distributed_training_vertex_ai.ipynb)

## Setup

In [1]:
pip install ydf -U

In [5]:
import ydf
import pandas as pd
import os

<div style="border:1px solid #AF8FDF; background-color:#EADCFF; padding: 5px;">
<p>For a general introduction to distributed training with YDF, see the <a href="https://ydf.readthedocs.io/en/latest/tutorial/distributed_training/">YDF Distributed Training</a> tutorial.</p></div>

<div style="border:1px solid #8FAFDF; background-color:#DCEAFF; padding: 5px;">
<b>For Googlers</b>
<p>YDF internal examples available at <a href="http://go/ydf/examples">go/ydf/examples</a> demonstrate how to use distributed training on Google infrastructure.</p></div>

## Introduction

By default YDF trains a model using a single computer. This works well for datasets with less than a few millions examples, but this does not work for datasets with billions of examples. YDF distributed training solves this problem by dividing the computation over multiple machines. As a rule of thumb, start distributed training once the dataset size exceeds 100M examples.

[Vertex AI](https://cloud.google.com/vertex-ai) is a service of [Google Cloud](https://en.wikipedia.org/wiki/Google_Cloud_Platform) to train ML models (and other things) on many computers. This tutorial shows how to train a YDF model without and with distributed training with Vertex AI.

 If you are unfamiliar with YDF, make sure to read the [Getting Started](../getting_started) tutorial first.

## Login and setup Google Cloud and Vertex AI

In this tutorial, we use the [gcloud CLI](https://cloud.google.com/sdk/docs/install). Make sure it is installed.

The commands from this tutorial can be typed in a shell or in a colab cell (with the `!` or `%%bash` prefix).

Note that the `gcloud auth login` command does not always work in Jupyter Notebooks. In this case, typing it in a shell is better.

The first step is to login and set our project. In a shell, use the command:

```shell
gcloud auth login
gcloud config set project <PROJECT_ID>
```

In a Google Colab, you can do the following instead:

```python
from google.colab import auth
auth.authenticate_user(project_id=PROJECT_ID)
```

In this example, we use the project id is `custom-oasis-452410-c2`, but to run this example you need to create your own project.

Next we need to enable two cloud services: VertexAI (previously known as AI Platform) and Cloud Build (to build the dockers).

In [37]:
!gcloud services enable cloudbuild.googleapis.com

In [38]:
!gcloud services enable aiplatform.googleapis.com

Google Cloud automatically creates a "service account" named <Project number>-compute@developer.gserviceaccount.com. For example, in this example, my service account is `282665763673-compute@developer.gserviceaccount.com`. You can find it in the Google Cloud console or my typing:

In [41]:
!gcloud projects describe custom-oasis-452410-c2 --format="value(projectNumber)"

282665763673


**Note:** The project ID and project number are two different identifiers.

The service account will be responsible for the docker packing and running the job. For this, you need to give it those permissions.

In [None]:
%%bash
gcloud projects add-iam-policy-binding custom-oasis-452410-c2 \
    --member="serviceAccount:282665763673-compute@developer.gserviceaccount.com" \
    --role="roles/storage.objectViewer"

gcloud projects add-iam-policy-binding custom-oasis-452410-c2 \
    --member="serviceAccount:282665763673-compute@developer.gserviceaccount.com" \
    --role="roles/run.builder"

gcloud projects add-iam-policy-binding custom-oasis-452410-c2 \
    --member="serviceAccount:282665763673-compute@developer.gserviceaccount.com" \
    --role="roles/artifactregistry.createOnPushWriter"

Now that Google Cloud is configured, we can configure the model :).

## Preparing the data

First we need a dataset. A good option is to use CSV, Avro or TensorFlowRecord files in a [bucket](https://cloud.google.com/storage/docs/creating-buckets). We will use CSV in this example.

For the training to be efficient, the dataset needs to be divided into several files (also known as "sharding"). In this section, we download the "adult" dataset, divide it into pieces, and save them in a bucket.

Ideally, the number of shards should be ~10x the number of workers. So if you train with 20 workers, splitting the data into 200 pieces is a good idea. 

The adult dataset is a small dataset with only ~30k examples. It does not need distributed training, but it is good for the demonstration.

First, let's create a Bucket where we will store the dataset, model, and temporary files.

In [7]:
!gcloud storage buckets create gs://ydf_bucket --location=us-east1

Creating gs://ydf_bucket/...


Then, let's download a dataset.

**Note:** For a real large dataset, you will likely export the data using [Google Bigtable](https://cloud.google.com/bigtable) or generate it with [Apache Beam](https://beam.apache.org/).

In [43]:
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")

print("The dataset has",len(train_ds),"training examples")

The dataset has 22792 training examples


Let's split the dataset and upload it to our bucket.

In [10]:
def split_dataset(
    dataset: pd.DataFrame, tmp_dir: str, num_shards: int
) -> list[str]:
  """Splits a csv file into multiple csv files."""

  os.makedirs(tmp_dir,exist_ok=True)
  num_row_per_shard = (dataset.shape[0] + num_shards - 1) // num_shards
  paths = []
  for shard_idx in range(num_shards):
    begin_idx = shard_idx * num_row_per_shard
    end_idx = (shard_idx + 1) * num_row_per_shard
    shard_dataset = dataset.iloc[begin_idx:end_idx]
    shard_path = os.path.join(tmp_dir , f"shard_{shard_idx}.csv")
    paths.append(shard_path)
    shard_dataset.to_csv(shard_path, index=False)
  return paths

split_dataset(train_ds, "gs://ydf_bucket/train_dataset", 10)

['gs://ydf_bucket/train_dataset/shard_0.csv',
 'gs://ydf_bucket/train_dataset/shard_1.csv',
 'gs://ydf_bucket/train_dataset/shard_2.csv',
 'gs://ydf_bucket/train_dataset/shard_3.csv',
 'gs://ydf_bucket/train_dataset/shard_4.csv',
 'gs://ydf_bucket/train_dataset/shard_5.csv',
 'gs://ydf_bucket/train_dataset/shard_6.csv',
 'gs://ydf_bucket/train_dataset/shard_7.csv',
 'gs://ydf_bucket/train_dataset/shard_8.csv',
 'gs://ydf_bucket/train_dataset/shard_9.csv']

Using the `gcloud storage ls` command, we can make sure the dataset is there.

In [14]:
!gcloud storage ls gs://ydf_bucket/train_dataset

gs://ydf_bucket/train_dataset/shard_0.csv
gs://ydf_bucket/train_dataset/shard_1.csv
gs://ydf_bucket/train_dataset/shard_2.csv
gs://ydf_bucket/train_dataset/shard_3.csv
gs://ydf_bucket/train_dataset/shard_4.csv
gs://ydf_bucket/train_dataset/shard_5.csv
gs://ydf_bucket/train_dataset/shard_6.csv
gs://ydf_bucket/train_dataset/shard_7.csv
gs://ydf_bucket/train_dataset/shard_8.csv
gs://ydf_bucket/train_dataset/shard_9.csv


Let's also save the testing dataset.

We will use it for validation.

In [44]:
split_dataset(test_ds, "gs://ydf_bucket/valid_dataset", 10)

['gs://ydf_bucket/valid_dataset/shard_0.csv',
 'gs://ydf_bucket/valid_dataset/shard_1.csv',
 'gs://ydf_bucket/valid_dataset/shard_2.csv',
 'gs://ydf_bucket/valid_dataset/shard_3.csv',
 'gs://ydf_bucket/valid_dataset/shard_4.csv',
 'gs://ydf_bucket/valid_dataset/shard_5.csv',
 'gs://ydf_bucket/valid_dataset/shard_6.csv',
 'gs://ydf_bucket/valid_dataset/shard_7.csv',
 'gs://ydf_bucket/valid_dataset/shard_8.csv',
 'gs://ydf_bucket/valid_dataset/shard_9.csv']

## Create docker

To run in VectexAI, the code cannot be executed in a notebook. Instead, the training code needs to be packaged in a Docker.

To pass the dataset path and other options to the training program, we use the `argparse` library. We also add an option to enable or disable distributed training. This will be useful to test the trainer quickly.

In [None]:
%%writefile train.py
import argparse
import dataclasses
import json
import os
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union
import ydf

parser = argparse.ArgumentParser()
# Path to training dataset. Should be prefixed with the dataset type e.g. 'csv':.
# See the supported formats at https://ydf.readthedocs.io/en/latest/dataset_formats/
parser.add_argument("--train_ds", type=str, required=True)
# Path to validation dataset. If empty, the model is trained without validation.
parser.add_argument("--valid_ds", type=str)
# Path to test dataset. If empty, the model is not evaluated.
parser.add_argument("--test_ds", type=str)
# Path to save the model.
parser.add_argument("--model", type=str, required=True)
# Work directory containing the temporary working data. Only used for distributed training.
parser.add_argument("--work_dir", default="", type=str)
# Label column to predict.
parser.add_argument("--label", type=str, required=True)
# Is the training distributed, or on a single machine?
parser.add_argument("--distributed", action="store_true")
args = parser.parse_args()


def main():
  print("Arguments:\n", args)

  if args.distributed:
    main_distributed()
  else:
    main_in_process()


def main_in_process():
  ydf.verbose(2)

  print("Train model in process on", args.train_ds)
  learner = ydf.GradientBoostedTreesLearner(label=args.label)
  model = learner.train(args.train_ds, valid=args.valid_ds)

  print("Save model in", args.model)
  model.save(args.model)

  if args.test_ds:
    print("Evaluate model on", args.test_ds)
    evaluation = model.evaluate(args.test_ds)
    print(evaluation)


def main_distributed():
  ydf.verbose(2)

  # Gather the manager and workers configuration.
  cluster_config = ydf.util.get_vertex_ai_cluster_spec()
  print("cluster_config:\n", cluster_config)

  if cluster_config.is_worker:
    # This machine is running a worker.
    ydf.start_worker(cluster_config.port)
    return

  print("Train model with distribution on", args.train_ds)
  learner = ydf.DistributedGradientBoostedTreesLearner(
      label=args.label,
      workers=cluster_config.workers,
      working_dir=args.work_dir,
      resume_training=True,
  )
  model = learner.train(args.train_ds, valid=args.valid_ds)

  print("Save model in", args.model)
  model.save(args.model)

  if args.test_ds:
    print("Evaluate model on", args.test_ds)
    evaluation = model.evaluate(args.test_ds)
    print(evaluation)


if __name__ == "__main__":
  main()

Starting a job on VertexAI takes a few minutes. Instead, to iterate quickly, it is a good idea to run the training script locally on a subset of the data.

The following command runs our trainer locally without distributed training.

**Note:** In YDF, the dataset paths always define the format of the dataset with a prefix (`csv:` in this example). To use another format, change the prefix accordingly. [Here](https://ydf.readthedocs.io/en/latest/dataset_formats/) is the list of supported formats.

In [None]:
%%bash
python train.py --train_ds=csv:gs://ydf_bucket/train_dataset/shard_0.csv \
--valid_ds=csv:gs://ydf_bucket/valid_dataset/shard_0.csv \
--model=gs://ydf_bucket/model \
--label=income

To be run in VertexAI, the trainer needs to be packaged in Docker. Let's create it.

In [55]:
%%writefile Dockerfile
FROM python:3.12
WORKDIR /root

RUN apt-get update && apt-get -y install sudo

RUN rm -rf /usr/share/keyrings/cloud.google.gpg
RUN rm -rf /etc/apt/sources.list.d/google-cloud-sdk.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
RUN echo "deb https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Install YDF from Pip
RUN python3 -m pip install ydf
# OR, install your own copy of YDF.
# COPY ydf-0.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl .
# RUN python3 -m pip install ydf-0.10.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl --upgrade --no-cache-dir --force-reinstall

RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

COPY train.py /root/train.py

ENTRYPOINT ["python3", "train.py"]

Overwriting Dockerfile


We can now compile the docker and upload it to Google Cloud.

In [None]:
%%bash
gcloud builds submit --tag gcr.io/custom-oasis-452410-c2/train-ydf

Finally, we can start a custom Vertex AI training job with our docker.

A few remarks:

You need to create two worker pools. The first worker pool contains the "manager" and will do very little computation. The second worker pool contains the machines that will train and evaluate the model. When training on a larger dataset, you need to increase the number of machines with the `replica-count` parameter.


In [57]:
%%bash
gcloud ai custom-jobs create \
    --region=us-east1 \
    --project=custom-oasis-452410-c2 \
    --worker-pool-spec=replica-count=1,machine-type='n1-highmem-2',container-image-uri='gcr.io/custom-oasis-452410-c2/train-ydf' \
    --worker-pool-spec=replica-count=5,machine-type='n1-highmem-2',container-image-uri='gcr.io/custom-oasis-452410-c2/train-ydf' \
    --display-name=train-ydf-job \
--args=\
--train_ds=csv:gs://ydf_bucket/train_dataset/shard_*.csv,\
--valid_ds=csv:gs://ydf_bucket/valid_dataset/shard_*.csv,\
--model=gs://ydf_bucket/model,\
--work_dir=gs://ydf_bucket/work_dir,\
--distributed,\
--label=income

Using endpoint [https://us-east1-aiplatform.googleapis.com/]
CustomJob [projects/282665763673/locations/us-east1/customJobs/7048111014285410304] is submitted successfully.

Your job is still active. You may view the status of your job with the command

ud ai custom-jobs describe projects/282665763673/locations/us-east1/customJobs/7048111014285410304

or continue streaming the logs with the command

65763673/locations/us-east1/customJobs/7048111014285410304


You can monitor the training in the [Vertex AI Custom Job console](https://pantheon.corp.google.com/vertex-ai/training/custom-jobs), or in your shell by running the printed command e.g.:

```shell
!gcloud ai custom-jobs stream-logs projects/282665763673/locations/us-east1/customJobs/8426212500260782080
```
**Note:** This command does not stop when the training is done. You need to stop it manually.

## Load and test the model

Now that your training is done, you can look at the model:

In [52]:
model = ydf.load_model("gs://ydf_bucket/model")

In [53]:
model.describe()

We can also generate some predictions.

In [35]:
model.predict(test_ds)

array([0.00404093, 0.35932407, 0.8662793 , ..., 0.01358805, 0.04585141,
       0.00885384], shape=(9769,), dtype=float32)

## Deploying model

The model can now be deployed. YDF offers several options: C++, FastAPI, TensorFlow Serving, etc.
See the "Deploying a model" section on the left for more details.

To give an example, let's deploy the model with FastAPI:

In [None]:
model.to_docker("/tmp/docker_model")

In [62]:
!ls -l /tmp/docker_model

total 24
-rw-r----- 1 gbm primarygroup  288 Mar  6 14:22 deploy_in_google_cloud.sh
-rw-r----- 1 gbm primarygroup  211 Mar  6 14:22 Dockerfile
-rw-r----- 1 gbm primarygroup 1313 Mar  6 14:22 main.py
drwxr-x--- 2 gbm primarygroup  140 Mar  6 14:22 model
-rw-r----- 1 gbm primarygroup  360 Mar  6 14:22 readme.txt
-rw-r----- 1 gbm primarygroup   26 Mar  6 14:22 requirements.txt
-rw-r----- 1 gbm primarygroup  485 Mar  6 14:22 test_locally.sh


Deploy the model in Google Cloud Run.

The results will be available in the [Google Cloud Run console](https://pantheon.corp.google.com/run).

In [67]:
%%bash
# Enable Google Cloud Run
gcloud services enable run.googleapis.com
# Deploy the model as a service
gcloud run deploy ydf-predict --source /tmp/docker_model --region=us-east1

Deploying from source requires an Artifact Registry Docker repository to store 
repository named [cloud-run-source-deploy] in region 
[us-east1] will be created.

ntinue (Y/n)?  co
Allow unauthenticated invocations to [ydf-predict] (y/N)?  
Building using Dockerfile and deploying container to Cloud Run service [[1mydf-predict[m] in project [[1mcustom-oasis-452410-c2[m] region [[1mus-east1[m]
Building and deploying new service...
Creating Container Repository................................................................................................................done
Uploading sources..................Creating temporary archive of 11 file(s) totalling 228.1 KiB before compression.
Uploading zipfile of [/tmp/docker_model] to [gs://run-sources-custom-oasis-452410-c2-us-east1/services/ydf-predict/1741267781.721195-c0f972bb00c345d2ae780b88d7a4f65d.zip]
......done
Building Container...................................................................................................