# Distributed training on Vertex AI 

In [5]:
from datetime import datetime

from google.cloud import aiplatform

## Configure your environment

Set the following constants to reflect your environment
- `PROJECT_ID` - Your project ID
- `REGION` - The GCP region for running Vertex Training jobs
- `STAGING_BUCKET` - The GCS bucket name to use for data and artifacts created during training. The bucket should be in the same region as your training workloads. 
- `IMAGE_NAME` - The custom training container image name
- `SERVICE_ACCOUNT_NAME` - The service account name to use with Vertex Training. When using Vertex Training with Vertex Tensorboard you need to run your jobs using a custom service account. If you don't already have a service account with the required permissions follow the below steps
- `TENSORBOARD_DISPLAY_NAME` - The Vertex TensorBoard instance name to use for tracking training experiments. If an instance with this display name exists it will be used. Otherwise a new instance will be created. 

In [6]:
PROJECT_ID='jk-mlops-dev'
REGION='asia-southeast1'
#REGION='us-central1'
STAGING_BUCKET='gs://jk-asia-southeast1-staging'
#STAGING_BUCKET='gs://jk-vertex-staging-us-central1'
IMAGE_NAME='distributed-training-sandbox'

IMAGE_URI=f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
print(IMAGE_URI)

gcr.io/jk-mlops-dev/distributed-training-sandbox


### Initialize Vertex SDK

In [7]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

## Build a custom training container


Vertex AI Training support running training jobs using [custom training containers](https://cloud.google.com/vertex-ai/docs/training/containers-overview). A custom training container image used in this sample packages **MosaicML LLM Foundry** and the required dependencies.

In [8]:
%%writefile Dockerfile
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13.py310:latest

WORKDIR /scripts
ADD hello-world.py ./

ENTRYPOINT ["python", "hello-world.py"]

Overwriting Dockerfile


You can build the image using locally installed `docker` or using **Cloud Build**.

### Building the image locally

In [16]:
! docker build -t {IMAGE_URI} .
! docker push {IMAGE_URI} 

Sending build context to Docker daemon  101.4kB
Step 1/4 : FROM us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-13.py310:latest
 ---> 78c144ecd81c
Step 2/4 : WORKDIR /scripts
 ---> Using cache
 ---> e1c92ecd8f17
Step 3/4 : ADD hello-world.py ./
 ---> 39e98c895c64
Step 4/4 : ENTRYPOINT ["python", "hello-world.py"]
 ---> Running in f0dfd31055ed
Removing intermediate container f0dfd31055ed
 ---> 42d071359896
Successfully built 42d071359896
Successfully tagged gcr.io/jk-mlops-dev/distributed-training-sandbox:latest
Using default tag: latest
The push refers to repository [gcr.io/jk-mlops-dev/distributed-training-sandbox]

[1B5f19d01b: Preparing 
[1Be776855d: Preparing 
[1B03eb5103: Preparing 
[1B105d38de: Preparing 
[1B6867eca5: Preparing 
[1Beb8da3b6: Preparing 
[1B3741a401: Preparing 
[1B4314a1a9: Preparing 
[1Be1a4db2c: Preparing 
[1B2d93004e: Preparing 
[1Bc5d23056: Preparing 
[1B988466f1: Preparing 
[1Ba520fb4d: Preparing 
[1B1fee7951: Preparing 
[1Bede7a422: Preparin

### Building the image using **Cloud Build**

In [19]:

! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} . --machine-type=e2-highcpu-8

gcr.io/jk-mlops-dev/mosaicml-sandbox
Creating temporary tarball archive of 3 file(s) totalling 36.5 KiB before compression.
Uploading tarball of [.] to [gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-dev/locations/global/builds/7fc7a526-87fa-46b2-ac5c-88543ac9a28d].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds/7fc7a526-87fa-46b2-ac5c-88543ac9a28d?project=895222332033 ].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "7fc7a526-87fa-46b2-ac5c-88543ac9a28d"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz#1696973540916671
Copying gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz#1696973540916671...
/ [1 files][  4.2 KiB/  4.2 KiB]                                                
Operat

### Configure a custom job


#### Configure `workerPoolSpecs` 

A worker pool spec encapsulates a configuration of a compute cluster that will be used to run a job.

In [17]:
worker_pool_spec = [
    {
        'machine_spec': {
            'machine_type': 'n1-standard-8'
        },
        'replica_count': 1,
        'container_spec': {
            'image_uri': IMAGE_URI
        } 
    },
    {
        'machine_spec': {
            'machine_type': 'n1-standard-8'
        },
        'replica_count': 2,
        'container_spec': {
            'image_uri': IMAGE_URI
        } 
    }, 

]

print(worker_pool_spec)

[{'machine_spec': {'machine_type': 'n1-standard-8'}, 'replica_count': 1, 'container_spec': {'image_uri': 'gcr.io/jk-mlops-dev/distributed-training-sandbox'}}, {'machine_spec': {'machine_type': 'n1-standard-8'}, 'replica_count': 2, 'container_spec': {'image_uri': 'gcr.io/jk-mlops-dev/distributed-training-sandbox'}}]


#### Create a CustomJob object

A CustomJob object finalizes the remaining information to run a job

In [18]:
JOB_ID = f'distributed-hello-world-{datetime.now().strftime("%Y%m%d%H%M")}'
print(JOB_ID)

job = aiplatform.CustomJob(
    display_name=JOB_ID,
    worker_pool_specs=worker_pool_spec
)

print(job.job_spec)

distributed-hello-world-202310181701
worker_pool_specs {
  machine_spec {
    machine_type: "n1-standard-8"
  }
  replica_count: 1
  container_spec {
    image_uri: "gcr.io/jk-mlops-dev/distributed-training-sandbox"
  }
}
worker_pool_specs {
  machine_spec {
    machine_type: "n1-standard-8"
  }
  replica_count: 2
  container_spec {
    image_uri: "gcr.io/jk-mlops-dev/distributed-training-sandbox"
  }
}
base_output_directory {
  output_uri_prefix: "gs://jk-asia-southeast1-staging/aiplatform-custom-job-2023-10-18-17:01:29.729"
}



#### Run the job


In [19]:
job.run(sync=False)

Creating CustomJob


CustomJob created. Resource name: projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/asia-southeast1/training/2282665853855989760?project=895222332033
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/2282665853855989760 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/custo

## Create and submit a CustomContainerTraining job

In [26]:
JOB_ID = f'distributed-hello-world-{datetime.now().strftime("%Y%m%d%H%M")}'
print(JOB_ID)

job = aiplatform.CustomContainerTrainingJob(
    display_name=JOB_ID,
    container_uri=IMAGE_URI,
    command=['python', 'hello-world.py']
)


distributed-hello-world-202310181717


In [None]:
job.run(
    machine_type='n1-standard-8',
    replica_count=4,
    sync=False
)