# Running MosaicML training workloads on Vertex AI Training

This notebook demonstrates how to configure and run Vertex Training Custom Jobs that encapsulate **MosaicML LLM Foundry** workloads

In [6]:
from datetime import datetime

from google.cloud import aiplatform

## Configure your environment

Set the following constants to reflect your environment
- `PROJECT_ID` - Your project ID
- `REGION` - The GCP region for running Vertex Training jobs
- `STAGING_BUCKET` - The GCS bucket name to use for data and artifacts created during training. The bucket should be in the same region as your training workloads. 
- `IMAGE_NAME` - The custom training container image name
- `SERVICE_ACCOUNT_NAME` - The service account name to use with Vertex Training. When using Vertex Training with Vertex Tensorboard you need to run your jobs using a custom service account. If you don't already have a service account with the required permissions follow the below steps
- `TENSORBOARD_DISPLAY_NAME` - The Vertex TensorBoard instance name to use for tracking training experiments. If an instance with this display name exists it will be used. Otherwise a new instance will be created. 

In [131]:
PROJECT_ID='jk-mlops-dev'
REGION='asia-southeast1'
#REGION='us-central1'
STAGING_BUCKET='gs://jk-asia-southeast1-staging'
#STAGING_BUCKET='gs://jk-vertex-staging-us-central1'
IMAGE_NAME='mosaicml-sandbox'
SERVICE_ACCOUNT_NAME='vertex-sa-101'
TENSORBOARD_DISPLAY_NAME='mosaicml-experiments'

SA_EMAIL=f'{SERVICE_ACCOUNT_NAME}@{PROJECT_ID}.iam.gserviceaccount.com'
IMAGE_URI=f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
print(SA_EMAIL)
print(IMAGE_URI)

vertex-sa-101@jk-mlops-dev.iam.gserviceaccount.com
gcr.io/jk-mlops-dev/mosaicml-sandbox


CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING


### Initialize Vertex SDK

In [132]:
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING


### Set or create TensorBoard resource name

In [133]:
tensorboards = aiplatform.Tensorboard.list(filter=f'displayName="{TENSORBOARD_DISPLAY_NAME}"')

if not tensorboards:
    tensorboard = aiplatform.Tensorboard.create(
        display_name=TENSORBOARD_DISPLAY_NAME,
        project=PROJECT_ID,
        location=REGION,
    )
else:
    tensorboard = tensorboards[0]

print(tensorboard.display_name)
print(tensorboard.resource_name)

TENSORBOARD_RESOURCE_NAME = tensorboard.resource_name 

mosaicml-experiments
projects/895222332033/locations/asia-southeast1/tensorboards/8004585387697635328


CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING


### (Optional) Create and configure a service account

#### Create a service account

In [11]:
! gcloud  iam service-accounts create {SERVICE_ACCOUNT_NAME} --project={PROJECT_ID}

Created service account [vertex-sa-101].


#### Grant the roles to the service account

In [13]:
! gcloud projects add-iam-policy-binding {PROJECT_ID} \
   --member="serviceAccount:{SA_EMAIL}" \
   --role="roles/storage.admin"

! gcloud projects add-iam-policy-binding {PROJECT_ID} \
   --member="serviceAccount:{SA_EMAIL}" \
   --role="roles/aiplatform.user"

vertex-sa-101@jk-mlops-dev.iam.gserviceaccount.com
Updated IAM policy for project [jk-mlops-dev].
bindings:
- members:
  - serviceAccount:cloud-tpu-sa@jk-mlops-dev.iam.gserviceaccount.com
  - serviceAccount:gke-sa@jk-mlops-dev.iam.gserviceaccount.com
  - serviceAccount:sa-admin@jk-mlops-dev.iam.gserviceaccount.com
  - serviceAccount:service-895222332033@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  - serviceAccount:tpu-sa@jk-mlops-dev.iam.gserviceaccount.com
  - user:renatoleite@google.com
  - user:rthallam@google.com
  role: roles/aiplatform.admin
- members:
  - serviceAccount:service-895222332033@gcp-sa-aiplatform-cc.iam.gserviceaccount.com
  role: roles/aiplatform.customCodeServiceAgent
- members:
  - serviceAccount:service-895222332033@gcp-sa-aiplatform-vm.iam.gserviceaccount.com
  role: roles/aiplatform.notebookServiceAgent
- members:
  - serviceAccount:service-895222332033@gcp-sa-aiplatform.iam.gserviceaccount.com
  role: roles/aiplatform.serviceAgent
- members:
  - serviceAccou

### (Optional) Create a bucket

If you need to create a bucket, uncomment and execute the following cell

In [14]:
!gsutil mb -l {REGION} {STAGING_BUCKET}

Creating gs://jk-asia-southeast1-staging/...
ServiceException: 409 A Cloud Storage bucket named 'jk-asia-southeast1-staging' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


## Build a custom training container


Vertex AI Training support running training jobs using [custom training containers](https://cloud.google.com/vertex-ai/docs/training/containers-overview). A custom training container image used in this sample packages **MosaicML LLM Foundry** and the required dependencies.

In [3]:
%%writefile Dockerfile
# Copyright 2022 MosaicML LLM Foundry authors
# SPDX-License-Identifier: Apache-2.0


FROM mosaicml/llm-foundry:2.0.1_cu118-latest


# Install and uninstall foundry to cache foundry requirements
RUN git clone -b main https://github.com/mosaicml/llm-foundry.git && \
    cd llm-foundry && \
    pip install -e ".[gpu,tensorboard]" 

WORKDIR /llm-foundry

ENTRYPOINT ["ls", "-la"]

Overwriting Dockerfile


You can build the image using locally installed `docker` or using **Cloud Build**.

### Building the image locally

In [4]:
! docker build -t {IMAGE_URI} .
! docker push {IMAGE_URI} 

Sending build context to Docker daemon  90.62kB
Step 1/4 : FROM mosaicml/llm-foundry:2.0.1_cu118-latest
 ---> 523dedab9f7a
Step 2/4 : RUN git clone -b main https://github.com/mosaicml/llm-foundry.git &&     cd llm-foundry &&     pip install -e ".[gpu,tensorboard]"
 ---> Using cache
 ---> 9e4ef2c615cb
Step 3/4 : WORKDIR /llm-foundry
 ---> Using cache
 ---> 92d5584cd9c3
Step 4/4 : ENTRYPOINT ["ls", "-la"]
 ---> Using cache
 ---> 36f7a7086123
Successfully built 36f7a7086123
Successfully tagged gcr.io/jk-mlops-dev/mosaicml-sandbox:latest
Using default tag: latest
The push refers to repository [gcr.io/jk-mlops-dev/mosaicml-sandbox]

[1B0173e5bb: Preparing 
[1B870a4b2a: Preparing 
[1B862421c8: Preparing 
[1B482e6f4f: Preparing 
[1B25ead38d: Preparing 
[1B9284649a: Preparing 
[1B632f11c0: Preparing 
[1Bfc23904b: Preparing 
[1Be99f8e07: Preparing 
[1Bacde97fc: Preparing 
[1B22bf1dd8: Preparing 
[1B6e4d252f: Preparing 
[1Bd48a239a: Preparing 
[1B37fe7152: Preparing 
[1Bbf18a086: 

### Building the image using **Cloud Build**

In [19]:
IMAGE_URI=f'gcr.io/{PROJECT_ID}/{IMAGE_NAME}'
print(IMAGE_URI)

! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} . --machine-type=e2-highcpu-8

gcr.io/jk-mlops-dev/mosaicml-sandbox
Creating temporary tarball archive of 3 file(s) totalling 36.5 KiB before compression.
Uploading tarball of [.] to [gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz]
Created [https://cloudbuild.googleapis.com/v1/projects/jk-mlops-dev/locations/global/builds/7fc7a526-87fa-46b2-ac5c-88543ac9a28d].
Logs are available at [ https://console.cloud.google.com/cloud-build/builds/7fc7a526-87fa-46b2-ac5c-88543ac9a28d?project=895222332033 ].
----------------------------- REMOTE BUILD OUTPUT ------------------------------
starting build "7fc7a526-87fa-46b2-ac5c-88543ac9a28d"

FETCHSOURCE
Fetching storage object: gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz#1696973540916671
Copying gs://jk-mlops-dev_cloudbuild/source/1696973540.650832-4cb4a5c21a6c4a34b5b3a09a95a4994e.tgz#1696973540916671...
/ [1 files][  4.2 KiB/  4.2 KiB]                                                
Operat

## Prepare a training dataset

MosaicML recommends using data in their highly efficient StreamingDataset format. They provide the `convert_dataset_hf.py` script to convert the HuggingFace c4 dataset to the Mosaic StreamingDataset format. You can run this script locally but we are going to demonstrate how to use Vertex Training as a "simple batch system". In the next few steps you are going to configure and submit a Vertex Training CustomJob that uses a custom training container image created in the previous step.

For details on how to configure and submit custom training jobs refers to [product documentation](). It is important to note that there are many options for creating CustomJob objects. In this sample we use the lower level API that gives you ultimate flexibility.


### Configure a custom job

A custom job is configured and managed using the `CustomJob` class. A primary component of `CustomJob` object is `workerPoolSpecs` which encapsulates the requested compute cluster configuration and containerized workload(s) to run on the cluster.

#### Configure `containerSpec` 

A container spec encapsulates a command, arguments, and environment of a container that will be used to run a job.

In [134]:
dataset_gcs_location = f'{STAGING_BUCKET}/datasets/c4'

container_spec = {
    'image_uri': IMAGE_URI,
    'command': ['/composer-python/python'],
    'args': [
        'scripts/data_prep/convert_dataset_hf.py',
        '--dataset=c4',
        '--data_subset=en',
        f'--out_root={dataset_gcs_location}',
        '--splits', 
        'train_small',
        'val_small',
        '--concat_tokens=2048',
        '--tokenizer=EleutherAI/gpt-neox-20b',
        "--eos_text='<|endoftext|>'"
    ]
}

print(container_spec)

{'image_uri': 'gcr.io/jk-mlops-dev/mosaicml-sandbox', 'command': ['/composer-python/python'], 'args': ['scripts/data_prep/convert_dataset_hf.py', '--dataset=c4', '--data_subset=en', '--out_root=gs://jk-asia-southeast1-staging/datasets/c4', '--splits', 'train_small', 'val_small', '--concat_tokens=2048', '--tokenizer=EleutherAI/gpt-neox-20b', "--eos_text='<|endoftext|>'"]}


CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/us-central1/customJobs/8696797570212560896 current state:
JobState.JOB_STATE_PENDING


#### Configure `workerPoolSpecs` 

A worker pool spec encapsulates a configuration of a compute cluster that will be used to run a job.

In [49]:
worker_pool_spec = [
    {
        'machine_spec': {
            'machine_type': 'n1-standard-32'
        },
        'replica_count': 1,
        'container_spec': container_spec
    }
]

print(worker_pool_spec)

[{'machine_spec': {'machine_type': 'n1-standard-32'}, 'replica_count': 1, 'container_spec': {'image_uri': 'gcr.io/jk-mlops-dev/mosaicml-sandbox', 'command': ['/composer-python/python'], 'args': ['scripts/data_prep/convert_dataset_hf.py', '--dataset=c4', '--data_subset=en', '--out_root=gs://jk-asia-southeast1-staging/datasets/c4', '--splits', 'train_small', 'val_small', '--concat_tokens=2048', '--tokenizer=EleutherAI/gpt-neox-20b', "--eos_text='<|endoftext|>'"]}}]


#### Create a CustomJob object

A CustomJob object finalizes the remaining information to run a job

In [50]:
JOB_ID = f'convert-c4-{datetime.now().strftime("%Y%m%d%H%M")}'
print(JOB_ID)

job = aiplatform.CustomJob(
    display_name=JOB_ID,
    worker_pool_specs=worker_pool_spec
)

print(job.job_spec)

convert-c4-202310110005
worker_pool_specs {
  machine_spec {
    machine_type: "n1-standard-32"
  }
  replica_count: 1
  container_spec {
    image_uri: "gcr.io/jk-mlops-dev/mosaicml-sandbox"
    command: "/composer-python/python"
    args: "scripts/data_prep/convert_dataset_hf.py"
    args: "--dataset=c4"
    args: "--data_subset=en"
    args: "--out_root=gs://jk-asia-southeast1-staging/datasets/c4"
    args: "--splits"
    args: "train_small"
    args: "val_small"
    args: "--concat_tokens=2048"
    args: "--tokenizer=EleutherAI/gpt-neox-20b"
    args: "--eos_text=\'<|endoftext|>\'"
  }
}
base_output_directory {
  output_uri_prefix: "gs://jk-asia-southeast1-staging/aiplatform-custom-job-2023-10-11-00:05:31.695"
}



#### Run the job


In [51]:
job.run(sync=True)

Creating CustomJob
CustomJob created. Resource name: projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/asia-southeast1/training/8029505268985364480?project=895222332033
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/8029505268985364480 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/as

## Create and submit a training job

### Configure `containerSpec`


Due to the way, the MosaicML `train.py` script stores artifact we need to do some "innovative" path setting to make sure that integration with Vertex Tensorboard works.

In [142]:
JOB_ID = f'train-mpt-{datetime.now().strftime("%Y%m%d%H%M")}'
print(JOB_ID)

base_output_dir = f'{STAGING_BUCKET}/runs/{JOB_ID}'
print(base_output_dir)

base_output_dir_fuse = '/gcs' + base_output_dir[4:]
print(base_output_dir_fuse)

tensorboard_logs_dir = f'{base_output_dir_fuse}/logs'
print(tensorboard_logs_dir)

!gsutil ls {dataset_gcs_location}

train-mpt-202310110249
gs://jk-asia-southeast1-staging/runs/train-mpt-202310110249
/gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249
/gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/logs
gs://jk-asia-southeast1-staging/datasets/c4/train_small/
gs://jk-asia-southeast1-staging/datasets/c4/val_small/


In [147]:
container_spec = {
    'image_uri': IMAGE_URI,
    'command': ['composer'],
    'args': [
        'scripts/train/train.py',
        'scripts/train/yamls/pretrain/mpt-1b.yaml',
        f'run_name={JOB_ID}',
        'data_local=/mds_cache',
        f'data_remote={dataset_gcs_location}',
        'train_loader.dataset.split=train_small',
        'eval_loader.dataset.split=val_small',
        'max_duration=40ba',
        'eval_interval=10ba',
        f'save_folder={base_output_dir_fuse}/checkpoints',
        f'loggers.tensorboard={{log_dir: {tensorboard_logs_dir}}}'
    ]
}

print(container_spec)

{'image_uri': 'gcr.io/jk-mlops-dev/mosaicml-sandbox', 'command': ['composer'], 'args': ['scripts/train/train.py', 'scripts/train/yamls/pretrain/mpt-1b.yaml', 'run_name=train-mpt-202310110249', 'data_local=/mds_cache', 'data_remote=gs://jk-asia-southeast1-staging/datasets/c4', 'train_loader.dataset.split=train_small', 'eval_loader.dataset.split=val_small', 'max_duration=40ba', 'eval_interval=10ba', 'save_folder=/gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/checkpoints', 'loggers.tensorboard={log_dir: /gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/logs}']}


### Configure `workerPoolSpecs`

In [148]:
worker_pool_spec = [
    {
        'machine_spec': {
            'machine_type': 'a2-highgpu-2g',
            'accelerator_type': 'NVIDIA_TESLA_A100',
            'accelerator_count': 2

        },
        'disk_spec': {
            'boot_disk_type': 'pd-ssd',
            'boot_disk_size_gb': 500
        },
        'replica_count': 1,
        'container_spec': container_spec
    }
]

print(worker_pool_spec)

[{'machine_spec': {'machine_type': 'a2-highgpu-2g', 'accelerator_type': 'NVIDIA_TESLA_A100', 'accelerator_count': 2}, 'disk_spec': {'boot_disk_type': 'pd-ssd', 'boot_disk_size_gb': 500}, 'replica_count': 1, 'container_spec': {'image_uri': 'gcr.io/jk-mlops-dev/mosaicml-sandbox', 'command': ['composer'], 'args': ['scripts/train/train.py', 'scripts/train/yamls/pretrain/mpt-1b.yaml', 'run_name=train-mpt-202310110249', 'data_local=/mds_cache', 'data_remote=gs://jk-asia-southeast1-staging/datasets/c4', 'train_loader.dataset.split=train_small', 'eval_loader.dataset.split=val_small', 'max_duration=40ba', 'eval_interval=10ba', 'save_folder=/gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/checkpoints', 'loggers.tensorboard={log_dir: /gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/logs}']}}]


### Create `CustomJob`

In [149]:
job = aiplatform.CustomJob(
    display_name=JOB_ID,
    worker_pool_specs=worker_pool_spec,
    base_output_dir=base_output_dir
)

print(job.job_spec)

worker_pool_specs {
  machine_spec {
    machine_type: "a2-highgpu-2g"
    accelerator_type: NVIDIA_TESLA_A100
    accelerator_count: 2
  }
  replica_count: 1
  disk_spec {
    boot_disk_type: "pd-ssd"
    boot_disk_size_gb: 500
  }
  container_spec {
    image_uri: "gcr.io/jk-mlops-dev/mosaicml-sandbox"
    command: "composer"
    args: "scripts/train/train.py"
    args: "scripts/train/yamls/pretrain/mpt-1b.yaml"
    args: "run_name=train-mpt-202310110249"
    args: "data_local=/mds_cache"
    args: "data_remote=gs://jk-asia-southeast1-staging/datasets/c4"
    args: "train_loader.dataset.split=train_small"
    args: "eval_loader.dataset.split=val_small"
    args: "max_duration=40ba"
    args: "eval_interval=10ba"
    args: "save_folder=/gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/checkpoints"
    args: "loggers.tensorboard={log_dir: /gcs/jk-asia-southeast1-staging/runs/train-mpt-202310110249/logs}"
  }
}
base_output_directory {
  output_uri_prefix: "gs://jk-asia-southea

### Run a job

In [150]:
job.run(
    sync=False,
    service_account=SA_EMAIL,
    tensorboard=TENSORBOARD_RESOURCE_NAME
)

Creating CustomJob


CustomJob created. Resource name: projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720
To use this CustomJob in another session:
custom_job = aiplatform.CustomJob.get('projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720')
View Custom Job:
https://console.cloud.google.com/ai/platform/locations/asia-southeast1/training/6070439431079198720?project=895222332033
View Tensorboard:
https://asia-southeast1.tensorboard.googleusercontent.com/experiment/projects+895222332033+locations+asia-southeast1+tensorboards+8004585387697635328+experiments+6070439431079198720
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_PENDING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_PE

### Monitor the job

In [151]:
job.state

<JobState.JOB_STATE_RUNNING: 3>

CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_RUNNING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_RUNNING
CustomJob projects/895222332033/locations/asia-southeast1/customJobs/6070439431079198720 current state:
JobState.JOB_STATE_RUNNING
