# Training LLMs with DeepSpeed on Vertex AI

This notebook demonstrates how to fine-tune large language models using DeepSpeed on Google Cloud's Vertex AI platform.

## Setup

First, we'll install the required packages:

In [82]:
!pip install --upgrade --quiet google-cloud-aiplatform
!pip install --quiet datasets
!pip install --quiet py7zr
!pip install --quiet pandas
!pip install --quiet python-dotenv

## Configuration

Set up key variables for the training job:

In [None]:
import os
from dotenv import load_dotenv

# Load the base .env file first
load_dotenv(dotenv_path=".env")

# Load the .env.local file, overriding values from .env
load_dotenv(dotenv_path=".env.local")

HF_TOKEN = os.getenv("HF_TOKEN")
PROJECT_ID = os.getenv("PROJECT_ID")

In [83]:
PROJECT_ID = "heikohotz-genai-sa"
REGION = "us-central1"
BUCKET_NAME = f"hf-deepspeed-training-{PROJECT_ID}"
BUCKET_URI = f"gs://{BUCKET_NAME}"
JOB_NAME = "hf-deepspeed-training-job"
DATASET_NAME = "timdettmers/openassistant-guanaco"
DATASET_FILE = f"data/{DATASET_NAME.split('/')[-1]}/train.jsonl"
DATASET_PATH = f"/gcs/{BUCKET_NAME}/{DATASET_FILE}"
MODEL_NAME = "meta-llama/Llama-3.1-8B"
MODEL_OUTPUT_URI = f"/gcs/{BUCKET_NAME}/{JOB_NAME}/{MODEL_NAME.split('/')[-1]}"
TRAINING_CONFIG_PATH = f"{JOB_NAME}/{MODEL_NAME.split('/')[-1]}/config/training_config.yaml"

## Initialize Vertex AI SDK

Connect to Vertex AI using the Google Cloud SDK:

In [84]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI, location=REGION)

## Load Hugging Face Token

Load the Hugging Face API token from .env file for accessing the model:

In [85]:
from dotenv import load_dotenv
import os

load_dotenv()
HF_TOKEN = os.getenv("HF_TOKEN")

## Create Cloud Storage Bucket

Create a GCS bucket to store the training data and model artifacts:

In [None]:
!gcloud storage buckets create $BUCKET_URI \
    --project $PROJECT_ID \
    --location=$REGION \
    --default-storage-class=STANDARD \
    --uniform-bucket-level-access

## Prepare Training Data

Download the dataset from Hugging Face and upload it to GCS:

In [None]:
from datasets import load_dataset
from google.cloud import storage

# Load the dataset
dataset = load_dataset(DATASET_NAME)

# Initialize a GCS client
client = storage.Client()
bucket = client.bucket(BUCKET_NAME)

# Save only the training split to GCS
dataset['train'].to_json('train.jsonl', orient='records', lines=True)
blob = bucket.blob(DATASET_FILE)
blob.upload_from_filename('train.jsonl')

## Upload Training Configuration

Upload the DeepSpeed and training configuration files to GCS:

In [89]:
#upload training config
blob = bucket.blob(TRAINING_CONFIG_PATH)
blob.upload_from_filename('config/training_config.yaml')

## Build Training Container

Build and push the custom training container to Google Container Registry:

In [None]:
IMAGE_URI = f"gcr.io/{PROJECT_ID}/training-containers/hf-training:latest"
!gcloud builds submit --tag {IMAGE_URI} . --project {PROJECT_ID}
CONTAINER_URI = IMAGE_URI

## Configure Training Job

Set up the Vertex AI custom training job:

In [91]:
job = aiplatform.CustomContainerTrainingJob(
    display_name=JOB_NAME,
    container_uri=CONTAINER_URI,
)

In [92]:
# args = [
#     # MODEL
#     "--model_name_or_path=meta-llama/Llama-3.1-8B",
#     # TRAINING
#     "--num_train_epochs=2",
#     "--per_device_train_batch_size=2",
#     "--gradient_accumulation_steps=4",
#     "--gradient_checkpointing",
#     "--gradient_checkpointing_use_reentrant",
#     "--learning_rate=2.0e-5",
#     "--lr_scheduler_type=cosine",
#     "--optim=adamw_bnb_8bit",
#     "--bf16",
#     # LOGGING
#     "--logging_steps=10",
#     "--save_strategy=epoch",
#     "--seed=42",
#     "--log_level=info",
#     # DATASET
#     "--dataset_text_field=text",
#     "--max_seq_length=2048",
#     "--packing=false"
# ]

args = [
    "--model_name_or_path",
    "meta-llama/Llama-3.1-8B",
    "--num_train_epochs",
    "2",
    "--per_device_train_batch_size",
    "2",
    "--gradient_accumulation_steps",
    "4",
    "--gradient_checkpointing",
    "--gradient_checkpointing_use_reentrant",
    "--learning_rate",
    "2.0e-5",
    "--lr_scheduler_type",
    "cosine",
    "--optim",
    "adamw_bnb_8bit",
    "--bf16",
    "--logging_steps",
    "10",
    "--save_strategy",
    "epoch",
    "--seed",
    "42",
    "--log_level",
    "info",
    "--dataset_text_field",
    "text",
    "--max_seq_length",
    "2048",
    "--packing",
    "false"
]

## Submit Training Job

Submit the training job to Vertex AI with the following specifications:
- A2 Ultra GPU machine with 8x NVIDIA A100 80GB GPUs
- 250GB boot disk
- Environment variables for model training configuration

In [None]:
from google.cloud.aiplatform_v1.types import custom_job as gca_custom_job_compat

job.submit(
    # args=args,
    replica_count=1,
    machine_type="a2-ultragpu-8g",
    accelerator_type="NVIDIA_A100_80GB",
    accelerator_count=8,
    environment_variables={
        "HF_TOKEN": HF_TOKEN,
        "TRL_USE_RICH": "0",
        "ACCELERATE_LOG_LEVEL": "INFO",
        "TRANSFORMERS_LOG_LEVEL": "INFO",
        "TQDM_POSITION": "-1",
        "DATASET_PATH": DATASET_PATH,
        "MODEL_OUTPUT_DIR": MODEL_OUTPUT_URI,
        "DATASET_NUMBER_OF_RECORDS": "2000",
        "MODEL_NAME": MODEL_NAME,
        "TRAINING_CONFIG_PATH": f"/gcs/{BUCKET_NAME}/{TRAINING_CONFIG_PATH}",
    },
    boot_disk_size_gb=250,
)