# Fine Tuning for Model Garden (OpenLLaMA 3B)

## Overview 

This notebook demonstrates fine tuning and deploying OpenLLaMA with performance efficient fine tuning libraries (PEFT). and running inference for a sample LLM from Model Garden (OpenLlama 3B). The functions in this notebook can be adapted for other models from Model Garden.

[openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b)


# Code

The following code sets up the Python environment on the workbench, loads and deploys the model into a model endpoint and provides an example on how to run inference on the deployed model.

## Set-up

In [None]:
# Cloud project ID.
PROJECT_ID = !gcloud config get project
PROJECT_ID = PROJECT_ID.n
print("Project ID: " + PROJECT_ID)

# The region you want to launch jobs in.
REGION = "europe-west2"
print("Region: "+ REGION)

# The Cloud Storage bucket for storing experiments output.
BUCKET_URI = "gs://gen-ai-%s-bucket" % PROJECT_ID
print("Bucket URI: " + BUCKET_URI)

import os

#Buckets or folders to store required model components
STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
EXPERIMENT_BUCKET = os.path.join(BUCKET_URI, "peft")
DATA_BUCKET = os.path.join(EXPERIMENT_BUCKET, "data")
MODEL_BUCKET = os.path.join(EXPERIMENT_BUCKET, "model")
print("- Staging Bucket: " + STAGING_BUCKET)
print("- Experiment Bucket: " + EXPERIMENT_BUCKET)
print("- Data Bucket: " + DATA_BUCKET)
print("- Model Bucket: " + MODEL_BUCKET)

# The service account for deploying fine tuned model, it requires the `Vertex AI User` and `Storage Object Admin` roles.
SERVICE_ACCOUNT = "%s-consumer-sa@%s.iam.gserviceaccount.com"  % (PROJECT_ID,PROJECT_ID)
print("Service Account:" + SERVICE_ACCOUNT)

! gcloud config set project $PROJECT_ID

### Initialize Vertex AI API

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

### Define Image Location Constants

The following constants define the location of the container images to be used in the endpoint to serve requests.

In [None]:
# The prebuilt training and serving docker images.
TRAIN_DOCKER_URI = (
    "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-train"
)
PREDICTION_DOCKER_URI = (
    "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-serve"
)

## Fine Tune and Deploy Prebuilt OpenLLaMA

This section demonstrates how to fine tune and deploy OpenLLaMA with PEFT LoRA. The model deployment step will take ~15 minutes to complete.

The peak GPU memory usages for [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b), [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b), and [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b) are ~5.3G, ~8.7G and ~15.2G separately with the default settings.

Set the prebuilt model ID. For larger versions of the model it may be necessary to increase compute capacity of the endpoint and the training job which may incur higher costs.
|Models|
| :- |
| openlm-research/open_llama_3b |
| openlm-research/open_llama_7b |
| openlm-research/open_llama_13b |


NOTE: The prebuilt model weights will be downloaded on the fly from the original location after the deployment succeeds. Thus, an additional 5 minutes of waiting time is needed **after** the above model deployment step succeeds and before you can run the next step below. Otherwise you might see a `ServiceUnavailable: 503 502:Bad Gateway` error when you send requests to the endpoint.

Once deployment succeeds, you can send requests to the endpoint with text prompts.

In [None]:
endpoint = aiplatform.Endpoint("projects/%s/locations/europe-west2/endpoints/gen-ai-oss-peft-endpoint" % PROJECT_ID)

In [None]:
base_model_id = "openlm-research/open_llama_3b"

## Finetune

Use the Vertex AI SDK to create and run the custom training jobs with Vertex AI Model Garden training images. 

In order to make the finetuning efficiently, we enabled quantization for loading pretrained models for finetuning LoRA models. Precision options include `"4bit"`, `"8bit"`, `"float16"` (default) and `"float32"`, and the precision can be set via `"--precision_mode"`. The peak GPU memory usages are ~7G, ~10G and ~16G for finetuning LoRA models for [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b), [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b), and [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b) separately with default training parameters and the example dataset. open_llama_3b and open_llama_7b can be finetuned on 1 V100, and open_llama_13b can be finetuned on 1 A100 (40G).

### Option 1: Finetune Using a Public HuggingFace Dataset

*Run either this cell or option 2 as the output model will be overwritten*

This example uses the dataset https://huggingface.co/datasets/Abirate/english_quotes. The google provided container uses the HuggingFace Dataset library to load datasets in: https://huggingface.co/docs/datasets/loading. The name of the passed dataset can be changed to any public dataset available within HuggingFace and usable for this context.

In [None]:
dataset_name = "Abirate/english_quotes"  

# Worker pool spec.
# Finetunes open_llama_3b.
# Change the machine specifications below if larger models or faster training required.
# Note that high compute provisioning result in significant costs
machine_type = "n1-standard-8"
accelerator_type = "NVIDIA_TESLA_T4"
replica_count = 1
accelerator_count = 1

# Setup training job. Runs a job using a custom Google provided container,
job_name = "openllama-3b-PEFT"
train_job = aiplatform.CustomContainerTrainingJob(
    display_name=job_name,
    container_uri=TRAIN_DOCKER_URI,
)
output_dir = os.path.join(MODEL_BUCKET, job_name)
output_dir_gcsfuse = output_dir.replace("gs://", "/gcs/")

# Pass training arguments and launch job.
train_job.run(
    args=[
        "--task=causal-language-modeling-lora",
        f"--pretrained_model_id={base_model_id}",
        f"--dataset_name={dataset_name}",
        f"--output_dir={output_dir_gcsfuse}",
        "--lora_rank=16",
        "--lora_alpha=32",
        "--lora_dropout=0.05",
        "--warmup_steps=10",
        "--max_steps=10",
        "--learning_rate=2e-4",
    ],
    replica_count=replica_count,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    boot_disk_size_gb=500,
    environment_variables={"WANDB_MODE":"offline"}
)

print("Trained models were saved in: ", output_dir)

### Option 2: Finetune Using Dataset from a Cloud Storage Bucket

*Run either this cell or option 1 as the output model will be overwritten*

This example uses the same public dataset but is obtained from a storage bucket within this project. This pattern can be used to train with public datasets from other sources. 

The storage bucket is mounted to the container running the PEFT job. Data files can be uploaded to the gen-ai storage bucket within the data folder (in structured data formats such as jsonl) and will picked up by the job with this configuration.

You can also extract data from sources like Big Query and store it within Cloud Storage to use with this pattern.

In [None]:
dataset_name = "quotes.jsonl"  

# Worker pool spec.
# Finetunes open_llama_3b.
# Change the machine specifications below if larger models or faster training required.
# Note that high compute provisioning result in significant costs
machine_type = "n1-standard-8"
accelerator_type = "NVIDIA_TESLA_T4"
replica_count = 1
accelerator_count = 1

# Setup training job. Runs a job using a custom Google provided container,
job_name = "openllama-3b-PEFT"
train_job = aiplatform.CustomContainerTrainingJob(
    display_name=job_name,
    container_uri=TRAIN_DOCKER_URI,
)
# Note: Mount location of the storage bucket is /gcs/ within the container's filesystem.
# The bucket is mounted using Cloud Storage FUSE which allows for the mounted bucket to be interacted with as a filesystem
# This is the same way the updates in this notebook gets stored in the user-guide bucket. 
data_dir = DATA_BUCKET
data_dir_gcsfuse = data_dir.replace("gs://", "/gcs/") # Sets the folder for the training data (Mounted to GCS)
output_dir = os.path.join(MODEL_BUCKET, job_name)
output_dir_gcsfuse = output_dir.replace("gs://", "/gcs/") # Sets the folder for the output model (Mounted to GCS)

# Pass training arguments and launch job.
train_job.run(
    args=[
        "--task=causal-language-modeling-lora",
        f"--pretrained_model_id={base_model_id}",
        f"--dataset_name={data_dir_gcsfuse}",
        f"--output_dir={output_dir_gcsfuse}",
        "--lora_rank=16",
        "--lora_alpha=32",
        "--lora_dropout=0.05",
        "--warmup_steps=10",
        "--max_steps=10",
        "--learning_rate=2e-4",
    ],
    replica_count=replica_count,
    machine_type=machine_type,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    boot_disk_size_gb=500,
    environment_variables={"WANDB_MODE":"offline"}
)

print("Trained models were saved in: ", output_dir)

## Deploy
This section uploads the model to Model Registry and deploys it on the Endpoint.

The model deployment step will take ~15 minutes to complete.

The peak GPU memory usages for [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b), [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b), and [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b) with LoRA weights are ~5.3G, ~8.7G and ~15.2G separately with the default settings.

### Upload Model Garden Model To Model Registry

Note that the serving container environment variables now contains the path to the finetuned LoRA in GCS

In [None]:
model = aiplatform.Model.upload(
    display_name="openllama-peft-serve",
    serving_container_image_uri=PREDICTION_DOCKER_URI,
    serving_container_ports=[7080],
    serving_container_predict_route="/predictions/peft_serving",
    serving_container_health_route="/ping",
    serving_container_environment_variables={
        "BASE_MODEL_ID": base_model_id,
        "TASK": "causal-language-modeling-lora",
        #This sets the path to the fine-tuned adapter model
        "FINETUNED_LORA_MODEL_PATH": output_dir
    },
)

### Deploy Model to Endpoint

This function deploys the model to the model endpoint and associates compute resources with it.

You can select from several machine and accelerator types. Reference:

- Machine Types: https://cloud.google.com/compute/docs/machine-resource
- Accelerator Types: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

Please note that accelerators (GPUs) are limited by region and can incur high costs depending on length of deployment and type.

In [None]:
model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    deploy_request_timeout=1800,
    service_account=SERVICE_ACCOUNT,
)

## Inference using deployed model

NOTE: After the deployment succeeds, the base model weights will be downloaded on the fly from the original location and LoRA model weights will be downloaded from the GCS bucket used in training above. Thus, an additional 5 minutes of waiting time is needed **after** the above model deployment step succeeds and before you can run the next step below. Otherwise you might see a `ServiceUnavailable: 503 502:Bad Gateway` error when you send requests to the endpoint.

In [None]:
instances = [
    {"prompt": "Generate a list of ways that makes Earth unique compared to other planets"},
]
response = endpoint.predict(instances=instances)

for prediction in response.predictions[0]:
    print(prediction["generated_text"])

### Clean Up Resources

Clean up resources after to avoid excess costs, this can also be done from the cloud console

In [None]:
# Undeploy all models from the endpoint
endpoint.undeploy_all()

# Delete Models
model.delete()

In [None]:
# Delete Endpoint, Note: if you delete your pre-provisioned endpoint use the function below to create and load another one.
endpoint.delete(force=True)

### Deploy Additional Endpoints

Use the function below to deploy additional endpoints

In [None]:
endpoint = aiplatform.Endpoint.create(
    display_name=f"gen-ai-oss-peft-endpoint",
)