# Fine-tune LLMs with TRL's CLI on Vertex AI

TL;DR [Transformer Reinforcement Learning (TRL)](https://github.com/huggingface/trl) is a framework developed by Hugging Face to fine-tune and align both transformer language and diffusion models using methods such as Supervised Fine-Tuning (SFT), Reward Modeling (RM), Proximal Policy Optimization (PPO), Direct Preference Optimization (DPO), and others. On the other hand, Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. In this example, we will show how to create a custom training job in Vertex AI running the Hugging Face DLCs for training models, using TRL's recently released CLI.

## Setup / Configuration

First, we need to install `gcloud` in our local machine, in order to be able to authenticate to Google Cloud, configure the project we want to use, our preferred / default location, etc. To install `gcloud`, follow the instructions at https://cloud.google.com/sdk/docs/install.

Before proceeding, for convenience we will set the following environment variables:

In [None]:
%env PROJECT_ID="your-project-id"
%env LOCATION="your-location"
%env BUCKET_URI="gs://hf-vertex-pipelines"

Then we need to login into our GCP account and set the project ID to the one we want to use for Vertex AI.

In [None]:
!gcloud auth login
!gcloud config set project $PROJECT_ID

Once we are logged in, we need to ensure that the necessary APIs are enabled in GCP, such as the Vertex AI, Compute Engine and Container Registry related APIs.

In [None]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

## Optional: Create bucket in GCS

Since we will run a Vertex AI job, we need to specify a GCS bucket to use so as to dump the artifacts, logs, etc. generated from the fine-tuning job; this means that the job will write those into a directory that is mounted into a GCS bucket, meaning that the content is synced with the bucket and persisted there, so that we can use it later on.

We can use an existing bucket, so if you already have a bucket available in GCS for storing the model outputs, feel free to skip this step and jump into the next one. Otherwise, in order to create the bucket we are going to use the `gsutil` service from `gcloud` to create a new bucket in GCS in the specified project and location with an unique name.

On top of `gcloud`, we will need to install `gsutil` to interact with Google Cloud Storage (GCS) in order to create the bucket (alternatively, the bucket can also be created via the UI or a `gcloud storage ...` command). To install `gsutil`, the recommended way is to install it via `gcloud` as follows:

In [None]:
!gcloud components install gsutil

So that then we can just create the bucket in GCS as follows:

In [None]:
!gsutil -p $PROJECT_ID -l $LOCATION $BUCKET_URI

## Prepare `CustomContainerTrainingJob`

Once we've configured the environment and created the bucket (if applicable), we can proceed with the definition of the `CustomContainerTrainingJob`, which is a standard container job that runs in Vertex AI on top of a Compute Engine instance running a container, being the Hugging Face DLC for training, so that we can define the set of commands to run on top of it.

To create the `CustomContainerTrainingJob` we need to first install the [`google-cloud-aiplatform`](https://github.com/googleapis/python-aiplatform/tree/main) Python SDK via `pip` so as to programatically define the job we want to run in Vertex AI, we install it as follows:

In [None]:
!pip install google-cloud-aiplatform

Once the Python SDK is installed, we just need to init the Vertex AI Platform client using the previously defined environment variables as follows: 

In [None]:
import os
from google.cloud import aiplatform

PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")
BUCKET_URI = os.getenv("BUCKET_URI")

aiplatform.init(
    project=PROJECT_ID,
    location=LOCATION,
    staging_bucket=BUCKET_URI,
)

Then, we define the `CustomContainerTrainingJob` to run on top of the Hugging Face DLC for training, running the following command:

```sh
sh -c pip install flash-attn --no-build-isolation && exec trl sft "$@" --
```

So that the command above:

* Installs [`flash-attn`](https://github.com/Dao-AILab/flash-attention) which does not come by default in the Hugging Face DLC for training and will speed up the forward pass during the fine-tuning.
* Prepares the `trl sft` command, which will later on receive a list of arguments, as those are appended to the predefined command.

Additionally, note that the `CustomContainerTrainingJob` will override the default `ENTRYPOINT` provided within the container URI provided, so if the `ENTRYPOINT` is already prepared to receive the arguments, then there's no need to define a custom `command`.

In [None]:
job = aiplatform.CustomContainerTrainingJob(
    display_name="trl-lora-sft",
    # TODO(alvarobartt): update container URI with the publicly pushed one instead, or show how to build and
    # push that to the Artifact Registry instead.
    container_uri="...",
    command=[
        "sh",
        "-c",
        " && ".join(
            (
                # 'pip install "trl>=0.9.4" --upgrade',
                # required since there's a bug with the `torch_dtype` that prevents us from loading
                # the model as the default is fp32 and that won't fit in an L4 GPU with 24GiB
                # see https://github.com/huggingface/trl/issues/1751
                'pip install "trl @ git+https://github.com/alvarobartt/trl.git" --upgrade',
                # https://cloud.google.com/vertex-ai/docs/training/code-requirements
                "pip install flash-attn --no-build-isolation",
                'exec trl sft "$@"',
            )
        ),
        "--",
    ],
)

## Run `CustomContainerTrainingJob`

In this case we will be using the recently released TRL's CLI to run the Supervised Fine-Tuning (SFT) on top of [`mistralai/Mistral-7B-v0.3`](https://huggingface.co/mistralai/Mistral-7B-v0.3) with LoRA in `bfloat16` using [`timdettmers/openassistant-guanaco`](https://huggingface.co/timdettmers/openassistant-guanaco), which is a subset from [`OpenAssistant/oasst1`](https://huggingface.co/datasets/OpenAssistant/oasst1) with ~10k samples.

Before running the `CustomContainerTrainingJob`, we first need to decide which accelerator or VM resources we need to use for fine-tuning a model, we can either do a rough calculation of needing ~4 times the model size in GPU memory (read more about it in [Eleuther AI - Transformer Math 101](https://blog.eleuther.ai/transformer-math/)), or, if your model is uploaded to the Hugging Face Hub, just check the numbers in [Vokturz/can-it-run-llm](https://huggingface.co/spaces/Vokturz/can-it-run-llm).

<div class="alert alert-block alert-info">
    <a href="https://huggingface.co/spaces/Vokturz/can-it-run-llm">Vokturz/can-it-run-llm</a> is a Spaces hosted in the Hugging Face Hub that does those calculations for us based on the model we want to fine-tune, whether we want to use LoRA / QLoRA or not, the accelerator we have, and some other metrics.
</div>

We will run it in an NVIDIA L4 GPU which has 24GiB of VRAM, which is enough to fit the model for LoRA fine-tuning in `bfloat16` with a batch size of 1, as shown in the screenshot below:

![`Vokturz/can-it-run-llm` for `mistralai/Mistral-7B-v0.3`](./imgs/can-it-run-llm.png)

<div class="alert alert-block alert-info">
    Once we've decided which resources are we going to use to fine-tune our model, then we need to define the hyper parameters accordingly. Some of the hyper params that we may want to look into to avoid running into OOM errors are the following:
    <ul>
        <li>LoRA / QLoRA configuration: since we may need to tweak the rank, denoted by `r`, which defines the fraction of trainable parameters for each linear layer included. </li>
        <li>Optimizer: by default the AdamW optimizer will be used, but alternatively lower precision optimizers can be used to reduce the memory as well e.g. `adamw_bnb_8bit` (for more information on 8-bit optimizers check <a href="https://huggingface.co/docs/bitsandbytes/main/en/optimizers" target="_blank">https://huggingface.co/docs/bitsandbytes/main/en/optimizers</a>). </li>
        <li>Batch size: you can tweak this so as to use a lower batch size when running into OOM, or you can also tweak the gradient accumulation steps to simulate a similar batch size for updating the gradients, but providing less inputs within a batch a time e.g. `batch_size=8` and `gradient_accumulation=1` is effectively the same as `batch_size=4` and `gradient_accumulation=2`.</li>
    </ul>
</div>

As the `CustomContainerTrainingJob` defines the command `trl sft` as the container entrypoint (i.e. last command), that means that the `args` provided will be appended to it. So on, in order to see the available arguments for the `trl sft` command we can either run `trl sft --help` or just check the documentation of the supported arguments at https://huggingface.co/docs/trl/en/sft_trainer#trl.SFTConfig, which is the dataclass used to parsed from the provided args via the CLI entrypoint.

Read more about TRL's CLI at https://huggingface.co/docs/trl/en/clis.

In [None]:
args = [
    # MODEL
    "--model_name_or_path=mistralai/Mistral-7B-v0.3",
    "--torch_dtype=bfloat16",
    "--attn_implementation=flash_attention_2",
    # DATASET
    "--dataset_name=timdettmers/openassistant-guanaco",
    "--dataset_text_field=text",
    # PEFT
    "--use_peft",
    "--lora_r=16",
    "--lora_alpha=32",
    "--lora_dropout=0.1",
    "--lora_target_modules=all-linear",
    # TRAINER
    "--bf16",
    "--max_seq_length=1024",
    "--per_device_train_batch_size=2",
    "--gradient_accumulation_steps=8",
    "--gradient_checkpointing",
    "--learning_rate=0.0002",
    "--lr_scheduler_type=cosine",
    "--optim=adamw_bnb_8bit",
    "--num_train_epochs=1",
    "--logging_steps=10",
    "--do_eval",
    "--eval_steps=100",
    "--report_to=none",
    f"--output_dir={BUCKET_URI.replace('gs://', '/gcs/')}/Mistral-7B-v0.3-LoRA-SFT-Guanaco",
    "--overwrite_output_dir",
    "--seed=42",
    "--log_level=debug",
]

It's important to note that since GCS FUSE is used to mount the bucket as a directory within the instance running the container job, the mounted path follows the following formatting `/gcs/<BUCKET_NAME>`, meaning that the `gs://` default GCS path notation is not used, and that the mounted path also contains the bucket name.

So that when using the TRL's `SFTTrainer` via the CLI, the `output_dir` (as in the default `transformers.Trainer`) will be the mounted GCS Bucket and everything we write there, i.e. files and directories, will be automatically uploaded to the GCS Bucket.

Once the `args` are defined, we can already call the `submit` method on the `aiplatform.CustomContainerTrainingJob`, which is effectively the same as `run`, but `submit` is non blocking so that the training job is scheduled, but the program is not blocked.

Here's a breakdown on the provided arguments to the `submit` method:

* `args`: as already mentioned, contains the args to be provided to the container's command defined above, which is `trl sft` meaning that the `args` are provided as `trl sft --arg_1=value ...`.
* `replica_count`: defines the number of replicas that we want, which for training will always ideally be 1 as we may only want to schedule one node for training.
* `machine_type`, `accelerator_type`, and `accelerator_count`: defines the machine i.e. Compute Engine instance, that will be used, the accelerator to use (if any), and the number of accelerators to use (from 1 to 8), respectively. To read more about the GPU machine selection at https://cloud.google.com/compute/docs/gpus, and to see what values does `accelerator_type` admit check https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec.
* `base_output_dir`: defines the base directory that will be mounted within the running container from the GCS Bucket, conditioned by the `staging_bucket` argument provided to the `aiplatform.init` initially.
* `environment_variables`: these are optional, but since in this case we are fine-tuning a gated model as [`mistralai/Mistral-7B-v0.3`](https://huggingface.co/mistralai/Mistral-7B-v0.3) we need to set the `HF_TOKEN` in advance, since it's required to be able to read / access gated or private models in the Hugging Face Hub.
* `timeout` and `create_request_timeout`: these are also optional, and define the timeouts in seconds to wait before interrupting the training job time or the training job creation request (time to allocate required resources and start the execution), respectively.

In [None]:
job.submit(
    args=args,
    replica_count=1,
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    base_output_dir=f"{BUCKET_URI}/Mistral-7B-v0.3-LoRA-SFT-Guanaco",
    environment_variables={"HF_TOKEN": os.getenv("HF_TOKEN", None)},
    timeout=60 * 60 * 3,  # 3 hours (10800s)
    create_request_timeout=60 * 10,  # 10 minutes (600s)
)

![Pipeline created in Vertex AI](./imgs/vertex-ai-pipeline-scheduled.png)

Since the training with the pre-defined hyper parameters will take around 2 hours to run, we will need to wait until the training job has finished. Once the training job is done, then we can check the GCS Bucket that contains the generated artifact, in this case being the PEFT adapters of the fine-tuned model.

![Vertex AI Pipeline successfully completed](./imgs/vertex-ai-pipeline-scheduled.png)

![Vertex AI Pipeline logs](./imgs/vertex-ai-pipeline-logs.png)

![GCS Bucket with uploaded artifacts](./imgs/gcs-bucket-artifacts.png)

Then we can just use the adapters to run the inference either locally or on a VM, or just merge the adapter with the base model and run the inference via any supported framework, via the Hugging Face DLC for inference with `transformers` or via the Hugging Face DLC for TGI.