# Deploy BERT Models with PyTorch Inference on Vertex AI

TL; DR DistilBERT is a small, fast, cheap and light Transformer model trained by distilling BERT, which is a bidirectional transformer pretrained using a combination of masked language modeling objective and next sentence prediction on a large corpus. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. In this example, we will show how to deploy any supported PyTorch model from the Hugging Face Hub, in this case [`distilbert/distilbert-base-uncased-finetuned-sst-2-english`](https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english), in Vertex AI using the PyTorch Inference DLC available in Google Cloud Platform (GCP) in both CPU and GPU instances.

![`distilbert/distilbert-base-uncased-finetuned-sst-2-english` in the Hugging Face Hub](./assets/deploy-bert-on-vertex-ai/model-in-hf-hub.png)

## Setup / Configuration

First, we need to install `gcloud` in our local machine, in order to be able to authenticate to Google Cloud, configure the project we want to use, our preferred / default location, etc.

To install `gcloud`, follow the instructions at https://cloud.google.com/sdk/docs/install.

Then, we will also need to install `google-cloud-aiplatform`, required to programmatically create the Vertex AI model, register it in their model registry, and then create the endpoint to deploy the model in Vertex AI. To be installed as follows:

In [None]:
!pip install google-cloud-aiplatform --upgrade --quiet

Before proceeding, for convenience we will set the following environment variables:

In [None]:
%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-inference-cu121.2-2.transformers.4-41.ubuntu2204.py311

Then we need to login into our GCP account and set the project ID to the one we want to use for Vertex AI.

In [None]:
!gcloud auth login
!gcloud config set project $PROJECT_ID

Once we are logged in, we need to ensure that the necessary APIs are enabled in GCP, such as the Vertex AI, Compute Engine and Container Registry related APIs.

In [None]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

## Register model in Vertex AI

Once we are logged in into our GCP account and enabled the required services, after installing `google-cloud-aiplatform` Python SDK, we can already initialize it using our previously defined `PROJECT_ID`, `LOCATION` and `BUCKET_URI`.

In [None]:
import os
from google.cloud import aiplatform

PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")

aiplatform.init(
    project=PROJECT_ID,
    location=LOCATION,
)

Then we can already proceed to the model "upload", which will basically consist on registering the model in Vertex AI, since the model will be automatically downloaded from the Hugging Face Hub in the Hugging Face PyTorch Inference DLC startup via the `HF_MODEL_ID` environment variable.

So on, before going into the code, let's review the arguments:

- `display_name` is the name that will be shown in Vertex AI Model Registry.

- `serving_container_image_uri` is the location of the Hugging Face PyTorch Inference DLC that we will be using for serving the model later on. In order to see which Hugging Face containers are available in GCP, you can run the following command:

    `gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-pytorch-inference"`

- `serving_container_environment_variables` are the environment variables that will be used during the container runtime, so these are aligned with the environment variables defined by `huggingface-inference-toolkit` Python SDK, which exposes some environment variables such as the following:
    - `HF_MODEL_ID` is the identifier of the model in the Hugging Face Hub. To explore all the supported models please check https://huggingface.co/models?sort=trending filtering by the task that you want to use e.g. `text-classification`.
    - `HF_TASK` is the task identifier within the Hugging Face Hub. To see all the supported tasks please check https://huggingface.co/docs/transformers/en/task_summary#natural-language-processing.

For more information on the supported `aiplatform.Model.upload` arguments, check the `upload` reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload.

In [None]:
model = aiplatform.Model.upload(
    display_name="distilbert--distilbert-base-uncased-finetuned-sst-2-english",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "HF_MODEL_ID": "distilbert/distilbert-base-uncased-finetuned-sst-2-english",
        "HF_TASK": "text-classification",
    },
)
model.wait()

![Model in Vertex AI Model Registry](./assets/deploy-bert-on-vertex-ai/vertex-ai-model.png)

![Model Version in Vertex AI Model Registry](./assets/deploy-bert-on-vertex-ai/vertex-ai-model-version.png)

## Deploy model in Vertex AI

Once the model has been registered in Vertex AI, we can define the endpoint we want to deploy the model to, and then link the model deployment to that endpoint resource.

To do so, we'll start by calling `aiplatform.Endpoint.create` to create a new Vertex AI endpoint resource (which comes only with the configuration, it's not linked to a model or anything usable yet).

In [None]:
endpoint = aiplatform.Endpoint.create(display_name="distilbert--distilbert-base-uncased-finetuned-sst-2-english-endpoint")

![Vertex AI Endpoint created](./assets/deploy-bert-on-vertex-ai/vertex-ai-endpoint.png)

Then we can already proceed to the model deployment in an endpoint via the `deploy` method within the previously registered `model`. The `deploy` method will link the previously created endpoint resource with the model that contains the configuration of the serving container and then it will deploy that model in Vertex AI in the specified instance/s.

So on, before going into the code, let's review the arguments:

- `endpoint` is the endpoint to deploy the model to, which is optional and by default will be set to the model's display name plus `_endpoint`; but in this case we're using a previously created endpoint.
- `machine_type`, `accelerator_type` and `accelerator_count` are arguments that define which instance to use, and additionally, the accelerator to use and the number of accelerators, respectively. The `machine_type` and the `accelerator_type` are tied together, since when using an instance with an accelerator, we will need to select an instance that supports it, to read more about the different instances check https://cloud.google.com/compute/docs/gpus, and to read about the `accelerator_type` naming check https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec.
- `sync` is an optional argument on whether to deploy the model and wait until it's done i.e. sync, or just trigger the deployment and continue the code execution i.e. async. In this case we set it to True (default value), as we won't be able to successfully run the follow up cells until the endpoint is deployed.

For more information on the supported `aiplatform.Model.deploy` arguments, check the `deploy` reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy.

In [None]:
deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-12",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    sync=True,
)

**WARNING**: _The Vertex AI endpoint deployment via the `deploy` method may take from 15 to 25 minutes._

![Vertex AI Endpoint Ready](./assets/deploy-bert-on-vertex-ai/vertex-ai-endpoint-ready.png)

![Vertex AI Model Ready](./assets/deploy-bert-on-vertex-ai/vertex-ai-model-ready.png)

## Run online inference in Vertex AI

Finally, we can run the online predictions on Vertex AI using the `predict` method, which will basically send the requests to the running endpoint in predict route specified within the container.

In [None]:
output = deployed_model.predict(instances=["I love this product", "I hate this product"], parameters={"top_k": 2})
output.predictions

Which produces the following output for each of the instances provided i.e. being `POSITIVE` the label for the first sentence and `NEGATIVE` for the second, as those are the greater scores within each output instance, respectively:

```
[[{'score': 0.9998788833618164, 'label': 'POSITIVE'},
  {'score': 0.0001210561968036927, 'label': 'NEGATIVE'}],
 [{'score': 0.9997544884681702, 'label': 'NEGATIVE'},
  {'score': 0.0002454846107866615, 'label': 'POSITIVE'}]
```

## Resource clean-up

Finally, we can release the resources we have created as follows:

- `deployed_model.undeploy_all` to undeploy the model from all the endpoints.
- `deployed_model.delete` to delete the endpoint/s where the model was deployed gracefully after the `undeploy_all`.
- `model.delete` to delete the model from the registry i.e. unregister it. Note that when using a Google Cloud Storage (GCS) artifact, this method won't delete neither the bucket nor its contents, but only unregister the model from Vertex AI.

In [None]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()