# Deploy Gemma 7B with TGI on Vertex AI 

TL; DR Gemma is a family of lightweight, state-of-the-art open models built from the same research and technology used to create the Gemini models, developed by Google DeepMind and other teams across Google. Text Generation Inference (TGI) is a toolkit developed by Hugging Face for deploying and serving LLMs, with high performance text generation. And, Google Vertex AI is a Machine Learning (ML) platform that lets you train and deploy ML models and AI applications, and customize large language models (LLMs) for use in your AI-powered applications. In this example, we will show how to deploy any supported text-generation model, in this case [`google/gemma-7b-it`](https://huggingface.co/google/gemma-7b-it), from the Hugging Face Hub in Vertex AI using the TGI DLC available in Google Cloud Platform (GCP).

![`google/gemma-7b-it` in the Hugging Face Hub](./assets/deploy-gemma-on-vertex-ai/model-in-hf-hub.png)

## Setup / Configuration

First, we need to install `gcloud` in our local machine, in order to be able to authenticate to Google Cloud, configure the project we want to use, our preferred / default location, etc.

To install `gcloud`, follow the instructions at https://cloud.google.com/sdk/docs/install.

Then, we will also need to install `google-cloud-aiplatform`, required to programatically create the Vertex AI model, register it in their model registry, and then create the endpoint to deploy the model in Vertex AI. To be installed as follows:

In [None]:
!pip install google-cloud-aiplatform --upgrade --quiet

Before proceeding, for convenience we will set the following environment variables:

In [None]:
%env PROJECT_ID=your-project-id
%env LOCATION=your-location
%env BUCKET_URI gs://hf-tgi-vertex-ai
%env CONTAINER_URI=us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu122.2-1-1.ubuntu2204

Then we need to login into our GCP account and set the project ID to the one we want to use for Vertex AI.

In [None]:
!gcloud auth login
!gcloud config set project $PROJECT_ID

Once we are logged in, we need to ensure that the necessary APIs are enabled in GCP, such as the Vertex AI, Compute Engine and Container Registry related APIs.

In [None]:
!gcloud services enable aiplatform.googleapis.com
!gcloud services enable compute.googleapis.com
!gcloud services enable container.googleapis.com
!gcloud services enable containerregistry.googleapis.com
!gcloud services enable containerfilesystem.googleapis.com

## Optional: Create bucket in GCS

Since we will run a Vertex AI job, we need to specify a GCS bucket to use so as to dump the artifacts, logs, etc. generated from the fine-tuning job; this means that the job will write those into a directory that is mounted into a GCS bucket, meaning that the content is synced with the bucket and persisted there, so that we can use it later on.

We can use an existing bucket, so if you already have a bucket available in GCS for storing the model outputs, feel free to skip this step and jump into the next one.

Otherwise, in order to create the bucket we are going to use `gcloud storage buckets create` to create a new bucket in GCS in the specified project and location.

In [None]:
!gcloud storage buckets create $BUCKET_URI --project $PROJECT_ID --location=$LOCATION --default-storage-class=STANDARD --uniform-bucket-level-access

## Register model in Vertex AI

Once we are logged in into our GCP account and enabled the required services, after installing `google-cloud-aiplatform` Python SDK, we can already initialize it using our previously defined `PROJECT_ID`, `LOCATION` and `BUCKET_URI`.

In [None]:
import os
from google.cloud import aiplatform

PROJECT_ID = os.getenv("PROJECT_ID")
LOCATION = os.getenv("LOCATION")
BUCKET_URI = os.getenv("BUCKET_URI")

aiplatform.init(
    project=PROJECT_ID,
    location=LOCATION,
    staging_bucket=BUCKET_URI,
)

Then we can already proceed to the model "upload", since it will basically consist on registering the model in Vertex AI with an empty bucket linked to it, since we the model will be automatically downloaded in the Hugging Face TGI DLC as the `MODEL_ID` environment variable is provided.

So on, before going into the code, let's review the arguments:

- `display_name` is the name that will be shown in Vertex AI Model Registry.

- `serving_container_image_uri` is the location of the Hugging Face TGI DLC that we will be using for serving the model later on. In order to see which TGI containers are available in GCP, you can run the following command:

    `gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-text-generation-inference"`

- `serving_container_environment_variables` are the environment variables that will be used during the container runtime, so these are aligned with the environment variables defined by TGI, which in this case natively supports the `AIP_` Vertex AI environment variables (to read more about the environment variables exposed by Vertex AI, please check https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements).
    - `MODEL_ID` is the identifier of the model in the Hugging Face Hub, to check all the TGI supported models please check https://huggingface.co/models?other=text-generation-inference&sort=trending.
    - `NUM_SHARD` is the number of shards to use if you don't want to use all GPUs on a given machine e.g. if you have two GPUs but you just want to use one for TGI then `NUM_SHARD=1`.
    - `MAX_INPUT_TOKENS` is the maximum allowed input length (expressed in number of tokens), the larger it is, the larger the prompt can be, but also more memory will be consumed.
    - `MAX_TOTAL_TOKENS` is the most important value to set as it defines the "memory budget" of running clients requests, the larger this value, the larger amount each request will be in your RAM and the less effective batching can be.
    - `MAX_BATCH_PREFILL_TOKENS` limits the number of tokens for the prefill operation, as it takes the most memory and is compute bound, it is interesting to limit the number of requests that can be sent.
    - `HUGGING_FACE_HUB_TOKEN` as we want to serve a gated model, `google/gemma-7b-it` in this case, we need to set the Hugging Face Hub token in advance in order to be able to access it from the TGI container. To generate a custom token for the Hugging Face Hub, you can follow the instructions at https://huggingface.co/docs/hub/en/security-tokens.
 
    To read more about all the arguments supported by TGI, please visit https://huggingface.co/docs/text-generation-inference/main/en/basic_tutorials/launcher.

- `serving_container_ports` this is optional, since we're setting this value to Vertex AI default port which is 8080, but is recommended to have more visibility on the open ports later on.

For more information on the supported `aiplatform.Model.upload` arguments, check the `upload` reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_upload.

In [None]:
model = aiplatform.Model.upload(
    display_name="google--gemma-7b-it",
    serving_container_image_uri=os.getenv("CONTAINER_URI"),
    serving_container_environment_variables={
        "MODEL_ID": "google/gemma-7b-it",
        "NUM_SHARD": "1",
        "MAX_INPUT_TOKENS": "512",
        "MAX_TOTAL_TOKENS": "1024",
        "MAX_BATCH_PREFILL_TOKENS": "1512",
        "HUGGING_FACE_HUB_TOKEN": os.getenv("HF_TOKEN"),
    },
    serving_container_ports=[8080],
)
model.wait()

![Model in Vertex AI Model Registry](./assets/deploy-gemma-on-vertex-ai/vertex-ai-model.png)

## Deploy model in Vertex AI

Once the model has been registered in Vertex AI, we can define the endpoint we want to deploy the model to, and then link the model deployment to that endpoint resource.

To do so, we'll start by calling `aiplatform.Endpoint.create` to create a new Vertex AI endpoint resource (which comes only with the configuration, it's not linked to a model or anythign usable yet).

In [None]:
endpoint = aiplatform.Endpoint.create(display_name="google--gemma-7b-it-endpoint")

![Vertex AI Endpoint created](./assets/deploy-gemma-on-vertex-ai/vertex-ai-endpoint.png)

Then we can already proceed to the model deployment in an endpoint via the `deploy` method within the previously registered `model`. The `deploy` method will link the previously created endpoint resource with the model that contains the configuration of the serving container, TEI in this case, and then it will deploy that model in Vertex AI in the specified instance/s.

So on, before going into the code, let's review the arguments:

- `endpoint` is the endpoint to deploy the model to, which is optional and by default will be set to the model's display name plus `_endpoint`; but in this case we're using a previously created endpoint.
- `machine_type`, `accelerator_type` and `accelerator_count` are arguments that define which instance to use, and additionally, if desired, also the accelerator to use (GPU or TPU) and the number of accelerators, respectively. The `machine_type` and the `accelerator_type` are tied together, since when using an instance with an accelerator, we will need to select an instance that supports it, to read more about the different instances check https://cloud.google.com/compute/docs/gpus, and to read about the `accelerator_type` naming check https://cloud.google.com/vertex-ai/docs/reference/rest/v1/MachineSpec.
- `sync` is an optional argument on whether to deploy the model and wait until it's done i.e. sync, or just trigger the deployment and continue the code execution i.e. async. In this case we set it to True (default value), as we won't be able to succesfully run the follow up cells until the endpoint is deployed.

For more information on the supported `aiplatform.Model.deploy` arguments, check the `deploy` reference at https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform.Model#google_cloud_aiplatform_Model_deploy.

In [None]:
deployed_model = model.deploy(
    endpoint=endpoint,
    machine_type="g2-standard-4",
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
)

**WARNING**: _The Vertex AI endpoint deployment via the `deploy` method may take from 15 to 25 minutes._

![Vertex AI Endpoint running the model](./assets/deploy-gemma-on-vertex-ai/vertex-ai-endpoint-run.png)

![Vertex AI Endpoint logs in Cloud Logging](./assets/deploy-gemma-on-vertex-ai/vertex-ai-endpoint-logs.png)

## Run online inference in Vertex AI

Finally, we can run the online predictions on Vertex AI using the `predict` method, which will basically send the requests to the running endpoint in predict route specified within the container. In order to do so, ideally we should first format the input query or conversation with the tokenizer that matches the model we're serving, being `google/gemma-7b-it`, and for that we will need to install `transformers` via `pip` as follows:

In [None]:
!pip install transformers --quiet

Once installed, the following snippet will apply the chat template to the input conversation:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it", token=os.getenv("HF_TOKEN"))
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What's Deep Learning?"}],
    tokenize=False,
    add_generation_prompt=True,
)

So that the we send the formatted conversation as a single string to the TGI API via the `predict` method of the Vertex AI deployed model:

In [None]:
output = deployed_model.predict(
    instances=[
        {
            "inputs": inputs,  # inputs = <bos><start_of_turn>user\nWhat's Deep Learning?<end_of_turn>\n<start_of_turn>model\n
            "parameters": {
                "max_new_tokens": 256, "do_sample": True,
                "top_p": 0.95, "temparature": 1.0,
            },
        },
    ]
)

Which produces the following output:

```
Prediction(predictions=['\n\nDeep learning is a type of machine learning that uses artificial neural networks to learn from large amounts of data, making it a powerful tool for various tasks, including image recognition, natural language processing, and speech recognition.\n\n**Key Concepts:**\n\n* **Artificial Neural Networks (ANNs):** Structures that mimic the interconnected neurons in the brain.\n* **Deep Learning Architectures:** Multi-layered ANNs that learn hierarchical features from data.\n* **Transfer Learning:** Reusing learned features from one task to improve performance on another.\n\n**Types of Deep Learning:**\n\n* **Supervised Learning:** Models are trained on labeled data, where inputs are paired with corresponding outputs.\n* **Unsupervised Learning:** Models learn patterns from unlabeled data, such as clustering or dimensionality reduction.\n* **Reinforcement Learning:** Models learn through trial-and-error by interacting with an environment to optimize a task.\n\n**Benefits:**\n\n* **High Accuracy:** Deep learning models can achieve high accuracy on complex tasks.\n* **Adaptability:** Deep learning models can adapt to new data and tasks.\n* **Scalability:** Deep learning models can handle large amounts of data.\n\n**Applications:**\n\n* Image recognition\n* Natural language processing (NLP)\n'], deployed_model_id='***', metadata=None, model_version_id='1', model_resource_name='projects/***/locations/us-central1/models/***', explanations=None)
```

![Vertex AI Endpoint logs in Cloud Logging after predict](./assets/deploy-gemma-on-vertex-ai/vertex-ai-endpoint-logs-predict.png)

Alternatively, we can also use the Online Prediction UI within the Vertex AI endpoint from the "Test your model" preview feature, as follows:

![Vertex AI Endpoint online inference](./assets/deploy-gemma-on-vertex-ai/vertex-ai-online-prediction.png)

## Resource clean-up

Finally, we can release the resources we have created as follows:

- `deployed_model.undeploy_all` to undeploy the model from all the endpoints.
- `deployed_model.delete` to delete the endpoint/s where the model was deployed gracefully after the `undeploy_all`.
- `model.delete` to delete the model from the registry i.e. unregister it. Note that when using a Google Cloud Storage (GCS) artifact, this method won't delete neither the bucket nor its contents, but only unregister the model from Vertex AI.

In [None]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

Optionally, we can also remove the GCS Bucket we created before, with the following `gcloud` command:

In [None]:
!gcloud storage rm -r $BUCKET_URI