# Deploy Gemma 7B on Google Vertex AI 
<table align="left">
  <td>
    <a href="https://github.com/huggingface/Google-Cloud-Containers/blob/main/examples/vertex-ai/notebooks/deploy-gemma-on-vertex-ai.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
</table>
<br/><br/><br/>

[Gemma-7b](https://huggingface.co/google/gemma-7b) is state-of-the-art open model from Google, built from the same research and technology used to create the Gemini models. It is a text-to-text, decoder-only large language model, available in English, with open weights, and is really well-suited for a variety of text generation tasks, including question answering, summarization, and reasoning. Learn more about it [Welcome Gemma - Google’s new open LLM](https://huggingface.co/blog/gemma).

In this tutorial you will learn how to deploy [google/gemma-7b](https://huggingface.co/google/gemma-7b) on Vertex AI Endpoints. We are going to use the Hugging Face [Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) container which is a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). You can now find all Hugging Face containers on [Google Cloud](https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face).

What you'll learn in this blog:

1. [Setup development environment](#1-setup-development-environment)
2. [Configure gcloud CLI](#2-configure-gcloud-cli)
3. [Initialize Vertex AI SDK](#3-initialize-vertex-ai-sdk)
4. [Deploy Gemma-7B on Vertex AI](#4-deploy-gemma-7b-on-vertex-ai)
5. [Run Inference with deployed Model](#5-run-inference-with-deployed-model)
6. [Cleaning Up Resources](#6-cleaning-up-resources)

## 1. Setup development environment

We are going to use the `Vertex AI` python SDK to deploy [google/gemma-7b](https://huggingface.co/google/gemma-7b) on Vertex AI. You need to have a GCP project and an account with the necessary permissions to create resources in the project. 

Before we can install the packages we need to install the `gcloud CLI`. Instructions can be found here: https://cloud.google.com/sdk/docs/install

In [1]:
# Install the required packages for the notebook
! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage "google-auth>=2.23.3"

## 2. Configure gcloud CLI

We need to authenticate with Google Cloud SDK to use the Vertex AI services. Run the following command to authenticate with Google Cloud SDK.

```bash
gcloud auth login 
gcloud auth application-default login
```

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)


In [None]:
PROJECT_ID = "gcp-project-id"  # @param {type:"string"}
REGION = "us-central1"  # @param {type: "string"}
BUCKET_URI = f"gs://vertexai-{PROJECT_ID}-tgi"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID} --quiet
# Set the region
! gcloud config set ai/region {REGION} --quiet
# create the bucket if it doesn't exist
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

## 3. Initialize Vertex AI SDK 

The following set of constants will be used to create names and display names of Vertex AI Prediction resources like models, endpoints, and model deployments.

In [3]:
# set model names and version
MODEL_NAME = "Gemma-7b" # @param {type:"string"}
MODEL_VERSION = "v01" # @param {type: "string"}
MODEL_DISPLAY_NAME = f"TGI-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}
ENDPOINT_DISPLAY_NAME = f"endpoint-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}

# Set the TGI serving container image uri, selected from https://cloud.google.com/deep-learning-containers/docs/choosing-container#hugging-face
SERVING_CONTAINER_IMAGE_URI = "us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-text-generation-inference-cu121.1-4.ubuntu2204.py310"

In [4]:
from google.cloud import aiplatform

# Initialize the Vertex AI SDK
aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## 4. Deploy Gemma-7B on Vertex AI

To deploy [Gemma-7B](https://huggingface.co/google/gemma-7b) on Vertex AI, we first need to first upload the model to [Vertex AI Model Registry](https://cloud.google.com/vertex-ai/docs/model-registry/introduction).

In [8]:
model = aiplatform.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    serving_container_environment_variables={
        "MODEL_ID": "google/gemma-7b", # Hugging Face model ID
        "NUM_SHARD": "1", 
        "MAX_INPUT_LENGTH": "512", 
        "MAX_TOTAL_TOKENS": "1024", 
        "MAX_BATCH_PREFILL_TOKENS": "1512",
        "HUGGING_FACE_HUB_TOKEN": "TOKEN WITH ACCESS TO Gemma", # Replace with your Hugging Face Hub token
        },
    serving_container_ports=[80],
)


model.wait()

print(model.display_name)
print(model.resource_name)

Creating Model
Create Model backing LRO: projects/755607090520/locations/us-central1/models/8539037099238621184/operations/3099102200106844160
Model created. Resource name: projects/755607090520/locations/us-central1/models/8539037099238621184@1
To use this Model in another session:
model = aiplatform.Model('projects/755607090520/locations/us-central1/models/8539037099238621184@1')
TGI-Gemma-7b-v01
projects/755607090520/locations/us-central1/models/8539037099238621184


Once the model is uploaded, now we can deploy it on Vertex AI Endpoints. 
First, we need to create an endpoint and then deploy the model to the endpoint. Here you also need to choose the hardware configuration for the deployment. 

The deployment will take ~20-25 minutes. You can check the status of the deployment in the Google cloud console.

In [9]:
machine_type = 'g2-standard-4' # L4 GPUs
endpoint = aiplatform.Endpoint.create(display_name=ENDPOINT_DISPLAY_NAME)

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_NAME, # The display name of the deployed model
    machine_type=machine_type,              # type of machine, read more here: https://cloud.google.com/vertex-ai/docs/predictions/configure-compute
    accelerator_type="NVIDIA_L4",           # Hardware accelerator 
    accelerator_count=1,                    # Number of accelerators to attach to a worker replica.
    traffic_percentage=100,                 # Percentage of traffic to send to this model
    min_replica_count=1,                    # The minimum number of machine replicas this deployed model will be always deployed on.
    sync=True,                              # Whether to execute this method synchronously.
)

Creating Endpoint
Create Endpoint backing LRO: projects/755607090520/locations/us-central1/endpoints/8199922424465063936/operations/3741428096960561152
Endpoint created. Resource name: projects/755607090520/locations/us-central1/endpoints/8199922424465063936
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/755607090520/locations/us-central1/endpoints/8199922424465063936')
Deploying model to Endpoint : projects/755607090520/locations/us-central1/endpoints/8199922424465063936
Deploy Endpoint model backing LRO: projects/755607090520/locations/us-central1/endpoints/8199922424465063936/operations/8404905511102709760
Endpoint model deployed. Resource name: projects/755607090520/locations/us-central1/endpoints/8199922424465063936


## 5. Run Inference with deployed Model
Awesome! We have successfully deployed the Gemma-7B model on Vertex AI. Now let's run inference on our endpoint. We will use the `predict` method of the `Endpoint` class to run inference on the deployed model. We can inference with different parameters to impact the generation. Parameters can be defined as in the `parameters` attribute of the payload. You can find supported parameters in the [TGI documentation](https://huggingface.github.io/text-generation-inference/) defined under the `GenerateParameters` section.


In [10]:
# Prompt for generation
prompt = "Deep Learning is"

res = deployed_model.predict(instances=[
  {"inputs": prompt, 
   "parameters": {"max_new_tokens": 256, "do_sample": True, "top_p": 0.95, "temparature": 1.0 }} # Generation arguments
  ]
)
print(prompt + res.predictions[0])


Deep Learning is an advanced machine learning technique used in automated, semi-automated, and manual testing. It can achieve higher accuracy through training rather than developing algorithms; data representations based on neural networks are used.

Deep Neural Networks are often used to perform automatic feature learning, nonlinear mapping of input into output, e.g., classification and regression. Data representations based on neural networks are used for feature learning and nonlinear mapping of inputs and outputs. As deep learning gets more accurate, it can detect false negatives more effectively and operates based on the minimal test. In the article, we will discuss <b>deep learning for test automation</b>.

When testing most of the time, test automation is not accurate. Because of which, a lot of deep learning is being implemented. Deep learning is an efficient way to reduce false negatives by performing quantitative labels for testing at a large scale, beyond the possibility of 

## 6. Cleaning Up Resources

To avoid incurring charges to your Google Cloud Platform account for the resources used in this tutorial, you can delete the resources you created in this tutorial.

In [11]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

Undeploying Endpoint model: projects/755607090520/locations/us-central1/endpoints/8199922424465063936
Undeploy Endpoint model backing LRO: projects/755607090520/locations/us-central1/endpoints/8199922424465063936/operations/1302728898739437568
Endpoint model undeployed. Resource name: projects/755607090520/locations/us-central1/endpoints/8199922424465063936
Deleting Endpoint : projects/755607090520/locations/us-central1/endpoints/8199922424465063936
Delete Endpoint  backing LRO: projects/755607090520/locations/us-central1/operations/994232324264558592
Endpoint deleted. . Resource name: projects/755607090520/locations/us-central1/endpoints/8199922424465063936
Deleting Model : projects/755607090520/locations/us-central1/models/8539037099238621184
Delete Model  backing LRO: projects/755607090520/locations/us-central1/models/8539037099238621184/operations/4205298858579722240
Model deleted. . Resource name: projects/755607090520/locations/us-central1/models/8539037099238621184
