# Deploy Golden Gate 7B on Vertex AI 

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/llm_streaming_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official//prediction/llm_streaming_prediction.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
        </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official//prediction/llm_streaming_prediction.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview

This tutorial demonstrates how to deploy Llama to Vertex AI using Hugging Face Text Generation Inference.


_Note: Make sure you build the container with the `patch` for the Golden Gate models._

## Installations

Before we can install the packages make sure you have the cli installed: https://cloud.google.com/sdk/docs/install

Install the packages required for executing this notebook.

In [10]:
! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage "google-auth>=2.23.3"
! pip install transformers

### Colab only: Uncomment the following cell to restart the kernel

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Setup Vertex AI and SDK



Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [1]:
! gcloud auth login 
! gcloud auth application-default login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=6fajmwQMFZ0ouvoHuQXEsle3oxNHu1&prompt=consent&access_type=offline&code_challenge=QkTl4yDx9R0W-2EY2-VqcdVQayB9gIB1yzEnja_rEwI&code_challenge_method=S256

Enter authorization code: ^C


Command killed by keyboard interrupt

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Setup SDK with your project id

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [1]:
# PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
PROJECT_ID = "huggingface-ml"  # @param {type:"string"}
REGION = "us-central1"  # @param {type: "string"}
BUCKET_URI = f"gs://vertexai-{PROJECT_ID}-tgi"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID} --quiet
# Set the region
! gcloud config set ai/region {REGION} --quiet
# create the bucket if it doesn't exist
! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

Updated property [core/project].
Updated property [ai/region].
Creating gs://vertexai-huggingface-ml-tgi/...
ServiceException: 409 A Cloud Storage bucket named 'vertexai-huggingface-ml-tgi' already exists. Try another name. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.


The following set of constants will be used to create names and display names of Vertex AI Prediction resources like models, endpoints, and model deployments.

In [2]:
# set model names and version
MODEL_NAME = "Golden-Gate-7b" # @param {type:"string"}
MODEL_VERSION = "v01" # @param {type: "string"}
MODEL_DISPLAY_NAME = f"TGI-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}
ENDPOINT_DISPLAY_NAME = f"endpoint-{MODEL_NAME}-{MODEL_VERSION}" # @param {type:"string"}

# You can get the latest Triton image uri from
# https://github.com/huggingface/text-generation-inference/pkgs/container/text-generation-inference
DOCKER_ARTIFACT_REPO = "custom-tgi-example" # @param {type:"string"}
BASE_TGI_IMAGE = "ghcr.io/huggingface/text-generation-inference:latest" # @param {type:"string"}
SERVING_CONTAINER_IMAGE_URI = f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{DOCKER_ARTIFACT_REPO}/base-tgi-image:latest"

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [3]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=BUCKET_URI)

## 2. Push TGI image to Container Registry

In [None]:
# current image is created with the patch 

# push-to-gcr.sh script

# ! gcloud services enable artifactregistry.googleapis.com

# # create a new Docker repository with your region with the description
# ! gcloud artifacts repositories create {DOCKER_ARTIFACT_REPO} \
#     --repository-format=docker \
#     --location={REGION} \
#     --description="Custom TGI Example"

# # verify that your repository was created.
# ! gcloud artifacts repositories list \
#     --location={REGION} \
#     --filter="name~"{DOCKER_ARTIFACT_REPO}

# # configure docker to use your repository    
# ! gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet
    
# # pull, tag and push
# ! docker pull {BASE_TGI_IMAGE}
# ! docker tag {BASE_TGI_IMAGE} {SERVING_CONTAINER_IMAGE_URI}
# ! docker push {SERVING_CONTAINER_IMAGE_URI}

## 3. Deploy model to Vertex AI

create new model

In [13]:
model = aiplatform.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    # artifact_uri=f"{BUCKET_URI}/{MODEL_NAME}",
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    serving_container_predict_route="/v1/endpoint",
    serving_container_health_route="/health",
    serving_container_environment_variables={
        "MODEL_ID": "gg-hf/golden-gate-7b",
        "NUM_SHARD": "1",
        "MAX_INPUT_LENGTH": "1512",
        "MAX_TOTAL_TOKENS": "4096",
        #"HUGGING_FACE_HUB_TOKEN": "TOKEN WITH ACCESS TO THE PRIVATE REPO",
        "HUGGING_FACE_HUB_TOKEN": "",
        },
    serving_container_ports=[80],
)


model.wait()

print(model.display_name)
print(model.resource_name)

Creating Model
Create Model backing LRO: projects/1049843053967/locations/us-central1/models/1731948517049499648/operations/4396258636477759488
Model created. Resource name: projects/1049843053967/locations/us-central1/models/1731948517049499648@1
To use this Model in another session:
model = aiplatform.Model('projects/1049843053967/locations/us-central1/models/1731948517049499648@1')
TGI-mistral-7b-hf-v01
projects/1049843053967/locations/us-central1/models/1731948517049499648


In [14]:
machine_type = 'g2-standard-4' # L4 GPUs
endpoint = aiplatform.Endpoint.create(display_name=ENDPOINT_DISPLAY_NAME)

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_NAME,
    machine_type=machine_type,
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    traffic_percentage=100,
    min_replica_count=1,
    sync=True,
)

Creating Endpoint
Create Endpoint backing LRO: projects/1049843053967/locations/us-central1/endpoints/8815701712577757184/operations/6215712885935439872
Endpoint created. Resource name: projects/1049843053967/locations/us-central1/endpoints/8815701712577757184
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/1049843053967/locations/us-central1/endpoints/8815701712577757184')
Deploying model to Endpoint : projects/1049843053967/locations/us-central1/endpoints/8815701712577757184
Deploy Endpoint model backing LRO: projects/1049843053967/locations/us-central1/endpoints/8815701712577757184/operations/216918182277939200
Endpoint model deployed. Resource name: projects/1049843053967/locations/us-central1/endpoints/8815701712577757184


In [22]:
from transformers import AutoTokenizer

prompt = "Deep Learning is"

res = deployed_model.predict(instances=[
  {"inputs": prompt, 
   "parameters": {"max_new_tokens": 256, "do_sample": True, "top_p": 0.7, "temparature": 1.0 }}
  ]
)
print(res.predictions[0])


 Google Cloud is a suite of cloud computing services offered by Google. It provides a range of infrastructure and platform solutions for businesses and developers, including storage, computing power, networking, and various software services. Google Cloud allows users to build, deploy, and scale applications, websites, and services on the same infrastructure that Google uses internally for its own offerings. This infrastructure is designed to be reliable, scalable, and secure, and offers flexible pricing models to fit different budgets and usage patterns. Services in Google Cloud include Google Compute Engine for virtual machines, Google Storage for data storage, Google App Engine for application hosting, Google Kubernetes Engine for container orchestration, BigQuery for analytics, and many others. Google Cloud also integrates with other Google services like Google Drive, Google Docs, and Google Maps for additional functionality. The goal of Google Cloud is to help businesses and devel

Delete resources

In [12]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

Undeploying Endpoint model: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Undeploy Endpoint model backing LRO: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776/operations/2355565055325503488
Endpoint model undeployed. Resource name: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Deleting Endpoint : projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Delete Endpoint  backing LRO: projects/1049843053967/locations/us-central1/operations/4209922201895305216
Endpoint deleted. . Resource name: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Deleting Model : projects/1049843053967/locations/us-central1/models/7037188878091943936
Delete Model  backing LRO: projects/1049843053967/locations/us-central1/models/7037188878091943936/operations/2782281120018857984
Model deleted. . Resource name: projects/1049843053967/locations/us-central1/models/7037188878091943936
