# Deploy Llama 7B on Vertex AI 

<table align="left">
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/llm_streaming_prediction.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
        <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official//prediction/llm_streaming_prediction.ipynb">
        <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
        </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official//prediction/llm_streaming_prediction.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Resources used 
* https://cloud.google.com/vertex-ai/docs/predictions/use-tpu
* https://cloud.google.com/vertex-ai/docs/predictions/use-custom-container
* https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/pytorch_image_classification_with_prebuilt_serving_containers.ipynb
* https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/llm_streaming_prediction.ipynb
* https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/prediction/get_started_with_nvidia_triton_serving.ipynb

## Overview

This tutorial demonstrates how to deploy Llama to Vertex AI using Hugging Face Text Generation Inference.

## Installations

Before we can install the packages make sure you have the cli installed: https://cloud.google.com/sdk/docs/install

Install the packages required for executing this notebook.

In [10]:
! pip install --upgrade --quiet google-cloud-aiplatform google-cloud-storage "google-auth>=2.23.3"
! pip install transformers

## Setup Vertex AI and SDK



Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [1]:
! gcloud auth login 
! gcloud auth application-default login

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=32555940559.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fappengine.admin+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcompute+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=6fajmwQMFZ0ouvoHuQXEsle3oxNHu1&prompt=consent&access_type=offline&code_challenge=QkTl4yDx9R0W-2EY2-VqcdVQayB9gIB1yzEnja_rEwI&code_challenge_method=S256

Enter authorization code: ^C


Command killed by keyboard interrupt

Go to the following link in your browser:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect

**3. Colab, uncomment and run:**

In [None]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Setup SDK with your project id

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [3]:
# PROJECT_ID = "[your-project-id]"  # @param {type:"string"}
PROJECT_ID = "gcp-partnership-412108"  # @param {type:"string"}
REGION = "us-central1"  # @param {type: "string"}

# # Set the project id
# ! gcloud config set project {PROJECT_ID} --quiet
# # Set the region
# ! gcloud config set ai/region {REGION} --quiet
# # create the bucket if it doesn't exist
# ! gsutil mb -l {REGION} -p {PROJECT_ID} {BUCKET_URI}

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [18]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION)

## 2. Upload model to GCS

In [None]:
%%bash

export HF_HUB_ENABLE_HF_TRANSFER=1
REPOSITORY_ID="meta-llama/Meta-Llama-3-8B-Instruct"  # The repository ID on HuggingFace
LOCAL_DIR="tmp/llama3"  # The directory where models will be downloaded
GCS_BUCKET="gs://hf-gcp-models-deployments-test-3451451/Meta-Llama-3-8B-Instruct/"  # The Google Cloud Storage bucket to upload models to

# Download models from HuggingFace, excluding certain file types
mkdir -p $LOCAL_DIR
huggingface-cli download $REPOSITORY_ID --exclude "*.bin" "*.pth" "*.gguf" --local-dir $LOCAL_DIR

# Upload the downloaded models to Google Cloud Storage
gsutil -m cp -r $LOCAL_DIR $GCS_BUCKET

# Clean up local directory after upload
rm -rf $LOCAL_DIR

## 3. Deploy model to Vertex AI

create new model

In [19]:
SERVING_CONTAINER_IMAGE_URI = "us-central1-docker.pkg.dev/gcp-partnership-412108/base-tgi-image/base-tgi-image"

model = aiplatform.Model.upload(
    display_name="llama3",
    artifact_uri="gs://hf-gcp-models-deployments-test-3451451/Meta-Llama-3-8B-Instruct/",
    serving_container_image_uri=SERVING_CONTAINER_IMAGE_URI,
    serving_container_environment_variables={
        "NUM_SHARD": "1",
        "MAX_INPUT_LENGTH": "1512",
        "MAX_TOTAL_TOKENS": "4096",
        },
)


model.wait()

print(model.display_name)
print(model.resource_name)

Creating Model
Create Model backing LRO: projects/755607090520/locations/us-central1/models/711059667240878080/operations/7358806508936626176
Model created. Resource name: projects/755607090520/locations/us-central1/models/711059667240878080@1
To use this Model in another session:
model = aiplatform.Model('projects/755607090520/locations/us-central1/models/711059667240878080@1')
llama3
projects/755607090520/locations/us-central1/models/711059667240878080


In [20]:
machine_type = 'g2-standard-4' # L4 GPUs
endpoint = aiplatform.Endpoint.create(display_name="llama3")

deployed_model = model.deploy(
    endpoint=endpoint,
    deployed_model_display_name="llama3",
    machine_type=machine_type,
    accelerator_type="NVIDIA_L4",
    accelerator_count=1,
    traffic_percentage=100,
    min_replica_count=1,
    sync=True,
)

Creating Endpoint
Create Endpoint backing LRO: projects/755607090520/locations/us-central1/endpoints/6878568932622467072/operations/502076076265046016


Endpoint created. Resource name: projects/755607090520/locations/us-central1/endpoints/6878568932622467072
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/755607090520/locations/us-central1/endpoints/6878568932622467072')
Deploying model to Endpoint : projects/755607090520/locations/us-central1/endpoints/6878568932622467072
Deploy Endpoint model backing LRO: projects/755607090520/locations/us-central1/endpoints/6878568932622467072/operations/7775389474468397056
Endpoint model deployed. Resource name: projects/755607090520/locations/us-central1/endpoints/6878568932622467072


In [21]:
from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

messages = [
  {"role": "user", "content": "What is Google Cloud?"},
]

res = deployed_model.predict(instances=[
  {"inputs": tok.apply_chat_template(messages,tokenize=False), 
   "parameters": {"max_new_tokens": 256, "do_sample": True, "top_p": 0.7, "temparature": 1.0 }}
  ]
)
print(res.predictions[0])


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


assistant

Google Cloud is a suite of cloud computing services offered by Google. It provides a range of products and services that allow individuals, businesses, and organizations to build, deploy, and manage applications and infrastructure on the cloud. Here are some of the key features and services offered by Google Cloud:

1. **Compute Services**: Google Cloud provides a range of compute services, including Google Compute Engine, Google Kubernetes Engine, and Cloud Functions, which allow users to run and manage virtual machines, containers, and serverless functions.
2. **Storage Services**: Google Cloud offers a range of storage services, including Google Cloud Storage, Cloud SQL, and Cloud Datastore, which provide scalable and secure storage for data and applications.
3. **Big Data and Analytics**: Google Cloud provides a range of big data and analytics services, including Google BigQuery, Google Cloud Dataproc, and Google Cloud Dataflow, which allow users to process and analyze l

Delete resources

In [12]:
deployed_model.undeploy_all()
deployed_model.delete()
model.delete()

Undeploying Endpoint model: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Undeploy Endpoint model backing LRO: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776/operations/2355565055325503488
Endpoint model undeployed. Resource name: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Deleting Endpoint : projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Delete Endpoint  backing LRO: projects/1049843053967/locations/us-central1/operations/4209922201895305216
Endpoint deleted. . Resource name: projects/1049843053967/locations/us-central1/endpoints/3364094363645771776
Deleting Model : projects/1049843053967/locations/us-central1/models/7037188878091943936
Delete Model  backing LRO: projects/1049843053967/locations/us-central1/models/7037188878091943936/operations/2782281120018857984
Model deleted. . Resource name: projects/1049843053967/locations/us-central1/models/7037188878091943936
