# Deployment of Model Garden Open-Source Model (OpenLlama 3B)

# Overview

This notebook demonstrates deploying and running inference for a sample LLM from Model Garden (OpenLlama 3B). The functions in this notebook can be adapted for other models from Model Garden.

[openlm-research/open_llama_3b](https://huggingface.aco/openlm-research/open_llama_3b)

# Code

The following code sets up the Python environment on the workbench, loads and deploys the model into a model endpoint and provides an example on how to run inference on the deployed model.

## Set-up

In [None]:
# Cloud project id.
PROJECT_ID = !gcloud config get project
PROJECT_ID = PROJECT_ID.n
print("Project ID: " + PROJECT_ID)

# The region you want to launch jobs in.
REGION = "europe-west2"
print("Region: "+ REGION)

# The Cloud Storage bucket for storing experiments output.
BUCKET_URI = "gs://gen-ai-%s-bucket" % PROJECT_ID
print("Bucket URI: " + BUCKET_URI)

import os

#Buckets or folders to store required model components
STAGING_BUCKET = os.path.join(BUCKET_URI, "temporal")
EXPERIMENT_BUCKET = os.path.join(BUCKET_URI, "peft")
DATA_BUCKET = os.path.join(EXPERIMENT_BUCKET, "data")
MODEL_BUCKET = os.path.join(EXPERIMENT_BUCKET, "model")

# The Service Account for deploying fine tuned model. It requires the `Vertex AI User` and `Storage Object Admin` roles.
SERVICE_ACCOUNT = "%s-consumer-sa@%s.iam.gserviceaccount.com"  % (PROJECT_ID,PROJECT_ID)
print("Service Account:" + SERVICE_ACCOUNT)

! gcloud config set project $PROJECT_ID
     

### Initialize Vertex AI API

In [None]:
from google.cloud import aiplatform

aiplatform.init(project=PROJECT_ID, location=REGION, staging_bucket=STAGING_BUCKET)

### Define Image Location Constants

The following constants define the location of the container images to be used in the endpoint to serve requests.

In [None]:
# The prebuilt training and serving Docker images.
PREDICTION_DOCKER_URI = (
    "us-docker.pkg.dev/vertex-ai/vertex-vision-model-garden-dockers/pytorch-peft-serve"
)

### Deploy Prebuilt OpenLLaMA

This section deploys prebuilt OpenLLaMA models on the Endpoint. The model deployment step will take ~15 minutes to complete.

The peak GPU memory usages for [openlm-research/open_llama_3b](https://huggingface.co/openlm-research/open_llama_3b), [openlm-research/open_llama_7b](https://huggingface.co/openlm-research/open_llama_7b), and [openlm-research/open_llama_13b](https://huggingface.co/openlm-research/open_llama_13b) are ~5.3G, ~8.7G and ~15.2G separately with the default settings.

Set the prebuilt model id. For larger versions of the model it may be necessary to increase compute capacity of the endpoint which may incur higher costs.
|Models|
| :- |
| openlm-research/open_llama_3b |
| openlm-research/open_llama_7b |
| openlm-research/open_llama_13b |


NOTE: The prebuilt model weights will be downloaded on the fly from the original location after the deployment succeeds. Thus, an additional 5 minutes of waiting time is needed **after** the above model deployment step succeeds and before you can run the next step below. Otherwise you might see a `ServiceUnavailable: 503 502:Bad Gateway` error when you send requests to the endpoint.

Once deployment succeeds, you can send requests to the endpoint with text prompts.

### Get Pre-Provisioned Endpoint and Define Model ID

In [None]:
prebuilt_model_id = "openlm-research/open_llama_3b"
endpoint = aiplatform.Endpoint("projects/%s/locations/europe-west2/endpoints/gen-ai-oss-endpoint" % PROJECT_ID)

### Upload Model Garden Model To Model Registry

In [None]:
model = aiplatform.Model.upload(
    display_name="openllama-serve",
    serving_container_image_uri=PREDICTION_DOCKER_URI,
    serving_container_ports=[7080],
    serving_container_predict_route="/predictions/peft_serving",
    serving_container_health_route="/ping",
    serving_container_environment_variables={
        "BASE_MODEL_ID": prebuilt_model_id,
        "TASK": "causal-language-modeling-lora",       
    },
)

### Deploy Model to Endpoint

This function deploys the model to the model endpoint and associates compute resources with it.

You can select from several machine and accelerator types. Reference:

- Machine Types: https://cloud.google.com/compute/docs/machine-resource
- Accelerator Types: https://cloud.google.com/compute/docs/gpus/gpu-regions-zones

Please note that accelerators (GPUs) are limited by region and can incur high costs depending on length of deployment and type.

In [None]:
model.deploy(
    endpoint=endpoint,
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=1,
    deploy_request_timeout=1800,
    service_account=SERVICE_ACCOUNT,
)

## Inference using deployed model

Please wait for around 5 mins after the deployment completes if you face an error.

In [None]:
instances = [
    {"prompt": "Generate a list of ways that makes Earth unique compared to other planets"},
]
response = endpoint.predict(instances=instances)

for prediction in response.predictions[0]:
    print(prediction["generated_text"])

### Clean Up Resources

Clean up resources after to avoid excess costs. This can also be done from the cloud console

In [None]:
# Undeploy all models from the endpoint
endpoint.undeploy_all()

# Delete Models
model.delete()

In [None]:
# Delete Endpoint. Note: if you delete your pre-provisioned endpoint use the function below to create and load another one.
endpoint.delete(force=True)

### Deploy Additional Endpoints

Use the function below to deploy additional endpoints

In [None]:
endpoint = aiplatform.Endpoint.create(
    display_name=f"gen-ai-oss-endpoint",
)