# Deploy your YOLO Model with NVIDIA's Triton Server to a Vertex AI Endpoint

**Note**: 

This guide assumes that:-
- The appropriate model artifacts (in this case the `models` directory) are already stored on GCS. 
-  You have built the Docker image using the Dockerfile in the repo. 
- The Triton Server (with the YOLO model) you are about to deploy to Vertex AI Endpoints has been tested locally.

If you haven't done the following, I would recommend reading the `README.md` in the repo.

## 1. Initializations

In this section, we would initialize the aiplatform client and declare other variables like the <a href= "https://cloud.google.com/kms/docs/key-management-service"> KMS key </a> (if your org uses one), etc.

Before proceeding please make sure you install google-cloud-aiplatform. 
If you want to exactly follow my setup, please run `pip install google-cloud-aiplatform==1.79.0`

In [None]:
from google.cloud import aiplatform


KMS_KEY = "" # Enter your KMS_KEY if you have one
location = "" # Enter the location/region for example 'us-east4'
aiplatform.init()


## 2. Upload the Docker image to Google Cloud Artifact Registry (GAR)

You must use GAR to upload your docker image because Vertex AI's Model Registry only integrates with GAR. And, Vertex AI Endpoint pulls the "model" from Vertex AI's Model Registry. 

Please make sure you have enabled Google Cloud Artifact Registry in your project and can upload docker images to it. If not, please refer to this <a href="https://cloud.google.com/artifact-registry/docs/docker/pushing-and-pulling"> link </a>

**Steps:**

1. Tag the Docker image to include the artifact registry location.

In [None]:
# Edit the command below and run it. 
# location can be us-east4, us-central1, etc.
! docker tag <docker image> <location>-docker.pkg.dev/<project name>/<registry name>/<docker image> 

2. Push the newly tagged Docker image to GAR

In [None]:
# Edit the command below and run it. 
! docker push <location>-docker.pkg.dev/<project name>/<registry name>/<docker image> 

## 3. Register the YOLO model on Vertex AI's Model Registry

Since Vertex AI has a default integration with NVIDIA Triton models, we can simply state the GCS location of the model artifacts and the URI of the Docker image on GAR along with other nominal details. The `model.resource_name` will be used while "mounting" our model on a Vertex AI Endpoint.

In [None]:
# Edit and Run
model = aiplatform.Model.upload(
    display_name                 = "", # Example:'yolo-model'
    description = "", # Example: 'YOLO model for document layout identification'
    artifact_uri                 = "<gcs url of model artifacts (in this case NVIDIA Triton model registry>", 
    serving_container_image_uri  = "<docker image uri from step 2.>",
    # Uncomment the line below if you have a KMS Key"
    # encryption_spec_key_name     = KMS_KEY,
    location = location
)
print("Model resource:", model.resource_name)

## 4. Create a Vertex AI Endpoint

The following code creates the default Vertex AI Endpoint type called <a href = "https://cloud.google.com/vertex-ai/docs/predictions/create-public-endpoint#create_a_shared_public_endpoint"> 'a shared public endpoint'.</a>
The `endpoint.resource_name` will be used for "mounting" the model from the Model Registry onto a Vertex AI Endpoint in step 5.

*Note*: Please refer to the following <a href="https://cloud.google.com/vertex-ai/docs/predictions/choose-endpoint-type"> page </a> to choose the endpoint that best suits your needs/requirements.


In [None]:
endpoint = aiplatform.Endpoint.create(
    display_name= "",  # Example: 'yolo-model-endpoint' 
    # Uncomment the line below if you have a KMS Key
    # encryption_spec_key_name = KMS_KEY,
    location = location
)
print("Endpoint resource:", endpoint.resource_name)

## 5. Upload the YOLO model from Vertex AI Model Registry onto our created Vertex AI Endpoint

The following code/command uploads our model from the Model Registry from step 3 to our endpoint created in step 4. 
The JSON payload will contain information about the model to upload, replicas, machine types, GPUs, and auto-scaling logic. A post request with the created JSON payload will be sent to Vertex AI endpoint created in step 4.

In [None]:
# Feel free to edit the JSON values apart from the model_name. 
model_name = model.resource_name
model_serving = {
    "deployedModel": {
      "model": f"{model_name}",
      "displayName": "yolo-model",
      "dedicatedResources": {
         "machineSpec": {
           "machineType": "g2-standard-8",
           "acceleratorType": "NVIDIA_L4",
           "acceleratorCount": "1"
         },
         "minReplicaCount": 2,
         "maxReplicaCount": 6, 
         "autoscalingMetricSpecs": [
        {
          "metricName": "aiplatform.googleapis.com/prediction/online/accelerator/duty_cycle",
          "target": 25
        }
      ]
       }
    }
} 

In [None]:
import json
with open("model_serving.json", "w", encoding="utf-8") as fp:
    json.dump(model_serving, fp)   

In [None]:
# Edit the endpoint.resource_name in the curl command according to the endpoint.resource_name in step 4. 
!curl -X POST \
     -H "Authorization: Bearer $(gcloud auth print-access-token)" \
     -H "Content-Type: application/json; charset=utf-8" \
     -d @model_serving.json \
     "https://<location>-aiplatform.googleapis.com/v1/<endpoint.resource_name>:deployModel"