# Run Multiple Models on the Same GPU with Amazon SageMaker Multi-Model Endpoints Powered by NVIDIA Triton Inference Server

This notebook was run on a `ml.g4dn.xlarge` SageMaker Notebook instance type, with `conda_pytorch_p38` kernel.

## Prerequisites

Install the necessary Python modules to use and interact with [NVIDIA Triton Inference Server](https://github.com/triton-inference-server/server/).

In [1]:
! pip install torch==1.10.0 sagemaker transformers==4.9.1 tritonclient[all]

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com, https://pypi.ngc.nvidia.com
Collecting torch==1.10.0
  Downloading torch-1.10.0-cp38-cp38-manylinux1_x86_64.whl (881.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m881.9/881.9 MB[0m [31m294.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting transformers==4.9.1
  Downloading transformers-4.9.1-py3-none-any.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m341.5 MB/s[0m eta [36m0:00:00[0m
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m880.6/880.6 kB[0m [31m343.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[2K     [90m━━━━━━

# Part 1 - Setup

In [2]:
import argparse
import boto3
import copy
import datetime
import json
import numpy as np
import os
import pandas as pd
import pprint
import re
import sagemaker
import sys
import time
from time import gmtime, strftime
import tritonclient.http as http_client

In [3]:
session = boto3.Session()
role = sagemaker.get_execution_role()

sm_client = session.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=session)
sm_runtime_client = boto3.client("sagemaker-runtime")

region = boto3.Session().region_name

In [4]:
account_id_map = {
    "us-east-1": "785573368785",
    "us-east-2": "007439368137",
    "us-west-1": "710691900526",
    "us-west-2": "301217895009",
    "eu-west-1": "802834080501",
    "eu-west-2": "205493899709",
    "eu-west-3": "254080097072",
    "eu-north-1": "601324751636",
    "eu-south-1": "966458181534",
    "eu-central-1": "746233611703",
    "ap-east-1": "110948597952",
    "ap-south-1": "763008648453",
    "ap-northeast-1": "941853720454",
    "ap-northeast-2": "151534178276",
    "ap-southeast-1": "324986816169",
    "ap-southeast-2": "355873309152",
    "cn-northwest-1": "474822919863",
    "cn-north-1": "472730292857",
    "sa-east-1": "756306329178",
    "ca-central-1": "464438896020",
    "me-south-1": "836785723513",
    "af-south-1": "774647643957",
}

***

# Part 2 - Save Model and tokenizer

We now save the tokenizer and the model to folders within the model repository

### Parameters:

* `model_name`: Model identifier from the Hugging Face model hub library

In [5]:
model_id = "roberta-large"
from transformers import AutoTokenizer,AutoModel

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
tokenizer.save_pretrained('model_repo/e2e/tokenizer')
model.save_pretrained('model_repo/e2e/model')

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


# Part 3 - Run Local Triton Inference Server

> **WARNING**: The cells under part 3 will only work if run within a SageMaker Notebook Instance!




The following cells run the Triton Inference Server container in the background and load all the models within the folder `/model_repo`. The docker won't fail if one or more of the model fails because of `--exit-on-error=false`, which is useful for iterative code and model repository building. Remove `-d` to see the logs.

In [6]:
!sudo docker system prune -f

Deleted Networks:
sagemaker-local

Total reclaimed space: 0B


In [7]:
!docker run --gpus=all -d --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/model_repo:/model_repository nvcr.io/nvidia/tritonserver:22.12-py3 tritonserver --model-repository=/model_repository --exit-on-error=false --strict-model-config=false
# time.sleep(20)

Unable to find image 'nvcr.io/nvidia/tritonserver:22.12-py3' locally
22.12-py3: Pulling from nvidia/tritonserver

[1B0b181fff: Pulling fs layer 
[1Bf751e984: Pulling fs layer 
[1Bb807c637: Pulling fs layer 
[1B2991e393: Pulling fs layer 
[1B71274096: Pulling fs layer 
[1B91138ef8: Pulling fs layer 
[1Bed3c7117: Pulling fs layer 
[1B46181ee6: Pulling fs layer 
[1Ba7918caa: Pulling fs layer 
[1B2fbe7c33: Pulling fs layer 
[1B8dd49356: Pulling fs layer 
[1B8fc97997: Pulling fs layer 
[1Ba4765a47: Pulling fs layer 
[1Bb700ef54: Pulling fs layer 
[1B42d4d1d7: Pulling fs layer 
[1Bb7b91111: Pulling fs layer 
[1B57c41539: Pulling fs layer 
[1Bbf837893: Pulling fs layer 
[1Bcb208312: Pulling fs layer 
[1B3fcdfbd9: Pulling fs layer 
[1B2037b0cf: Pulling fs layer 
[1Be9aef86f: Pulling fs layer 
[1Bf2adb71b: Pulling fs layer 
[1Bba395cd0: Pull complete 137kB/2.137kBB[24A[2K[24A[2K[24A[2K[23A[2K[22A[2K[23A[2K[23A[2K[23A[2K[23A[2K[22A[2K[22A[2K[20A[2K

In [8]:
CONTAINER_ID=!docker container ls -q
FIRST_CONTAINER_ID = CONTAINER_ID[0]

Uncomment the next cell and run it to view the container logs and understand Triton model loading.

In [17]:
# !docker logs $FIRST_CONTAINER_ID -f
!docker logs $FIRST_CONTAINER_ID


== Triton Inference Server ==

NVIDIA Release 22.12 (build 50109463)
Triton Server Version 2.29.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 11.8 driver version 520.61.05 with kernel driver version 510.47.03.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

I0217 05:59:47.318165 1 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f3f1e000000' with size 268435456
I0217 05:59:47.320371 1 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
W0217 05:59:47.324396 1 model

## Test TensorRT model by invoking the local Triton Server

In [18]:
# Start a local Triton client
try:
    triton_client = http_client.InferenceServerClient(url="localhost:8000", verbose=True)
except Exception as e:
    print("context creation failed: " + str(e))
    sys.exit()

In [19]:
# Create inputs to send to Triton
model_name = "e2e"

text_inputs = ["Sentence 1", "Sentence 2"]

# Text is passed to Trtion as BYTES
inputs = []
inputs.append(http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES"))

# We need to structure batch inputs as such
batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]
input0_real = np.array(batch_request, dtype=np.object_)

inputs[0].set_data_from_numpy(input0_real, binary_data=False)

In [20]:
outputs = []

outputs.append(http_client.InferRequestedOutput("SENT_EMBED"))

In [21]:
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

POST /v2/models/e2e/infer, headers None
{"inputs":[{"name":"INPUT0","shape":[2,1],"datatype":"BYTES","data":["Sentence 1","Sentence 2"]}],"outputs":[{"name":"SENT_EMBED","parameters":{"binary_data":true}}]}
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/octet-stream', 'inference-header-content-length': '148', 'content-length': '8340'}>
bytearray(b'{"model_name":"e2e","model_version":"1","outputs":[{"name":"SENT_EMBED","datatype":"FP32","shape":[2,1024],"parameters":{"binary_data_size":8192}}]}')


In [22]:
outputs0 = results.as_numpy("SENT_EMBED")


In [23]:
for idx, output in enumerate(outputs0):
    print(text_inputs[idx])
    print(output)

Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]


In [24]:
# Use this to stop the container that was started in detached mode
!docker kill $FIRST_CONTAINER_ID

91448f589d0c


***

# Part 4 - Deploy Triton to SageMaker MME Endpoint

# MME Experiments

In [25]:
if region not in account_id_map.keys():
    raise ("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"

triton_image_uri = "{account_id}.dkr.ecr.{region}.{base}/sagemaker-tritonserver:22.12-py3".format(
    account_id=account_id_map[region], region=region, base=base
)

triton_image_uri

'785573368785.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.12-py3'

In [26]:
bucket = sagemaker_session.default_bucket()
print(bucket)

sagemaker-us-east-1-414210492846


In [49]:
!tar -C model_repo_5/ -czf e2e-5.tar.gz e2e-5
prefix = 'bert_mme_gpu'
e2e_uri = sagemaker_session.upload_data(path="e2e-5.tar.gz", key_prefix=prefix)

In [43]:
model_data_url = f"s3://{bucket}/{prefix}/"
!aws s3 ls $model_data_url

2023-02-17 06:30:11  834120513 e2e-2.tar.gz
2023-02-17 06:09:58  834120588 e2e.tar.gz


In [29]:
model_data_url = f"s3://{bucket}/{prefix}/"

container = {
    "Image": triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
}

In [30]:
sm_model_name = "triton-e2e-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Model Arn: arn:aws:sagemaker:us-east-1:414210492846:model/triton-e2e-2023-02-17-06-10-23


In [66]:
endpoint_config_name = "triton-e2e-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g4dn.2xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Endpoint Config Arn: arn:aws:sagemaker:us-east-1:414210492846:endpoint-config/triton-e2e-2023-02-17-07-55-14


In [67]:
endpoint_name = "triton-e2e-" + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

Endpoint Arn: arn:aws:sagemaker:us-east-1:414210492846:endpoint/triton-e2e-2023-02-17-07-55-22


In [68]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: Creating
Status: InService
Arn: arn:aws:sagemaker:us-east-1:414210492846:endpoint/triton-e2e-2023-02-17-07-55-22
Status: InService


## Test endpoint

In [59]:
text_inputs

['Sentence 1', 'Sentence 2']

In [60]:
http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES")

<tritonclient.http.InferInput at 0x7f02dc867610>

In [61]:
text_inputs = ["Sentence 1", "Sentence 2"]

inputs = []
inputs.append(http_client.InferInput("INPUT0", [len(text_inputs), 1], "BYTES"))

batch_request = [[text_inputs[i]] for i in range(len(text_inputs))]

input0_real = np.array(batch_request, dtype=np.object_)

inputs[0].set_data_from_numpy(input0_real, binary_data=False)

len(input0_real)

2

In [62]:
outputs = []

outputs.append(http_client.InferRequestedOutput("SENT_EMBED"))

In [63]:
outputs

[<tritonclient.http.InferRequestedOutput at 0x7f0252033100>]

In [64]:
request_body, header_length = http_client.InferenceServerClient.generate_request_body(
    inputs, outputs=outputs
)

print(request_body)

{"inputs":[{"name":"INPUT0","shape":[2,1],"datatype":"BYTES","data":["Sentence 1","Sentence 2"]}],"outputs":[{"name":"SENT_EMBED","parameters":{"binary_data":true}}]}


In [44]:
response = sm_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    # ContentType="application/vnd.sagemaker-triton.binary+json;json-header-size={}".format(
        # header_length
    # ),
    ContentType='application/octet-stream',
    Body=request_body,
    TargetModel='e2e-2.tar.gz'
)

In [45]:
header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
header_length_str = response["ContentType"][len(header_length_prefix) :]

# Read response body
result = http_client.InferenceServerClient.parse_response_body(
    response["Body"].read(), header_length=int(header_length_str)
)

outputs_data = result.as_numpy("SENT_EMBED")

for idx, output in enumerate(outputs_data):
    print(text_inputs[idx])
    print(output)

Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]


In [69]:
import time

for x in range (10):
    for counter in [1,2,3,4,5]:   
        st = time.time()
        target_model=f"e2e-{counter}.tar.gz"
        print(f"invoking model {target_model}")
        response = sm_runtime_client.invoke_endpoint(
            EndpointName=endpoint_name,
            ContentType="application/octet-stream",
            Body=request_body,
            TargetModel=target_model,
        )

        header_length_prefix = "application/vnd.sagemaker-triton.binary+json;json-header-size="
        header_length_str = response["ContentType"][len(header_length_prefix) :]

        # Read response body
        result = http_client.InferenceServerClient.parse_response_body(
            response["Body"].read(), header_length=int(header_length_str)
        )

        outputs_data = result.as_numpy("SENT_EMBED")

        for idx, output in enumerate(outputs_data):
            print(text_inputs[idx])
            print(output)
        et = time.time()
        elapsed_time = et - st
        print('Execution time:', elapsed_time, 'seconds')

invoking model e2e-1.tar.gz
Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]
Execution time: 81.8214521408081 seconds
invoking model e2e-2.tar.gz
Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]
Execution time: 21.268391132354736 seconds
invoking model e2e-3.tar.gz
Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]
Execution time: 21.287919759750366 seconds
invoking model e2e-4.tar.gz
Sentence 1
[-0.00097987 -0.00352379 -0.004177   ... -0.00120587 -0.00202981
 -0.00294534]
Sentence 2
[-0.00035618 -0.0042098  -0.00419457 ... -0.00180162 -0.00149669
 -0.0010363 ]
Execution time: 21.50484848022461 seconds

# Part 5 - Test SageMaker Endpoint with Java Client

## Build Java App Docker Container

Get credentials first

In [None]:
!curl http://169.254.169.254/latest/meta-data/iam/security-credentials/BaseNotebookInstanceEc2InstanceRole>tmp.json
f = open('tmp.json')
metadata=json.load(f)
os.remove('tmp.json')

In [None]:
with open('./java_client/credentials', 'a') as credentials_file:
    credentials_file.write("[default]\n")
    credentials_file.write(f"aws_access_key_id = {metadata['AccessKeyId']}\n")
    credentials_file.write(f"aws_secret_access_key = {metadata['SecretAccessKey']}\n")
    credentials_file.write(f"aws_session_token = {metadata['Token']}\n")

### Build the Docker Image

In [None]:
!docker build  -t sagemaker-runtime-java-example ./java_client

In [None]:
os.remove('./java_client/credentials')

### Run the Docker Container to invoke the endpoint from Java Client

In [None]:
!docker run -e AWS_REGION=us-east-1 -e ENDPOINT_NAME={endpoint_name} sagemaker-runtime-java-example

# Part 6 - Delete the Endpoint

In [None]:
#sm_client.delete_endpoint(EndpointName=endpoint_name)