Copyright 2022 NVIDIA Corporation. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: center;">

## Installs <a class="anchor" id="installs-and-set-up"></a>

Install required to packages using pip

In [None]:
!pip install -qU pip boto3 sagemaker awscli

In [None]:
#### Imports and variables

In [None]:
# imports
import boto3
import sagemaker
from sagemaker import get_execution_role

# sagemaker variables
role = get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

#variables
prefix = "aim401-mme-gpu"
model_name = "efficientnet_b0"
pytorch_model_file_name = f"{model_name}_pt_v0.tar.gz"
tensorrt_model_file_name = f"{model_name}_trt_v0.tar.gz"

# imports
import boto3
#import json
import sagemaker
import time
from sagemaker import get_execution_role
#import numpy as np

# variables
s3_client = boto3.client("s3")
ts = time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

# sagemaker variables
role = get_execution_role()
sm_client = boto3.client(service_name="sagemaker")
runtime_sm_client = boto3.client("sagemaker-runtime")
sagemaker_session = sagemaker.Session(boto_session=boto3.Session())
bucket = sagemaker_session.default_bucket()

%store -r

# endpoint variables
sm_model_name = f"{prefix}-mdl-{ts}"
endpoint_config_name = f"{prefix}-epc-{ts}"
endpoint_name = f"{prefix}-ep-{ts}"
model_data_url = f"s3://{bucket}/{prefix}/"
instance_type = "ml.g4dn.xlarge"

In [None]:
### SageMaker Triton Container Image

In [None]:
allowed_regions = ["us-east-1","us-east-2","us-west-1","us-west-2"]

region = boto3.Session().region_name
if region not in allowed_regions:
    raise Exception("UNSUPPORTED REGION")

base = "amazonaws.com.cn" if region.startswith("cn-") else "amazonaws.com"
mme_triton_image_uri = (
     "875666977404.dkr.ecr.{region}.{base}/sagemaker-tritonserver-ensemble:22.10-py3".format(
     region=region, base=base
    )
)

## Creating Model Artifacts <a class="anchor" id="pytorch-efficientnet-model"></a>

This section presents overview of steps to prepare efficientnet_b0 pre-trained model to be deployed on SageMaker MME using Triton Inference server model configurations. 

<div class="alert alert-info"><strong> Note </strong>
We are demonstrating deployment with
</div>

### Prepare PyTorch Model  <a class="anchor" id="create-pytorch-model"></a>

- First, we load a pre-trained efficientnet_b0 model using torchvision models package. 
- We save the model as model.pt file in TorchScript optimized and serialized format. 
- TorchScript takes an example inputs to do a model forward pass, so pass an input of dimension 3X224X224.
- [generate_model_pytorch.sh](./workspace/generate_model_pytorch.sh) file contains scripts to generate a PyTorch EfficientNet B0 model.

Run the cell below and check out the [pt_exporter.py](./workspace/pt_exporter.py) file for more details

### Export model artifacts to S3 <a class="anchor" id="export-to-s3"></a>

SageMaker expects the model artifacts in below format, it should also satisfy Triton container requirements such as model name, version, config.pbtxt files etc. `tar` the folder containing the model file as `model.tar.gz` and upload it to s3

In [None]:
!mkdir -p triton-serve-trt/efficientnet_b0/1/
!mv -f workspace/model.plan triton-serve-trt/efficientnet_b0/1/
!tar -C triton-serve-trt/ -czf $tensorrt_model_file_name efficientnet_b0
model_uri_trt = sagemaker_session.upload_data(path=tensorrt_model_file_name, key_prefix=prefix)

In [None]:
print(f"TensorRT Model S3 location: {model_uri_trt}")

### Deploy Models with MME <a class="anchor" id="deploy-models-with-mme"></a>

We will now deploy EfficientNet B0 and T5 model with different framework backends i.e. PyTorch, Python and TensorRT to SageMaker MME.


<div class="alert alert-info"> <strong> Note </strong>
you can deploy 100s of models. The models can use same framework. They can also use different frameworks as shown in this note.
</div>

We will use AWS SDK for Python (Boto) APIs [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model), [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) and [create_endpoint](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint) to create a mulit-model endpoint.

### Define the serving container  <a class="anchor" id="define-container-def"></a>

 In the container definition, define the `ModelDataUrl` to specify the S3 directory that contains all the models that SageMaker multi-model endpoint will use to load  and serve predictions. Set `Mode` to `MultiModel` to indicates SageMaker would create the endpoint with MME container specifications. We set the container with an image that supports deploying multi-model endpoints with GPU, see MME [container images](https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html#multi-model-support) for more details.

In [None]:
container = {"Image": mme_triton_image_uri, "ModelDataUrl": model_data_url, "Mode": "MultiModel"}

In [None]:
### Create a MME object <a class="anchor" id="create-mme-model-obj"></a>

Using the SageMaker boto3 client, create the model using [create_model](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_model) API. We will pass the container definition to the create model API along with ModelName and ExecutionRoleArn.


In [None]:
create_model_response = sm_client.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

### Define configuration for the MME<a class="anchor" id="config-mme"></a>

Create a multi-model endpoint configuration using [create_endpoint_config](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker.html#SageMaker.Client.create_endpoint_config) boto3 API. Specify an accelerated GPU computing instance in InstanceType (we will use the same instance type that we are using to host our SageMaker Notebook). We recommend configuring your endpoints with at least two instances with real-life use-cases. This allows SageMaker to provide a highly available set of predictions across multiple Availability Zones for the models.




In [None]:
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

### Create MME  <a class="anchor" id="create-mme"></a>

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to **InService** once the deployment is successful.

In [None]:
create_endpoint_response = sm_client.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

### Describe MME <a class="anchor" id="describe-mme"></a>

Now, we check the status of the endpoint using `describe_endpoint`. This step will take about 5 mins to complete and you should see "Status: InService" message before you proceed to next cells.

In [None]:
resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm_client.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

## Perform Sample Inference

In [None]:
texts = ["I feel irritated and rejected without anyone doing anything or saying anything",
         "I become overwhelmed and feel defeated."
        ]

In [None]:
xdistilbert_emotion_classification_client.request_inference(texts,
                              model_name="xdistilbert_pt",
                              triton_url="localhost:8000")