In [1]:
!pip install --upgrade pip
!pip -q install sagemaker awscli boto3 pandas --upgrade 



## Example: TorchServe Performance Tuning on Amazon SageMaker

In this example, we’ll show you how you can tune TorchServe performance, build a TorchServe container and host it using Amazon SageMaker. With Amazon SageMaker hosting you get a fully-managed hosting experience. Just specify the type of instance, and the maximum and minimum number desired, and SageMaker takes care of the rest.

There are two options to tune TorchServe performance on SageMaker:

1. Tune the following TorchServe's parameters in config.properties.
[TorchServe configuration](https://github.com/pytorch/serve/blob/master/docs/configuration.md#other-properties)

* number_of_netty_threads
* netty_client_threads
* async_logging
* minWorkers
* maxWorkers
* batchSize

2. [SageMaker batch transform](https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html)

SageMaker batch transform job provides two stragties to send Http request to TorchServe, and one parameter to adjust the concurrency.

* 2 Strategies:
1) SingleRecord: a single HTTP request contains one record.
2) MultiRecord: a single HTTP request contains multiple records. This is a client side batching and requires a model handler to split the requests in a batch.

* Concurrency parameter:
[MaxConcurrentTransforms](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTransformJob.html#sagemaker-CreateTransformJob-request-MaxConcurrentTransforms): The maximum number of parallel requests that can be sent to each instance in a transform job.

In summary, we recommend setting TorchServe on SageMaker as the following:
1. TorchSeve dynamic batching
2. SageMaker batch transform SingleRecord strategy
3. Set MaxConcurrentTransforms

## TorchServe dynamic batching

### config.properties

In [2]:
!cat config.properties

inference_address=http://0.0.0.0:8080
management_address=http://0.0.0.0:8081
number_of_netty_threads=32
job_queue_size=1000
model_store=/opt/ml/model
load_models=all
install_py_dep_per_model=true
default_response_timeout=300
unregister_model_timeout=300
-XX:-UseContainerSupport -XX:+UnlockDiagnosticVMOptions -XX:+PrintActiveCpus
models={\
  "TransformerEn2Fr": {\
    "1.0": {\
        "defaultVersion": true,\
        "marName": "TransformerEn2Fr.mar",\
        "minWorkers": 1,\
        "maxWorkers": 4,\
        "batchSize": 4,\
        "maxBatchDelay": 500,\
        "responseTimeout": 120\
    }\
  }\
}



### Clone the TorchServe repository

In [None]:
!git clone https://github.com/pytorch/serve.git

In [None]:
!cd /home/ec2-user/SageMaker/torchserve_perf/serve && git checkout issue_1107

### Download a PyTorch model 

In [3]:
model_name = "TransformerEn2Fr"
mar_file = f'{model_name}.mar'
mar_url = f'https://torchserve.pytorch.org/mar_files/{mar_file}'
!wget -q {mar_url}
!ls *.mar

TransformerEn2Fr.mar


### Upload the TransformerEn2Fr.mar archive file to Amazon S3
Create a compressed tar.gz file from the TransformerEn2Fr.mar file since Amazon SageMaker expects that models are in a tar.gz file. 
Uploads the model to your default Amazon SageMaker S3 bucket under the models directory

### Create a boto3 session and get specify a role with SageMaker access

In [4]:
import boto3, time, json
sess    = boto3.Session()
sm      = sess.client('sagemaker')
region  = sess.region_name
account = boto3.client('sts').get_caller_identity().get('Account')

In [5]:
import sagemaker
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session(boto_session=sess)

In [6]:
bucket_name = sagemaker_session.default_bucket()
prefix = 'torchserve'

!tar cvfz {model_name}.tar.gz {mar_file}
!aws s3 cp {model_name}.tar.gz s3://{bucket_name}/{prefix}/models/

### Create an Amazon ECR registry
Create a new docker container registry for your torchserve container images.

In [7]:
registry_name = 'torchserve-perf'
!aws ecr create-repository --repository-name {registry_name}


An error occurred (RepositoryAlreadyExistsException) when calling the CreateRepository operation: The repository with name 'torchserve-perf' already exists in the registry with id '057122759684'


### Build a TorchServe Docker container and push it to Amazon ECR

In [9]:
image_label = 'v1'
image = f'{account}.dkr.ecr.{region}.amazonaws.com/{registry_name}:{image_label}'

!docker build -t {registry_name}:{image_label} .
!$(aws ecr get-login --no-include-email --region {region})
!docker tag {registry_name}:{image_label} {image}
!docker push {image}

### Deploy endpoint and make prediction using Amazon SageMaker SDK

In [10]:
from sagemaker.model import Model
from sagemaker.predictor import Predictor

model_data = f's3://{bucket_name}/{prefix}/models/{model_name}.tar.gz'
sm_model_name = f'torchserve-{model_name}'

torchserve_model = Model(model_data = model_data, 
                         image_uri = image,
                         role  = role,
                         predictor_cls=Predictor,
                         name  = sm_model_name)

In [None]:
endpoint_name = 'torchserve-endpoint-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

predictor = torchserve_model.deploy(instance_type='ml.g4dn.xlarge',
                                    initial_instance_count=1,
                                    endpoint_name = endpoint_name)

### Test the TorchServe hosted model

TorchServe dynamic batching is transparent to client side. It aggregates a model's incoming prediction requests together, processes in batch and distributes response to clients.

In [None]:
payload = "Hi James, when are you coming back home? I am waiting for you. Please come as soon as possible."    
response = predictor.predict(data=payload)
print(response)

## SageMaker Batch Transform Jobs

In [11]:
batch_input = f's3://{bucket_name}/{model_name}/batch_transform_torchserve_sagemaker_input/'
batch_output = f's3://{bucket_name}/{model_name}/batch_transform_torchserve_sagemaker_output/'
batch_job_name = f'{model_name}-batch-' + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())
transform_input = sagemaker_session.upload_data('serve/examples/nmt_transformer/model_input/', 
                                                bucket=bucket_name, 
                                                key_prefix=f'{model_name}/batch_transform_torchserve_sagemaker_input')
transform_input

's3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_input'

In [13]:
transformer = sagemaker.transformer.Transformer(model_name=sm_model_name, 
                                                instance_count=1, 
                                                instance_type='ml.m4.xlarge',
                                                strategy="SingleRecord",
                                                max_concurrent_transforms=2,
                                                assemble_with=None, 
                                                output_path=batch_output, 
                                                sagemaker_session=sagemaker_session)

In [14]:
transformer.transform(data=transform_input)
transformer.wait()

...........................................[34mCUDNN_VERSION=7.6.5.32[0m
[34mPYTHONUNBUFFERED=TRUE[0m
[34mLD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64[0m
[34mSAGEMAKER_MAX_CONCURRENT_TRANSFORMS=2[0m
[34mSAGEMAKER_BATCH_STRATEGY=SINGLE_RECORD[0m
[34mLANG=C.UTF-8[0m
[34mSAGEMAKER_SAFE_PORT_RANGE=10000-10999[0m
[34mHOSTNAME=c166cabe72bf[0m
[34mPYTHONIOENCODING=UTF-8[0m
[34mNVIDIA_VISIBLE_DEVICES=all[0m
[34mNCCL_VERSION=2.10.3[0m
[34mPWD=/home/model-server[0m
[34mHOME=/root[0m
[34mSAGEMAKER_BATCH=true[0m
[34mAWS_REGION=us-east-2[0m
[34mSAGEMAKER_BIND_TO_PORT=8080[0m
[34mCUDA_PKG_VERSION=10-2=10.2.89-1[0m
[34mCUDA_VERSION=10.2.89[0m
[34mNVIDIA_DRIVER_CAPABILITIES=compute,utility[0m
[34mSHLVL=1[0m
[34mNVIDIA_REQUIRE_CUDA=cuda>=10.2 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441[0m
[34mAWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/54

In [15]:
print(transformer.output_path)

s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/


In [16]:
!aws s3 cp --recursive $transformer.output_path ./

download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample2.txt.out to ./sample2.txt.out
download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample.txt.out to ./sample.txt.out
download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample5.txt.out to ./sample5.txt.out
download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample3.txt.out to ./sample3.txt.out
download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample1.txt.out to ./sample1.txt.out
download: s3://sagemaker-us-east-2-057122759684/TransformerEn2Fr/batch_transform_torchserve_sagemaker_output/sample4.txt.out to ./sample4.txt.out


In [18]:
!head sample*.txt.out

==> sample1.txt.out <==
{"input": "Hello World !!!\n", "french_output": "Bonjour le monde ! ! !"}
==> sample2.txt.out <==
{"input": "Hi James, when are you coming back home? I am waiting for you.\nPlease come as soon as possible.\n", "french_output": "Bonjour James, quand rentrerez-vous chez vous, je vous attends et je vous prie de venir le plus t\u00f4t possible."}
==> sample3.txt.out <==
{"input": "I\u2019m sorry, I don\u2019t remember your name. You are you?\n", "french_output": "Je vous prie de m'excuser, je ne me souviens pas de votre nom."}
==> sample4.txt.out <==
{"input": "I\u2019m well. How are you?\nIt\u2019s going well, thank you. How are you doing?\nFine, thanks. And yourself?\n", "french_output": "Je me sens bien. Comment allez-vous ? \u00c7a va bien, merci. Comment allez-vous ?"}
==> sample5.txt.out <==
{"input": "Hello world\nGood morning\nGood evening\nThis is a dog\nThis is a cat\nGo swimming\nGo to Paris\nThank you\n", "french_output": "Bonjour Monde Bonjour 