<h1>Multi Model Server Container for Differential Deep Learning</h1>

This notebook demonstrates how to build and use a custom Docker container for serving with Amazon SageMaker that leverages on the <strong>Multi Model Server (MMS)</strong> and <strong>sagemaker-inference-toolkit</strong> libraries for serving models through Amazon SageMaker's endpoints.
We will also see how MMS allows deploying multiple models on a single endpoint thanks to the multi-model endpoints functionality of Amazon SageMaker Hosting (https://docs.aws.amazon.com/sagemaker/latest/dg/multi-model-endpoints.html).

Useful links:
- https://github.com/awslabs/multi-model-server/
- https://github.com/aws/sagemaker-inference-toolkit

We start by defining some variables like the current execution role, the ECR repository that we are going to use for pushing the custom Docker container and a default Amazon S3 bucket to be used by Amazon SageMaker.

In [1]:
import boto3
import sagemaker
from sagemaker import get_execution_role

ecr_namespace = 'sagemaker-serving-containers/'
prefix = 'diffdl-container'

ecr_repository_name = ecr_namespace + prefix
role = get_execution_role()
account_id = role.split(':')[4]
region = boto3.Session().region_name
sagemaker_session = sagemaker.session.Session()
bucket = sagemaker_session.default_bucket()

print(account_id)
print(region)
print(role)
print(bucket)

785577973223
us-east-1
arn:aws:iam::785577973223:role/service-role/AmazonSageMaker-ExecutionRole-20210715T110490
sagemaker-us-east-1-785577973223


Let's take a look at the Dockerfile which defines the statements for building our serving container:

In [3]:
! pygmentize ../docker/Dockerfile

[34mFROM[39;49;00m [33mtensorflow/tensorflow:latest[39;49;00m

[34mLABEL[39;49;00m [31mmaintainer[39;49;00m=[33m"Oleg Grytsynevych"[39;49;00m

[37m# Set a docker label to advertise multi-model support on the container[39;49;00m
[34mLABEL[39;49;00m com.amazonaws.sagemaker.capabilities.multi-models=true
[37m# Set a docker label to enable container to use SAGEMAKER_BIND_TO_PORT environment variable if present[39;49;00m
[34mLABEL[39;49;00m com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true

[37m# Install python and other runtime dependencies[39;49;00m
[34mRUN[39;49;00m apt-get update && [33m\[39;49;00m
    apt-get -y install [33m\[39;49;00m
        build-essential [33m\[39;49;00m
        libatlas-dev [33m\[39;49;00m
        git [33m\[39;49;00m
        wget [33m\[39;49;00m
        curl [33m\[39;49;00m
        openjdk-8-jdk-headless

[37m# Python won’t try to write .pyc or .pyo files on the import of source modules[39;49;00m
[37m# Force stdin

At high-level the Dockerfile specifies the following operations for building this container:

- Set two Docker labels to advertise multi-model support and to enable the container using the SAGEMAKER_BIND_TO_PORT environment variable, if present
- Install libraries (including OpenJDK since MMS frontend is Java-based) and Python 3.6 through miniconda
- Set e few environment variables, including PYTHONUNBUFFERED which is used to avoid buffering Python standard output (useful for logging)
- Install XGBoost (it is the ML framework of choice for this example)
- Install Multi Model Server (MMS) and SageMaker Inference Toolkit
- Copy a .tar.gz package named <strong>multi_model_serving-1.0.0.tar.gz</strong> in the WORKDIR
- Install this package
- Copy the serve.py file in the WORKDIR and use it as the Docker ENTRYPOINT

Let's see the content of the <strong>serve.py</strong> file.

In [4]:
! pygmentize ../docker/code/serve.py

[34mfrom[39;49;00m [04m[36m__future__[39;49;00m [34mimport[39;49;00m absolute_import

[34mfrom[39;49;00m [04m[36msubprocess[39;49;00m [34mimport[39;49;00m CalledProcessError
[34mfrom[39;49;00m [04m[36mretrying[39;49;00m [34mimport[39;49;00m retry
[34mfrom[39;49;00m [04m[36mmulti_model_serving[39;49;00m [34mimport[39;49;00m handler_service
[34mfrom[39;49;00m [04m[36msagemaker_inference[39;49;00m [34mimport[39;49;00m model_server

HANDLER_SERVICE = handler_service.[31m__name__[39;49;00m

[34mdef[39;49;00m [32m_retry_if_error[39;49;00m(exception):
    [34mreturn[39;49;00m [36misinstance[39;49;00m(exception, CalledProcessError)

[90m@retry[39;49;00m(stop_max_delay=[34m1000[39;49;00m * [34m30[39;49;00m,
       retry_on_exception=_retry_if_error)
[34mdef[39;49;00m [32m_start_model_server[39;49;00m():
    [37m# there's a race condition that causes the model server command to[39;49;00m
    [37m# sometimes fail with 'bad address'. more i

<h2>Handler Service</h2>

When looking at the Dockerfile above, you might be askiong yourself what the <strong>multi_model_serving-1.0.0.tar.gz</strong> package is.
When building a framework container for serving, sagemaker-inference-toolkit allows you to pass an handler service that will define the default inference handling logic, when users do not pass any custom inference script. The package above contains this code.

This is the content of the handler service:

In [5]:
! pygmentize ../package/src/multi_model_serving/handler_service.py

[34mfrom[39;49;00m [04m[36msagemaker_inference[39;49;00m[04m[36m.[39;49;00m[04m[36mdefault_handler_service[39;49;00m [34mimport[39;49;00m DefaultHandlerService
[34mfrom[39;49;00m [04m[36msagemaker_inference[39;49;00m[04m[36m.[39;49;00m[04m[36mtransformer[39;49;00m [34mimport[39;49;00m Transformer
[34mfrom[39;49;00m [04m[36mmulti_model_serving[39;49;00m[04m[36m.[39;49;00m[04m[36mdefault_inference_handler[39;49;00m [34mimport[39;49;00m DefaultDiffDLInferenceHandler

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m

ENABLE_MULTI_MODEL = os.getenv([33m"[39;49;00m[33mSAGEMAKER_MULTI_MODEL[39;49;00m[33m"[39;49;00m, [33m"[39;49;00m[33mfalse[39;49;00m[33m"[39;49;00m) == [33m"[39;49;00m[33mtrue[39;49;00m[33m"[39;49;00m

[34mclass[39;49;00m [04m[32mHandlerService[39;49;00m(DefaultHandlerService):
    [34mdef[39;49;00m [32m__init__[39;49;00m([36mself[39;49;00m):
        [36mself[39;

And this is the logic defined in the default inference handler:

In [6]:
! pygmentize ../package/src/multi_model_serving/default_inference_handler.py

[34mimport[39;49;00m [04m[36mpickle[39;49;00m [34mas[39;49;00m [04m[36mpkl[39;49;00m
[34mfrom[39;49;00m [04m[36msagemaker_inference[39;49;00m [34mimport[39;49;00m content_types, decoder, default_inference_handler, encoder, errors
[34mfrom[39;49;00m [04m[36mmulti_model_serving[39;49;00m [34mimport[39;49;00m encoder [34mas[39;49;00m xgb_encoders

[34mclass[39;49;00m [04m[32mDefaultDiffDLInferenceHandler[39;49;00m(default_inference_handler.DefaultInferenceHandler):

    [34mdef[39;49;00m [32mdefault_input_fn[39;49;00m([36mself[39;49;00m, input_data, content_type):
        [33m"""Take request data and de-serializes the data into an object for prediction.[39;49;00m
[33m        When an InvokeEndpoint operation is made against an Endpoint running SageMaker model server,[39;49;00m
[33m        the model server receives two pieces of information:[39;49;00m
[33m            - The request Content-Type, for example "application/json"[39;49;00m
[33m      

<h2>Build and push the container</h2>
We are now ready to build this container and push it to Amazon ECR. This task is executed using a shell script stored in the ../script/ folder. Let's take a look at this script and then execute it.

In [7]:
! pygmentize ../scripts/build_and_push.sh

[31mACCOUNT_ID[39;49;00m=[31m$1[39;49;00m
[31mREGION[39;49;00m=[31m$2[39;49;00m
[31mREPO_NAME[39;49;00m=[31m$3[39;49;00m

[36mcd[39;49;00m ../package/ && python setup.py sdist && cp dist/multi_model_serving-1.0.0.tar.gz docker/code/

docker build -f ../docker/Dockerfile -t [31m$REPO_NAME[39;49;00m ../docker

docker tag [31m$REPO_NAME[39;49;00m [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest

[34m$([39;49;00maws ecr get-login --no-include-email --registry-ids [31m$ACCOUNT_ID[39;49;00m[34m)[39;49;00m

aws ecr describe-repositories --repository-names [31m$REPO_NAME[39;49;00m || aws ecr create-repository --repository-name [31m$REPO_NAME[39;49;00m

docker push [31m$ACCOUNT_ID[39;49;00m.dkr.ecr.[31m$REGION[39;49;00m.amazonaws.com/[31m$REPO_NAME[39;49;00m:latest


<h3>--------------------------------------------------------------------------------------------------------------------</h3>

The script builds the Docker container, then creates the repository if it does not exist, and finally pushes the container to the ECR repository. The build task requires a few minutes to be executed the first time, then Docker caches build outputs to be reused for the subsequent build operations.

In [11]:
!sudo yum -y install docker

/bin/sh: 1: sudo: not found


In [13]:
! ../scripts/build_and_push.sh $account_id $region $ecr_repository_name

running sdist
running egg_info
writing src/multi_model_serving.egg-info/PKG-INFO
writing dependency_links to src/multi_model_serving.egg-info/dependency_links.txt
writing top-level names to src/multi_model_serving.egg-info/top_level.txt
reading manifest file 'src/multi_model_serving.egg-info/SOURCES.txt'
writing manifest file 'src/multi_model_serving.egg-info/SOURCES.txt'

running check

creating multi_model_serving-1.0.0
creating multi_model_serving-1.0.0/src
creating multi_model_serving-1.0.0/src/multi_model_serving
creating multi_model_serving-1.0.0/src/multi_model_serving.egg-info
copying files to multi_model_serving-1.0.0...
copying setup.py -> multi_model_serving-1.0.0
copying src/multi_model_serving/__init__.py -> multi_model_serving-1.0.0/src/multi_model_serving
copying src/multi_model_serving/default_inference_handler.py -> multi_model_serving-1.0.0/src/multi_model_serving
copying src/multi_model_serving/encoder.py -> multi_model_serving-1.0.0/src/multi_model_serving
copying s

<h2>Deploy with Amazon SageMaker</h2>


<h3>Get the container URI</h3>
Once we have correctly pushed our container to Amazon ECR, we are ready to deploy with Amazon SageMaker, which requires the ECR path to the Docker container used for serving as parameter for deployment.

In [8]:
container_image_uri = '{0}.dkr.ecr.{1}.amazonaws.com/{2}:latest'.format(account_id, region, ecr_repository_name)
print(container_image_uri)

785577973223.dkr.ecr.us-east-1.amazonaws.com/sagemaker-serving-containers/multi-model-server-container:latest


<h3>Prepare two models</h3>

We are going to deploy two different XGBoost models to our model server. We will need the serialized models and the inference scripts that we want to use.
We will store them in the current notebook folder, under <strong>model_and_code_1/</strong> and <strong>model_and_code_2/</strong>.

The purpose of using different models is to show that you can also deploy models that require diverse features and pre/post processing code.

First model is a regression model trained on the [Abalone data](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html) originally from the [UCI data repository](https://archive.ics.uci.edu/ml/datasets/abalone).
For further information, please refer to this [example](https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/xgboost_abalone/xgboost_abalone.ipynb).

Second model is a binary classification model built by following this workshop: https://github.com/aws-samples/amazon-sagemaker-build-train-deploy

In [9]:
! rm -rf ./model_and_code_1/.ipynb_checkpoints
! rm -rf ./model_and_code_1/code/.ipynb_checkpoints
! rm -rf ./model_and_code_2/.ipynb_checkpoints
! rm -rf ./model_and_code_2/code/.ipynb_checkpoints

! tar -C ./model_and_code_1/ -cvzf model1.tar.gz ./
! tar -C ./model_and_code_2/ -cvzf model2.tar.gz ./

./
./xgboost-model
./code/
./code/predictor.py
./
./model.bin
./code/
./code/predictor.py


Let's see the custom inference script for the first model:

In [10]:
! pygmentize model_and_code_1/code/predictor.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m [34mas[39;49;00m [04m[36mpkl[39;49;00m

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    model_file = model_dir + [33m'[39;49;00m[33m/xgboost-model[39;49;00m[33m'[39;49;00m
    model = pkl.load([36mopen[39;49;00m(model_file, [33m'[39;49;00m[33mrb[39;49;00m[33m'[39;49;00m))
    [34mreturn[39;49;00m model


And this is the one for the second model:

In [11]:
! pygmentize model_and_code_2/code/predictor.py

[34mimport[39;49;00m [04m[36mos[39;49;00m
[34mimport[39;49;00m [04m[36mpickle[39;49;00m [34mas[39;49;00m [04m[36mpkl[39;49;00m

[34mfrom[39;49;00m [04m[36mmulti_model_serving[39;49;00m [34mimport[39;49;00m encoder [34mas[39;49;00m xgb_encoders

[34mdef[39;49;00m [32minput_fn[39;49;00m(input_data, content_type):    
    [34mreturn[39;49;00m xgb_encoders.decode(input_data, content_type)

[34mdef[39;49;00m [32mmodel_fn[39;49;00m(model_dir):
    model_file = model_dir + [33m'[39;49;00m[33m/model.bin[39;49;00m[33m'[39;49;00m
    model = pkl.load([36mopen[39;49;00m(model_file, [33m'[39;49;00m[33mrb[39;49;00m[33m'[39;49;00m))
    [34mreturn[39;49;00m model


<h3>Deploy a single model</h3>

In [12]:
s3_model_path = 's3://{0}/{1}/model/model1.tar.gz'.format(bucket, prefix)
!aws s3 cp model1.tar.gz {s3_model_path}

In [15]:
!pwd

/root/sagemaker-custom-serving-containers/multi-model-server-container/notebook


In [13]:
from time import gmtime, strftime
from sagemaker.model import Model

model_name = 'multi-model-server-model-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

model = Model(model_data = s3_model_path,
              image_uri = container_image_uri,
              env = {
                  'SAGEMAKER_PROGRAM': 'predictor'
              },
              role=role,
              name = model_name,
              predictor_cls = sagemaker.predictor.Predictor,
              #sagemaker_session=sagemaker_session #comment this line for local mode.
             )

<strong>Note:</strong> the environment variable SAGEMAKER_PREDICTOR is used to specify the name of the custom inference script.

In [14]:
endpoint_name = 'multi-model-server-single-ep-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(endpoint_name)
pred = model.deploy(initial_instance_count=1,
                    instance_type='local',
                    endpoint_name=endpoint_name)

multi-model-server-single-ep-2021-07-15-16-41-10


FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker'

In [None]:
from sagemaker.predictor import Predictor

pred.serializer = sagemaker.serializers.CSVSerializer()
item = '77,33,143.0,101,212.2,102,104.9,120,15.3,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1'
pred.predict(item)

In [None]:
pred.delete_endpoint()
pred.delete_model()

<h3>Deploy multiple models</h3>

In [None]:
model_data_prefix = 's3://{0}/{1}/modeldata'.format(bucket, prefix)

s3_model_1_path = model_data_prefix + '/model1.tar.gz'
!aws s3 cp model1.tar.gz {s3_model_1_path}
s3_model_2_path = model_data_prefix + '/model2.tar.gz'
!aws s3 cp model2.tar.gz {s3_model_2_path}

In [None]:
from time import gmtime, strftime
from sagemaker.multidatamodel import MultiDataModel
from sagemaker.model import Model

model_name = 'multi-model-server-multidatamodel-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())

model = Model(name = model_name,
              model_data = '',
              image_uri = container_image_uri,
              role=role,
              env = {
                  'SAGEMAKER_PROGRAM': 'predictor'
              },
              predictor_cls = sagemaker.predictor.Predictor,
              sagemaker_session=sagemaker_session)

multi_model = MultiDataModel(name = model_name,
                             model_data_prefix = model_data_prefix,
                             model = model,
                             sagemaker_session=sagemaker_session)

<strong>Note:</strong> the environment variable SAGEMAKER_PREDICTOR is used to specify the name of the custom inference script.

In [None]:
multi_endpoint_name = 'multi-model-server-ep-' + strftime("%Y-%m-%d-%H-%M-%S", gmtime())
print(multi_endpoint_name)

pred = multi_model.deploy(initial_instance_count=1,
                          instance_type='ml.m5.xlarge',
                          endpoint_name=multi_endpoint_name)

<h3>Executing inferences</h3>
Once the multi-model endpoint is ready, we can invoke either model1 or model2 by changing the target_model variable in the predict() function call.

In [None]:
from sagemaker.predictor import Predictor
pred = Predictor(multi_endpoint_name)
pred.serializer = sagemaker.serializers.CSVSerializer()

In [None]:
item = '77,33,143.0,101,212.2,102,104.9,120,15.3,4,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,1'
model_archive = '/model1.tar.gz'
pred.predict(item, target_model=model_archive)

In [None]:
item = '0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,73.0,79.0,32.0,27.0,45.0,48.0,13.0,62.0'
model_archive = '/model2.tar.gz'
pred.predict(item, target_model=model_archive)

In [None]:
pred.delete_endpoint()
pred.delete_model()