# Model deployment with HuggingFace Accelerate engine integrated in LMI (Large Moder Inference Container) 

We use DJLServing as the model serving solution in this example that is bundled in the LMI container. DJLServing is a high-performance universal model serving solution powered by the Deep Java Library (DJL) that is programming language agnostic. To learn more about DJL and DJLServing, you can refer to our recent blog post (https://aws.amazon.com/blogs/machine-learning/deploy-bloom-176b-and-opt-30b-on-amazon-sagemaker-with-large-model-inference-deep-learning-containers-and-deepspeed/).

Important to refer:

LMI Samples - https://github.com/aws/amazon-sagemaker-examples/tree/main/inference/generativeai/llm-workshop
</br>
LMI in SageMaker Developer guide - https://docs.aws.amazon.com/sagemaker/latest/dg/large-model-inference-tutorials.html
</br>
DJLModel - https://sagemaker.readthedocs.io/en/stable/frameworks/djl/index.html
</br>
DJL Serving 可配置参数列表 - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html

#### Init SageMaker Runtime

In [None]:
import sagemaker
from sagemaker import image_uris
import boto3
import os
import time
import json
from pathlib import Path
from sagemaker.utils import name_from_base

role = sagemaker.get_execution_role()  # execution role for the endpoint
sess = sagemaker.session.Session()  # sagemaker session for interacting with different AWS APIs
default_bucket = sess.default_bucket()  # bucket to house artifacts
region = sess._region_name

#### Construct artifacts and deploy on SageMaker endpoint

<mark>**NOTE**
: Copy the S3 path where the Training Job saves the model artifacts to. And give it to the option.s3url entry.</mark>


In this sample notebook, we use <mark>Python (HuggingFace Accelerate)</mark> engine.

In [None]:
%%writefile serving.properties
engine=Python
#engine=DeepSpeed
option.tensor_parallel_degree=1
#option.model_id=TheBloke/Wizard-Vicuna-7B-Uncensored-HF
option.s3url=s3://COPY_FROM_TRAINING_SCRIPT/

Here is a list of settings that we use in this configuration file -

- engine: The engine for DJL to use. In this case, we have set it to MPI.
option.model_id: The model id of a pretrained model hosted inside a model repository on huggingface.co (https://huggingface.co/models) or S3 path to the model artefacts. 
- option.tensor_parallel_degree: Set to the number of GPU devices over which Accelerate needs to partition the model. This parameter also controls the no of workers per model which will be started up when DJL serving runs. As an example if we have a 4 GPU machine and we are creating 4 partitions then we will have 1 worker per model to serve the requests.

For more details on the configuration options and an exhaustive list, you can refer the documentation - https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints-large-model-configuration.html.

#### Tar the model serving code and upload to S3 (for SageMaker Endpoint)

In [None]:
# Construct code artifacts tar
code_tarname = 'llama2-qlora-merged-acc'

!mkdir -p {code_tarname}
!rm -rf {code_tarname}.tar.gz
!rm -rf {code_tarname}/.ipynb_checkpoints

!mv model.py {code_tarname}/
!mv requirements.txt {code_tarname}/
!mv serving.properties {code_tarname}/
!tar czvf {code_tarname}.tar.gz {code_tarname}/

Upload the tar of CODE to 'any' valid S3 path (different from hf model artifacts path)

In [None]:
s3_code_artifact = sess.upload_data(f"{code_tarname}.tar.gz", 
                                    default_bucket, 
                                    sagemaker.utils.name_from_base("tmp/v0"))

Use a proper LMI version

In [None]:
# Specify a inference container version, 
# - https://github.com/aws/deep-learning-containers/blob/master/available_images.md#large-model-inference-containers
inference_image_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/djl-inference:0.23.0-deepspeed0.9.5-cu118"

# name a SageMaker Endpoint
endpoint_name = sagemaker.utils.name_from_base(code_tarname)

Register/Declare the model with proper configurations

In [None]:
from sagemaker.model import Model

model = Model(image_uri=inference_image_uri,
              model_data=s3_code_artifact, 
              role=role)

Trigger the model deployment

In [None]:
model.deploy(initial_instance_count = 1,
             instance_type = 'ml.g4dn.xlarge', 
             endpoint_name = endpoint_name,
             container_startup_health_check_timeout = 900
            )

#### Init predictor and invoke specified endpoint

If you only need to invoke a in-service Endpoint without the need of deployment, just start from this step and pass the SageMaker Endpoint name to the Predictor.

In [None]:
from sagemaker import serializers, deserializers

# Or copy endpoint name from SageMaker console for direct invocation
# endpoint_name = 'llama2-merge-model-2023-08-19-04-42-02-574'

predictor = sagemaker.Predictor(
            endpoint_name=endpoint_name,
            sagemaker_session=sess,
            serializer=serializers.JSONSerializer(),
            deserializer=deserializers.JSONDeserializer(),
            )

The payload pattern should align with that defined in the code (model.py)

In [None]:
predictor.predict(
    {"inputs": ["tuna sandwich nutritional content is ", "I need to cook a good pizza, so "], 
     "parameters": {"max_new_tokens": 200}}
)

In [None]:
%%timeit -n3 -r1
predictor.predict(
    {"inputs": "tuna sandwich nutritional content is ", 
     "parameters": {"max_new_tokens": 200}}
)