<a href="https://colab.research.google.com/github/orca3/llm-model-serving/blob/main/ch04/dlc/aws_dlc_serving.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The sample code are compiled from AWS tutorial, to show the concept clearly, we trim lots of setup code and only keep the key sudo code in this notebook, for the full executable code and AWS setup instruction, please refer to:


* [AWS Large Model Inference Starting Guide](https://docs.djl.ai/master/docs/serving/serving/docs/lmi/user_guides/starting-guide.html)
* [AWS Developer Guide: Deploy models with TorchServe](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-models-frameworks-torchserve.html)  
* [AWS Developer Guide: Deploy models with DJL Serving](https://docs.aws.amazon.com/sagemaker/latest/dg/deploy-models-frameworks-djl-serving.html)




## Example One: Serve Self-built Pytorch Model with AWS TorchServe Image

In [None]:
# Find the available serving image by searching model framework
# and server instance type
baseimage = sagemaker.image_uris.retrieve(
        framework="pytorch",
        region="<region>",
        py_version="py310",
        image_scope="inference",
        version="2.0.1",
        instance_type="ml.g4dn.16xlarge",
    )


In [None]:
# Create model file and upload to cloud strage (AWS s3)

# package model
torch-model-archiver --model-name mnist --version 1.0 --model-file workspace/mnist-dev/mnist.py \\
  --serialized-file workspace/mnist-dev/mnist_cnn.pt --handler workspace/mnist-dev/mnist_handler.py \\
  --config-file workspace/mnist-dev/model-config.yaml --archive-format tgz

output_path = f"s3://{bucket_name}/{prefix}/models"
    aws s3 cp mnist.tar.gz {output_path}/mnist.tar.gz


In [None]:
# create the Model definition include serving image
model = Model(model_data = f'{output_path}/mnist.tar.gz',
                  image_uri = baseimage,
                  predictor_cls = Predictor,
                  name = "mnist")

In [None]:
# deploy model as a model endpoint

endpoint_name = 'torchserve-endpoint-1'
predictor = model.deploy(
    instance_type='ml.g4dn.xlarge',
    initial_instance_count=1,
    endpoint_name = endpoint_name,
    serializer=JSONSerializer(),
    deserializer=JSONDeserializer())

## Example Two: Serve LLM with vLLM on AWS DJL image  

In [None]:
# Create the SageMaker Model object. In this example we let LMI configure the deployment settings based on the model architecture
model = DJLModel(
  model_id="meta-llama/Meta-Llama-3.1-8B-Instruct",
  role=iam_role,
  env={
    "HF_TOKEN": "<hf token value for gated models>",
    # Add more serving configurations here
    "OPTION_TENSOR_PARALLEL_DEGREE": "4", # Example: set tensor parallel degree
    "OPTION_SERVING_LOADER": "vllm", # Example: specify serving loader
    "OPTION_MAX_ROLLING_BATCH_SIZE": "128", # Example: set max rolling batch size
  }
)



In [None]:
# Deploy your model to a SageMaker Endpoint and create a Predictor to make inference requests
endpoint_name = sagemaker.utils.name_from_base("llama-8b-endpoint")
predictor = model.deploy(
    instance_type="ml.g5.12xlarge",
    initial_instance_count=1,
    endpoint_name=endpoint_name)
