# Deploy NVIDIA NIM on Amazon SageMaker

NVIDIA NIM, part of NVIDIA AI Enterprise, brings the power of state-of-the-art large language models (LLM) to your applications, providing unmatched natural language processing and understanding capabilities. Whether you’re developing chatbots, content analyzers—or any application that needs to understand and generate human language—NVIDIA NIM for LLMs has you covered. Built on the robust foundations including
inference engines like Triton Inference Server, TensorRT, TensorRT-LLM, and PyTorch, NIM is
engineered to facilitate seamless AI inferencing at scale, ensuring that you can deploy AI
applications with confidence.

## Using Non-Prebuilt Models with NIM Model Repo Generator

In this example we show how you can use NIM's Model Repo Generator to create optimize your custom model and and deploy it with NIM on Amazon SageMaker on **g5.xlarge (A10G GPU)**.

## Setup

Installs the dependencies and setup roles required to download the model, package the model and create SageMaker inference endpoint.

In [None]:
import boto3, json, sagemaker, time
from sagemaker import get_execution_role
from pathlib import Path
import os

sess = boto3.Session()
sm = sess.client("sagemaker")
sagemaker_session = sagemaker.Session(boto_session=sess)
role = get_execution_role()
client = boto3.client("sagemaker-runtime")
region = sess.region_name
sts_client = sess.client('sts')
account_id = sts_client.get_caller_identity()['Account']

def create_directory(directory_path):
    if not os.path.exists(directory_path):
        os.makedirs(directory_path)
        print(f"Directory '{directory_path}' created successfully.")
    else:
        print(f"Directory '{directory_path}' already exists.")

We define the NIM image URI that we will be using for deploying on SageMaker endpoint.

In [None]:
nim_image_uri = f"{account_id}.dkr.ecr.{region}.amazonaws.com/nim-<YY>.<MM>-sm"

### Download Llama Model From HuggingFace

In [None]:
# install git lfs
!sudo amazon-linux-extras install epel -y
!curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.rpm.sh | sudo bash
!sudo yum install git-lfs -y
!git lfs install
!git clone https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0

In [None]:
current_directory = Path.cwd()
hf_model_path = current_directory / "TinyLlama-1.1B-Chat-v1.0"

### Creating the Model Config

First, create a model_config.yaml file. For `Llama` model it should look like the following. You can edit `max_batch_size`, `max_input_len`, `max_output_len` if you want. To build the engine for multi-GPU inference you can edit `num_gpus` and `tensor_para_size`. Here we just go wtih default settings and build the engine for 1-GPU.

In [None]:
%%writefile model_config.yaml
model_repo_path: "/model-store/"
use_ensemble: false
model_type: "LLAMA"
backend: "trt_llm"
base_model_id: "ensemble"
prompt_timer: 60
gateway_ip: "gateway-api"
customization_cache_capacity: 10000
logging_level: "INFO"
enable_chat: true
pipeline:
  model_name: "ensemble"
  num_instances: 1
trt_llm:
  use: true
  ckpt_type: "hf"
  model_name: "trt_llm"
  backend: "python"
  num_gpus: 1
  model_path: /engine_dir
  max_queue_delay_microseconds: 10000
  model_type: "llama"
  max_batch_size: 1
  max_input_len: 256
  max_output_len: 256
  max_beam_width: 1
  tensor_para_size: 1
  pipeline_para_size: 1
  data_type: "float16"
  int8_mode: 0
  enable_custom_all_reduce: 0
  per_column_scaling: false

### Creating an Empty Directory for model-store

The model store directory must be empty because the output from the Model Repo Generator command is stored there. So we create empty `model-store` directory



In [None]:
model_path = "model-store"
create_directory(model_path)

### Running the Model Repo Generator Command

The following command generates the model repository in the specified path for `model-store`. We must also pass in the location of the `model_config.yaml` file we created and pass in the path to the `Llama` model that we downloaded from HuggingFace and expose it in the container as the value for `/engine_dir`.

In [None]:
!mkdir tmp
!docker run --rm -it --gpus all -v $(pwd)/model-store:/model-store -v $(pwd)/model_config.yaml:/model_config.yaml -v {hf_model_path}/:/engine_dir -v $(pwd)/tmp:/tmp {nim_image_uri}  bash -c "model_repo_generator llm --verbose --yaml_config_file=/model_config.yaml"

### Packaging model and uploading to s3

SageMaker expects `tar.gz` format for the model artifact so we need to package our model and then we upload it to S3. 

In [None]:
!tar -czf model.tar.gz model-store
model_uri = sagemaker_session.upload_data(path="model.tar.gz", key_prefix="nim-model")

### Create SageMaker Endpoint

Next we can start creating a sagemaker model from the model we uploaded to s3 in the previous step.

In this step we also need to provide additional Environment Variables
- `SAGEMAKER_MODEL_NAME` which specifies the name of the model to be loaded by NIM container on SageMaker. You can provide any name, you just have to make sure it matches the name you provide in the inference request also.
- `SAGEMAKER_NUM_GPUS` which specifies the number of GPUs the model was prebuilt to run inference on. This was specified in the name of the prebuilt model engine you downloaded from NGC. 

Here we set model name as `llama` and `num of GPUs = 1`

In [None]:
SAGEMAKER_MODEL_NAME = "llama"
SAGEMAKER_NUM_GPUS = "1"

In [None]:
container = {
    "Image": nim_image_uri,
    "ModelDataUrl": model_uri,
    "Environment": {"SAGEMAKER_MODEL_NAME": SAGEMAKER_MODEL_NAME,
                    "SAGEMAKER_NUM_GPUS": SAGEMAKER_NUM_GPUS}
}
sm_prefix = "nim-model-" + SAGEMAKER_MODEL_NAME

sm_model_name = sm_prefix + time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
create_model_response = sm.create_model(
    ModelName=sm_model_name, ExecutionRoleArn=role, PrimaryContainer=container
)

print("Model Arn: " + create_model_response["ModelArn"])

Using the model above, we create an endpoint configuration where we can specify the type of instance we want in the endpoint. 

**IMPORTANT: In this case since we used NIM's Model Repo Generator to build the model engine for g5 (A10G) GPU we need to specify the `Instance Type` as `ml.g5.xlarge`**

**Engine that is built for certain GPU should only be deployed with that specific GPU. For example, if you want to deploy on p4d.24xlarge (A100 40GB) then you should run Model Repo Generator with your model in p4d.24xlarge notebook instance**

In [None]:
endpoint_config_name = sm_prefix + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_config_response = sm.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": "ml.g5.xlarge",
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": sm_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

print("Endpoint Config Arn: " + create_endpoint_config_response["EndpointConfigArn"])

Using the above endpoint configuration we create a new sagemaker endpoint and wait for the deployment to finish. The status will change to InService once the deployment is successful.

In [None]:
endpoint_name = sm_prefix + time.strftime("%Y-%m-%d-%H-%M-%S", time.gmtime())

create_endpoint_response = sm.create_endpoint(
    EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name
)

print("Endpoint Arn: " + create_endpoint_response["EndpointArn"])

In [None]:
resp = sm.describe_endpoint(EndpointName=endpoint_name)
status = resp["EndpointStatus"]
print("Status: " + status)

while status == "Creating":
    time.sleep(60)
    resp = sm.describe_endpoint(EndpointName=endpoint_name)
    status = resp["EndpointStatus"]
    print("Status: " + status)

print("Arn: " + resp["EndpointArn"])
print("Status: " + status)

### Run Inference

Once we have the endpoint running we can use a sample text to do a prompt text completion inference request using json as the payload format. For inference request format, NIM on SageMaker supports the OpenAI API /completions inference protocol. For explanation of supported parameters please see [this link](https://platform.openai.com/docs/api-reference/completions/create). 

In [None]:
payload = {
  "model": SAGEMAKER_MODEL_NAME,
  "prompt": "<|system|> You are a chatbot who can help code!</s> <|user|> Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI.</s> <|assistant|>",
  "max_tokens": 100,
  "temperature": 1,
  "n": 1,
  "stream": False,
  "stop": ["string"],
  "frequency_penalty": 0.0
}

response = client.invoke_endpoint(
    EndpointName=endpoint_name, ContentType="application/json", Body=json.dumps(payload)
)

output = json.loads(response["Body"].read().decode("utf8"))
print(json.dumps(output, indent=2))

### Try streaming inference

NIM on SageMaker also supports streaming inference and you can enable that by setting **`"stream"` as `True`** in the payload and by using [`invoke_endpoint_with_response_stream`](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/sagemaker-runtime/client/invoke_endpoint_with_response_stream.html) method.

In [None]:
payload = {
  "model": SAGEMAKER_MODEL_NAME,
  "prompt": "<|system|> You are a chatbot who can help code!</s> <|user|> Write me a function to calculate the first 10 digits of the fibonacci sequence in Python and print it out to the CLI.</s> <|assistant|>",
  "max_tokens": 100,
  "temperature": 1,
  "n": 1,
  "stream": True,
  "stop": ["string"],
  "frequency_penalty": 0.0
}
response = client.invoke_endpoint_with_response_stream(
    EndpointName=endpoint_name,
    Body=json.dumps(payload),
    ContentType="application/json",
    Accept="application/jsonlines",
)

We do some postprocessing on the event stream to handle the streaming output tokens.

In [None]:
from utils import LineIterator, has_same_prefix
import re

event_stream = response["Body"]

# Create an instance of LineIterator
line_iterator = LineIterator(event_stream)

# Iterate over the lines
prev = None
for line in line_iterator:
    # Decode the line into bytes
    decoded_line = line[len(b'data: '):].decode("utf-8").rstrip('\n')

    if decoded_line == " [DONE]":
        print(prev)
        print("\nStreaming Generation Finished!")
        break
    else:
        # Extract the desired information from the JSON
        decoded_json = json.loads(decoded_line)
        text = decoded_json['choices'][0]['text']
        # print(text)
        words_and_punctuations = re.findall(r"[\w']+|[.,!?;&()\"–—:;!*#@$%/\\<>\[\]{}|^~=+]", text)# Get the last word
        # print(words_and_punctuations[-1])
        # print("===========")
        if len(words_and_punctuations) > 0:
            if not has_same_prefix(prev, words_and_punctuations[-1]) and prev is not None:
                # print("**************")
                print(prev, end=' ')
            prev = words_and_punctuations[-1]

### Terminate endpoint and clean up artifacts

In [None]:
sm.delete_endpoint(EndpointName=endpoint_name)
sm.delete_endpoint_config(EndpointConfigName=endpoint_config_name)
sm.delete_model(ModelName=sm_model_name)