NIM (NVIDIA Inference Microservice) Shim

Preparation
Usage
- Deploying to Sagemaker
- Deploying locally
Testing (Sagemaker)
- Invocation
Testing (Local)
- Health
- Invocation
  - Non-streaming
  - Streaming
Cleanup

Preparation

🛈 Prefer python? Check here

Customize the environment variables below to match your AWS, NGC, etc. configuration(s). If needed, customize the the parameters passed to the launch.sh call to ensure proper mapping of frontend/backend ports and source entrypoint. At a minimum you should customize the following:

NGC_API_KEY
DST_REGISTRY
SG_INST_TYPE
- Note that ml.p4d.24xlarge or similar variants are required for llama3-70b. ml.g5.4xlarge will work fine for llama3-8b
SG_EXEC_ROLE_ARN

git clone https://github.com/liveaverage/nim-shim && cd nim-shim

### Set your NGC API Key
export NGC_API_KEY=nvapi-your-api-key

export SRC_IMAGE_PATH=nvcr.io/nim/meta/llama3-70b-instruct:latest
export SRC_IMAGE_NAME="${SRC_IMAGE_PATH##*/}"
export SRC_IMAGE_NAME="${SRC_IMAGE_NAME%%:*}"
export DST_REGISTRY=your-registry.dkr.ecr.us-west-2.amazonaws.com/nim-shim

docker login nvcr.io
docker login ${DST_REGISTRY}
docker pull ${SRC_IMAGE}

# Build shimmed image
envsubst < Dockerfile > Dockerfile.nim
docker build -f Dockerfile.nim -t ${DST_REGISTRY}:${SRC_IMAGE_NAME} -t nim-shim:latest .
docker push ${DST_REGISTRY}:${SRC_IMAGE_NAME}

export SG_EP_NAME="nim-llm-${SRC_IMAGE_NAME}"
export SG_EP_CONTAINER=${DST_REGISTRY}:${SRC_IMAGE_NAME}
export SG_INST_TYPE=ml.p4d.24xlarge # ml.g5.4xlarge -- adequate for llama3-8b
export SG_EXEC_ROLE_ARN="arn:aws:iam::YOUR-ARN-ROLE:role/service-role/AmazonSageMakerServiceCatalogProductsUseRole"
export SG_CONTAINER_STARTUP_TIMEOUT=850 #in seconds -- adjust depending on dynamic or S3 model pull; model parameters (70b can take 460s+ to download)

Usage

Deploying to Sagemaker

Review logs in Cloudwatch. Ensure proper instance types have been set for the correlated model you're running & startup timeout values have been set to sane values, especially for dynamic download of large models (70b+).

# Generate model JSON
envsubst < templates/sg-model.template > sg-model.json

# Create Model
aws sagemaker create-model \
    --cli-input-json file://sg-model.json

# Create Endpoint Config
aws sagemaker create-endpoint-config \
    --endpoint-config-name $SG_EP_NAME \
    --production-variants "$(envsubst < templates/sg-prod-variant.template)"

# Create Endpoint
aws sagemaker create-endpoint \
    --endpoint-name $SG_EP_NAME \
    --endpoint-config-name $SG_EP_NAME

Deploying locally

Start the container and monitor for:

Caddy download & launch
Model weight(s) download
Service startup(s)

# Optional (but recommended to expedite future NIM launch times)
mkdir -p /opt/nim/cache

# Start NIM Shim container
docker run -it --rm -v /opt/nim/cache:/opt/nim/.cache -e NGC_API_KEY=$NGC_API_KEY -p 8080:8080 nim-shim:latest

Testing (Sagemaker)

Invocation

# Generate sample payload JSON
envsubst < templates/sg-test-payload.template > sg-invoke-payload.json

# Create sample invocation
aws sagemaker-runtime invoke-endpoint \
    --endpoint-name $SG_EP_NAME \
    --body file://sg-invoke-payload.json \
    --content-type application/json \
    --cli-binary-format raw-in-base64-out \
    sg-invoke-output.json

Testing (Local)

Health

Confirm Sagemaker health check will pass:

curl -X GET 127.0.0.1:8080/ping -vvv

Invocation

Non-streaming

curl -X 'POST' \
'http://127.0.0.1:8080/invocations' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"max_tokens": 32
}'

Streaming

curl -X 'POST' \
'http://127.0.0.1:8080/invocations' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
	-H 'Content-Type: text/event-stream' \
    -d '{
"model": "meta/llama3-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song featuring 90s grunge rock vibes?"
}
],
"max_tokens": 320,
"stream": true
}'

Cleanup

Purge your Sagemaker resources (if desired) between runs:

# Cleanup Sagemaker
sg_delete_resources() {
    local endpoint_name=$1
    # Delete endpoint
    aws sagemaker delete-endpoint --endpoint-name $endpoint_name || true
    # Wait for the endpoint to be deleted
    aws sagemaker wait endpoint-deleted --endpoint-name $endpoint_name || true
    # Delete endpoint config
    aws sagemaker delete-endpoint-config --endpoint-config-name $endpoint_name || true
    # Delete model
    aws sagemaker delete-model --model-name $endpoint_name || true
}

# Delete existing resources
sg_delete_resources $SG_EP_NAME

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
templates		templates
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
README_python.md		README_python.md
caddy-config.json		caddy-config.json
launch.py		launch.py
launch.sh		launch.sh
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NIM (NVIDIA Inference Microservice) Shim

Preparation

Usage

Deploying to Sagemaker

Deploying locally

Testing (Sagemaker)

Invocation

Testing (Local)

Health

Invocation

Non-streaming

Streaming

Cleanup

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NIM (NVIDIA Inference Microservice) Shim

Preparation

Usage

Deploying to Sagemaker

Deploying locally

Testing (Sagemaker)

Invocation

Testing (Local)

Health

Invocation

Non-streaming

Streaming

Cleanup

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages