# SageMaker Inference Component

❗This notebook works well with the `Data Science 3.0 Python 3` kernel on a SageMaker Studio `ml.t3.medium` instance.

In this demo, we are going to invoke a model on a SageMaker real-time endpoint using the new inference capabilities. A copy of the [Dolly v2 7B model](https://huggingface.co/databricks/dolly-v2-7b) and a copy of the [FLAN-T5 XXL model](https://huggingface.co/google/flan-t5-xxl) from the [Hugging Face model hub](https://huggingface.co/models) are deployed on the SageMaker real-time endpoint using [Amazon CDK](https://docs.aws.amazon.com/cdk/api/v2/).

In [None]:
!pip list | grep -E "boto3|sagemaker"

boto3                                1.34.84
sagemaker                            2.215.0
sagemaker-data-insights              0.3.3
sagemaker-datawrangler               0.4.3
sagemaker-headless-execution-driver  0.0.13
sagemaker-scikit-learn-extension     2.5.0
sagemaker-studio-analytics-extension 0.0.20
sagemaker-studio-sparkmagic-lib      0.1.4


## Set up

In [None]:
import boto3
import sagemaker

role = sagemaker.get_execution_role()
region_name = boto3.Session().region_name

In [None]:
from typing import List
import boto3


def get_cfn_outputs(stack_name: str, region_name: str='us-east-1') -> List:
    cfn = boto3.client('cloudformation', region_name=region_name)
    outputs = {}
    for output in cfn.describe_stacks(StackName=stack_name)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

In [None]:
CFN_STACK_NAME = "SageMakerInferenceComponent1" # name of CloudFormation stack

cfn_outputs = get_cfn_outputs(CFN_STACK_NAME, region_name=region_name)
endpoint_name = cfn_outputs['EndpointName']
inference_component_name1 = cfn_outputs['InferenceComponentName']

In [None]:
CFN_STACK_NAME = "SageMakerInferenceComponent2" # name of CloudFormation stack

cfn_outputs = get_cfn_outputs(CFN_STACK_NAME, region_name=region_name)
endpoint_name = cfn_outputs['EndpointName']
inference_component_name2 = cfn_outputs['InferenceComponentName']

## Run Inference

### Using Boto3 SDK to invoke SageMaker Endpoint

In [None]:
sm_runtime_client = boto3.client(service_name="sagemaker-runtime", region_name=region_name)

In [None]:
payload = {"inputs": "Why is California a great place to live?"}

In [9]:
import json


print(f'InferenceComponentName: {inference_component_name1}')

response_dolly = sm_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name1,
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)

result_dolly = json.loads(response_dolly['Body'].read().decode())
result_dolly

InferenceComponentName: ic-dolly-v2-7b-7304256


[{'generated_text': 'Why is California a great place to live?\n\nCalifornia is a great place to live for many reasons. The climate is amazing, with mild'}]

In [None]:
import json


print(f'InferenceComponentName: {inference_component_name2}')

response_flant5 = sm_runtime_client.invoke_endpoint(
    EndpointName=endpoint_name,
    InferenceComponentName=inference_component_name2,
    ContentType="application/json",
    Accept="application/json",
    Body=json.dumps(payload),
)

result_flant5 = json.loads(response_flant5['Body'].read().decode())
result_flant5

InferenceComponentName: ic-flan-t5-xxl-8396501


[{'generated_text': 'California is a great place to live because of its diverse climate and geography'}]

### Using SageMaker Python SDK to invoke SageMaker Endpoint

In [None]:
from sagemaker import Predictor
from sagemaker.serializers import JSONSerializer
from sagemaker.deserializers import JSONDeserializer


serializer = JSONSerializer()
deserializer = JSONDeserializer()

predictor = Predictor(
    endpoint_name=endpoint_name,
    serializer=serializer,
    deserializer=deserializer
)

In [None]:
print(f'InferenceComponentName: {inference_component_name1}')

initial_args = {'InferenceComponentName': inference_component_name1}
response = predictor.predict(
    initial_args=initial_args,
    data=payload
)

response

InferenceComponentName: ic-dolly-v2-7b-7304256


[{'generated_text': 'Why is California a great place to live?\n\nCalifornia is a great place to live for many reasons. The climate is amazing, with mild'}]

In [None]:
print(f'InferenceComponentName: {inference_component_name2}')

initial_args = {'InferenceComponentName': inference_component_name2}
response = predictor.predict(
    initial_args=initial_args,
    data=payload
)

response

InferenceComponentName: ic-flan-t5-xxl-8396501


[{'generated_text': 'California is a great place to live because of its diverse climate and geography'}]

## References

- [Amazon SageMaker adds new inference capabilities to help reduce foundation model deployment costs and latency (2023-11-29)](https://aws.amazon.com/blogs/aws/amazon-sagemaker-adds-new-inference-capabilities-to-help-reduce-foundation-model-deployment-costs-and-latency/)
- [Reduce model deployment costs by 50% on average using the latest features of Amazon SageMaker (2023-11-30)](https://aws.amazon.com/blogs/machine-learning/reduce-model-deployment-costs-by-50-on-average-using-sagemakers-latest-features/)