# Inference with Autoscaled Models

If [statically provisioned instances](./Real-time%20inference.ipynb) don't meet your needs, Amazon SageMaker offers automatic scaling for your hosted models. This allows the number of instances allocated for a model to dynamically adjust based on traffic demand, ensuring your deployment is both cost-effective and adept at managing varying traffic loads.

## Pre-requisites:
1. Ensure that IAM role used has **AmazonSageMakerFullAccess**
1. To deploy this ML model successfully, ensure that:
    1. Either your IAM role has these three permissions and you have authority to make AWS Marketplace subscriptions in the AWS account used: 
        1. **aws-marketplace:ViewSubscriptions**
        1. **aws-marketplace:Unsubscribe**
        1. **aws-marketplace:Subscribe**  
    2. or your AWS account has a subscription to [jina-embedding-model](link). If so, skip step: [Subscribe to the model package](#1.-Subscribe-to-the-model-package)

# Subscribe to the model package

Install `jina-sagemaker` package 


```bash
pip install --upgrade jina-sagemaker
```

In [1]:
# Specify the role as required by SageMaker
role = "..."

In [None]:
from jina_sagemaker import Client
import boto3

region = boto3.Session().region_name

# Specify the model name
model_name = "jina-embeddings-v2-small-en"

# Mapping for Model Packages
model_package_map = {
    "us-east-1": f"arn:aws:sagemaker:us-east-1:253352124568:model-package/{model_name}",
}

# Specify the model you want to use
if region not in model_package_map.keys():
    raise Exception(f"Current boto3 session region {region} is not supported.")

model_package_arn = model_package_map[region]

# Create an endpoint that automatically scales

In this section, we configure an autoscaling endpoint that leverages step scaling, which scales a resource based on a set of scaling adjustments that vary based on the size of the alarm breach. For an in-depth understanding of step scaling, you can refer to the [Documentations](https://docs.aws.amazon.com/autoscaling/application/userguide/application-auto-scaling-step-scaling-policies.html).

When utilizing step scaling, it's your responsibility to configure the alarms that trigger the policy. Generally, you'll want to establish two alarms: one for triggering a step scale-in action and another for initiating a step scale-out action. Note that the upper and lower bounds specified in these policies are relative to the thresholds set in the corresponding alarms.

In [None]:

client = Client(region_name=region)
client.create_endpoint(arn=model_package_arn, role=role, endpoint_name="my-autoscaling-endpoint", instance_type="ml.g4dn.xlarge", n_instances=2)

# If the endpoint is already created, you just need to connect to it
# co.connect_to_endpoint(endpoint_name="my-autoscaling_endpoint")

client.register_scalable_target(max_capacity=5, min_capacity=2)

r = client.set_step_autoscaling(
    policy_name="down",
    policy_configuration={
        "AdjustmentType": "ExactCapacity",
        "StepAdjustments": [
            {
                "MetricIntervalUpperBound": 0,
                "ScalingAdjustment": 2,
            }
        ],
        "MetricAggregationType": "Average",
        "Cooldown": 10,
    },
)
down_policy = r['PolicyARN']

r = client.set_step_autoscaling(
    policy_name="up",
    policy_configuration={
        "AdjustmentType": "ChangeInCapacity",
        "StepAdjustments": [
            {
                "MetricIntervalLowerBound": 0,
                "MetricIntervalUpperBound": 10,
                "ScalingAdjustment": 0,
            },
            {
                "MetricIntervalLowerBound": 10,
                "MetricIntervalUpperBound": 40,
                "ScalingAdjustment": 3,
            },
            {
                "MetricIntervalLowerBound": 40,
                "ScalingAdjustment": 4,
            },
        ],
        "MetricAggregationType": "Average",
        "Cooldown": 10,
    },
)
up_policy = r['PolicyARN']

client.set_metric_alarm(policy_arn=down_policy, 
    AlarmName=f"step_scaling_policy_alarm_down",
    MetricName="CPUUtilization",
    Namespace="/aws/sagemaker/Endpoints",
    Statistic="Average",
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=30,
    ComparisonOperator="LessThanOrEqualToThreshold",
    TreatMissingData="missing",
    Period=60,
    Unit="Percent",
)

client.set_metric_alarm(policy_arn=up_policy,
    AlarmName=f"step_scaling_policy_alarm_up",
    MetricName="CPUUtilization",
    Namespace="/aws/sagemaker/Endpoints",
    Statistic="Average",
    EvaluationPeriods=1,
    DatapointsToAlarm=1,
    Threshold=60,
    ComparisonOperator="GreaterThanThreshold",
    TreatMissingData="missing",
    Period=10,
    Unit="Percent",
)

# Perform real-time inference

In [None]:
result = client.embed(texts=[
    "how is the weather today", 
    "what is the weather like today",
    "what's the color of an orange",
])
result

# Clean-up

## Delete the model

In [None]:
client.delete_endpoint()
client.close()

## Unsubscribe to the listing (optional)

If you would like to unsubscribe to the model package, follow these steps. Before you cancel the subscription, ensure that you do not have any [deployable model](https://console.aws.amazon.com/sagemaker/home#/models) created from the model package or using the algorithm. Note - You can find this information by looking at the container name associated with the model. 

**Steps to unsubscribe to product from AWS Marketplace**:
1. Navigate to __Machine Learning__ tab on [__Your Software subscriptions page__](https://aws.amazon.com/marketplace/ai/library?productType=ml&ref_=mlmp_gitdemo_indust)
2. Locate the listing that you want to cancel the subscription for, and then choose __Cancel Subscription__  to cancel the subscription.
