# Auto Scaling Lab
This notebook walks you through how to configure and execute auto scaling on a SageMaker endpoint.

In [None]:
import threading
import numpy as np
import time
import math
from multiprocessing.pool import ThreadPool
from sagemaker.tensorflow.model import TensorFlowPredictor
from sagemaker.estimator import Estimator

## Deploy or attach to your endpoint

The lab has a dependency on the prior lab involving bringing your own TensorFlow script. To get started, we first attach to the existing endpoint from the prior lab. If the endpoint has already been deleted, we re-deploy it based on the name used earlier for the training job.

To locate your specific training job, go back to your notebook from the earlier lab and look at the cell output from the `fit` method. It will show you the specific training job name. **Enter that as `ENDPOINT_NAME` in this cell before proceeding**. This ensures we use the same model you trained earlier.

In [None]:
SERVE_INSTANCE_TYPE = 'ml.c5.xlarge'
ENDPOINT_NAME = 'sagemaker-tensorflow-scriptmode-2019-04-17-03-52-34-057'
TRAINING_JOB_NAME = ENDPOINT_NAME
NOT_RUNNING = True

import sagemaker
from sagemaker import get_execution_role
import boto3
sess = sagemaker.Session()
role = get_execution_role()
bucket = sess.default_bucket()

if (NOT_RUNNING):
    from sagemaker.tensorflow.serving import Model
    model = Model(model_data=f's3://{bucket}/{TRAINING_JOB_NAME}/output/model.tar.gz', 
                  role=f'{role}')
    loss_predictor = model.deploy(initial_instance_count=1, 
                                       instance_type=SERVE_INSTANCE_TYPE)
else:
    loss_predictor = TensorFlowPredictor(endpoint_name=ENDPOINT_NAME)

Now that the endpoint is available, prepare a single payload that we will use in the simple stress test. The actual values do not matter, as we are just trying to simulate load.

In [None]:
X = [ 1.05332958, -0.53354753, -0.69436208, -2.21762908, -3.20396808,  1.03539088,
  1.20417872, -1.03589941, -0.35095956, -0.01160373, -0.1615418,  -0.20454251,
 -0.72053914]
print(str(X))

Define a simple function for making a prediction. Track the elapsed time and return that as seconds.

In [None]:
def predict(payload):
    elapsed_time = time.time()
    results = loss_predictor.predict(X)
    elapsed_time = time.time() - elapsed_time  
    prediction = results['predictions'][0][0]
    return elapsed_time

Make sure a single prediction is working against the endpoint before proceeding to 
generate load for auto scaling.

In [None]:
predict(X)

## Configure auto scaling on your endpoint

Follow these steps to configure auto scaling.

1. In a new browser tab, navigate to the `Endpoints` section of the SageMaker console. 

2. Navigate to the details page for the endpoint. 

3. Under the `Endpoint runtime settings`, select the one and only variant that was created for this endpoint (it is named `All traffic` by default).

4. Click on `Configure auto scaling` in the upper right of `Endpoint runtime settings`.

5. Under `Variant automatic scaling`, set the maximum number of instances to 2.

6. Under `Built in scaling policy`, set the target to track to 2000 for the  `SageMakerVariantInvocationsPerInstance` metric. 

7. Click `Save` at the bottom of the page.

You have now set a threshold that will be used by SageMaker to determine when to add more instances. If it detects more invocations per instance per minute than the threshold, more instances will be added in an attempt to distribute the load and reduce that metric to the target. 

We have purposely set the threshold to a low number so that we can more easily trigger scaling. In practice, you will need to perform testing and analysis to determine an appropriate trigger and the right number of instances for your workload.

See the detailed documentation on SageMaker auto scaling [here](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

## Execute stress tests to force auto scaling

Now that the endpoint has auto scaling configured, lets drive some inference traffic against the endpoint. We use multiple client threads to drive sufficient volume of requests to trigger SageMaker auto scaling. The number of requests are mapped across a set of threads. Resulting elasped times are summed and returned.

In [None]:
def run_test(max_threads, max_requests):
    pool = ThreadPool(max_threads)
    bunch_of_x = []
    for i in range(max_requests):
        bunch_of_x.append(X)
    result = pool.map(predict, bunch_of_x)
    pool.close()
    pool.join()
    elapsedtime = 0
    for i in result:
        elapsedtime += i
    elapsedtime = np.sum(result)
    return elapsedtime

### Drive a few short tests
We run a few tests to allow us to start seeing invocation metrics in CloudWatch. This will help you visualize how traffic ramps up on your single instance endpoint, and is eventually distributed across a cluster of instances. 

In [None]:
%%time
print('Starting test 0')
run_test(5, 10)

In [None]:
%%time
print('Starting test 1')
run_test(10, 250)

In [None]:
%%time
print('Starting test 2')
run_test(10, 1000)

### Observe auto scaling

To trigger auto scaling, kick off one more round of tests. While that is running, read the instructions in the subsequent cell. It explains how to confirm that auto scaling worked.

In [None]:
%%time
print('Starting test 3')
run_test(10, 400000)

In the endpoint details console, you should still see the `Desired instance count` as `1`, since the scaling threshold has not been reached. 

This next test will continuously send traffic to the endpoint for about 15 minutes. During this time, we'll see the invocations per instance rise. Invocations per instance will track exactly the same as the total invocations until auto scaling happens, since we only have a single instance to start with. Note that in practice, you would want to start with at least two instances. This ensures you have higher availability by leveraging multiple availability zones.

Once auto scaling is triggered, SageMaker will take several minutes to add new instances (in our case, just one). While the auto scaling is happening, the endpoint details console will show you that the new desired instance count has increased to two. There will also be a blue bar at the top of the console indicating that the endpoint is being updated. Eventually that banner turns green and indicates that the `Endpoint was successfully updated.`

Once the expanded set of instances is running, click on `Invocation metrics` from the endpoint details console. This takes you to CloudWatch to show graphs of those metrics. Select two metrics: `Invocations` and `InvocationsPerInstance`. Next, click on the `Graphed metrics` tab, and update the `Statistics` to be `Sum`, and the `Period` to be `1 second`. At the top of the chart, set the time period to 30 minutes (using the `custom` drop down).

You will now see the total number of invocations continue at the same pace as before, yet the *invocations per instance* will be cut in half, as SageMaker automatically distributes the load ascross the cluster. 

### Scaling back in (optional)

For extra credit, you can observe SageMaker scaling in (reducing the number of instances) the infrastructure. This will take about 15 minutes after the previous traffic generator is complete. At that point, you should see a scale in event. SageMaker detects the invocations per instance is below the threshold, and automatically reduces the number of instances to avoid being over-provisioned. Cool down parameters are available to control how aggressively SageMaker adds or removes instances.

To ensure the CloudWatch alarm scale is triggered, there needs to be at least some traffic to have sufficient data points for the alarm. Here we generate a small load.

In [None]:
%%time
print('Adding a few invocations every 30s for 15 mins')
for i in range(30):
    run_test(10, 100)
    time.sleep(30)

### Delete the endpoint
Delete the endpoint, which will take down all of the instances.

In [None]:
sagemaker.Session().delete_endpoint(loss_predictor.endpoint)