In [126]:
import threading
import numpy as np
import time
import math
from multiprocessing.pool import ThreadPool
from sagemaker.tensorflow.model import TensorFlowPredictor
from sagemaker.estimator import Estimator

## Deploy or attach to your endpoint

Attach to the existing endpoint from the prior lab, or deploy the endpoint if it had already been deleted.

In [134]:
SERVE_INSTANCE_TYPE = 'ml.c5.xlarge'
ENDPOINT_NAME = 'sagemaker-tensorflow-scriptmode-2019-04-17-03-52-34-057'
TRAINING_JOB_NAME = ENDPOINT_NAME
NOT_RUNNING = True

import sagemaker
from sagemaker import get_execution_role
import boto3
sess = sagemaker.Session()
role = get_execution_role()
bucket = sess.default_bucket()

if (NOT_RUNNING):
    from sagemaker.tensorflow.serving import Model
    model = Model(model_data=f's3://{bucket}/{TRAINING_JOB_NAME}/output/model.tar.gz', 
                  role=f'{role}')
    loss_predictor = model.deploy(initial_instance_count=1, 
                                       instance_type=SERVE_INSTANCE_TYPE)
else:
    loss_predictor = TensorFlowPredictor(endpoint_name=ENDPOINT_NAME)

INFO:sagemaker:Created S3 bucket: sagemaker-us-east-1-355151823911
INFO:sagemaker:Creating model with name: sagemaker-tensorflow-serving-2019-04-17-13-07-17-866
INFO:sagemaker:Creating endpoint with name sagemaker-tensorflow-serving-2019-04-17-13-07-17-866


----------------------------------------------------------------!

Prepare a single payload that we will use in the simple stress test. The actual values do not matter, as we are just trying to simulate load.

In [135]:
X = [ 1.05332958, -0.53354753, -0.69436208, -2.21762908, -3.20396808,  1.03539088,
  1.20417872, -1.03589941, -0.35095956, -0.01160373, -0.1615418,  -0.20454251,
 -0.72053914]
print(str(X))

[1.05332958, -0.53354753, -0.69436208, -2.21762908, -3.20396808, 1.03539088, 1.20417872, -1.03589941, -0.35095956, -0.01160373, -0.1615418, -0.20454251, -0.72053914]


In [136]:
def predict(payload):
    elapsed_time = time.time()
    results = loss_predictor.predict(X)
    elapsed_time = time.time() - elapsed_time  
    prediction = results['predictions'][0][0]
    return elapsed_time

Make sure a single prediction is working against the endpoint before proceeding to 
generate load for auto scaling.

In [137]:
predict(X)

0.18019890785217285

## Configure auto scaling on your endpoint

In a new browser tab, navigate to the `Endpoints` section of the SageMaker console. Navigate to the details page for the endpoint. Under the `Endpoint runtime settings`, select the one and only variant that was created for this endpoint (it is named `All traffic` by default).

Click on `Configure auto scaling` in the upper right of `Endpoint runtime settings`.

Under `Variant automatic scaling`, set the maximum number of instances to 2.

Under `Built in scaling policy`, set the target to track to 2000 for the  `SageMakerVariantInvocationsPerInstance` metric. Click `Save` at the bottom of the page.

You have now set a threshold that will be used by SageMaker to determine when to add more instances. If it detects more invocations per instance per minute than the threshold, more instances will be added in an attempt to distribute the load and reduce that metric to the target. 

We have purposely set the threshold to a low number so that we can more easily trigger scaling. In practice, you will need to perform testing and analysis to determine an appropriate trigger and the right number of instances for your workload.

See the detailed documentation [here](https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html).

## Execute stress tests to force auto scaling

Here we simply use multiple client threads to drive sufficient volume of requests to trigger SageMaker auto scaling.

In [138]:
def run_test(max_threads, max_requests):
    pool = ThreadPool(max_threads)
    bunch_of_x = []
    for i in range(max_requests):
        bunch_of_x.append(X)
    result = pool.map(predict, bunch_of_x)
    pool.close()
    pool.join()
    elapsedtime = 0
    for i in result:
        elapsedtime += i
    elapsedtime = np.sum(result)
    return elapsedtime

### Drive a few short tests
We run a few tests to allow us to start seeing invocation metrics in CloudWatch. This will help you visualize how traffic ramps up on your single instance endpoint, and is eventually distributed across a cluster of instances. 

In [139]:
%%time
print('Starting test 0')
run_test(5, 10)

Starting test 0
CPU times: user 71.8 ms, sys: 8.7 ms, total: 80.5 ms
Wall time: 204 ms


In [140]:
%%time
print('Starting test 1')
run_test(10, 250)

Starting test 1
CPU times: user 1.69 s, sys: 149 ms, total: 1.84 s
Wall time: 2.05 s


In [143]:
%%time
print('Starting test 2')
run_test(10, 1000)

Starting test 1
CPU times: user 1.77 s, sys: 132 ms, total: 1.91 s
Wall time: 1.87 s


### Observe auto scaling
In the endpoint details console, you should still see the `Desired instance count` as `1`, since the scaling threshold has not been reached. This next test will continuously send traffic to the endpoint for about 15 minutes. During this time, we'll see the invocations per instance rise. It will track exactly the same as the total invocations.

Once auto scaling is triggered, SageMaker will take several minutes to add new instances (in our case, just one). While the auto scaling is happening, the endpoint details console will show you that the new desired instance count has increased to two, and there will be a blue bar at the top of the console indicating that the endpoint is being updated. Eventually that banner turns green and indicates `Endpoint was successfully updated.`

Once the expanded set of instances is running, click on `Invocation metrics` from the endpoint details console. This takes you to CloudWatch to show graphs of those metrics. Select `Invocations` and `InvocationsPerInstance`. Click on `Graphed metrics` and update the `Statistics` to be `Sum`, and the `Period` to be `1 second`. At the top of the chart, set the time period to 30 minutes (using the `custom` drop down).

You will see the total number of invocations continue at the same pace as before, yet the invocations per instance will be cut in half as SageMaker automatically distributes the load ascross the cluster. 

In [142]:
%%time
print('Starting test 3')
run_test(10, 400000)

Starting test 3
CPU times: user 16min 55s, sys: 1min 2s, total: 17min 58s
Wall time: 16min 56s


### Scaling back in

For extra credit, you can observe SageMaker scaling in (reducing the number of instances) the infrastructure. This will take about 15 minutes after the previous traffic generator is complete. At that point, you should see a scale in event. SageMaker detects the invocations per instance is below the threshold, and automatically reduces the number of instances to avoid being over-provisioned. Cool down parameters are available to control how aggressively SageMaker adds or removes instances.

To ensure the CloudWatch alarm scale is triggered, there needs to be at least some traffic to have sufficient data points for the alarm. Here we generate a small load.

In [146]:
%%time
print('Adding a few invocations every 30s for 15 mins')
for i in range(30):
    run_test(10, 100)
    time.sleep(30)

Adding a few invocations every 30s for 15 mins
CPU times: user 12.2 s, sys: 2.25 s, total: 14.4 s
Wall time: 15min 9s


### Delete the endpoint
Delete the endpoint, which will take down all of the instances.

In [None]:
sagemaker.Session().delete_endpoint(loss_predictor.endpoint)