`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

rtyley · 2024-05-23T14:39:13Z

This aims to address, to an extent, issue #1342 - that in an autoscaling deploy, apps can not auto-scale until the original number of instances in the ASG is capable of successfully handling the level of traffic, as checked by WaitForStabilization. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker, when a large increase in pageview traffic during a deploy meant the original number of instances could never have handled the load.

Disabling auto-scaling

Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (SuspendAlarmNotifications), and only re-enabled them at the end of the deploy (ResumeAlarmNotifications) once deployment has successfully completed.

In December 2016, with #403, an additional WaitForStabilization was added as a penultimate deploy step, which served two purposes:

Ensure that the cull of old instances has completed before the deploy ends
Prove that the new boxes can handle their load without the old boxes.

However, the WaitForStabilization step was placed before ResumeAlarmNotifications - so the original number of EC2 instances in the ASG must be capable of supporting the latest level of traffic, otherwise WaitForStabilization and the entire deploy will fail, leaving auto-scaling indefinitely disabled - potentially waiting hours for human intervention to fix the problem.

The old-instance cull must complete

By putting ResumeAlarmNotifications before WaitForStabilization, the Ophan outage would have been shortened from 1 hour to ~2 minutes, but as @jacobwinch points out, making only that change could lead to a race condition:

Once ResumeAlarmNotifications executes, a 'low CPU' alarm could immediately decide to reduce the desired size of the ASG, randomly removing new instances
WaitForStabilization could then see that the ASG is at its desired size, even if it happens that the termination of the old servers has not yet completed
The deploy then reports as 'complete', even though old EC2 instances are potentially still active

Consequently, with this change we are also introducing a new task, WaitForCullToComplete, that checks specifically that EC2 instances tagged for termination have been removed from the ASG - more specific than the old assumption with WaitForStabilization that the ASG being at desired capacity definitely means all the old instances are gone.

The final sequence of tasks in an autoscaling deploy now looks like this:

riff-raff/magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala

Lines 199 to 215 in 0fb840e

    
           CullInstancesWithTerminationTag(autoScalingGroup, target.region), 
        
           TerminationGrace( 
        
             autoScalingGroup, 
        
             target.region, 
        
             terminationGrace(pkg, target, reporter) 
        
           ), 
        
           ResumeAlarmNotifications(autoScalingGroup, target.region), 
        
           WaitForCullToComplete( 
        
             autoScalingGroup, 
        
             secondsToWait(pkg, target, reporter), 
        
             target.region 
        
           ), 
        
           WaitForStabilization( 
        
             autoScalingGroup, 
        
             secondsToWait(pkg, target, reporter), 
        
             target.region 
        
           )

Common logic for finding EC2 instances tagged for termination, used by both CullInstancesWithTerminationTag & WaitForCullToComplete, has been factored out into the new CullSummary class.

Testing

I've deployed this branch to Riff Raff CODE as build 3385, and then used that to deploy the Ophan Dashboard to CODE, successfully! You can see that the new WaitForCullToComplete step took place after ResumeAlarmNotifications, and the WaitForStabilization step completed immediately after that:

cc @guardian/ophan

github-actions · 2024-05-23T14:51:26Z

Deploy build 3385 of `tools::riffraff` to CODE

All deployment options

From guardian/actions-riff-raff.

rtyley · 2024-05-23T15:36:49Z

`WaitForStabilization` for culled instances?

@jacobwinch comments:

In order for this to work I think it would be desirable to replace the final WaitForStabilization task with a new task (WaitForOldInstancesToTerminate, or similar).

The current task is checking for an expected number of instances. This is currently an acceptable way to check because Riff-Raff is essentially holding a lock on the desired capacity setting by blocking scaling operations (i.e. we know that the number won't change).

Once we re-enable scaling the desired number of instances becomes a moving target, so I think it would be better for Riff-Raff to check that there are 0 instances with the Magenta=Terminate tag still running in the ASG. This would allow us to be sure that all instances are now running the build currently being deployed, which means it is safe/correct to mark the deployment as successful.

@jacobwinch, rather than introduce WaitForOldInstancesToTerminate, would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

jacobwinch · 2024-05-23T15:52:34Z

would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

I think it's probably safe to allow scaling actions to run again as soon as we've asked AWS to terminate the instances running the old build. At this point there is no risk of AWS trying to scale down the 'wrong' instances (i.e. the ones running the new build) and we know it is safe to launch more instances that will run with the newest build (as at least one instance running this build has already passed the health check).

With that in mind, the only downside of the suggested approach is that while we're blocking and waiting for the instances to terminate, we could probably be allowing the app to start scaling again if we were to split the 'waiting work' into a separate task. We're probably talking about a pretty small time window here though so if it's considerably easier to implement the version you've suggested then I think that would still be a big improvement!

rtyley · 2024-05-31T16:16:35Z

You've convinced me @jacobwinch, so I've added some WIP to this PR to implement a WaitForCullToComplete task!

Not quite tidied up yet, but you probably get the idea. Some code from CullInstancesWithTerminationTag is factored out into a CullSummary class so that WaitForCullToComplete can also use it.

This aims to address, to some extent, issue #1342 - the problem that *apps can not auto-scale* until an autoscaling deploy has successfully completed. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker. Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy, (`ResumeAlarmNotifications`) once deployment has successfully completed. In December 2016, with #403, an additional `WaitForStabilization` was added as a penultimate deploy step, with the aim of ensuring that the cull of old instances has _completed_ before the deploy ends. However, the `WaitForStabilization` step was added _before_ `ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are already overloaded and recycling, the ASG will _never_ stabilise, because it _needs to scale up_ to handle the load it's experiencing. In this change, we introduce a new task, `WaitForCullToComplete`, that can establish whether the cull has completed or not, regardless of whether the ASG is scaling - it simply checks that there are no remaining instances tagged for termination. Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`, and then `WaitForCullToComplete` _afterwards_. With this change in place, the Ophan outage would have been shortened from 1 hour to ~2 minutes, a much better outcome! Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has been factored out into a new `CullSummary` class.

rtyley · 2024-06-04T12:53:41Z

magenta-lib/src/main/scala/magenta/deployment_type/AutoScaling.scala

+        WaitForStabilization(
+          autoScalingGroup,
+          secondsToWait(pkg, target, reporter),
+          target.region
+        )


We're retaining this here because it does offer proof that the new boxes can handle our traffic without the old boxes.

In normal deploys, it will complete almost instantly, if auto-scaling alarms change the size of the ASG, it could take a little longer to settle.

Jacob Winch pointed out that `WaitForCullToComplete` is a repeating check, and so needs to get up-to-date `AutoScalingGroupInfo` in order for it to know what instances currently exist and what their state is! I was missing a call to `ASG.refresh(asg)`. This is easy to miss in tasks like `WaitForCullToComplete` that extend `magenta.tasks.PollingCheck`, the `magenta.tasks.ASGTask.execute(asg: AutoScalingGroup, ...)` method puts a `AutoScalingGroup` into scope by the method parameter, and it's inevitably out-out-date... could be something to refactor in a later PR! We also decided that as polling checks inevitably involve network calls, it makes sense to put an exception-catching guard in the `magenta.tasks.PollingCheck.check()` method.

rtyley · 2024-06-05T10:00:55Z

magenta-lib/src/main/scala/magenta/tasks/tasks.scala

+      val checkResult =
+        Try(theCheck).recover { case NonFatal(e) =>
+          reporter.info(e.getMessage)
+          false
+        }.get // '.get' allows a fatal exception, like an OutOfMemoryError, to kill the process, which is what we want!


@jacobwinch & I decided that as polling checks invariably involve network calls, it makes sense to put an exception-catching guard around the invocation of theCheck - if a non-fatal exception is received, we probably don't want to kill the deploy, just attempt the check again after the next sleepyTime.

jacobwinch

This looks good to me 👍

I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete works successfully.

Test Setup

Redeployed this branch to Riff-Raff CODE
Started sending some artificial traffic to a test service, with the aim of forcing the scale up action to coincide with the Riff-Raff deployment
Started a deployment of the test service via Riff-Raff CODE

Observations

12:16:56 - Deployment starts: [12:16:56] deploy for devx::cdk-playground (build 801) to stage PROD

12:17:05 - Scaling actions are disabled by Riff-Raff: [12:17:05] task SuspendAlarmNotifications Suspending Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will no longer scale on any configured alarms

12:18:41¹ - Alarm used for automatically scaling up enters alarm state:

Automatic scale up does not happen yet because it is still blocked by Riff-Raff.

12:19:39 - Scaling actions are re-enabled by Riff-Raff: [12:19:39] task ResumeAlarmNotifications Resuming Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will scale on any configured alarms

12:19:40 - New WaitForCullToComplete task starts running: [12:19:40] task WaitForCullToComplete Check that all instances tagged for termination in playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB have been terminated

12:19:45 - The ASG scales up automatically:

12:21:41 - New WaitForCullToComplete task (and overall deployment) completes:

The screenshot here shows a UTC timestamp, I've converted to BST to make it easier to follow the timeline. ↩

rtyley · 2024-06-05T12:44:24Z

I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete works successfully.

Magnificent! Great test, and gratifying to see it work correctly! I shall now merge 👍

rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from 6c8f8f7 to 7550f3b Compare May 23, 2024 14:47

rtyley mentioned this pull request May 23, 2024

Apps can not auto-scale until an autoscaling deploy has successfully completed #1342

Open

rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from 7550f3b to 4900a69 Compare May 30, 2024 19:27

rtyley changed the base branch from main to use-strongly-typed-durations May 30, 2024 19:27

rtyley force-pushed the use-strongly-typed-durations branch from feaf4f0 to 3560ae2 Compare June 4, 2024 09:05

Base automatically changed from use-strongly-typed-durations to main June 4, 2024 09:09

rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch 3 times, most recently from 20fa5f2 to fc4a264 Compare June 4, 2024 10:54

rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from fc4a264 to 0f75e11 Compare June 4, 2024 11:20

rtyley commented Jun 4, 2024

View reviewed changes

rtyley marked this pull request as ready for review June 5, 2024 09:56

rtyley requested review from a team as code owners June 5, 2024 09:56

rtyley commented Jun 5, 2024

View reviewed changes

rtyley changed the title ~~Re-enable ASG scaling before final stabilisation~~ autoscaling deploy: re-enable ASG scaling before final stabilisation check Jun 5, 2024

jacobwinch approved these changes Jun 5, 2024

View reviewed changes

rtyley merged commit 295ec4b into main Jun 5, 2024
1 check passed

rtyley deleted the reenable-asg-scaling-before-final-WaitForStabilization branch June 5, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

rtyley commented May 23, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading

rtyley commented May 23, 2024

jacobwinch commented May 23, 2024 •

edited

Loading

rtyley commented May 31, 2024 •

edited

Loading

rtyley Jun 4, 2024 •

edited

Loading

rtyley Jun 5, 2024 •

edited

Loading

jacobwinch left a comment •

edited

Loading

rtyley commented Jun 5, 2024

	CullInstancesWithTerminationTag(autoScalingGroup, target.region),
	TerminationGrace(
	autoScalingGroup,
	target.region,
	terminationGrace(pkg, target, reporter)
	),
	ResumeAlarmNotifications(autoScalingGroup, target.region),
	WaitForCullToComplete(
	autoScalingGroup,
	secondsToWait(pkg, target, reporter),
	target.region
	),
	WaitForStabilization(
	autoScalingGroup,
	secondsToWait(pkg, target, reporter),
	target.region
	)

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Conversation

rtyley commented May 23, 2024 • edited Loading

Disabling auto-scaling

The old-instance cull must complete

Testing

github-actions bot commented May 23, 2024 • edited Loading

Deploy build 3385 of tools::riffraff to CODE

rtyley commented May 23, 2024

WaitForStabilization for culled instances?

jacobwinch commented May 23, 2024 • edited Loading

rtyley commented May 31, 2024 • edited Loading

rtyley Jun 4, 2024 • edited Loading

Choose a reason for hiding this comment

rtyley Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

jacobwinch left a comment • edited Loading

Choose a reason for hiding this comment

Test Setup

Observations

Footnotes

rtyley commented Jun 5, 2024

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

`autoscaling` deploy: re-enable ASG scaling before final stabilisation check #1345

rtyley commented May 23, 2024 •

edited

Loading

github-actions bot commented May 23, 2024 •

edited

Loading

Deploy build 3385 of `tools::riffraff` to CODE

`WaitForStabilization` for culled instances?

jacobwinch commented May 23, 2024 •

edited

Loading

rtyley commented May 31, 2024 •

edited

Loading

rtyley Jun 4, 2024 •

edited

Loading

rtyley Jun 5, 2024 •

edited

Loading

jacobwinch left a comment •

edited

Loading