Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autoscaling deploy: re-enable ASG scaling before final stabilisation check #1345

Merged
merged 2 commits into from
Jun 5, 2024

Conversation

rtyley
Copy link
Member

@rtyley rtyley commented May 23, 2024

This aims to address, to an extent, issue #1342 - that in an autoscaling deploy, apps can not auto-scale until the original number of instances in the ASG is capable of successfully handling the level of traffic, as checked by WaitForStabilization. On 22nd May 2024, this inability to auto-scale led to a severe outage in the Ophan Tracker, when a large increase in pageview traffic during a deploy meant the original number of instances could never have handled the load.

image

Disabling auto-scaling

Ever since #83 in April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy (SuspendAlarmNotifications), and only re-enabled them at the end of the deploy (ResumeAlarmNotifications) once deployment has successfully completed.

In December 2016, with #403, an additional WaitForStabilization was added as a penultimate deploy step, which served two purposes:

  1. Ensure that the cull of old instances has completed before the deploy ends
  2. Prove that the new boxes can handle their load without the old boxes.

However, the WaitForStabilization step was placed before ResumeAlarmNotifications - so the original number of EC2 instances in the ASG must be capable of supporting the latest level of traffic, otherwise WaitForStabilization and the entire deploy will fail, leaving auto-scaling indefinitely disabled - potentially waiting hours for human intervention to fix the problem.

The old-instance cull must complete

By putting ResumeAlarmNotifications before WaitForStabilization, the Ophan outage would have been shortened from 1 hour to ~2 minutes, but as @jacobwinch points out, making only that change could lead to a race condition:

  1. Once ResumeAlarmNotifications executes, a 'low CPU' alarm could immediately decide to reduce the desired size of the ASG, randomly removing new instances
  2. WaitForStabilization could then see that the ASG is at its desired size, even if it happens that the termination of the old servers has not yet completed
  3. The deploy then reports as 'complete', even though old EC2 instances are potentially still active

Consequently, with this change we are also introducing a new task, WaitForCullToComplete, that checks specifically that EC2 instances tagged for termination have been removed from the ASG - more specific than the old assumption with WaitForStabilization that the ASG being at desired capacity definitely means all the old instances are gone.

The final sequence of tasks in an autoscaling deploy now looks like this:

CullInstancesWithTerminationTag(autoScalingGroup, target.region),
TerminationGrace(
autoScalingGroup,
target.region,
terminationGrace(pkg, target, reporter)
),
ResumeAlarmNotifications(autoScalingGroup, target.region),
WaitForCullToComplete(
autoScalingGroup,
secondsToWait(pkg, target, reporter),
target.region
),
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter),
target.region
)

Common logic for finding EC2 instances tagged for termination, used by both CullInstancesWithTerminationTag & WaitForCullToComplete, has been factored out into the new CullSummary class.

Testing

I've deployed this branch to Riff Raff CODE as build 3385, and then used that to deploy the Ophan Dashboard to CODE, successfully! You can see that the new WaitForCullToComplete step took place after ResumeAlarmNotifications, and the WaitForStabilization step completed immediately after that:

image

cc @guardian/ophan

@rtyley rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from 6c8f8f7 to 7550f3b Compare May 23, 2024 14:47
@rtyley
Copy link
Member Author

rtyley commented May 23, 2024

WaitForStabilization for culled instances?

@jacobwinch comments:

In order for this to work I think it would be desirable to replace the final WaitForStabilization task with a new task (WaitForOldInstancesToTerminate, or similar).

The current task is checking for an expected number of instances. This is currently an acceptable way to check because Riff-Raff is essentially holding a lock on the desired capacity setting by blocking scaling operations (i.e. we know that the number won't change).

Once we re-enable scaling the desired number of instances becomes a moving target, so I think it would be better for Riff-Raff to check that there are 0 instances with the Magenta=Terminate tag still running in the ASG. This would allow us to be sure that all instances are now running the build currently being deployed, which means it is safe/correct to mark the deployment as successful.

@jacobwinch, rather than introduce WaitForOldInstancesToTerminate, would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

@jacobwinch
Copy link
Contributor

jacobwinch commented May 23, 2024

would it reasonable to update the implementation of CullInstancesWithTerminationTag so that it just blocks until all the instances it's been asked to cull actually have terminated?

I think it's probably safe to allow scaling actions to run again as soon as we've asked AWS to terminate the instances running the old build. At this point there is no risk of AWS trying to scale down the 'wrong' instances (i.e. the ones running the new build) and we know it is safe to launch more instances that will run with the newest build (as at least one instance running this build has already passed the health check).

With that in mind, the only downside of the suggested approach is that while we're blocking and waiting for the instances to terminate, we could probably be allowing the app to start scaling again if we were to split the 'waiting work' into a separate task. We're probably talking about a pretty small time window here though so if it's considerably easier to implement the version you've suggested then I think that would still be a big improvement!

@rtyley rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from 7550f3b to 4900a69 Compare May 30, 2024 19:27
@rtyley rtyley changed the base branch from main to use-strongly-typed-durations May 30, 2024 19:27
@rtyley
Copy link
Member Author

rtyley commented May 31, 2024

You've convinced me @jacobwinch, so I've added some WIP to this PR to implement a WaitForCullToComplete task!

Not quite tidied up yet, but you probably get the idea. Some code from CullInstancesWithTerminationTag is factored out into a CullSummary class so that WaitForCullToComplete can also use it.

@rtyley rtyley force-pushed the use-strongly-typed-durations branch from feaf4f0 to 3560ae2 Compare June 4, 2024 09:05
Base automatically changed from use-strongly-typed-durations to main June 4, 2024 09:09
@rtyley rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch 3 times, most recently from 20fa5f2 to fc4a264 Compare June 4, 2024 10:54
This aims to address, to some extent, issue #1342 -
the problem that *apps can not auto-scale* until an autoscaling deploy has
successfully completed. On 22nd May 2024, this inability to auto-scale led
to a severe outage in the Ophan Tracker.

Ever since #83 in
April 2013, Riff Raff has disabled ASG scaling alarms at the start of a deploy
(`SuspendAlarmNotifications`), and only re-enabled them at the end of the deploy,
(`ResumeAlarmNotifications`) once deployment has successfully completed.

In December 2016, with #403, an
additional `WaitForStabilization` was added as a penultimate deploy step,
with the aim of ensuring that the cull of old instances has _completed_
before the deploy ends. However, the `WaitForStabilization` step was added _before_
`ResumeAlarmNotifications`, rather than _after_, and if the ASG instances are
already overloaded and recycling, the ASG will _never_ stabilise, because it _needs
to scale up_ to handle the load it's experiencing.

In this change, we introduce a new task, `WaitForCullToComplete`, that can establish
whether the cull has completed or not, regardless of whether the ASG is scaling -
it simply checks that there are no remaining instances tagged for termination.
Consequently, once we've executed `CullInstancesWithTerminationTag` to _request_ old
instances terminate, we can immediately allow scaling with `ResumeAlarmNotifications`,
and then `WaitForCullToComplete` _afterwards_.

With this change in place, the Ophan outage would have been shortened from
1 hour to ~2 minutes, a much better outcome!

Common code between `CullInstancesWithTerminationTag` and `WaitForCullToComplete` has
been factored out into a new `CullSummary` class.
@rtyley rtyley force-pushed the reenable-asg-scaling-before-final-WaitForStabilization branch from fc4a264 to 0f75e11 Compare June 4, 2024 11:20
Comment on lines +211 to +215
WaitForStabilization(
autoScalingGroup,
secondsToWait(pkg, target, reporter),
target.region
)
Copy link
Member Author

@rtyley rtyley Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're retaining this here because it does offer proof that the new boxes can handle our traffic without the old boxes.

In normal deploys, it will complete almost instantly, if auto-scaling alarms change the size of the ASG, it could take a little longer to settle.

Jacob Winch pointed out that `WaitForCullToComplete` is a repeating check, and
so needs to get up-to-date `AutoScalingGroupInfo` in order for it to know what
instances currently exist and what their state is! I was missing a call to
`ASG.refresh(asg)`.

This is easy to miss in tasks like `WaitForCullToComplete` that extend
`magenta.tasks.PollingCheck`, the
`magenta.tasks.ASGTask.execute(asg: AutoScalingGroup, ...)` method
puts a `AutoScalingGroup` into scope by the method parameter, and it's inevitably
out-out-date... could be something to refactor in a later PR!

We also decided that as polling checks inevitably involve network calls, it makes
sense to put an exception-catching guard in the `magenta.tasks.PollingCheck.check()`
method.
@rtyley rtyley marked this pull request as ready for review June 5, 2024 09:56
@rtyley rtyley requested review from a team as code owners June 5, 2024 09:56
Comment on lines +279 to +283
val checkResult =
Try(theCheck).recover { case NonFatal(e) =>
reporter.info(e.getMessage)
false
}.get // '.get' allows a fatal exception, like an OutOfMemoryError, to kill the process, which is what we want!
Copy link
Member Author

@rtyley rtyley Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobwinch & I decided that as polling checks invariably involve network calls, it makes sense to put an exception-catching guard around the invocation of theCheck - if a non-fatal exception is received, we probably don't want to kill the deploy, just attempt the check again after the next sleepyTime.

@rtyley rtyley changed the title Re-enable ASG scaling before final stabilisation autoscaling deploy: re-enable ASG scaling before final stabilisation check Jun 5, 2024
Copy link
Contributor

@jacobwinch jacobwinch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me 👍

I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete works successfully.

Test Setup

  1. Redeployed this branch to Riff-Raff CODE
  2. Started sending some artificial traffic to a test service, with the aim of forcing the scale up action to coincide with the Riff-Raff deployment
  3. Started a deployment of the test service via Riff-Raff CODE

Observations

12:16:56 - Deployment starts: [12:16:56] deploy for devx::cdk-playground (build 801) to stage PROD

12:17:05 - Scaling actions are disabled by Riff-Raff: [12:17:05] task SuspendAlarmNotifications Suspending Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will no longer scale on any configured alarms

12:18:411 - Alarm used for automatically scaling up enters alarm state:

image

Automatic scale up does not happen yet because it is still blocked by Riff-Raff.

12:19:39 - Scaling actions are re-enabled by Riff-Raff: [12:19:39] task ResumeAlarmNotifications Resuming Alarm Notifications - playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB will scale on any configured alarms

12:19:40 - New WaitForCullToComplete task starts running: [12:19:40] task WaitForCullToComplete Check that all instances tagged for termination in playground-PROD-cdk-playground-AutoScalingGroupCdkplaygroundASGD6E49F0F-9KOUOLFGBKZB have been terminated

12:19:45 - The ASG scales up automatically:

image

12:21:41 - New WaitForCullToComplete task (and overall deployment) completes:

image

Footnotes

  1. The screenshot here shows a UTC timestamp, I've converted to BST to make it easier to follow the timeline.

@rtyley
Copy link
Member Author

rtyley commented Jun 5, 2024

I ran one additional test to confirm that automatically scaling up while Riff-Raff is running WaitForCullToComplete works successfully.

Magnificent! Great test, and gratifying to see it work correctly! I shall now merge 👍

@rtyley rtyley merged commit 295ec4b into main Jun 5, 2024
1 check passed
@rtyley rtyley deleted the reenable-asg-scaling-before-final-WaitForStabilization branch June 5, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants