feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (`AutoScalingReplacingUpdate`) #2395

akash1810 · 2024-07-29T22:33:05Z

Note

This is an alternative take of #2379, differing mostly in the creation of an experimental pattern. The intention is to provide a strong signal that this feature is not yet considered ready for use on high-profile services.

Another point of difference is usage tracking via the Metadata Aspect rather than tags. The Aspect updates the stack's Metadata section with a list of GuCDK constructs being used.

What does this change?

This change adds a new experimental pattern GuEc2AppExperimental for provisioning an EC2 based service deployed entirely via CloudFormation updates. This is achieved by setting the UpdatePolicy attribute of the ASG, specifically AutoScalingReplaceUpdate is employed.

With an AutoScalingReplaceUpdate policy, a CloudFormation update will create a second ASG, and:

After successfully creating the new Auto Scaling group, CloudFormation deletes the old Auto Scaling group during the cleanup process.
– https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-replacingupdate

The CloudFormation update remains IN_PROGRESS until the instances in the ASG report healthy. By default this is defined as instance health (did the instance boot correctly), whereas we want an instance to be healthy only once the target group can see it. For this reason, we implement some custom logic in the user-data.

The resulting user-data works like this:

Perform the user-data commands provided by the service (i.e. download the artifact, and run it).
Start polling the target group for the health of the current instance.
Once the instance is healthy, send a signal to indicate success.

On success, the CloudFormation update status moves to UPDATE_COMPLETE, and the deployment succeeds. The final state is one ASG, running the new version of the service.

If the target group is unable to report the instance as healthy, then a signal is sent indicating this. The CloudFormation update status moves to UPDATE_FAILED, and the changes are rolled back. The final state is one ASG, running the current version of the service.

Updating a service via this mechanism differs from Riff-Raff's autoscaling deployment in one key way: the current ASG can continue to scale if needed. In Riff-Raff's process, scaling alarms are disabled before doubling the desired capacity of the ASG, and enabled again once the capacity has been halved (technically after the halving request has been made).

The GuEc2AppExperimental pattern is not compatible with Riff-Raff's autoscaling deployment type too. As a pre-flight check, Riff-Raff checks there is exactly one ASG matching the tagging specification. As we're creating a second, this check will fail. Consequently, the riff-raff.yaml generator has been updated to omit the autoscaling deployment.

Why `experimental`?

There are a few requirements for this approach:

The AWS CLI, and cfn-signal binaries need to be available and on the PATH (this is added via the AMIgo role aws-tools)
The service's artifact should include a build number, as this is the most reliable way to create a difference with the running CloudFormation template, and thus trigger an update.

We're not (yet) validating these are met as its tricky.

There are also a few of unknowns about this approach:

What is the best signal timeout value?
If the current ASG scales mid-deployment, how does this impact the new ASG? Is the desired mirrored?

For this reason, the rollout plan looks something like this:

Dogfood on some DevX services.
Test on a service within the department.
Move the pattern to stable, as a breaking change, with communication on the migration path, etc.

How to test

For a real-world test, I've been using the pattern (and the update to the generated riff-raff.yaml file) within guardian/cdk-playground - guardian/cdk-playground#496.

Additional testing has been done via unit tests.

How can we measure success?

ASGs can scale during deployment.

Have we considered potential risks?

As noted above, there are some unknowns. Any service that starts using this pattern in its experimental form implicitly accepts the risk of these unknowns.

Checklist

I have listed any breaking changes, along with a migration path ¹
I have updated the documentation as required for the described changes ²

Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs. ↩
If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid? ↩

changeset-bot · 2024-07-29T22:33:09Z

⚠️ No Changeset found

Latest commit: 970acd1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

akash1810 · 2024-07-30T06:54:30Z

src/experimental/patterns/ec2-app.ts

+        minSuccessfulInstancesPercent: 100,
+      },
+      resourceSignal: {
+        count: minimumInstances,


CloudFormation will only think the ASG is healthy once it has received a signal from each instance. If an instance sends multiple signals, it is still counted as one.

akash1810 · 2024-07-30T06:58:46Z

src/experimental/patterns/ec2-app.ts

+    See https://github.com/guardian/amigo/tree/main/roles/aws-tools.
+     */
+    userData.addCommands(
+      `# ${GuEc2AppExperimental.name} UserData Start`,


This (bash) comment attempts to demarcate the consumer commands, provided when instantiating the object, and the commands from GuCDK.

A practical example of the full diff to the user-data can be seen here guardian/cdk-playground#496 (comment).

akash1810 · 2024-07-30T07:00:01Z

src/experimental/patterns/ec2-app.ts

+      `# ${GuEc2AppExperimental.name} UserData End`,
+    );
+
+    userData.addOnExitCommands(


This creates a trap, meaning it'll run on the happy, and unhappy paths.

akash1810 · 2024-07-30T07:00:20Z

src/experimental/patterns/ec2-app.ts

+        cfn-signal --stack ${stackId} \
+          --resource ${cfnAutoScalingGroup.logicalId} \
+          --region ${region} \
+          --exit-code $exitCode || echo 'Failed to send Cloudformation Signal'


The $exitCode variable is provided by AWS CDK.

akash1810 · 2024-07-30T07:02:46Z

src/riff-raff-yaml-file/index.ts

+          // ASGs without an UpdatePolicy can be deployed via Riff-Raff's (legacy) `autoscaling` deployment type.
+          // ASGs with an UpdatePolicy are updated via Riff-Raff's `cloud-formation` deployment type.
+          const legacyAutoscalingGroups = autoscalingGroups.filter((asg) => {
+            const { cfnOptions } = asg.node.defaultChild as CfnAutoScalingGroup;
+            const { updatePolicy } = cfnOptions;
+            return updatePolicy?.autoScalingReplacingUpdate === undefined;
+          });


Technically this should be a change on it's own, as a fix to the changes introduced in #2369.

akash1810 · 2024-07-30T07:06:25Z

src/experimental/patterns/ec2-app.ts

+import type { GuEc2AppProps } from "../../patterns";
+import { GuEc2App } from "../../patterns";
+
+export interface GuEc2AppExperimentalProps extends Omit<GuEc2AppProps, "updatePolicy"> {}


The props of this pattern match those of GuEc2App minus the updatePolicy, which is fixed as UpdatePolicy.replacingUpdate().

akash1810 · 2024-07-30T07:08:45Z

src/experimental/patterns/ec2-app.ts

+    // TODO are these sensible values?
+    const signalTimeoutSeconds = Math.max(
+      targetGroup.healthCheck.timeout?.toSeconds() ?? 0,
+      cfnAutoScalingGroup.healthCheckGracePeriod ?? 0,
+      Duration.minutes(5).toSeconds(),
+    );


IIUC this is the duration CloudFormation will wait for the signal (positive or negative), before aborting the update.

I think using the ASG health check grace period or 5 minutes (whichever is higher) makes sense as a starting point.

I think targetGroup.healthCheck.timeout maps to HealthCheckTimeoutSeconds. This is the timeout for each healthcheck request made by the ALB and has a max timeout of 120 seconds, so I think we could safely drop this one.

adamnfish · 2024-08-01T11:11:27Z

src/experimental/patterns/ec2-app.ts

+ * NOTE: This pattern:
+ *  - Is NOT compatible with the "autoscaling" Riff-Raff deployment type.
+ *  - Your application should include a build number in its filename.
+ *    This value will change across builds, and therefore create a CloudFormation template difference to be deployed.


If the AMI also gets updated by CloudFormation, would that be a separate CloudFormation change, or would we try and do it together in one update? I think the reason this is important is that CloudFormation will rollback failed changes, and we wouldn't it to rollback halfway to the old ASG with the new AMI.

Currently, Riff-Raff attempts to update the AMI on each deployment. That is, the changes are bundled together into one.

adamnfish · 2024-08-01T11:19:56Z

src/experimental/patterns/ec2-app.ts

+    `aws` is available via AMIgo baked AMIs.
+    See https://github.com/guardian/amigo/tree/main/roles/aws-tools.
+     */
+    userData.addCommands(


I'd love us to move away from teams having to write the boiler plate of fetching their application from S3. We don't necessarily have to think about this at the same time, but if we're already changing artifacts around to include the build number, is this an opportunity to remove this as well at the same time, with only one breaking change?

I'd say Ideally userdata should be used by teams for appplication specific work if required, and we'd have everything needed for deploying and running a "normal" application in place.

Again, no need to tackle everything at once so if this is non-trivial then for sure let's consider it later.

adamnfish · 2024-08-01T11:21:04Z

src/experimental/patterns/ec2-app.ts

+          --region ${region} \
+          --targets Id=$INSTANCE_ID,Port=${applicationPort} \
+          --query "TargetHealthDescriptions[0].TargetHealth.State")
+      done


Should we add a log line here to say words to the effect of "yeah we're all good, I got healthy back from the target health call"?

ASGs with an update policy will get deployed via CloudFormation, instead of Riff-Raff's `autoscaling` deployment type.

…a CloudFormation

…arkers This should make it easier to parse a user data string if ever one is debugging.

akash1810 · 2024-09-02T15:13:34Z

Here is a summary of observations made. I think the negative changes this mechanism creates for incident triaging outweighs the benefits.

Launch templates

Launch templates are a prerequisite for this solution; we still have a number of services using launch configurations.

Scaling

If, during a deployment, the current ASG scales up, the second ASG will always be under-provisioned.

Let’s say we have this timeline:

Within a CloudFormation stack, an ASG has min = 3, max = 10, desired = 3
CloudFormation update starts, and a new ASG created. This ASG has capacity 3/10/3 too.
The first ASG scales up, doubling capacity; it is now 3/10/6.
CloudFormation removes the first ASG, leaving one ASG at 3/10/3.

We’re left with a single ASG with capacity 3/10/3. This is under-provisioned (assuming we’re still in an alarm state).

Indeed, this is confirmed by AWS support:

I think if you want to maintain 6 after update you need to specify the desired capacity to 6 for the update and the old ASG will remove its own 3 and leave you with 6.

AWS recommends doing this by making the scale out alarm more sensitive.

During the meeting we touched on the bases of Auto scaling behavior during "AutoScalingReplacingUpdate", when an alarm is triggered for the old ASG to add capacity while the new ASG maintains the previous desired capacity. I mentioned to you that the old instances should continue to serve all flights requests as per the connection draining, while new connections will be sent to the new instance. Then, if during this time the aggregated CPUUtilization seem to be spiking to the threshold, the alarm attached to the ASG will react to the change, to either add more instances. However, what determines how quick response of the alarm is the number of datapoints and the period of evaluation.

Activity history

With new ASGs created, we lose the ability to use the ASG “Activity History” to, for example, understand why an instance was unexpectedly terminated.

AWS have suggested two alternatives:

Use CloudTrail logs. We’d need to know the ARN of the ASG(s), then join the CloudTrail events.
Use the CLI. Using the describe-scaling-activities command and flag --include-deleted-groups, we’re able to view activity history. We’d need to know the ARN of the ASG(s).

CloudWatch Metrics

Each ASG records metrics to CloudWatch to a namespace based on the ASG name. With multiple ASGs, these metrics will be spread across namespaces. To get a continuous timeline of these metrics, we have to manually join the metrics. How, is yet unanswered.

Security SLAs

A number of our security SLAs come into effect based on the age of an ASG. With each deployment creating a new ASG, these metrics will become nonsensical.

⚠️ We could create some tooling to improve the DX of observing the metrics, activity, etc. of multiple ASGs for a service. We should consider the cost of doing this beforehand though.

akash1810 mentioned this pull request Jul 29, 2024

test: Does guardian/cdk/pull/2395 work? guardian/cdk-playground#496

Closed

akash1810 marked this pull request as ready for review July 29, 2024 22:34

akash1810 requested a review from a team as a code owner July 29, 2024 22:34

akash1810 mentioned this pull request Jul 29, 2024

feat(ec2): Setting UpdatePolicy on our EC2 patterns now configures th… #2379

Closed

2 tasks

akash1810 commented Jul 30, 2024

View reviewed changes

adamnfish reviewed Aug 1, 2024

View reviewed changes

akash1810 marked this pull request as draft August 6, 2024 22:45

akash1810 changed the title ~~feat(experimental-ec2-pattern): Add pattern to deploy ASGs updates via CloudFormation~~ feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate) Aug 20, 2024

akash1810 added 5 commits August 20, 2024 13:08

fix(riff-raff.yaml): Do not deploy ASG w/update policy

8f45e4f

ASGs with an update policy will get deployed via CloudFormation, instead of Riff-Raff's `autoscaling` deployment type.

feat(experimental-ec2-pattern): Add pattern to deploy ASGs updates vi…

77f2fcd

…a CloudFormation

test(riff-raff.yaml): Test GuEc2AppExperimental behaviour

8e165f7

feat(experimental-ec2-pattern): Decorate added user data commands w/m…

e85e3df

…arkers This should make it easier to parse a user data string if ever one is debugging.

docs(experimental-ec2-pattern): Add doc string to class

970acd1

akash1810 force-pushed the aa/ec2-app-experimental branch from a1fbba3 to 970acd1 Compare August 20, 2024 12:09

akash1810 closed this Sep 2, 2024

akash1810 deleted the aa/ec2-app-experimental branch September 2, 2024 15:14

akash1810 mentioned this pull request Sep 13, 2024

feat(experimental-ec2-pattern): Pattern to deploy ASG updates w/CFN #2417

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (`AutoScalingReplacingUpdate`) #2395

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (`AutoScalingReplacingUpdate`) #2395

akash1810 commented Jul 29, 2024 •

edited

Loading

changeset-bot bot commented Jul 29, 2024 •

edited

Loading

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024 •

edited

Loading

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024

akash1810 Jul 30, 2024

jacobwinch Jul 31, 2024 •

edited

Loading

adamnfish Aug 1, 2024

akash1810 Aug 1, 2024

adamnfish Aug 1, 2024

adamnfish Aug 1, 2024

akash1810 commented Sep 2, 2024

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate) #2395

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate) #2395

Conversation

akash1810 commented Jul 29, 2024 • edited Loading

What does this change?

Why experimental?

How to test

How can we measure success?

Have we considered potential risks?

Checklist

Footnotes

changeset-bot bot commented Jul 29, 2024 • edited Loading

⚠️ No Changeset found

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akash1810 Jul 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacobwinch Jul 31, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

akash1810 commented Sep 2, 2024

Launch templates

Scaling

Activity history

CloudWatch Metrics

Security SLAs

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (`AutoScalingReplacingUpdate`) #2395

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (`AutoScalingReplacingUpdate`) #2395

akash1810 commented Jul 29, 2024 •

edited

Loading

Why `experimental`?

changeset-bot bot commented Jul 29, 2024 •

edited

Loading

akash1810 Jul 30, 2024 •

edited

Loading

jacobwinch Jul 31, 2024 •

edited

Loading