Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate) #2395

Closed
wants to merge 5 commits into from

Conversation

akash1810
Copy link
Member

@akash1810 akash1810 commented Jul 29, 2024

Note

This is an alternative take of #2379, differing mostly in the creation of an experimental pattern. The intention is to provide a strong signal that this feature is not yet considered ready for use on high-profile services.

Another point of difference is usage tracking via the Metadata Aspect rather than tags. The Aspect updates the stack's Metadata section with a list of GuCDK constructs being used.

What does this change?

This change adds a new experimental pattern GuEc2AppExperimental for provisioning an EC2 based service deployed entirely via CloudFormation updates. This is achieved by setting the UpdatePolicy attribute of the ASG, specifically AutoScalingReplaceUpdate is employed.

With an AutoScalingReplaceUpdate policy, a CloudFormation update will create a second ASG, and:

After successfully creating the new Auto Scaling group, CloudFormation deletes the old Auto Scaling group during the cleanup process.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-replacingupdate

The CloudFormation update remains IN_PROGRESS until the instances in the ASG report healthy. By default this is defined as instance health (did the instance boot correctly), whereas we want an instance to be healthy only once the target group can see it. For this reason, we implement some custom logic in the user-data.

The resulting user-data works like this:

  1. Perform the user-data commands provided by the service (i.e. download the artifact, and run it).
  2. Start polling the target group for the health of the current instance.
  3. Once the instance is healthy, send a signal to indicate success.

On success, the CloudFormation update status moves to UPDATE_COMPLETE, and the deployment succeeds. The final state is one ASG, running the new version of the service.

If the target group is unable to report the instance as healthy, then a signal is sent indicating this. The CloudFormation update status moves to UPDATE_FAILED, and the changes are rolled back. The final state is one ASG, running the current version of the service.

Updating a service via this mechanism differs from Riff-Raff's autoscaling deployment in one key way: the current ASG can continue to scale if needed. In Riff-Raff's process, scaling alarms are disabled before doubling the desired capacity of the ASG, and enabled again once the capacity has been halved (technically after the halving request has been made).

The GuEc2AppExperimental pattern is not compatible with Riff-Raff's autoscaling deployment type too. As a pre-flight check, Riff-Raff checks there is exactly one ASG matching the tagging specification. As we're creating a second, this check will fail. Consequently, the riff-raff.yaml generator has been updated to omit the autoscaling deployment.

Why experimental?

There are a few requirements for this approach:

  • The AWS CLI, and cfn-signal binaries need to be available and on the PATH (this is added via the AMIgo role aws-tools)
  • The service's artifact should include a build number, as this is the most reliable way to create a difference with the running CloudFormation template, and thus trigger an update.

We're not (yet) validating these are met as its tricky.

There are also a few of unknowns about this approach:

  • What is the best signal timeout value?
  • If the current ASG scales mid-deployment, how does this impact the new ASG? Is the desired mirrored?

For this reason, the rollout plan looks something like this:

  1. Dogfood on some DevX services.
  2. Test on a service within the department.
  3. Move the pattern to stable, as a breaking change, with communication on the migration path, etc.

How to test

For a real-world test, I've been using the pattern (and the update to the generated riff-raff.yaml file) within guardian/cdk-playground - guardian/cdk-playground#496.

Additional testing has been done via unit tests.

How can we measure success?

ASGs can scale during deployment.

Have we considered potential risks?

As noted above, there are some unknowns. Any service that starts using this pattern in its experimental form implicitly accepts the risk of these unknowns.

Checklist

  • I have listed any breaking changes, along with a migration path 1
  • I have updated the documentation as required for the described changes 2

Footnotes

  1. Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs.

  2. If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid?

Copy link

changeset-bot bot commented Jul 29, 2024

⚠️ No Changeset found

Latest commit: 970acd1

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

minSuccessfulInstancesPercent: 100,
},
resourceSignal: {
count: minimumInstances,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CloudFormation will only think the ASG is healthy once it has received a signal from each instance. If an instance sends multiple signals, it is still counted as one.

See https://github.com/guardian/amigo/tree/main/roles/aws-tools.
*/
userData.addCommands(
`# ${GuEc2AppExperimental.name} UserData Start`,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (bash) comment attempts to demarcate the consumer commands, provided when instantiating the object, and the commands from GuCDK.

Copy link
Member Author

@akash1810 akash1810 Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A practical example of the full diff to the user-data can be seen here guardian/cdk-playground#496 (comment).

`# ${GuEc2AppExperimental.name} UserData End`,
);

userData.addOnExitCommands(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a trap, meaning it'll run on the happy, and unhappy paths.

cfn-signal --stack ${stackId} \
--resource ${cfnAutoScalingGroup.logicalId} \
--region ${region} \
--exit-code $exitCode || echo 'Failed to send Cloudformation Signal'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The $exitCode variable is provided by AWS CDK.

Comment on lines +257 to +263
// ASGs without an UpdatePolicy can be deployed via Riff-Raff's (legacy) `autoscaling` deployment type.
// ASGs with an UpdatePolicy are updated via Riff-Raff's `cloud-formation` deployment type.
const legacyAutoscalingGroups = autoscalingGroups.filter((asg) => {
const { cfnOptions } = asg.node.defaultChild as CfnAutoScalingGroup;
const { updatePolicy } = cfnOptions;
return updatePolicy?.autoScalingReplacingUpdate === undefined;
});
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this should be a change on it's own, as a fix to the changes introduced in #2369.

import type { GuEc2AppProps } from "../../patterns";
import { GuEc2App } from "../../patterns";

export interface GuEc2AppExperimentalProps extends Omit<GuEc2AppProps, "updatePolicy"> {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The props of this pattern match those of GuEc2App minus the updatePolicy, which is fixed as UpdatePolicy.replacingUpdate().

Comment on lines +115 to +120
// TODO are these sensible values?
const signalTimeoutSeconds = Math.max(
targetGroup.healthCheck.timeout?.toSeconds() ?? 0,
cfnAutoScalingGroup.healthCheckGracePeriod ?? 0,
Duration.minutes(5).toSeconds(),
);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this is the duration CloudFormation will wait for the signal (positive or negative), before aborting the update.

Copy link
Contributor

@jacobwinch jacobwinch Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using the ASG health check grace period or 5 minutes (whichever is higher) makes sense as a starting point.

I think targetGroup.healthCheck.timeout maps to HealthCheckTimeoutSeconds. This is the timeout for each healthcheck request made by the ALB and has a max timeout of 120 seconds, so I think we could safely drop this one.

* NOTE: This pattern:
* - Is NOT compatible with the "autoscaling" Riff-Raff deployment type.
* - Your application should include a build number in its filename.
* This value will change across builds, and therefore create a CloudFormation template difference to be deployed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the AMI also gets updated by CloudFormation, would that be a separate CloudFormation change, or would we try and do it together in one update? I think the reason this is important is that CloudFormation will rollback failed changes, and we wouldn't it to rollback halfway to the old ASG with the new AMI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, Riff-Raff attempts to update the AMI on each deployment. That is, the changes are bundled together into one.

`aws` is available via AMIgo baked AMIs.
See https://github.com/guardian/amigo/tree/main/roles/aws-tools.
*/
userData.addCommands(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love us to move away from teams having to write the boiler plate of fetching their application from S3. We don't necessarily have to think about this at the same time, but if we're already changing artifacts around to include the build number, is this an opportunity to remove this as well at the same time, with only one breaking change?

I'd say Ideally userdata should be used by teams for appplication specific work if required, and we'd have everything needed for deploying and running a "normal" application in place.

Again, no need to tackle everything at once so if this is non-trivial then for sure let's consider it later.

--region ${region} \
--targets Id=$INSTANCE_ID,Port=${applicationPort} \
--query "TargetHealthDescriptions[0].TargetHealth.State")
done

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a log line here to say words to the effect of "yeah we're all good, I got healthy back from the target health call"?

@akash1810 akash1810 marked this pull request as draft August 6, 2024 22:45
@akash1810 akash1810 changed the title feat(experimental-ec2-pattern): Add pattern to deploy ASGs updates via CloudFormation feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate) Aug 20, 2024
ASGs with an update policy will get deployed via CloudFormation,
instead of Riff-Raff's `autoscaling` deployment type.
…arkers

This should make it easier to parse a user data string if ever one is debugging.
@akash1810
Copy link
Member Author

Here is a summary of observations made. I think the negative changes this mechanism creates for incident triaging outweighs the benefits.

Launch templates

Launch templates are a prerequisite for this solution; we still have a number of services using launch configurations.

Scaling

If, during a deployment, the current ASG scales up, the second ASG will always be under-provisioned.

Let’s say we have this timeline:

  1. Within a CloudFormation stack, an ASG has min = 3, max = 10, desired = 3
  2. CloudFormation update starts, and a new ASG created. This ASG has capacity 3/10/3 too.
  3. The first ASG scales up, doubling capacity; it is now 3/10/6.
  4. CloudFormation removes the first ASG, leaving one ASG at 3/10/3.

We’re left with a single ASG with capacity 3/10/3. This is under-provisioned (assuming we’re still in an alarm state).

Indeed, this is confirmed by AWS support:

I think if you want to maintain 6 after update you need to specify the desired capacity to 6 for the update and the old ASG will remove its own 3 and leave you with 6.

AWS recommends doing this by making the scale out alarm more sensitive.

During the meeting we touched on the bases of Auto scaling behavior during "AutoScalingReplacingUpdate", when an alarm is triggered for the old ASG to add capacity while the new ASG maintains the previous desired capacity. I mentioned to you that the old instances should continue to serve all flights requests as per the connection draining, while new connections will be sent to the new instance. Then, if during this time the aggregated CPUUtilization seem to be spiking to the threshold, the alarm attached to the ASG will react to the change, to either add more instances. However, what determines how quick response of the alarm is the number of datapoints and the period of evaluation.

Activity history

With new ASGs created, we lose the ability to use the ASG “Activity History” to, for example, understand why an instance was unexpectedly terminated.

AWS have suggested two alternatives:

  1. Use CloudTrail logs. We’d need to know the ARN of the ASG(s), then join the CloudTrail events.
  2. Use the CLI. Using the describe-scaling-activities command and flag --include-deleted-groups, we’re able to view activity history. We’d need to know the ARN of the ASG(s).

CloudWatch Metrics

Each ASG records metrics to CloudWatch to a namespace based on the ASG name. With multiple ASGs, these metrics will be spread across namespaces. To get a continuous timeline of these metrics, we have to manually join the metrics. How, is yet unanswered.

Security SLAs

A number of our security SLAs come into effect based on the age of an ASG. With each deployment creating a new ASG, these metrics will become nonsensical.

⚠️ We could create some tooling to improve the DX of observing the metrics, activity, etc. of multiple ASGs for a service. We should consider the cost of doing this beforehand though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants