-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(experimental-ec2-pattern): Pattern to deploy ASGs updates via CloudFormation (AutoScalingReplacingUpdate
)
#2395
Conversation
|
minSuccessfulInstancesPercent: 100, | ||
}, | ||
resourceSignal: { | ||
count: minimumInstances, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CloudFormation will only think the ASG is healthy once it has received a signal from each instance. If an instance sends multiple signals, it is still counted as one.
See https://github.com/guardian/amigo/tree/main/roles/aws-tools. | ||
*/ | ||
userData.addCommands( | ||
`# ${GuEc2AppExperimental.name} UserData Start`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (bash) comment attempts to demarcate the consumer commands, provided when instantiating the object, and the commands from GuCDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A practical example of the full diff to the user-data can be seen here guardian/cdk-playground#496 (comment).
`# ${GuEc2AppExperimental.name} UserData End`, | ||
); | ||
|
||
userData.addOnExitCommands( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This creates a trap, meaning it'll run on the happy, and unhappy paths.
cfn-signal --stack ${stackId} \ | ||
--resource ${cfnAutoScalingGroup.logicalId} \ | ||
--region ${region} \ | ||
--exit-code $exitCode || echo 'Failed to send Cloudformation Signal' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The $exitCode
variable is provided by AWS CDK.
// ASGs without an UpdatePolicy can be deployed via Riff-Raff's (legacy) `autoscaling` deployment type. | ||
// ASGs with an UpdatePolicy are updated via Riff-Raff's `cloud-formation` deployment type. | ||
const legacyAutoscalingGroups = autoscalingGroups.filter((asg) => { | ||
const { cfnOptions } = asg.node.defaultChild as CfnAutoScalingGroup; | ||
const { updatePolicy } = cfnOptions; | ||
return updatePolicy?.autoScalingReplacingUpdate === undefined; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically this should be a change on it's own, as a fix to the changes introduced in #2369.
import type { GuEc2AppProps } from "../../patterns"; | ||
import { GuEc2App } from "../../patterns"; | ||
|
||
export interface GuEc2AppExperimentalProps extends Omit<GuEc2AppProps, "updatePolicy"> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The props of this pattern match those of GuEc2App
minus the updatePolicy
, which is fixed as UpdatePolicy.replacingUpdate()
.
// TODO are these sensible values? | ||
const signalTimeoutSeconds = Math.max( | ||
targetGroup.healthCheck.timeout?.toSeconds() ?? 0, | ||
cfnAutoScalingGroup.healthCheckGracePeriod ?? 0, | ||
Duration.minutes(5).toSeconds(), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this is the duration CloudFormation will wait for the signal (positive or negative), before aborting the update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using the ASG health check grace period or 5 minutes (whichever is higher) makes sense as a starting point.
I think targetGroup.healthCheck.timeout
maps to HealthCheckTimeoutSeconds
. This is the timeout for each healthcheck request made by the ALB and has a max timeout of 120 seconds, so I think we could safely drop this one.
* NOTE: This pattern: | ||
* - Is NOT compatible with the "autoscaling" Riff-Raff deployment type. | ||
* - Your application should include a build number in its filename. | ||
* This value will change across builds, and therefore create a CloudFormation template difference to be deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the AMI also gets updated by CloudFormation, would that be a separate CloudFormation change, or would we try and do it together in one update? I think the reason this is important is that CloudFormation will rollback failed changes, and we wouldn't it to rollback halfway to the old ASG with the new AMI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, Riff-Raff attempts to update the AMI on each deployment. That is, the changes are bundled together into one.
`aws` is available via AMIgo baked AMIs. | ||
See https://github.com/guardian/amigo/tree/main/roles/aws-tools. | ||
*/ | ||
userData.addCommands( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love us to move away from teams having to write the boiler plate of fetching their application from S3. We don't necessarily have to think about this at the same time, but if we're already changing artifacts around to include the build number, is this an opportunity to remove this as well at the same time, with only one breaking change?
I'd say Ideally userdata should be used by teams for appplication specific work if required, and we'd have everything needed for deploying and running a "normal" application in place.
Again, no need to tackle everything at once so if this is non-trivial then for sure let's consider it later.
--region ${region} \ | ||
--targets Id=$INSTANCE_ID,Port=${applicationPort} \ | ||
--query "TargetHealthDescriptions[0].TargetHealth.State") | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a log line here to say words to the effect of "yeah we're all good, I got healthy back from the target health call"?
AutoScalingReplacingUpdate
)
ASGs with an update policy will get deployed via CloudFormation, instead of Riff-Raff's `autoscaling` deployment type.
…arkers This should make it easier to parse a user data string if ever one is debugging.
a1fbba3
to
970acd1
Compare
Here is a summary of observations made. I think the negative changes this mechanism creates for incident triaging outweighs the benefits. Launch templatesLaunch templates are a prerequisite for this solution; we still have a number of services using launch configurations. ScalingIf, during a deployment, the current ASG scales up, the second ASG will always be under-provisioned. Let’s say we have this timeline:
We’re left with a single ASG with capacity 3/10/3. This is under-provisioned (assuming we’re still in an alarm state). Indeed, this is confirmed by AWS support:
AWS recommends doing this by making the scale out alarm more sensitive.
Activity historyWith new ASGs created, we lose the ability to use the ASG “Activity History” to, for example, understand why an instance was unexpectedly terminated. AWS have suggested two alternatives:
CloudWatch MetricsEach ASG records metrics to CloudWatch to a namespace based on the ASG name. With multiple ASGs, these metrics will be spread across namespaces. To get a continuous timeline of these metrics, we have to manually join the metrics. How, is yet unanswered. Security SLAsA number of our security SLAs come into effect based on the age of an ASG. With each deployment creating a new ASG, these metrics will become nonsensical. |
Note
This is an alternative take of #2379, differing mostly in the creation of an experimental pattern. The intention is to provide a strong signal that this feature is not yet considered ready for use on high-profile services.
Another point of difference is usage tracking via the
Metadata
Aspect rather than tags. The Aspect updates the stack'sMetadata
section with a list of GuCDK constructs being used.What does this change?
This change adds a new experimental pattern
GuEc2AppExperimental
for provisioning an EC2 based service deployed entirely via CloudFormation updates. This is achieved by setting theUpdatePolicy
attribute of the ASG, specificallyAutoScalingReplaceUpdate
is employed.With an
AutoScalingReplaceUpdate
policy, a CloudFormation update will create a second ASG, and:The CloudFormation update remains
IN_PROGRESS
until the instances in the ASG report healthy. By default this is defined as instance health (did the instance boot correctly), whereas we want an instance to be healthy only once the target group can see it. For this reason, we implement some custom logic in the user-data.The resulting user-data works like this:
On success, the CloudFormation update status moves to
UPDATE_COMPLETE
, and the deployment succeeds. The final state is one ASG, running the new version of the service.If the target group is unable to report the instance as healthy, then a signal is sent indicating this. The CloudFormation update status moves to
UPDATE_FAILED
, and the changes are rolled back. The final state is one ASG, running the current version of the service.Updating a service via this mechanism differs from Riff-Raff's
autoscaling
deployment in one key way: the current ASG can continue to scale if needed. In Riff-Raff's process, scaling alarms are disabled before doubling the desired capacity of the ASG, and enabled again once the capacity has been halved (technically after the halving request has been made).The
GuEc2AppExperimental
pattern is not compatible with Riff-Raff'sautoscaling
deployment type too. As a pre-flight check, Riff-Raff checks there is exactly one ASG matching the tagging specification. As we're creating a second, this check will fail. Consequently, theriff-raff.yaml
generator has been updated to omit theautoscaling
deployment.Why
experimental
?There are a few requirements for this approach:
cfn-signal
binaries need to be available and on thePATH
(this is added via the AMIgo roleaws-tools
)We're not (yet) validating these are met as its tricky.
There are also a few of unknowns about this approach:
For this reason, the rollout plan looks something like this:
How to test
For a real-world test, I've been using the pattern (and the update to the generated
riff-raff.yaml
file) withinguardian/cdk-playground
- guardian/cdk-playground#496.Additional testing has been done via unit tests.
How can we measure success?
ASGs can scale during deployment.
Have we considered potential risks?
As noted above, there are some unknowns. Any service that starts using this pattern in its
experimental
form implicitly accepts the risk of these unknowns.Checklist
Footnotes
Consider whether this is something that will mean changes to projects that have already been migrated, or to the CDK CLI tool. If changes are required, consider adding a checklist here and/or linking to related PRs. ↩
If you are adding a new construct or pattern, has new documentation been added? If you are amending defaults or changing behaviour, are the existing docs still valid? ↩