"Service... did not stabilize" error when updating the CloudFormation stack #175

MikeTheCanuck · 2018-06-17T18:01:16Z

Attempted to update the CF stack last night with the changes due to PR 38:

it would not allow me to make the update, Update Failed and it all rolled back - I did not have sufficient permissions (as observed in the Events details of one of the affected, nested Stacks):

mikethecanuck@gmail.com is not authorized to perform: iam:PassRole on resource: arn:aws:iam::845828040396:role/ecs-service-hacko-integration-2018LE-1GVBYRVJAIDYJ (Service: AmazonECS; Status Code: 400; Error Code: AccessDenied

when Michael Lange (who has god access) attempted to run the changeset, that attempt too failed with the following error:

Service arn:aws:ecs:us-west-2:845828040396:service/hacko-integration-transportService-67KME5SFWBJO-Service-1OVJNMOPH8ZH2 did not stabilize.

Observed conditions

the HackO cluster was running with 14 Services and 28 Tasks
out of the 8GB of memory on each underlying EC2 instance, there was approx. 3GB free

Theory on the failure

we haven't yet implemented the "spread" settings to any services
perhaps when the changeset was executed, it caused the 14 Services to spin up 28 more Tasks, and of those 28 new Tasks, it's possible that (lacking the "spread" settings), the two new 2017 transport tasks (containers) - each requiring 2GB of memory - were both loaded on the same EC2 instance
and since that would've required 4GB of memory, and less than 4GB was available, ECS noticed that at least one of the tasks couldn't start

Evidence for this theory

I have temporarily (manually) told ECS to only run 1 Task instance of the 2017 transport service (through the ECS console)
then we executed the same settings in a new changeset as we'd tried last night
when I looked at the state of the cluster while we're waiting for the changeset to complete (or rollback), I noticed that there were now 28 Services and 54 Tasks running (54 = 28 x 2 less two for the 1 less task per EC2 instance for the 2017 transport service)

MikeTheCanuck · 2018-06-17T18:07:08Z

This changeset execution is currently consuming all but 747 MB of memory on the EC2 instances:

And it's taking an awfully long time to resolve either way...

And the transportService (the 2017 monster) is the only one that's currently in "UPDATE_IN_PROGRESS" state rather than "UPDATE_COMPLETE_CLEANUP_IN_PROGRESS" for all the rest of the affected services.

MikeTheCanuck · 2018-06-17T18:11:25Z

Seeing as the new attempt to bring up transportService has been in this state for ~25 minutes now, with no further message:

I'm going to assume that this isn't likely to ultimately succeed.

So I went to the older/active instance of transportService that's that was configured down to 1 task, and tuned it down to 0 tasks - hoping this'll catch things before they miserably fail and rollback this latest attempt.

MikeTheCanuck · 2018-06-17T18:14:53Z

And then because the 1 remaining Task was still running, I went in and manually Stop'd it, thus freeing up the memory on the one EC2 instance:

MikeTheCanuck · 2018-06-17T18:28:24Z

It finally succeeded, nearly a half-hour into the attempt. Did freeing up the memory from that one lingering 2GB task make the difference? Who's to say? But it's definitely making progress back to a stable cluster:

MikeTheCanuck · 2018-06-17T18:30:42Z

Conclusion

What caused this to fail? Who can know for sure? The fact that one of our Tasks - when spun in two instances, without the "spread" settings from PR 38 already in place - could still potentially swamp the remaining memory on an EC2 host, is a very likely candidate.

We *shouldn't see this any longer, so long as the cluster (a) keeps the "spread" settings, and (b) never tries to launch a Service that could consume the remaining memory on the EC2 host.

MikeTheCanuck closed this as completed Jun 17, 2018

This was referenced Jun 17, 2018

Bump up memory for growing API containers #176

Closed

add placement strategies to API service/task definitions. #165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Service... did not stabilize" error when updating the CloudFormation stack #175

"Service... did not stabilize" error when updating the CloudFormation stack #175

MikeTheCanuck commented Jun 17, 2018 •

edited

Loading

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

"Service... did not stabilize" error when updating the CloudFormation stack #175

"Service... did not stabilize" error when updating the CloudFormation stack #175

Comments

MikeTheCanuck commented Jun 17, 2018 • edited Loading

Observed conditions

Theory on the failure

Evidence for this theory

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

MikeTheCanuck commented Jun 17, 2018

Conclusion

MikeTheCanuck commented Jun 17, 2018 •

edited

Loading