Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Service... did not stabilize" error when updating the CloudFormation stack #175

Closed
MikeTheCanuck opened this issue Jun 17, 2018 · 5 comments

Comments

@MikeTheCanuck
Copy link
Contributor

MikeTheCanuck commented Jun 17, 2018

Attempted to update the CF stack last night with the changes due to PR 38:

  • it would not allow me to make the update, Update Failed and it all rolled back - I did not have sufficient permissions (as observed in the Events details of one of the affected, nested Stacks):

mikethecanuck@gmail.com is not authorized to perform: iam:PassRole on resource: arn:aws:iam::845828040396:role/ecs-service-hacko-integration-2018LE-1GVBYRVJAIDYJ (Service: AmazonECS; Status Code: 400; Error Code: AccessDenied

  • when Michael Lange (who has god access) attempted to run the changeset, that attempt too failed with the following error:

Service arn:aws:ecs:us-west-2:845828040396:service/hacko-integration-transportService-67KME5SFWBJO-Service-1OVJNMOPH8ZH2 did not stabilize.

Observed conditions

  • the HackO cluster was running with 14 Services and 28 Tasks
  • out of the 8GB of memory on each underlying EC2 instance, there was approx. 3GB free

Theory on the failure

  • we haven't yet implemented the "spread" settings to any services
  • perhaps when the changeset was executed, it caused the 14 Services to spin up 28 more Tasks, and of those 28 new Tasks, it's possible that (lacking the "spread" settings), the two new 2017 transport tasks (containers) - each requiring 2GB of memory - were both loaded on the same EC2 instance
  • and since that would've required 4GB of memory, and less than 4GB was available, ECS noticed that at least one of the tasks couldn't start

Evidence for this theory

  • I have temporarily (manually) told ECS to only run 1 Task instance of the 2017 transport service (through the ECS console)
  • then we executed the same settings in a new changeset as we'd tried last night
  • when I looked at the state of the cluster while we're waiting for the changeset to complete (or rollback), I noticed that there were now 28 Services and 54 Tasks running (54 = 28 x 2 less two for the 1 less task per EC2 instance for the 2017 transport service)

screen shot 2018-06-17 at 10 52 28

@MikeTheCanuck
Copy link
Contributor Author

This changeset execution is currently consuming all but 747 MB of memory on the EC2 instances:

screen shot 2018-06-17 at 11 04 00

And it's taking an awfully long time to resolve either way...

And the transportService (the 2017 monster) is the only one that's currently in "UPDATE_IN_PROGRESS" state rather than "UPDATE_COMPLETE_CLEANUP_IN_PROGRESS" for all the rest of the affected services.

screen shot 2018-06-17 at 11 05 31

@MikeTheCanuck
Copy link
Contributor Author

Seeing as the new attempt to bring up transportService has been in this state for ~25 minutes now, with no further message:
screen shot 2018-06-17 at 11 09 03

I'm going to assume that this isn't likely to ultimately succeed.

So I went to the older/active instance of transportService that's that was configured down to 1 task, and tuned it down to 0 tasks - hoping this'll catch things before they miserably fail and rollback this latest attempt.

@MikeTheCanuck
Copy link
Contributor Author

And then because the 1 remaining Task was still running, I went in and manually Stop'd it, thus freeing up the memory on the one EC2 instance:

screen shot 2018-06-17 at 11 14 06

@MikeTheCanuck
Copy link
Contributor Author

It finally succeeded, nearly a half-hour into the attempt. Did freeing up the memory from that one lingering 2GB task make the difference? Who's to say? But it's definitely making progress back to a stable cluster:

screen shot 2018-06-17 at 11 19 38

screen shot 2018-06-17 at 11 22 56

screen shot 2018-06-17 at 11 28 18

@MikeTheCanuck
Copy link
Contributor Author

Conclusion

What caused this to fail? Who can know for sure? The fact that one of our Tasks - when spun in two instances, without the "spread" settings from PR 38 already in place - could still potentially swamp the remaining memory on an EC2 host, is a very likely candidate.

We *shouldn't see this any longer, so long as the cluster (a) keeps the "spread" settings, and (b) never tries to launch a Service that could consume the remaining memory on the EC2 host.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant