Skip to content

runner: some aws instances don't shut down and keep running forever #1138

@courentin

Description

@courentin

When running cml runner, I've noticed a couple of times that some aws instances were not properly terminated when jobs were finished. However, the gihub runner was marked as offline.

I can't reproduce it easily, it seems to happen in a particular scenario that I can't isolate for now. Also I don't have the bandwidth to investigate more for now.

However, as a broader issue here, I'm wondering if the instance termination is resilient enough to diverse failures.
Given that cml is made for ML workflows, we expect users to run big/expensive instances. Thus, cml should ensure that no matter what happen (github action's down, cml bug, particular edge case with canceling workflows, etc.) instances must be shut down at some point.

When cml didn't existed, we used to start instances manually and figured out it was easy to forget about them. To fix that, we developed an EC2 instance garbage collector that terminated instances with low CPU usage for a certain amount of time.
If that kind of stuff can be integrated to cml, that would be awesome.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions