runner: some aws instances don't shut down and keep running forever

When running cml runner, I've noticed a couple of times that some aws instances were not properly terminated when jobs were finished. However, the gihub runner was marked as offline.

I can't reproduce it easily, it seems to happen in a particular scenario that I can't isolate for now. Also I don't have the bandwidth to investigate more for now.

However, as a broader issue here, I'm wondering if the instance termination is resilient enough to diverse failures.
Given that cml is made for ML workflows, we expect users to run big/expensive instances. Thus, cml should ensure that no matter what happen (github action's down, cml bug, particular edge case with canceling workflows, etc.) instances must be shut down at some point.

When cml didn't existed, we used to start instances manually and figured out it was easy to forget about them. To fix that, we developed an EC2 instance garbage collector that terminated instances with low CPU usage for a certain amount of time.
If that kind of stuff can be integrated to cml, that would be awesome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runner: some aws instances don't shut down and keep running forever #1138

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runner: some aws instances don't shut down and keep running forever #1138

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions