-
Notifications
You must be signed in to change notification settings - Fork 344
Description
When running cml runner, I've noticed a couple of times that some aws instances were not properly terminated when jobs were finished. However, the gihub runner was marked as offline.
I can't reproduce it easily, it seems to happen in a particular scenario that I can't isolate for now. Also I don't have the bandwidth to investigate more for now.
However, as a broader issue here, I'm wondering if the instance termination is resilient enough to diverse failures.
Given that cml is made for ML workflows, we expect users to run big/expensive instances. Thus, cml should ensure that no matter what happen (github action's down, cml bug, particular edge case with canceling workflows, etc.) instances must be shut down at some point.
When cml didn't existed, we used to start instances manually and figured out it was easy to forget about them. To fix that, we developed an EC2 instance garbage collector that terminated instances with low CPU usage for a certain amount of time.
If that kind of stuff can be integrated to cml, that would be awesome.