Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Improved UX for update cancellation, hibernate and timeout with managed stacks #1077
We need to provide great CLI experience in these sorts of failure modes, and to make it easy to restart an update in one of these cases.
Note that this issue is going to affect our ability to run tests against managed stacks in CI: if the CLI crashes during a test, it will not have the chance to mark the update as completed, and any attempt to destroy the test would need to wait for the crashed update to time out.
I think we should do something like the following:
I would first start by working through the full end-to-end story here. That, for me, begins with asking the question: Why did something get interrupted? Luke lists three:
I think (2) and (3), and related to that, (4) Lost Internet connectivity, are unavoidable, and so that puts you immediately in the "recovery" situation.
There is a significant other class than recovery, however: prevention. In the event that we can prevent this situation, we should do so. I suspect many of the occurrences of (1) in the wild can be prevented simply by having better support for things like #513.
Why would you ^C? Three major reasons, as far as I can tell:
a) Something is taking forever (possibly hung), and I have lost patience.
In all cases, it would be ideal to guarantee we stop at a safe point. In the case of a), however, it may be the case that I truly want to ^C even if it means possibly orphaning a lease token and/or corrupting my checkpoint. I don't know if this is the common case. I expect most people who hit this will do so because they simply don't know any better. Sure, user education can help (RTFM), but we can do better. In other words, attempt to prevent this from happening as much a possible, and, knowing it will still occur, help with recovery for those cases where we couldn't prevent.
TL;DR, I think for M12 we should
Beyond M12, we should of course implement proper cancellation so we can more aggressively perform "safe" ^Cs. I also think it's worth somehow remembering local state about the prior lease so that if you attempt to do something, and we can recognize that your machine was in fact the last to be performing an update, and we know where it left off, we can safely and silently resume.