Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved UX for update cancellation, hibernate and timeout with managed stacks #1077

Closed
lukehoban opened this issue Mar 27, 2018 · 12 comments

Comments

@lukehoban
Copy link
Member

commented Mar 27, 2018

If I ctrl-c a deployment using the managed stacks model, then attempting to restart the update leads to:

$ pulumi update
Updating stack 'stress-tester-testing' in the Pulumi Cloud ☁️
error: calling API: [400] another update is currently in progress

We need to provide great CLI experience in these sorts of failure modes, and to make it easy to restart an update in one of these cases.

Scenarios:

  1. Ctrl-c a deployment, then start again.
  2. Close lid and come back later - should it fail or auto-renew and continue if possible?
  3. Deployment in CI cancelled (same as (1) except can't rely on ctrl-c handler).

@lukehoban lukehoban added this to the 0.12 milestone Mar 27, 2018

@joeduffy

This comment has been minimized.

Copy link
Member

commented Mar 27, 2018

Is this with the new service model?

@lukehoban lukehoban changed the title Improved UX for update cancellation, hibernate and timeout Improved UX for update cancellation, hibernate and timeout with managed stacks Mar 27, 2018

@lukehoban

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2018

Is this with the new service model?

Yes - updated title and description to be clearer.

@pgavlin

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

Note that this issue is going to affect our ability to run tests against managed stacks in CI: if the CLI crashes during a test, it will not have the chance to mark the update as completed, and any attempt to destroy the test would need to wait for the crashed update to time out.

I think we should do something like the following:

  • add an explicit "cancel" verb to the CLI that cancels the current update. Maybe pulumi update cancel?
  • add a flag to update and destroy that causes the CLI to cancel the current update and export + import the stack before the update or destroy s.t. it is more likely to proceed. There is still a race here--another update could begin between the cancellation of any current update and the start of the new update--but I think that's acceptable for scenarios where the expectation is that only one update will ever be attempted at a time.
@pgavlin

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

(note that this sort of affects #1092)

@lukehoban lukehoban modified the milestones: 0.14, 0.12 Apr 9, 2018

@lukehoban

This comment has been minimized.

Copy link
Member Author

commented Apr 9, 2018

We've extended M12, so pulling this back in in case there is progress we can make as part of landing "managed stacks" in a more complete state in M12.

@lindydonna

This comment has been minimized.

Copy link

commented Apr 9, 2018

This is included as a Known Issue in the documentation. When it is fixed, please ensure the Known Issues document is also updated.

@pgavlin

This comment has been minimized.

Copy link
Member

commented Apr 16, 2018

@lukehoban @joeduffy any opinion on the approach I proposed above?

@joeduffy

This comment has been minimized.

Copy link
Member

commented Apr 16, 2018

I would first start by working through the full end-to-end story here. That, for me, begins with asking the question: Why did something get interrupted? Luke lists three:

Scenarios:

  1. Ctrl-c a deployment, then start again.
  2. Close lid and come back later - should it fail or auto-renew and continue if possible?
  3. Deployment in CI cancelled (same as (1) except can't rely on ctrl-c handler).

I think (2) and (3), and related to that, (4) Lost Internet connectivity, are unavoidable, and so that puts you immediately in the "recovery" situation.

There is a significant other class than recovery, however: prevention. In the event that we can prevent this situation, we should do so. I suspect many of the occurrences of (1) in the wild can be prevented simply by having better support for things like #513.

Why would you ^C? Three major reasons, as far as I can tell:

a) Something is taking forever (possibly hung), and I have lost patience.
b) I just want to stop "soon" but I certainly would prefer not to corrupt any state.
c) Oops, I just hit enter to perform an update, and realized it is wrong -- stop!!

In all cases, it would be ideal to guarantee we stop at a safe point. In the case of a), however, it may be the case that I truly want to ^C even if it means possibly orphaning a lease token and/or corrupting my checkpoint. I don't know if this is the common case. I expect most people who hit this will do so because they simply don't know any better. Sure, user education can help (RTFM), but we can do better. In other words, attempt to prevent this from happening as much a possible, and, knowing it will still occur, help with recovery for those cases where we couldn't prevent.

TL;DR, I think for M12 we should

  • Require two ^Cs to cancel: the first prints a message to the tune of, ^C recognized; interrupting a stack update can lead to orphaned and corrupt state; if you'd like to proceed, type ^C again.

  • Implement a pulumi cancel command (not specific to update, since it pertains to destroy also). We should make sure that this comes with a big, red, flashing warning on it, with confirmation prompt.

Beyond M12, we should of course implement proper cancellation so we can more aggressively perform "safe" ^Cs. I also think it's worth somehow remembering local state about the prior lease so that if you attempt to do something, and we can recognize that your machine was in fact the last to be performing an update, and we know where it left off, we can safely and silently resume.

pgavlin added a commit that referenced this issue Apr 18, 2018
Add a `pulumi cancel` command.
This command cancels a stack's currently running update, if any. It can
be used to recover from the scenario in which an update is aborted
without marking the running update as complete. Once an update has been
cancelled, it is likely that the affected stack will need to be repaired
via an pair of export/import commands before future updates can succeed.

This is part of #1077.
pgavlin added a commit that referenced this issue Apr 19, 2018
Plumb basic cancellation through the engine.
These changes plumb basic support for cancellation through the engine.
Two types of cancellation are supported for all engine operations:
- Cancellation, which waits for the operation to drive itself to a safe
  point before the operation returns, and
- Terminationm which does not wait for the operation to drive itself
  to a safe opint for the operation returns.

When updating local or managed stacks, a single ^C triggers cancellation
of any running operation; a second ^C will trigger termination.

Fixes #513, #1077.
pgavlin added a commit that referenced this issue Apr 19, 2018
Plumb basic cancellation through the engine.
These changes plumb basic support for cancellation through the engine.
Two types of cancellation are supported for all engine operations:
- Cancellation, which waits for the operation to drive itself to a safe
  point before the operation returns, and
- Termination, which does not wait for the operation to drive itself
  to a safe opint for the operation returns.

When updating local or managed stacks, a single ^C triggers cancellation
of any running operation; a second ^C will trigger termination.

Fixes #513, #1077.
pgavlin added a commit that referenced this issue Apr 19, 2018
Add a `pulumi cancel` command. (#1230)
This command cancels a stack's currently running update, if any. It can
be used to recover from the scenario in which an update is aborted
without marking the running update as complete. Once an update has been
cancelled, it is likely that the affected stack will need to be repaired
via an pair of export/import commands before future updates can succeed.

This is part of #1077.
pgavlin added a commit that referenced this issue Apr 20, 2018
Plumb basic cancellation through the engine.
These changes plumb basic support for cancellation through the engine.
Two types of cancellation are supported for all engine operations:
- Cancellation, which waits for the operation to drive itself to a safe
  point before the operation returns, and
- Termination, which does not wait for the operation to drive itself
  to a safe opint for the operation returns.

When updating local or managed stacks, a single ^C triggers cancellation
of any running operation; a second ^C will trigger termination.

Fixes #513, #1077.
pgavlin added a commit that referenced this issue Apr 20, 2018
Plumb basic cancellation through the engine. (#1231)
hese changes plumb basic support for cancellation through the engine.
Two types of cancellation are supported for all engine operations:
- Cancellation, which waits for the operation to drive itself to a safe
  point before the operation returns, and
- Termination, which does not wait for the operation to drive itself
  to a safe opint for the operation returns.

When updating local or managed stacks, a single ^C triggers cancellation
of any running operation; a second ^C will trigger termination.

Fixes #513, #1077.
@pgavlin

This comment has been minimized.

Copy link
Member

commented Apr 20, 2018

Closed by #1231.

@pgavlin pgavlin closed this Apr 20, 2018

@pgavlin pgavlin reopened this Apr 20, 2018

@pgavlin

This comment has been minimized.

Copy link
Member

commented Apr 20, 2018

Disregard that--the ctrl+c piece of this was handled by #1231. The post-hoc cancellation was handled by #1230.

@pgavlin pgavlin modified the milestones: 0.12, 0.14 Apr 20, 2018

@lukehoban

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2018

@pgavlin Can you summarize what remaining work we want to do here? Perhaps even open one or more new issues to capture concrete things we still need to do?

@pgavlin

This comment has been minimized.

Copy link
Member

commented May 10, 2018

Filed pulumi/pulumi-service#1405 to track separating timeout from lease expiration in order to better handle scenario (2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.