Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved recovery for invalid stacks #1094

Open
joeduffy opened this issue Mar 30, 2018 · 9 comments

Comments

@joeduffy
Copy link
Member

commented Mar 30, 2018

Issue

If an update is interrupted partway through, it is possible that the service's understanding of your stack's state will have drifted from reality. The Pulumi CLI and service have no way of reconciling this drift today, and react conservatively by marking the state as "unknown." The result is the following error message:

error: stack's resources are in an unknown state due to an interrupted update;
please export the stack, repair any inconsistencies, and import the result

Workaround

To recover from this situation, you will need to do three things:

  1. Export your stack's current state checkpoint by running the pulumi stack export command. It is JSON and you probably want to redirect it to a file for easy viewing and editing: pulumi stack export > stack.json. We recommend making a backup of this file for safe-keeping, since what follows requires a bit of manual and error prone editing.
  2. Manually verify that the state represents the current state in your cloud and make any necessary edits to the state file (if any -- it is possible none will be needed). Be very careful making edits; any incorrect information may worsen the situation, and any deletions are not easy to recover in the current system.
  3. Import your stack's updated state checkpoint, with the newly patched resource information, using the pulumi stack import command. For example, if stored in the stack.json file, just run pulumi stack import < stack.json. This will upload your file and automatically clear the unknown status on your stack so that it is usable once again.

Notes:

Depending on where we land with #1077, we may or may not need to invest in better recovery for invalid stacks. An invalid checkpoint occurs when an update interruption happens at an inopportune moment. There are ways we could possibly recover from this, for instance, if we knew that the local machine had reached a steady state prior to the interruption. Alternatively, we may just want to make the UX better, by interactively prompting the user for what they want to do next -- including possibly doing an interactive refresh to repair the state of the stack from the source of truth.

@pgavlin

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

There are ways we could possibly recover from this, for instance, if we knew that the local machine had reached a steady state prior to the interruption.

If the local machine had reached a steady state, the stack should not be invalid, right?

@joeduffy

This comment has been minimized.

Copy link
Member Author

commented Mar 30, 2018

It may have lost connectivity with the Pulumi Service, but succeeded at making the changes to AWS.

@pgavlin

This comment has been minimized.

Copy link
Member

commented Mar 30, 2018

Ah, sure--fair enough.

@lindydonna

This comment has been minimized.

Copy link

commented Apr 9, 2018

This is included as a Known Issue in the documentation. When it is fixed, please ensure the Known Issues document is also updated.

@lukehoban

This comment has been minimized.

Copy link
Member

commented Apr 22, 2018

I believe we now know when we are in this state, but suggest import/export, which may not be the best recovery option. Suggesting pulumi refresh, or even doing a `pulumi refresh --preview automatically, might help here?

@lukehoban lukehoban modified the milestones: 0.14, 0.16 May 14, 2018

@lindydonna

This comment has been minimized.

Copy link

commented Jun 5, 2018

Just chatted with @pgavlin and he said that we should indeed suggest the import/export workaround. I've updated the issue text with the symptom and workaround, and added the known-issue label.

@pgavlin pgavlin modified the milestones: 0.16, 0.17 Jul 12, 2018

@lukehoban lukehoban modified the milestones: 0.17, 0.18 Aug 27, 2018

@lukehoban lukehoban modified the milestones: 0.18, 0.19 Sep 13, 2018

@lukehoban

This comment has been minimized.

Copy link
Member

commented Sep 13, 2018

This got better with "pending operations" in the checkpoint - so we provide more context to users - but we can still improve further.

@lukehoban

This comment has been minimized.

Copy link
Member

commented Nov 12, 2018

@pgavlin What is the current plan for next steps on this?

@pgavlin

This comment has been minimized.

Copy link
Member

commented Nov 19, 2018

@lukehoban I have no current plan. I will think on it.

@pgavlin pgavlin modified the milestones: 0.19, 0.20 Nov 19, 2018

@lukehoban lukehoban modified the milestones: 0.20, 0.21 Dec 7, 2018

@pgavlin pgavlin modified the milestones: 0.21, 0.22 Jan 30, 2019

@lukehoban lukehoban modified the milestones: 0.22, 0.23 Mar 24, 2019

@ellismg ellismg removed this from the 0.23 milestone May 8, 2019

@pgavlin pgavlin added the feature/q3 label Jul 10, 2019

@lukehoban lukehoban added this to the 0.26 milestone Jul 25, 2019

@lukehoban lukehoban removed the feature/q3 label Aug 3, 2019

@ellismg ellismg removed this from the 0.26 milestone Aug 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.