Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding a --save-final-state-only flag (or PULUMI_SKIP_CHECKPOINTS env var) to speed up large stacks #10668

Closed
t0yv0 opened this issue Sep 9, 2022 · 6 comments 路 Fixed by #10687
Assignees
Labels
area/backends State storage (filestate/httpstate/etc.) impact/performance Something is slower than expected impact/reliability Something that feels unreliable or flaky kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Milestone

Comments

@t0yv0
Copy link
Member

t0yv0 commented Sep 9, 2022

Hello!

  • Vote on this issue by adding a 馃憤 reaction
  • If you want to implement this feature, comment to let us know (we'll work with you on design, scheduling, etc.)

Issue details

Consider adding a --save-final-state-only flag to trade safety for speed. This flag would be relevant for situations where:

  • the state is very large (many resources, large resources)

  • Pulumi CLI crashes or disconnects that might lead to inconsistencies between the state of the cloud and the state checkpoint internally tracked by Pulumi in the statefile (or service) are not a serious concern - either unlikely or easy to recover from

The flag would ask Pulumi to only write the state at the end of a deployment (pulumi up, etc). The default behavior for Pulumi is to write the state immediately after it is making changes. A stack update that creates N resources is currently generating O(N) state writes, and it can slow down large deployments.

Currently pulumi up does this (example using the httpstate backend):

POST /api/stacks/t0yv0/ts-many-resources/test1/update -->
       updateResponse.UpdateID=d70490f6-0c8c-43e0-b67a-dfc47af25e51/
PATCH /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/checkpoint
PATCH /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/checkpoint
...
PATCH /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/checkpoint
POST /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/complete

The new behavior of pulumi up --save-final-state-only would be using only one PATCH at the end:

POST /api/stacks/t0yv0/ts-many-resources/test1/update -->
       updateResponse.UpdateID=d70490f6-0c8c-43e0-b67a-dfc47af25e51/
PATCH /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/checkpoint
POST /api/stacks/t0yv0/ts-many-resources/test1/update/d70490f6-0c8c-43e0-b67a-dfc47af25e51/complete

The treadeoff is that in case of a catastrophic failure of the pulumi process the intermediate state changes it has performed will not be written to the state backend.

The flag would universally apply to the httpstate and filestate backends.

Alternatives and related tickets

Naming possibilities

--save-final-state-only
--save-final-checkpoint-only
--skip-saving-intermediate-state
--skip-saving-intermediate-checkpoints

Affected area/feature

area/backends

@t0yv0 t0yv0 added kind/enhancement Improvements or new features impact/performance Something is slower than expected area/backends State storage (filestate/httpstate/etc.) labels Sep 9, 2022
@mikhailshilkov mikhailshilkov added the impact/reliability Something that feels unreliable or flaky label Sep 9, 2022
@t0yv0 t0yv0 added this to the 0.78 milestone Sep 9, 2022
@t0yv0 t0yv0 self-assigned this Sep 9, 2022
@dixler dixler self-assigned this Sep 9, 2022
@t0yv0 t0yv0 assigned dixler and unassigned t0yv0 and dixler Sep 9, 2022
@jkodroff
Copy link
Member

jkodroff commented Sep 9, 2022

--skip-interstitial-checkpoints?
--large-stack-mode?

@dixler
Copy link
Contributor

dixler commented Sep 12, 2022

@AaronFriel suggests using an environment variable rather than a CLI flag

I prefer env vars over CLI flags as it works in users' favor and engineering's favor.

users:

  • users would have to modify all of their CI scripts to use this feature as a argument rather than have the desired behavior
  • environment variables are easier to provide to CI systems and user shell profiles
  • removing this flag would be a breaking change to users that use this flag.

engineering:

  • we use cobra for argument parsing so environment variables are easier to retrieve from subsystems meaning that the diff will be smaller
  • we can more easily deprecate this experimental behavior when patch support is done

an env var indicating a vague performance improvement would allow us to quietly remove the env var once our performance goals are reached since the users just want pulumi to be faster and we could hide the implementation details since it'd reduce our flexibility.

i.e. PULUMI_EXPERIMENTAL_UNSAFE_OPTIMIZE_STACK_MANAGER=1

@t0yv0
Copy link
Member Author

t0yv0 commented Sep 15, 2022

The PR implements PULUMI_EXPERIMENTAL_SNAPSHOT_MANAGER=1. I admit the idea this should be env vars better than a flag, but to nitpick the naming one last time, I liked Josh's suggestions that sound more like what the flag is doing for the user rather than how its' implemented.

https://www.pulumi.com/docs/intro/concepts/state/#checkpoints seems to be the official docs that give some vocabulary to what Pulumi does here ("record a checkpoint").

Perhaps:

PULUMI_EXPERIMENTAL=1
PULUMI_RECORD_CHECKPOINTS=once 
# or
PULUMI_RECORD_CHECKPOINTS=final-only

This opens up adding some more options down the road.

PULUMI_RECORD_CHECKPOINTS=every:30s
PULUMI_RECORD_CHECKPOINTS=off # no state saving at all

@iwahbe
Copy link
Member

iwahbe commented Sep 15, 2022

I like the skip based naming options:

--skip-saving-intermediate-state
--skip-saving-intermediate-checkpoints

I think they better communicate that we will stop doing something we should otherwise be doing.

We should definitely gate these behind PULUMI_EXPERIMENTAL. We might want to add an unsafe to the beginning:

--unsafe-skip-saving-intermediate-state
--unsafe-skip-saving-intermediate-checkpoints

@bors bors bot closed this as completed in 0f3e536 Sep 15, 2022
@pulumi-bot pulumi-bot added the resolution/fixed This issue was fixed label Sep 15, 2022
@dixler
Copy link
Contributor

dixler commented Sep 15, 2022

Reopening since we would like to better name the environment variable.

@dixler dixler reopened this Sep 15, 2022
@dixler dixler removed the resolution/fixed This issue was fixed label Sep 15, 2022
@dixler dixler added the resolution/fixed This issue was fixed label Sep 16, 2022
@dixler dixler closed this as completed Sep 16, 2022
@t0yv0
Copy link
Member Author

t0yv0 commented Sep 22, 2022

This shipped as an env-var based flag:

PULUMI_EXPERIMENTAL=true PULUMI_SKIP_CHECKPOINTS=true.

Looking forward to feedback.

@lukehoban lukehoban changed the title Consider adding a --save-final-state-only flag to speed up large stacks Consider adding a --save-final-state-only flag (or PULUMI_SKIP_CHECKPOINTS env var) to speed up large stacks May 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backends State storage (filestate/httpstate/etc.) impact/performance Something is slower than expected impact/reliability Something that feels unreliable or flaky kind/enhancement Improvements or new features resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants