Support locking on filestate logins #2697

bigkraig · 2019-05-02T20:05:22Z

When using the filestate backend (local files and cloud buckets) there is no protection to prevent two processes from managing the same stack simultaneously.

This PR creates a locks directory in the management directory that stores lock files for a stack. Each backend implementation gets its own UUID that is joined with the stack name. The feature is currently available behind the PULUMI_SELF_MANAGED_STATE_LOCKING=1 environment variable flag.

@Place1 This follows the distributed locking idea you had in #2455.

ellismg · 2019-05-06T23:44:59Z

I wonder if we should try to support pulumi cancel here somehow. I assume that in cases where pulumi crashes and the unlock defers are not run, you'll end up in the state where future locks fail and you'll have to connect to the bucket directly to delete the lock?

bigkraig · 2019-05-07T18:39:07Z

@ellismg I wasn't aware of that command but this definitely seems like something important to support.

Does a pulumi cancel merely erase the lock or does it also send a message back to the Pulumi process holding the lock telling it to abort its work in progress? If it is just a lock clean up I think I can knock that out pretty quickly.

bigkraig · 2019-05-07T18:49:06Z

FWIW I read the code its just not clear what happens once the cancel command has been sent to the API.

Perhaps in the interest of getting this in sooner than later, I implement cancel as a lock removal step for filestate backends and a more intelligent notification/cancelling system is implemented in a separate PR?

chrsmith · 2019-05-07T19:29:54Z

Does a pulumi cancel merely erase the lock or does it also send a message back to the Pulumi process holding the lock telling it to abort its work in progress?

I might be forgetting something super-important here, but these are the two main side effects of running pulumi cancel from the Pulumi Service:

The process performing the update will terminate as soon as it sees that the update has been canceled. This isn't because of the Pulumi Service directly sends a message to the CLI, but rather the next API call made to as part of that update will fail. (Returning 400 or 409 or something like that.)
The canceled update is reported as having failed/canceled. I don't believe the filestate backend has a notion of stack history, so perhaps this is a moot point.

Also note that a pulumi update will "timeout" after 10 minutes, if the Pulumi Service hasn't received any messages from the CLI. So if an update is started but pulumi crashes, you don't wind up in a state where the lock is taken indefinitely. (Which is one of the scenarios Matt pointed out.)

I would strongly suggest that we ensure pulumi cancel works correctly for the filestate backed. The reason being is that beyond ending up in a state where "where future locks fail and you'll have to connect to the bucket directly to delete the lock", you also run the risk of not being able to safely stop that other update using the file state.

For example, a naive solution would be to just record the PID for the process that takes the lock. And then have pulumi cancel kill that process. But doing so could corrupt your checkpoint, if it is in the middle of writing its state or immediately after starting to create a cloud resource.

I don't have a super-elegant approach off the top of my head. But adding a hook to check the current status of that lock file before/after updating the checkpoint file in the SnapshotUpdateManager (something like that) seems like a good starting point. That would allow anybody who looks at that lock file to get a sense for its current status. e.g. "update still in progress, last checkpoint write at timestamp X" or "update cancellation requested, please terminate safely".

bigkraig · 2019-05-08T18:07:18Z

I agree @chrsmith, what probably makes the most sense is to update the lock with some information saying to abort the process, and delete the lock. Meanwhile the snapshot process should check the lock for existence and an abort message. I'm suggesting the "abort message" to get around CAP issues with cloud providers, mainly in S3.

Some quick looks into the snapshot process makes me think that it's going to take me a bit of time to get familiar with that code and implement something like that in a concise way.

In interest of getting this merged sooner than later, can we separate some of the cancel support from this MR? What could a minimum cancel do at this point?

ellismg · 2019-05-11T00:50:53Z

In interest of getting this merged sooner than later, can we separate some of the cancel support from this MR? What could a minimum cancel do at this point?

@bigkraig I think to start we just need to document what happens when you end in up in this state. Can the error message that you get when you try to start an update (and another is already in flight) at least tell you what file you need to delete if you are sure the other update has completed?

bigkraig · 2019-05-11T18:46:36Z

@ellismg Looks like this now

error: the stack is current locked by 1 lock(s). Either wait for the other processes to end or manually delete the lock file(s).
  s3://<BUCKET>/.pulumi/locks/<STACK REF>.56c29f8e.json: created by kamador@Kraigs-MacBook-Pro.local (pid 44214)

piclemx · 2019-05-21T13:41:50Z

any news on this MR 😄 @ellismg @bigkraig

bigkraig · 2019-06-01T00:04:44Z

@ellismg friendly ping

pkg/backend/filestate/lock.go

CyrusNajmabadi · 2019-06-03T19:51:50Z

pkg/backend/filestate/lock.go

+	err := b.bucket.Delete(context.TODO(), b.lockPath(stackRef.Name()))
+	if err != nil {
+		logging.Errorf("there was a problem deleting the lock at %v: %v",
+			filepath.Join(b.url, b.lockPath(stackRef.Name())), err)


should this not return an error as well. basically, at this point, things are in a very bad state, and it looks like manual cleanup is necessary. We may have to even explicitly state that that's the case.

pkg/backend/filestate/backend.go

CyrusNajmabadi · 2019-06-03T19:58:07Z

pkg/backend/filestate/backend.go

+	if err != nil {
+		return nil, err
+	}
+	defer b.Unlock(stackRef)


i'm personally not a fan of this pattern where each method needs to be prefixed with this code. It feels brittle to me since it would be easy to forget. I would prefer we do something similar to how we created an intercepting Bucket wrapper around go-cloud's abstraction. In other words, i'd like to see something akin to:

type lockeableBackend struct { localBackend *localBackend; } // all the methods will then look like this: func (b *lockeableBackend) CreateStack(ctx context.Context, stackRef backend.StackReference, opts interface{}) (backend.Stack, error) { err := b.Lock(stackRef) if err != nil { return nil, err } defer b.Unlock(stackRef) return b.localBackend.CreateStack(ctx, stackRef, opts);

This means that we can easily see the pattern here and ensure that every operation that needs to be is appropriate locked. It also keeps the actual localBackend code blissfully unaware of locking, and it means we don't have to constantly be auditing it to make sure some new member does the right thing.

With this new pattern, if we needed to add a new localBackend member and we needed to expose it to someone else, we would have to go through this interface and we would have to ask ourselves "what's the locking rule here". That's a big plus in my mind.

I'm good with this. The defer b.Unlock is incompatible with returning an error from Unlock as mentioned earlier. Can we forego the error return there in favor of this pattern?

pkg/backend/filestate/lock.go

CyrusNajmabadi · 2019-06-03T21:29:25Z

You accidentally added a large binary file to the review. can you rewrite so it doesn't become part of history? Thanks!

techmouse84 · 2021-02-28T14:46:20Z

Any idea when would this be released?

lukehoban · 2021-03-01T05:27:29Z

@bigkraig I've opened #6437 which build on the work in this PR but updates it to work with latest Pulumi, fixes some implementation issues, adds initial basic test coverage, and moves the locking support behind an environment variable initially so that this can be rolled out safely. I've created this in a branch in the pulumi org so that we can run tests against it. If it looks good to you, feel free to merge it into your branch here as well.

github-actions · 2021-03-01T21:25:30Z

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

github-actions · 2021-03-01T21:30:18Z

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

yamamoto-ryo-001 · 2021-03-08T15:26:37Z

will this be released soon?
Big blocker for us to transit from terraform.

lukehoban · 2021-03-09T05:00:10Z

/run-acceptance-tests

github-actions · 2021-03-09T05:01:12Z

Please view the results of the PR Build + Acceptance Tests Run Here

leezen

Seems reasonable, but I wonder if we could actually make the way the lock prefixes work more granular?

pkg/backend/filestate/lock.go

github-actions · 2021-03-15T23:01:11Z

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

lukehoban · 2021-03-16T00:10:40Z

/run-acceptance-tests

github-actions · 2021-03-16T00:11:11Z

Please view the results of the PR Build + Acceptance Tests Run Here

lukehoban · 2021-03-16T03:01:12Z

Thanks @bigkraig for your work (and patience!) on this PR! 🎉 🚀

renannprado · 2021-03-18T15:23:21Z

@lukehoban Should this flag be documented here?

https://www.pulumi.com/docs/reference/cli/environment-variables/

Freddo1 · 2023-11-17T08:48:32Z

@lukehoban Should this flag be documented here?

https://www.pulumi.com/docs/reference/cli/environment-variables/

Ping on this for anyone still following this. This wasn't the most pleasant way of finding out that we can have state lock

ringods · 2023-11-21T09:24:10Z

@Freddo1 state locking for self-managed (aka filestate) backends is active by default. This was merged via #8565 almost 2 years ago. If you look at the changes, the environment variable was removed from the sources:

https://github.com/pulumi/pulumi/pull/8565/files#diff-4c8ef1e204e75d051f27907bd7c529fe5605a983c9f14b6bc041b9c6667dbe11

If you bumped into a problem, please raise a new issue.

Freddo1 · 2023-11-29T14:58:41Z

@ringods My bad, and thank you for clarifying.

bigkraig changed the title ~~WIP: Support locking on filestate logins~~ Support locking on filestate logins May 2, 2019

bigkraig mentioned this pull request May 2, 2019

S3 state backend #1966

Closed

bigkraig force-pushed the cloud-locks branch from 4a52a95 to adcdf29 Compare May 3, 2019 00:29

bigkraig force-pushed the cloud-locks branch from adcdf29 to a9ef376 Compare May 11, 2019 18:52

bigkraig force-pushed the cloud-locks branch 2 times, most recently from 229b997 to 795b001 Compare May 21, 2019 19:54