Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support locking on filestate logins #2697

Merged
merged 10 commits into from
Mar 16, 2021
Merged

Conversation

bigkraig
Copy link
Contributor

@bigkraig bigkraig commented May 2, 2019

When using the filestate backend (local files and cloud buckets) there is no protection to prevent two processes from managing the same stack simultaneously.

This PR creates a locks directory in the management directory that stores lock files for a stack. Each backend implementation gets its own UUID that is joined with the stack name. The feature is currently available behind the PULUMI_SELF_MANAGED_STATE_LOCKING=1 environment variable flag.

@Place1 This follows the distributed locking idea you had in #2455.

@bigkraig bigkraig changed the title WIP: Support locking on filestate logins Support locking on filestate logins May 2, 2019
@bigkraig bigkraig mentioned this pull request May 2, 2019
@ellismg
Copy link
Contributor

ellismg commented May 6, 2019

I wonder if we should try to support pulumi cancel here somehow. I assume that in cases where pulumi crashes and the unlock defers are not run, you'll end up in the state where future locks fail and you'll have to connect to the bucket directly to delete the lock?

@bigkraig
Copy link
Contributor Author

bigkraig commented May 7, 2019

@ellismg I wasn't aware of that command but this definitely seems like something important to support.

Does a pulumi cancel merely erase the lock or does it also send a message back to the Pulumi process holding the lock telling it to abort its work in progress? If it is just a lock clean up I think I can knock that out pretty quickly.

@bigkraig
Copy link
Contributor Author

bigkraig commented May 7, 2019

FWIW I read the code its just not clear what happens once the cancel command has been sent to the API.

Perhaps in the interest of getting this in sooner than later, I implement cancel as a lock removal step for filestate backends and a more intelligent notification/cancelling system is implemented in a separate PR?

@chrsmith
Copy link
Contributor

chrsmith commented May 7, 2019

Does a pulumi cancel merely erase the lock or does it also send a message back to the Pulumi process holding the lock telling it to abort its work in progress?

I might be forgetting something super-important here, but these are the two main side effects of running pulumi cancel from the Pulumi Service:

  • The process performing the update will terminate as soon as it sees that the update has been canceled. This isn't because of the Pulumi Service directly sends a message to the CLI, but rather the next API call made to as part of that update will fail. (Returning 400 or 409 or something like that.)
  • The canceled update is reported as having failed/canceled. I don't believe the filestate backend has a notion of stack history, so perhaps this is a moot point.

Also note that a pulumi update will "timeout" after 10 minutes, if the Pulumi Service hasn't received any messages from the CLI. So if an update is started but pulumi crashes, you don't wind up in a state where the lock is taken indefinitely. (Which is one of the scenarios Matt pointed out.)

I would strongly suggest that we ensure pulumi cancel works correctly for the filestate backed. The reason being is that beyond ending up in a state where "where future locks fail and you'll have to connect to the bucket directly to delete the lock", you also run the risk of not being able to safely stop that other update using the file state.

For example, a naive solution would be to just record the PID for the process that takes the lock. And then have pulumi cancel kill that process. But doing so could corrupt your checkpoint, if it is in the middle of writing its state or immediately after starting to create a cloud resource.

I don't have a super-elegant approach off the top of my head. But adding a hook to check the current status of that lock file before/after updating the checkpoint file in the SnapshotUpdateManager (something like that) seems like a good starting point. That would allow anybody who looks at that lock file to get a sense for its current status. e.g. "update still in progress, last checkpoint write at timestamp X" or "update cancellation requested, please terminate safely".

@bigkraig
Copy link
Contributor Author

bigkraig commented May 8, 2019

I agree @chrsmith, what probably makes the most sense is to update the lock with some information saying to abort the process, and delete the lock. Meanwhile the snapshot process should check the lock for existence and an abort message. I'm suggesting the "abort message" to get around CAP issues with cloud providers, mainly in S3.

Some quick looks into the snapshot process makes me think that it's going to take me a bit of time to get familiar with that code and implement something like that in a concise way.

In interest of getting this merged sooner than later, can we separate some of the cancel support from this MR? What could a minimum cancel do at this point?

@ellismg
Copy link
Contributor

ellismg commented May 11, 2019

In interest of getting this merged sooner than later, can we separate some of the cancel support from this MR? What could a minimum cancel do at this point?

@bigkraig I think to start we just need to document what happens when you end in up in this state. Can the error message that you get when you try to start an update (and another is already in flight) at least tell you what file you need to delete if you are sure the other update has completed?

@bigkraig
Copy link
Contributor Author

bigkraig commented May 11, 2019

@ellismg Looks like this now

error: the stack is current locked by 1 lock(s). Either wait for the other processes to end or manually delete the lock file(s).
  s3://<BUCKET>/.pulumi/locks/<STACK REF>.56c29f8e.json: created by kamador@Kraigs-MacBook-Pro.local (pid 44214)

@piclemx
Copy link

piclemx commented May 21, 2019

any news on this MR 😄 @ellismg @bigkraig

@bigkraig bigkraig force-pushed the cloud-locks branch 2 times, most recently from 229b997 to 795b001 Compare May 21, 2019 19:54
@bigkraig
Copy link
Contributor Author

bigkraig commented Jun 1, 2019

@ellismg friendly ping

err := b.bucket.Delete(context.TODO(), b.lockPath(stackRef.Name()))
if err != nil {
logging.Errorf("there was a problem deleting the lock at %v: %v",
filepath.Join(b.url, b.lockPath(stackRef.Name())), err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this not return an error as well. basically, at this point, things are in a very bad state, and it looks like manual cleanup is necessary. We may have to even explicitly state that that's the case.

if err != nil {
return nil, err
}
defer b.Unlock(stackRef)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm personally not a fan of this pattern where each method needs to be prefixed with this code. It feels brittle to me since it would be easy to forget. I would prefer we do something similar to how we created an intercepting Bucket wrapper around go-cloud's abstraction. In other words, i'd like to see something akin to:

type lockeableBackend struct {
    localBackend *localBackend;
}

// all the methods will then look like this:
func (b *lockeableBackend) CreateStack(ctx context.Context, stackRef backend.StackReference, opts interface{}) (backend.Stack, error) {
    
	err := b.Lock(stackRef)
	if err != nil {
		return nil, err
	}
	defer b.Unlock(stackRef)
        return b.localBackend.CreateStack(ctx, stackRef, opts);

This means that we can easily see the pattern here and ensure that every operation that needs to be is appropriate locked. It also keeps the actual localBackend code blissfully unaware of locking, and it means we don't have to constantly be auditing it to make sure some new member does the right thing.

With this new pattern, if we needed to add a new localBackend member and we needed to expose it to someone else, we would have to go through this interface and we would have to ask ourselves "what's the locking rule here". That's a big plus in my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this. The defer b.Unlock is incompatible with returning an error from Unlock as mentioned earlier. Can we forego the error return there in favor of this pattern?

@CyrusNajmabadi
Copy link
Contributor

You accidentally added a large binary file to the review. can you rewrite so it doesn't become part of history? Thanks!

@bigkraig bigkraig force-pushed the cloud-locks branch 2 times, most recently from 47a9985 to 44351f1 Compare June 12, 2019 21:15
@bigkraig bigkraig force-pushed the cloud-locks branch 3 times, most recently from 4ee609b to 1e48dd0 Compare June 14, 2019 18:47
@techmouse84
Copy link

Any idea when would this be released?

@lukehoban
Copy link
Member

@bigkraig I've opened #6437 which build on the work in this PR but updates it to work with latest Pulumi, fixes some implementation issues, adds initial basic test coverage, and moves the locking support behind an environment variable initially so that this can be rolled out safely. I've created this in a branch in the pulumi org so that we can run tests against it. If it looks good to you, feel free to merge it into your branch here as well.

@github-actions
Copy link

github-actions bot commented Mar 1, 2021

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

@github-actions
Copy link

github-actions bot commented Mar 1, 2021

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

@yamamoto-ryo-001
Copy link

will this be released soon?
Big blocker for us to transit from terraform.

@lukehoban
Copy link
Member

/run-acceptance-tests

@github-actions
Copy link

github-actions bot commented Mar 9, 2021

Please view the results of the PR Build + Acceptance Tests Run Here

Copy link
Contributor

@leezen leezen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, but I wonder if we could actually make the way the lock prefixes work more granular?

pkg/backend/filestate/lock.go Outdated Show resolved Hide resolved
@github-actions
Copy link

PR is now waiting for a maintainer to run the acceptance tests.

Note for the maintainer: To run the acceptance tests, please comment /run-acceptance-tests on the PR

@lukehoban
Copy link
Member

/run-acceptance-tests

@github-actions
Copy link

Please view the results of the PR Build + Acceptance Tests Run Here

@lukehoban lukehoban merged commit 71ec66a into pulumi:master Mar 16, 2021
@lukehoban
Copy link
Member

Thanks @bigkraig for your work (and patience!) on this PR! 🎉 🚀

@renannprado
Copy link

@lukehoban Should this flag be documented here?

https://www.pulumi.com/docs/reference/cli/environment-variables/

@Freddo1
Copy link

Freddo1 commented Nov 17, 2023

@lukehoban Should this flag be documented here?

https://www.pulumi.com/docs/reference/cli/environment-variables/

Ping on this for anyone still following this. This wasn't the most pleasant way of finding out that we can have state lock

@ringods
Copy link
Member

ringods commented Nov 21, 2023

@Freddo1 state locking for self-managed (aka filestate) backends is active by default. This was merged via #8565 almost 2 years ago. If you look at the changes, the environment variable was removed from the sources:

https://github.com/pulumi/pulumi/pull/8565/files#diff-4c8ef1e204e75d051f27907bd7c529fe5605a983c9f14b6bc041b9c6667dbe11

If you bumped into a problem, please raise a new issue.

@Freddo1
Copy link

Freddo1 commented Nov 29, 2023

@ringods My bad, and thank you for clarifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet