Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to unseal Vault immediately after initializing fails with error 400 in HA mode #491

Closed
ghost opened this issue Aug 4, 2015 · 6 comments

Comments

@ghost
Copy link

ghost commented Aug 4, 2015

We've got an automated setup for Vault running in a testing environment - under these conditions we have to auto-unseal in order to be able to test our infrastructure in an automated fashion. We're using Vault in combination with three Consul servers to verify that high availability works effectively with the rest of our infrastructure.

Everything works fine with initializing Vault until it gets to the point where we attempt to unseal. Any attempt to interact with Vault, even with a simple query to verify the seal status, fails with Error 400. We found that adding a 10 second sleep after initializing Vault allows everything to proceed.

This seems a bit concerning, I'm assuming it's due to some kind of disparity between Vault and Consul. Any further advice would be extremely helpful.

@armon
Copy link
Member

armon commented Aug 4, 2015

@lyptt In an HA setup, only the active Vault will respond to queries. The standby Vaults redirect to the active instance. It takes a few seconds for Vault to elect a leader when there is none, so in that period when a request comes in there is no active instance to forward to and you get an error. Does that help clarify?

@ghost
Copy link
Author

ghost commented Aug 4, 2015

We only have one Vault instance and three Consul servers so I'm not sure if this particularly applies to us?

@armon
Copy link
Member

armon commented Aug 4, 2015

@lyptt It does, the Consul backend is HA enabled. Vault cannot know there are no other peer servers. There could be 3 Vaults as far as each particular instances knows, so it is only safe to proceed once the lock is acquired. We could add a flag to force disable HA mode so that you can indicate there are no other peers to worry about, but the safest thing to do is to acquire the lock which is what we do now.

@ghost
Copy link
Author

ghost commented Aug 4, 2015

In that case shouldn't Vault be either queueing the request and deferring until the lock is acquired, or at least returning a response that scripts or end users can use to usefully determine this is the case?

Simply returning 400 Bad Request isn't particularly helpful, and makes it seem like the end user has done something incorrect, when the reality is it's simply not ready yet.

@armon
Copy link
Member

armon commented Aug 4, 2015

@lyptt Probably, but it is a 0.2 product and it takes time to mature. I actually think it is best to push the error back to the client quickly instead of trying to queue the request. It should probably be a 503 or equivalent error code. Otherwise it is indistinguishable from an incredibly slow request for a client.

@ghost
Copy link
Author

ghost commented Aug 4, 2015

Yes absolutely, if the error could be changed then at least we can handle it better. Having to wait isn't too much of an issue as long as we can tell when we have to.

@ghost ghost closed this as completed Jun 6, 2016
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants