Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: gossip encryption key rotation #336

Merged
merged 80 commits into from
Nov 20, 2014
Merged

feature: gossip encryption key rotation #336

merged 80 commits into from
Nov 20, 2014

Conversation

ryanuber
Copy link
Member

Since Serf supports symmetric key rotation for gossip messages, Consul can leverage this to allow key rotation for both the LAN and WAN gossip pools.

Because of the multiple gossip pools and a few other semantic differences between Serf and Consul, there are a few things to note about this implementation:

  • The command is consul keyring to avoid confusion with TLS keys or KV keys.
  • For operational simplicity, both LAN and WAN keyrings are modified / queried in the same command run for all keyring-related commands.
  • Only server nodes are allowed to query or modify any keyrings. This avoids ambiguity in running consul keyring on a server vs. a client.
  • Both LAN and WAN keyrings are persisted separately in consul's data-dir.
  • At agent start, if keyring files exist in consul's data directory, they will be automatically loaded and updated if keys change.
  • The -encrypt argument still works with unmodified behavior

Some examples of the feature in action in a cluster with one server and one client are in this gist.

I've still got a few more tests and docs to write, but would like feedback on the overall flow of how this is working in its current form.

Related to #246

@armon
Copy link
Member

armon commented Sep 15, 2014

Looks awesome! Thanks for tackling this! The workflow LGTM, let me know when you feel it's complete and I'll give it a one-over.

@ryanuber
Copy link
Member Author

I'm still hacking around on this. I realized while testing in multi-DC setups that the implementation was a bit flawed, and should probably more closely resemble the UserEvent RPC methods so that forwarding key events to remote datacenters is possible. I'll take a crack at that and update here.

@armon
Copy link
Member

armon commented Sep 22, 2014

@ryanuber Oh since using the event on the WAN won't rotate the remote LAN's?

@ryanuber
Copy link
Member Author

@armon yeah exactly. Initially I was thinking we would just have new keys gossiped to all of the nodes and a collective response returned, which seems desirable from the operator's perspective. The alternative is to have like a -datacenter flag and a -wan flag, and require the operator to run the key rotation for each datacenter plus the wan keyring. I was going to give the former a shot first to see how feasible that ends up being. Let me know if there is a better way of doing this that I'm not thinking of.

@armon
Copy link
Member

armon commented Sep 23, 2014

Yeah I like your initial approach. You are right, it just is a lot nicer to have the gossip handle it. I think everything you've proposed will work fine.

@ryanuber
Copy link
Member Author

ryanuber commented Oct 1, 2014

Sorry I've been parking on this for so long, its been busy! The functionality feels complete at this point, docs and tests are in here too. When you have a little time give it a once-over and let me know of any suggestions!

Here's a short demo of the feature in a 3-datacenter cluster, each with 1 server and 1 client:

// server in each datacenter. This will only error for RPC-related errors.
// Otherwise, application-level errors are returned inside of the inner
// response objects.
func (s *Server) keyringRPC(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method is more generally useful from key rings. I like having a helper to just "broadcast" the RPC to all the known DC's. Maybe we can make it like a globalRPC instead of keyringRPC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think we should move the RPC methods to consul/rpc.go

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a pass at generalizing this. I ended up using an interface for the RPC replies, structs.CompoundResponse, which has an Add(interface{}) method for appending responses. The reason for this is because we are expecting one response per DC, and we will just never know what type of response struct to expect back. Doing a type assertion for it inside of globalRPC felt ghetto.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. Another option is that the response could be a pointer to a slice and we just append responses to it, but this I think is cleaner.

@armon
Copy link
Member

armon commented Oct 1, 2014

Awesome. Just gave it a one over and it looks really good. I think there is a bit of reshuffling of methods but it looks super solid. I think the only broader feedback is that lets enable keyring by default with encryption is enabled. This way the user experience is even nicer, "it just works".

@ryanuber
Copy link
Member Author

ryanuber commented Oct 2, 2014

Thanks, @armon! Always appreciate your code reviews. I left a few comments inline above regarding the generalized globalRPC method. Let me know if it makes sense, and I'll address the other suggestions tomorrow.

@armon
Copy link
Member

armon commented Oct 2, 2014

This all makes sense! Looking good. I noticed that globalRPC guards against the single DC case, but I wonder if what we do is just have global DC also just invoke for the local DC. This way the caller of globalDC doesn't need the split logic of calling the local DC + remote DC, it just calls global and assumes all the DCs (including local) were hit.

@ryanuber
Copy link
Member Author

ryanuber commented Oct 3, 2014

For the -init option, I agree whole-heartedly that it feels clunky. What I was trying to solve with it is avoiding putting the initial encryption key inside of a config file, or having to start consul with different options to get the keyring initialized. It isn't clear what should happen if the encryption key passed in the agent options does not exist on the keyring. If we install it automatically, does it replace the existing keyring (if any), and get configured as the primary key? Does it get auto-broadcasted to the cluster? It seems like it could be difficult to remove that initial key from your config.

I had an idea last night to allow passing the path to a file on agent start. If the file indicated exists on agent startup, we just consume a key out of it to seed the keyring, and delete the file. Otherwise the agent can just start normally if the file is missing. That would allow us to always start consul in the same way without having to put a key in our config or pass it on the command line, and disambiguates how you go about initializing the keyring.

I imagine we could also just have the seed file be generated with consul keygen > /tmp/seed.key.

What do you think?

@ryanuber
Copy link
Member Author

ryanuber commented Oct 3, 2014

Actually, thinking about it a little more, this probably isn't any better than just shipping the keyring file directly to its final spot. I'll think more about this.

@armon
Copy link
Member

armon commented Oct 4, 2014

Yeah this is a hard one... I think as a user, you want to be able to specify -encrypt as the seed value, and then just use the keyring later. What I think at first blush:

  1. If there is no keyring, add the -encrypt key and set as primary
  2. If there is a keyring, add the -encrypt key but as a non-primary

Thoughts?

@ryanuber
Copy link
Member Author

ryanuber commented Oct 4, 2014

Looks like you read my mind - that's exactly what I just pushed. We append but don't change primary if -encrypt is passed but a keyring exists. This is probably the best we can do. I'll wrap this up and post back when I'm done.

@ryanuber
Copy link
Member Author

ryanuber commented Oct 9, 2014

@armon I think this is better now! There are just a couple of things I want to run by you.

  1. Anyone currently using the -encrypt flag, and rebooting their clusters to change keys will have broken workflows after their next upgrade. Starting with the -encrypt flag will work as they would expect, but changing the -encrypt value and restarting the agent would not have the same effect as before.
  2. I noticed that the WAN keyring is the slowest to respond to keyring queries. I can get a fast response from all LAN pools (~10-15ms total on 10-DC local cluster), but the WAN pool, even on a completely local cluster with multiple DC's, can take up to ~400ms on its own. I though this might have something to do with the timing values being different for the WAN pool, but thought I'd run it by you to see what you think.
  3. globalRPC simply forwards to all nodes, including the local DC. I used forwardDC for this since forward skips the local DC and also checks if we can tolerate stale reads and does leader forwarding, which doesn't seem applicable to the globalRPC method.

@ryanuber
Copy link
Member Author

ryanuber commented Oct 9, 2014

I might have just answered my own question. I think 2. is just because the WAN pool in this scenario is 5x larger than the local little LAN pools and involves remote members. Probably a non-issue.

@armon
Copy link
Member

armon commented Oct 9, 2014

Awesome! I'll give it another look over soon. I just discussed 1 with @mitchellh, and he had a good idea. Lets only use the -encrypt flag if there is no keyring to install the primary key. Once the key ring is in place, we should ignore the flag, and WARN the user that we are ignoring because of the keyring. This way a user can delete the keyring file if they want to reset it, but it removes this complex user interaction.

With 2, it just is the WAN timing is much slower. Gossip just takes longer, so I think that is fine.

With 3, that sounds great. Makes sense.

@ryanuber
Copy link
Member Author

ryanuber commented Oct 9, 2014

Sounds good! I made some adjustments to get this working - I'll fix the tests to accommodate tonight.

@ryanuber
Copy link
Member Author

Just rebased this thing against master. I'll do a sanity check on it as well, but tests are all passing so I think we should be OK.

@armon
Copy link
Member

armon commented Nov 20, 2014

Feel free to merge on in!

ryanuber added a commit that referenced this pull request Nov 20, 2014
feature: gossip encryption key rotation
@ryanuber ryanuber merged commit a3cd40c into hashicorp:master Nov 20, 2014
@ryanuber ryanuber deleted the f-keyring branch November 20, 2014 07:18
@pearkes
Copy link
Contributor

pearkes commented Nov 20, 2014

Awesome, this thing is a beast!

duckhan pushed a commit to duckhan/consul that referenced this pull request Oct 24, 2021
duckhan pushed a commit to duckhan/consul that referenced this pull request Oct 24, 2021
- defaults to true
- replaces ENABLE_WEBHOOKS environment variable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants