feature: change encryption keys #199

Merged
merged 65 commits into from Apr 18, 2014

Conversation

Projects
None yet
2 participants
@ryanuber
Member

ryanuber commented Apr 11, 2014

For a week or so I've been chipping away at this feature little by little. It uses the new memberlist keyring feature to allow Serf to handle encryption key changes in a running cluster. I feel like it's feature-complete, and ready to start gathering some suggestions on the code itself.

There is more conversation around the feature in general in #194.

The bottom line on what it does can be seen with a few command examples. All of the operations are idempotent, so you can just keep running them without negative consequences if they fail.

Install a new key

$ serf key -install E4SoXyZSJkcg37n42U2i2Q==
==> Installing key on all members...
==> Successfully installed key!

Change the key used to encrypt messages:

$ serf key -use E4SoXyZSJkcg37n42U2i2Q==
==> Changing primary key on all members...
==> Successfully changed primary key!

List keys in use on the cluster:

$ serf key -list
==> Asking all members for installed keys...
==> Keys gathered, listing cluster keys...

E4SoXyZSJkcg37n42U2i2Q==
9kHC3w5eswavy1iU56YG8g==

Remove a key:

$ serf key -remove 9kHC3w5eswavy1iU56YG8g==
==> Removing key on all members...
==> Successfully removed key!

Error conditions during key operations:

$ serf key -remove E4SoXyZSJkcg37n42U2i2Q==
==> Removing key on all members...
failed:  db9    Removing the primary key is not allowed
failed:  web15  Removing the primary key is not allowed
failed:  db8    Removing the primary key is not allowed
failed:  web21  Removing the primary key is not allowed
failed:  lb31   Removing the primary key is not allowed
failed:  lb3    Removing the primary key is not allowed

Error removing key: 6/102 nodes reported failure

I know this pull request is pretty massive, so if splitting it into smaller chunks is preferred, I can try doing that. Just let me know!

@armon

This comment has been minimized.

Show comment
Hide comment
@armon

armon Apr 11, 2014

Member

@ryanuber Wow! This is a huge PR! Awesome work! I will try to comb through this over the weekend.

With the -list mode, does it indicate if some nodes are missing certain keys? e.g. How does the operator verify all nodes have a given key before trying to -use that key?

Member

armon commented Apr 11, 2014

@ryanuber Wow! This is a huge PR! Awesome work! I will try to comb through this over the weekend.

With the -list mode, does it indicate if some nodes are missing certain keys? e.g. How does the operator verify all nodes have a given key before trying to -use that key?

@ryanuber

This comment has been minimized.

Show comment
Hide comment
@ryanuber

ryanuber Apr 11, 2014

Member

@armon the -list command just gives you a list of all unique keys in use on the cluster. It doesn't tell you if they are in use on some but not others. You could get that confidence using the -install command though, since it will only return 0 if every node replies with success.

My thinking was that if the commands are idempotent, you would just do a -install until you succeeded. Similarly, you would want to do a -use until it succeeded before doing a -remove of the old key. So essentially, -install could also be looked at as a "verify" of sorts, "verify this key exists on all nodes".

Member

ryanuber commented Apr 11, 2014

@armon the -list command just gives you a list of all unique keys in use on the cluster. It doesn't tell you if they are in use on some but not others. You could get that confidence using the -install command though, since it will only return 0 if every node replies with success.

My thinking was that if the commands are idempotent, you would just do a -install until you succeeded. Similarly, you would want to do a -use until it succeeded before doing a -remove of the old key. So essentially, -install could also be looked at as a "verify" of sorts, "verify this key exists on all nodes".

@armon

This comment has been minimized.

Show comment
Hide comment
@armon

armon Apr 11, 2014

Member

@ryanuber Got it. Would it be possible to add a very simple indicator to list, just something thats like [58/60] or something that indicates how many nodes have that key? This way an operator can quickly see if they need to run -install again.

Member

armon commented Apr 11, 2014

@ryanuber Got it. Would it be possible to add a very simple indicator to list, just something thats like [58/60] or something that indicates how many nodes have that key? This way an operator can quickly see if they need to run -install again.

@ryanuber

This comment has been minimized.

Show comment
Hide comment
@ryanuber

ryanuber Apr 11, 2014

Member

@armon definitely! I'll take a stab at that over the weekend.

Member

ryanuber commented Apr 11, 2014

@armon definitely! I'll take a stab at that over the weekend.

@armon

View changes

serf/internal_query.go
+ goto SEND
+ }
+
+ s.logger.Printf("[DEBUG] serf: Received install-key query")

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

I'd make this INFO level

@armon

armon Apr 14, 2014

Member

I'd make this INFO level

@armon

View changes

command/key.go
+ cmdFlags.Usage = func() { c.Ui.Output(c.Help()) }
+ cmdFlags.StringVar(&installKey, "install", "", "install a new key")
+ cmdFlags.StringVar(&useKey, "use", "", "change primary encryption key")
+ cmdFlags.StringVar(&removeKey, "remove", "", "remove a key")

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

I'd add a check to make sure one and only one of -install -use or -remove is provided. Rather odd otherwise.

@armon

armon Apr 14, 2014

Member

I'd add a check to make sure one and only one of -install -use or -remove is provided. Rather odd otherwise.

+func (a *Agent) loadKeyringFile(keyringFile string) error {
+ // Avoid passing an encryption key and a keyring file at the same time
+ if len(a.agentConf.EncryptKey) > 0 {
+ return fmt.Errorf("Encryption key not allowed while using a keyring")

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

Any reason to not just set the encrypt key as the primary key in the key ring? I imagine that generally -encrypt is provided with the primary key, and the -keyring-file is used only as a temp store while keys are being swapped

@armon

armon Apr 14, 2014

Member

Any reason to not just set the encrypt key as the primary key in the key ring? I imagine that generally -encrypt is provided with the primary key, and the -keyring-file is used only as a temp store while keys are being swapped

This comment has been minimized.

@ryanuber

ryanuber Apr 15, 2014

Member

We could definitely do this, but my thinking here was that providing a -encrypt might become brittle if the keys are updated at any point. As an example, if I have a script that starts serf with serf agent -keyring-file /tmp/ring.json -encrypt xxxxxxxxxx, then each time that Serf starts, the primary key will again be set to xxxxxxxxxx, even if a newer key had been set/broadcasted from a member at some other point and saved into the -keyring-file.

The -keyring-file in its current form is intended to work almost exactly like the -tags-file.

@ryanuber

ryanuber Apr 15, 2014

Member

We could definitely do this, but my thinking here was that providing a -encrypt might become brittle if the keys are updated at any point. As an example, if I have a script that starts serf with serf agent -keyring-file /tmp/ring.json -encrypt xxxxxxxxxx, then each time that Serf starts, the primary key will again be set to xxxxxxxxxx, even if a newer key had been set/broadcasted from a member at some other point and saved into the -keyring-file.

The -keyring-file in its current form is intended to work almost exactly like the -tags-file.

@armon

View changes

serf/internal_query.go
+ goto SEND
+ }
+
+ s.logger.Printf("[DEBUG] serf: Received use-key query")

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

Also INFO level

@armon

armon Apr 14, 2014

Member

Also INFO level

@armon

View changes

serf/internal_query.go
+ goto SEND
+ }
+
+ s.logger.Printf("[DEBUG] serf: Received remove-key query")

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

Lets make this INFO as well. Anything that is a modifying operation is good to have logged.

@armon

armon Apr 14, 2014

Member

Lets make this INFO as well. Anything that is a modifying operation is good to have logged.

@armon

View changes

serf/keymanager.go
+ return nil, err
+ }
+
+ qParam := &QueryParam{}

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

Lets use DefaultQueryParams() to guard against future changes.

@armon

armon Apr 14, 2014

Member

Lets use DefaultQueryParams() to guard against future changes.

@armon

View changes

serf/keymanager.go
+// RemoveKey handles broadcasting a key to the cluster for removal. Each member
+// will receive this event, and if they have the key in their keyring, remove
+// it. If any errors are encountered, RemoveKey will collect and relay them.
+func (k *keyManager) RemoveKey(key string) (*ModifyKeyResponse, error) {

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

We need to abstract away the commonality in this file. There is a lot of code repetition, since all these methods are effectively doing the same thing.

@armon

armon Apr 14, 2014

Member

We need to abstract away the commonality in this file. There is a lot of code repetition, since all these methods are effectively doing the same thing.

@armon

View changes

serf/serf.go
@@ -1537,3 +1556,30 @@ func (s *Serf) Stats() map[string]string {
}
return stats
}
+
+// WriteKeyringFile will serialize the current keyring and save it to a file.
+func (s *Serf) WriteKeyringFile(keyring *memberlist.Keyring) error {

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

No reason to make this public. Also we should be able to get the keyring from our state instead of an argument.

@armon

armon Apr 14, 2014

Member

No reason to make this public. Also we should be able to get the keyring from our state instead of an argument.

+// is enabled, which can be possible in one of 2 cases:
+// - Single encryption key passed at agent start (no persistence)
+// - Keyring file provided at agent start
+func (s *Serf) EncryptionEnabled() bool {

This comment has been minimized.

@armon

armon Apr 14, 2014

Member

Don't think we need this public either

@armon

armon Apr 14, 2014

Member

Don't think we need this public either

This comment has been minimized.

@ryanuber

ryanuber Apr 15, 2014

Member

This is currently public because it is called at agent start in command/agent/command.go in order to display Encrypted: <true/false>.

@ryanuber

ryanuber Apr 15, 2014

Member

This is currently public because it is called at agent start in command/agent/command.go in order to display Encrypted: <true/false>.

@armon

This comment has been minimized.

Show comment
Hide comment
@armon

armon Apr 14, 2014

Member

@ryanuber I've made a few notes, but overall this looks super solid! Very excited to have this in for the next release. Thanks for the hard work!

Member

armon commented Apr 14, 2014

@ryanuber I've made a few notes, but overall this looks super solid! Very excited to have this in for the next release. Thanks for the hard work!

@ryanuber

This comment has been minimized.

Show comment
Hide comment
@ryanuber

ryanuber Apr 15, 2014

Member

@armon Thanks for the review! I made some adjustments based on your recommendations and added a few response comments. Let me know what you think and we can adjust further from here.

Member

ryanuber commented Apr 15, 2014

@armon Thanks for the review! I made some adjustments based on your recommendations and added a few response comments. Let me know what you think and we can adjust further from here.

@armon

View changes

website/source/docs/agent/options.html.markdown
+ keyring file helps persist changes to the encryption keyring, allowing the
+ agent to start and rejoin the cluster successfully later on, even if key
+ rotations had been initiated by other members in the cluster. NOTE: this
+ option is not compatible with the `-encrypt` option.

This comment has been minimized.

@armon

armon Apr 16, 2014

Member

We should probably add a section below that documents the format of this file. Just so people know how to setup the keys, that the first key is the primary, etc.

@armon

armon Apr 16, 2014

Member

We should probably add a section below that documents the format of this file. Just so people know how to setup the keys, that the first key is the primary, etc.

@armon

View changes

serf/messages.go
@@ -18,6 +18,10 @@ const (
messageQueryType
messageQueryResponseType
messageConflictResponseType
+ messageInstallKeyType

This comment has been minimized.

@armon

armon Apr 16, 2014

Member

These 3 don't seem to be used as far as I can tell?

@armon

armon Apr 16, 2014

Member

These 3 don't seem to be used as far as I can tell?

This comment has been minimized.

@ryanuber

ryanuber Apr 16, 2014

Member

Ah, missed removing those... fixed in d756172.

@ryanuber

ryanuber Apr 16, 2014

Member

Ah, missed removing those... fixed in d756172.

@armon

View changes

serf/keymanager.go
+ }
+
+ qParam := k.serf.DefaultQueryParams()
+ queryResp, err := k.serf.Query(qName, rawKey, qParam)

This comment has been minimized.

@armon

armon Apr 16, 2014

Member

I hesitate to use the rawKey directly as the payload. We should instead use an encoded struct. The reason being if we need to add new fields later it will be much easier if there is already a struct that is using msgpack since it is flexible to new fields. Changing from a raw key to a struct will be much harder.

@armon

armon Apr 16, 2014

Member

I hesitate to use the rawKey directly as the payload. We should instead use an encoded struct. The reason being if we need to add new fields later it will be much easier if there is already a struct that is using msgpack since it is flexible to new fields. Changing from a raw key to a struct will be much harder.

+// them into a KeyResponse. It will update a KeyResponse *in place* and
+// therefore has nothing to return.
+func (k *keyManager) streamKeyResp(resp *KeyResponse, ch <-chan NodeResponse) {
+ for r := range ch {

This comment has been minimized.

@armon

armon Apr 16, 2014

Member

I think we also have a change to fast-path an early exit. The default query timeout is VERY conservative. We should pass in the number of nodes, and if we get that many responses, just exit. Otherwise we will wait like 10x longer than needed for the 95% case.

@armon

armon Apr 16, 2014

Member

I think we also have a change to fast-path an early exit. The default query timeout is VERY conservative. We should pass in the number of nodes, and if we get that many responses, just exit. Otherwise we will wait like 10x longer than needed for the 95% case.

This comment has been minimized.

@ryanuber

ryanuber Apr 16, 2014

Member

Awesome idea! Added in db060d5 and f28567f.

@ryanuber

ryanuber Apr 16, 2014

Member

Awesome idea! Added in db060d5 and f28567f.

+ }
+ defer client.Close()
+
+ if listKeys {

This comment has been minimized.

@armon

armon Apr 16, 2014

Member

We should probably do a check that only one of -list -install -use or -remove is specified. Really strange if they do something like "-list -install XXX"

@armon

armon Apr 16, 2014

Member

We should probably do a check that only one of -list -install -use or -remove is specified. Really strange if they do something like "-list -install XXX"

This comment has been minimized.

@ryanuber

ryanuber Apr 16, 2014

Member

Agreed, added in 17e8fb1

@ryanuber

ryanuber Apr 16, 2014

Member

Agreed, added in 17e8fb1

@armon

This comment has been minimized.

Show comment
Hide comment
@armon

armon Apr 16, 2014

Member

@ryanuber This is looking really good. I think it's very close. I think we are also missing some tests for the RPC layer, but I can always add those after a merge if you want.

Member

armon commented Apr 16, 2014

@ryanuber This is looking really good. I think it's very close. I think we are also missing some tests for the RPC layer, but I can always add those after a merge if you want.

ryanuber added some commits Apr 10, 2014

@ryanuber

This comment has been minimized.

Show comment
Hide comment
@ryanuber

ryanuber Apr 18, 2014

Member

@armon I took a swing at the RPC tests and added some command tests. If there is anything you think we are missing just let me know!

Member

ryanuber commented Apr 18, 2014

@armon I took a swing at the RPC tests and added some command tests. If there is anything you think we are missing just let me know!

@armon

This comment has been minimized.

Show comment
Hide comment
@armon

armon Apr 18, 2014

Member

@ryanuber Looks awesome! Thanks so much for all the hard work!

Member

armon commented Apr 18, 2014

@ryanuber Looks awesome! Thanks so much for all the hard work!

armon added a commit that referenced this pull request Apr 18, 2014

Merge pull request #199 from ryanuber/keyring
feature: change encryption keys

@armon armon merged commit 069de11 into hashicorp:master Apr 18, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment