Allow nodes to change IDs when replacing a dead node #5485

kyhavlov · 2019-03-14T00:11:40Z

This PR changes logic in the state store and leader to allow a node's ID to be updated when overwriting a failed node. Previously this would cause an error when, for example, a user shut down a server causing it to leave non-gracefully and brought it up with the same name on a fresh VM (and a new node ID). This PR also includes a change in memberlist (see the branch here for the memberlist changes including test).

This follows the discussion in #5008 and is intended to fix #4741, but does not include changes around nodes with empty IDs, which will be a separate PR.

pierresouchay

@kyhavlov Thanks for dealing with this (while this is similar to the "dead" policy of #5008 I think it also address some other issues we (and some others add) with node having no serf health.

While we tackled this issue by providing real fixed IDs in our bare-metal servers (by computing IDs from serials of the machines), having this method if probably the way to go.

I just wonder how it will behave with "fake" nodes (such as the ones inserted by projects such as https://github.com/hashicorp/consul-k8s/ )

pierresouchay · 2019-03-14T01:19:17Z

agent/consul/state/catalog.go

+				return fmt.Errorf("Cannot get status of node %s: %s", enode.Node, err)
+			}
+
+			var nodeHealthy bool


So, if the node has no SerfHealth, you consider it safe ?
Is it the behavior you want (I think it make sense, but it would require a comment maybe)

Yeah that's a fair point. In general nodes without Serf are fake "external nodes" like the k8s one you mentioned or also ones created by use of ESM.

I think in that case then we have to take the thing registering them as the source of truth - if it wants to change ID or overwrite a node name I guess we just have to assume that is what they intend to do.

There are edge cases where a real agent could start with same name as an existing "fake" node but I think at this point we've done out best to stop the user misconfiguring things.

I agree a comment to make that intent clear would be a good idea.

pierresouchay · 2019-03-14T01:22:59Z

agent/consul/leader.go

@@ -1332,11 +1332,12 @@ AFTER_CHECK:
 			Status:  api.HealthPassing,
 			Output:  structs.SerfCheckAliveOutput,
 		},


what does means the removal of SkipNodeUpdate: true ?
To encore the Healthy on Serf?

@kyhavlov can correct me if I am wrong but I believe the reasoning here is that we may want to update the node id or address. In that case using SkipNodeUpdate wouldn't update either of those two fields.

The few lines added below copy in the nodes tagged addresses and node meta of the existing registration so that we don't clobber those (as we cannot get them via serf).

One question I have is whether their is a race condition here. We are pulling the tagged addresses and node meta of the node out of a snapshot of the fsm. Wouldn't it be possible for anti entropy on the node to kick in and update the tagged addresses or node meta between when we grab it and when we finally get raft to apply the change?

The SkipNodeUpdate field prevents the node state from being written if it doesn't already exist (which we don't want since we're updating the ID here): https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L309

With regard to the possibility of a race condition, anti-entropy starts on a delay where this handleAliveMember function gets triggered immediately by the node joining, so in practice that shouldn't happen. Even if anti-entropy did update the meta/tagged addresses in the middle of this (and get clobbered by the leader's update) it should set them back to the updated values afterward when it kicks in again.

banks

@kyhavlov This looks great to me.

I think Pierres suggestions for comments are worthwhile.

You need to PR and merge the memberlist change properly before we can land this right? Was there a reason that isn't a PR already? For example the vendor updates here are not safe to commit since the reference a commit that is only in your branch and we should really stick to memberlist master branch.

banks · 2019-03-14T13:07:13Z

agent/consul/leader_test.go

+		}
+		if got, want := failed, 1; got != want {
+			r.Fatalf("got %d failed members want %d", got, want)
+		}


nit: we can use require.Equal(t, 1, failed) here now even with Retry IIRC.

banks · 2019-03-14T13:11:10Z

agent/consul/state/catalog.go

+				return fmt.Errorf("Cannot get status of node %s: %s", enode.Node, err)
+			}
+
+			var nodeHealthy bool


Yeah that's a fair point. In general nodes without Serf are fake "external nodes" like the k8s one you mentioned or also ones created by use of ESM.

I think in that case then we have to take the thing registering them as the source of truth - if it wants to change ID or overwrite a node name I guess we just have to assume that is what they intend to do.

There are edge cases where a real agent could start with same name as an existing "fake" node but I think at this point we've done out best to stop the user misconfiguring things.

I agree a comment to make that intent clear would be a good idea.

banks · 2019-03-14T13:12:29Z

agent/consul/state/catalog_test.go

+	})
+	if err := s.ensureNoNodeWithSimilarNameTxn(tx, newNode, false); err != nil {
+		t.Fatal(err)
+	}


nit: I prefer to use require.NoError(err) for new test code but not a big deal as this is consistent with the old test code here.

mkeeler

This looks great.

In my reply to one of the existing comments I may have found a race condition. I am not totally sure.

When updating the node information in leader.go would it be possible for the tagged addresses or node meta to change between what we pull out of the FSM state and when the raft request gets processed?

mkeeler · 2019-03-14T14:16:48Z

agent/consul/leader.go

@@ -1332,11 +1332,12 @@ AFTER_CHECK:
 			Status:  api.HealthPassing,
 			Output:  structs.SerfCheckAliveOutput,
 		},


@kyhavlov can correct me if I am wrong but I believe the reasoning here is that we may want to update the node id or address. In that case using SkipNodeUpdate wouldn't update either of those two fields.

The few lines added below copy in the nodes tagged addresses and node meta of the existing registration so that we don't clobber those (as we cannot get them via serf).

One question I have is whether their is a race condition here. We are pulling the tagged addresses and node meta of the node out of a snapshot of the fsm. Wouldn't it be possible for anti entropy on the node to kick in and update the tagged addresses or node meta between when we grab it and when we finally get raft to apply the change?

banks · 2019-03-14T14:57:32Z

Reading the memberlist change I'm just trying to get my head around it all!

The first question is that the errors reported in #4741 seem to imply that their agents did manage to join Gossip already and only failed on the catalog registration. That was the original intuition behind "if serf allowed them in already then it's fine", but doesn't explain why we need changes to memberlist to implement this 🤔.

But things are more subtle.

What really happens when memberlist tries to join as far as I can see is:

memberlist.Create set's ourselves as alive in our own member list state but doen't talk to anyone else yet.
memberlist.Join goes through the seed peers provided and attampts to pushPull sync the full memberlist from them. Since it's pushPull, we are also trying to get our self-reference added in Create added to the peers cluster state.
The peer will decode the push and attempt to merge it with it's own: https://github.com/hashicorp/memberlist/blob/b38abf62d7f3ce5225722cd62a90cfb098e02519/net.go#L264
It will invoke the MergeDelegate before it does which is handled by Serf's MergeDelegate, https://github.com/hashicorp/serf/blob/master/serf/merge_delegate.go#L13:6 which essentially just converts the memberlist.Node into serf.Member and passes them all on to Consul's MergeDelegate:

consul/agent/consul/merge.go

Line 26 in b5abf61

func (md *lanMergeDelegate) NotifyMerge(members []*serf.Member) error {
Consul's Merge delegate already does duplicate ID (but not name) detection and rejects pushPulls that try to add a new node with the same ID! That should cause the memberlist.Join to fail.
assuming the Merge is allowed by the delegate, the pushPull succeeds, both peers will call memberlist.aliveNode for each node. That is where your memberlist changes are. Bailing out of aliveNode on a name conflict prevents that node from ever being added to peer state preventing the join.
for any new Node that is not a duplicate, aliveNode will broadcast to N peers which is what spreads the node join around the cluster.

So here is my take on the original question: how did users in #4741 manage to successfully join and only fail on catalog registration?

Answer: because PushPull doesn't actually error in this case. Their peer still sends them the cluster state and attempts to merge them in, but fails the duplicate name check and so just doesn't actually add them to it's own state or broadcast it.

Serf Join will not see an error though and assume it's joined the cluster and broadcast a Join message (which seems really broken to me...

At this point I'm still not totally sure what happens, maybe memberlist rejects the Serf join broadcast since the node broadcasting is not in the cluster according to anyone else but I've run out of time and want to push this so I don't loose the train of thought.

I guess I'd still like to understand what is happening at Serf/Memberlist layers for people affected by #4741 since it would see that memberlist would not have let them join but Serf somehow seems to be partially connected and unaware that memberlist rejected the duplicate?

kyhavlov · 2019-03-14T23:12:35Z

@banks That's a good detailed writeup of the different layers of this. This needs a change in both memberlist and Consul because there were two different ways to hit the issue:

Remove a node non-gracefully so that it enters the "failed" state, then re-add it with a different ID and address but the same name. This would typically happen when recreating the host VM for a server. In this case, memberlist is the layer checking for an address conflict and you'd get an error in the logs like:

memberlist: Conflicting address for node3.dc1. Mine: 127.0.0.1:9350 Theirs: 127.0.0.1:9351

Remove a node non-gracefully but this time re-add it with a different ID and the same address/name. This could happen if you cleared the data-dir, and memberlist wouldn't complain because it's the same address and name. Serf/memberlist don't know about a duplicate in this case because the address and node name are the same, so it just looks like a failed node rejoining the cluster, and the node ID is just arbitrary metadata there. Instead, the only error would happen in the state store when trying to update the node's record with a new ID. I think this is probably what happened in Duplicate Node IDs after upgrading to 1.2.3 #4741, and it could also happen if you recreated a node using the same IP address.

I opened just this PR initially to make it easier to review the changes as a whole; I'll put up the memberlist PR too and re-vendor that when it's merged.

kyhavlov added 4 commits March 7, 2019 22:42

Add logic to allow changing a failed node's ID

1e4523f

Add a test for changing a failed node's ID

6932b06

Update memberlist for the node renaming change

dd7688d

Update state store test for changing node ID

8ae6547

pierresouchay reviewed Mar 14, 2019

View reviewed changes

banks reviewed Mar 14, 2019

View reviewed changes

mkeeler reviewed Mar 14, 2019

View reviewed changes

kyhavlov mentioned this pull request Mar 14, 2019

Allow a dead node's name to be taken by a new node hashicorp/memberlist#189

Merged

banks approved these changes Mar 18, 2019

View reviewed changes

Condense some test logic and add a comment about renaming

aa4e26d

pearkes added this to the 1.5.0 milestone Apr 16, 2019

banks mentioned this pull request Apr 26, 2019

Duplicate Node IDs after upgrading to 1.2.3 #4741

Closed

pearkes modified the milestones: 1.5.0, 1.5.1 Apr 29, 2019

kyhavlov added 3 commits May 15, 2019 10:51

Merge branch 'master' into change-node-id

29eb83c

vendor: update memberlist

1d9b8e1

Set the dead node reclaim timer at 30s

31bb9d6

kyhavlov merged commit b15cb60 into master May 15, 2019

kyhavlov deleted the change-node-id branch May 15, 2019 19:18

crhino mentioned this pull request Jan 3, 2020

force-leave does not remove the failed node from the raft voters list #6856

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow nodes to change IDs when replacing a dead node #5485

Allow nodes to change IDs when replacing a dead node #5485

kyhavlov commented Mar 14, 2019

pierresouchay left a comment

pierresouchay Mar 14, 2019

banks Mar 14, 2019

pierresouchay Mar 14, 2019

mkeeler Mar 14, 2019

kyhavlov Mar 14, 2019

banks left a comment

banks Mar 14, 2019

banks Mar 14, 2019

banks Mar 14, 2019

mkeeler left a comment

mkeeler Mar 14, 2019

banks commented Mar 14, 2019

kyhavlov commented Mar 14, 2019 •

edited

Loading

Allow nodes to change IDs when replacing a dead node #5485

Allow nodes to change IDs when replacing a dead node #5485

Conversation

kyhavlov commented Mar 14, 2019

pierresouchay left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkeeler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

banks commented Mar 14, 2019

kyhavlov commented Mar 14, 2019 • edited Loading

kyhavlov commented Mar 14, 2019 •

edited

Loading