Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow nodes to change IDs when replacing a dead node #5485

Merged
merged 8 commits into from
May 15, 2019
Merged

Conversation

kyhavlov
Copy link
Contributor

This PR changes logic in the state store and leader to allow a node's ID to be updated when overwriting a failed node. Previously this would cause an error when, for example, a user shut down a server causing it to leave non-gracefully and brought it up with the same name on a fresh VM (and a new node ID). This PR also includes a change in memberlist (see the branch here for the memberlist changes including test).

This follows the discussion in #5008 and is intended to fix #4741, but does not include changes around nodes with empty IDs, which will be a separate PR.

Copy link
Contributor

@pierresouchay pierresouchay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyhavlov Thanks for dealing with this (while this is similar to the "dead" policy of #5008 I think it also address some other issues we (and some others add) with node having no serf health.

While we tackled this issue by providing real fixed IDs in our bare-metal servers (by computing IDs from serials of the machines), having this method if probably the way to go.

I just wonder how it will behave with "fake" nodes (such as the ones inserted by projects such as https://github.com/hashicorp/consul-k8s/ )

return fmt.Errorf("Cannot get status of node %s: %s", enode.Node, err)
}

var nodeHealthy bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if the node has no SerfHealth, you consider it safe ?
Is it the behavior you want (I think it make sense, but it would require a comment maybe)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a fair point. In general nodes without Serf are fake "external nodes" like the k8s one you mentioned or also ones created by use of ESM.

I think in that case then we have to take the thing registering them as the source of truth - if it wants to change ID or overwrite a node name I guess we just have to assume that is what they intend to do.

There are edge cases where a real agent could start with same name as an existing "fake" node but I think at this point we've done out best to stop the user misconfiguring things.

I agree a comment to make that intent clear would be a good idea.

@@ -1332,11 +1332,12 @@ AFTER_CHECK:
Status: api.HealthPassing,
Output: structs.SerfCheckAliveOutput,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does means the removal of SkipNodeUpdate: true ?
To encore the Healthy on Serf?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyhavlov can correct me if I am wrong but I believe the reasoning here is that we may want to update the node id or address. In that case using SkipNodeUpdate wouldn't update either of those two fields.

The few lines added below copy in the nodes tagged addresses and node meta of the existing registration so that we don't clobber those (as we cannot get them via serf).

One question I have is whether their is a race condition here. We are pulling the tagged addresses and node meta of the node out of a snapshot of the fsm. Wouldn't it be possible for anti entropy on the node to kick in and update the tagged addresses or node meta between when we grab it and when we finally get raft to apply the change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SkipNodeUpdate field prevents the node state from being written if it doesn't already exist (which we don't want since we're updating the ID here): https://github.com/hashicorp/consul/blob/master/agent/consul/state/catalog.go#L309

With regard to the possibility of a race condition, anti-entropy starts on a delay where this handleAliveMember function gets triggered immediately by the node joining, so in practice that shouldn't happen. Even if anti-entropy did update the meta/tagged addresses in the middle of this (and get clobbered by the leader's update) it should set them back to the updated values afterward when it kicks in again.

Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyhavlov This looks great to me.

I think Pierres suggestions for comments are worthwhile.

You need to PR and merge the memberlist change properly before we can land this right? Was there a reason that isn't a PR already? For example the vendor updates here are not safe to commit since the reference a commit that is only in your branch and we should really stick to memberlist master branch.

}
if got, want := failed, 1; got != want {
r.Fatalf("got %d failed members want %d", got, want)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can use require.Equal(t, 1, failed) here now even with Retry IIRC.

return fmt.Errorf("Cannot get status of node %s: %s", enode.Node, err)
}

var nodeHealthy bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a fair point. In general nodes without Serf are fake "external nodes" like the k8s one you mentioned or also ones created by use of ESM.

I think in that case then we have to take the thing registering them as the source of truth - if it wants to change ID or overwrite a node name I guess we just have to assume that is what they intend to do.

There are edge cases where a real agent could start with same name as an existing "fake" node but I think at this point we've done out best to stop the user misconfiguring things.

I agree a comment to make that intent clear would be a good idea.

})
if err := s.ensureNoNodeWithSimilarNameTxn(tx, newNode, false); err != nil {
t.Fatal(err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I prefer to use require.NoError(err) for new test code but not a big deal as this is consistent with the old test code here.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.

In my reply to one of the existing comments I may have found a race condition. I am not totally sure.

When updating the node information in leader.go would it be possible for the tagged addresses or node meta to change between what we pull out of the FSM state and when the raft request gets processed?

@@ -1332,11 +1332,12 @@ AFTER_CHECK:
Status: api.HealthPassing,
Output: structs.SerfCheckAliveOutput,
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kyhavlov can correct me if I am wrong but I believe the reasoning here is that we may want to update the node id or address. In that case using SkipNodeUpdate wouldn't update either of those two fields.

The few lines added below copy in the nodes tagged addresses and node meta of the existing registration so that we don't clobber those (as we cannot get them via serf).

One question I have is whether their is a race condition here. We are pulling the tagged addresses and node meta of the node out of a snapshot of the fsm. Wouldn't it be possible for anti entropy on the node to kick in and update the tagged addresses or node meta between when we grab it and when we finally get raft to apply the change?

@banks
Copy link
Member

banks commented Mar 14, 2019

Reading the memberlist change I'm just trying to get my head around it all!

The first question is that the errors reported in #4741 seem to imply that their agents did manage to join Gossip already and only failed on the catalog registration. That was the original intuition behind "if serf allowed them in already then it's fine", but doesn't explain why we need changes to memberlist to implement this 🤔.

But things are more subtle.

What really happens when memberlist tries to join as far as I can see is:

  1. memberlist.Create set's ourselves as alive in our own member list state but doen't talk to anyone else yet.
  2. memberlist.Join goes through the seed peers provided and attampts to pushPull sync the full memberlist from them. Since it's pushPull, we are also trying to get our self-reference added in Create added to the peers cluster state.
  3. The peer will decode the push and attempt to merge it with it's own: https://github.com/hashicorp/memberlist/blob/b38abf62d7f3ce5225722cd62a90cfb098e02519/net.go#L264
  4. It will invoke the MergeDelegate before it does which is handled by Serf's MergeDelegate, https://github.com/hashicorp/serf/blob/master/serf/merge_delegate.go#L13:6 which essentially just converts the memberlist.Node into serf.Member and passes them all on to Consul's MergeDelegate:
    func (md *lanMergeDelegate) NotifyMerge(members []*serf.Member) error {
  5. Consul's Merge delegate already does duplicate ID (but not name) detection and rejects pushPulls that try to add a new node with the same ID! That should cause the memberlist.Join to fail.
  6. assuming the Merge is allowed by the delegate, the pushPull succeeds, both peers will call memberlist.aliveNode for each node. That is where your memberlist changes are. Bailing out of aliveNode on a name conflict prevents that node from ever being added to peer state preventing the join.
  7. for any new Node that is not a duplicate, aliveNode will broadcast to N peers which is what spreads the node join around the cluster.

So here is my take on the original question: how did users in #4741 manage to successfully join and only fail on catalog registration?

Answer: because PushPull doesn't actually error in this case. Their peer still sends them the cluster state and attempts to merge them in, but fails the duplicate name check and so just doesn't actually add them to it's own state or broadcast it.

Serf Join will not see an error though and assume it's joined the cluster and broadcast a Join message (which seems really broken to me...

At this point I'm still not totally sure what happens, maybe memberlist rejects the Serf join broadcast since the node broadcasting is not in the cluster according to anyone else but I've run out of time and want to push this so I don't loose the train of thought.

I guess I'd still like to understand what is happening at Serf/Memberlist layers for people affected by #4741 since it would see that memberlist would not have let them join but Serf somehow seems to be partially connected and unaware that memberlist rejected the duplicate?

@kyhavlov
Copy link
Contributor Author

kyhavlov commented Mar 14, 2019

@banks That's a good detailed writeup of the different layers of this. This needs a change in both memberlist and Consul because there were two different ways to hit the issue:

  1. Remove a node non-gracefully so that it enters the "failed" state, then re-add it with a different ID and address but the same name. This would typically happen when recreating the host VM for a server. In this case, memberlist is the layer checking for an address conflict and you'd get an error in the logs like:
memberlist: Conflicting address for node3.dc1. Mine: 127.0.0.1:9350 Theirs: 127.0.0.1:9351
  1. Remove a node non-gracefully but this time re-add it with a different ID and the same address/name. This could happen if you cleared the data-dir, and memberlist wouldn't complain because it's the same address and name. Serf/memberlist don't know about a duplicate in this case because the address and node name are the same, so it just looks like a failed node rejoining the cluster, and the node ID is just arbitrary metadata there. Instead, the only error would happen in the state store when trying to update the node's record with a new ID. I think this is probably what happened in Duplicate Node IDs after upgrading to 1.2.3 #4741, and it could also happen if you recreated a node using the same IP address.

I opened just this PR initially to make it easier to review the changes as a whole; I'll put up the memberlist PR too and re-vendor that when it's merged.

@pearkes pearkes added this to the 1.5.0 milestone Apr 16, 2019
@pearkes pearkes modified the milestones: 1.5.0, 1.5.1 Apr 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duplicate Node IDs after upgrading to 1.2.3
5 participants