Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use strategic merge patch on Node.Status.Addresses #79391

Merged
merged 2 commits into from Jul 3, 2019

Conversation

@danwinship
Copy link
Contributor

commented Jun 25, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
On a node with multiple IP addresses of the same type (eg, multiple InternalIP addresses), when the set of active addresses changed, kubelet would sometimes randomly reorder them. This could cause the node to suddenly declare a different primary node IP, causing various breakage.

This is because the kubelet sync code uses strategic merge patch to update Node.Status, but Node.Status.Addresses is incorrectly annotated with a patchStrategy that only works when there is only a single address of the same type.

Even though the annotation is wrong for the API, fixing it is an API break, particularly since changing it would result in even worse behavior when the clients and servers were different versions. (The client would generate a strategic patch according to one interpretation which the server would then apply according to a different interpretation.)

So this fixes PatchNodeStatus() by having it edit the automatically-generated strategic merge patch to force it to use a "replace" strategy on Status.Addressess.

Which issue(s) this PR fixes:
none, but the newly-added unit test demonstrates the bug. (It fails against the old code.)
(Downstream bug is https://bugzilla.redhat.com/show_bug.cgi?id=1696628 but it's mostly private)

Does this PR introduce a user-facing change?:

Kubelet should now more reliably report the same primary node IP even if the set of node IPs reported by the CloudProvider changes.

/priority important-soon

@k8s-ci-robot k8s-ci-robot requested review from dashpole and smarterclayton Jun 25, 2019

@danwinship danwinship force-pushed the danwinship:nodeaddresses-update-fix branch 2 times, most recently from 68f5b0c to d717cb4 Jun 25, 2019

@danwinship danwinship force-pushed the danwinship:nodeaddresses-update-fix branch from d717cb4 to 3064d17 Jun 26, 2019

@danwinship danwinship force-pushed the danwinship:nodeaddresses-update-fix branch from 3064d17 to e22af74 Jun 26, 2019

@danwinship danwinship changed the title WIP kubelet: Don't use strategic merge patch on NodeStatus Don't use strategic merge patch on NodeStatus Jun 26, 2019

@RobertKrawitz

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2019

@k8s-ci-robot k8s-ci-robot requested review from liggitt and sttts Jun 26, 2019

@thockin
Copy link
Member

left a comment

That's a lot of JSON transforms. I don't see a better answer.

I would still leave a comment in types.go, effectively:

"This field is declared as mergeable, but the merge key is not sufficiently unique, which can cause data corruption when it is merged. Callers should instead use a full-replacement patch. See http://pr.k8s.io/79391 for an example."

@thockin

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

@liggitt

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

When CCM has been rolled out but some of the nodes are not yet updated?

Oh, are we transitioning all addresses to be managed by the CCM, not just for "external" providers?

existingNode := v1.Node{ObjectMeta: metav1.ObjectMeta{Name: testKubeletHostname}}
kubeClient.ReactionChain = fake.NewSimpleClientset(&v1.NodeList{Items: []v1.Node{existingNode}}).ReactionChain

testAddresses := [][]v1.NodeAddress{

This comment has been minimized.

Copy link
@liggitt

liggitt Jul 2, 2019

Member

a table test with existing/new address lists would be easier to follow

would like to see tests around:

  • nil list -> populated list
  • empty list -> populated list
  • populated list -> nil list
  • populated list -> empty list
  • no change with multiple addresses of the same type
  • 1 internalIP addresses -> 2 internalIP addresses
  • 2 internalIP addresses -> 1 internalIP addresses
  • 2 internalIP addresses -> 2 different internalIP addresses

and verify that constructing/applying the patch always results in a set of node addresses semantically equal to the new addresses

@kfox1111

This comment has been minimized.

Copy link

commented Jul 2, 2019

I'm trying to evaluate how much this might effect one of my systems.

On a node with multiple IP addresses of the same type (eg, multiple InternalIP addresses), when the set of active addresses changed, kubelet would sometimes randomly reorder them.

What typically causes the set of active addresses to change? adding new interfaces/deleting interfaces only? simply restarting kubelet? rebooting? other?

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

That's a lot of JSON transforms. I don't see a better answer.

Well... I could apply a JSON patch to the JSON patch... 😮
(That actually might not be terrible...?)

Cluster is rolling out cloud-controller-manager. CCM is running the controller that updates node addresses with this patch. Meanwhile, at least one kubelet is back-rev, writing node addresses with old code. It's going to flap

If kubelet and CCM don't agree on what the correct list of addresses is, then there will be flapping regardless of who is using which algorithm, of course.

If kubelet and CCM do agree on what the correct list of addresses is, then it won't flap for long, because both sides will only try to patch the list if it's currently wrong. Once the side using the fixed algorithm manages to patch it to the correct value, both sides will agree that that value is correct and so they won't try to patch it again.

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Jul 2, 2019

What typically causes the set of active addresses to change? adding new interfaces/deleting interfaces only? simply restarting kubelet? rebooting? other?

If you do not have a cloud provider set, then the bug doesn't affect you. (I'll update the release note to clarify that.) If you do have a cloud provider set, then the set of active addresses is up to the cloud provider, but generally corresponds to adding/deleting interfaces, and maybe adding/deleting secondary IPs on those interfaces.

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

If kubelet and CCM don't agree on what the correct list of addresses is, then there will be flapping regardless of who is using which algorithm, of course.

This is part of the problem though, we have less control of version skew for the external cloud provider case which makes it hard to reason about the impact of this. The more probable and dangerous case is that a cluster is using a version of a kubelet with this patch, and then the cluster is migrated to use CCM where it vendored an older version of kubernetes (say v1.15) where the CCM is back to using strategic merge patch. Ideally you deploy a version of CCM that is lock-step with the Kubernetes version but when the cloud provider integration is out-of-tree, it's much more common for those versions to skew.

@thockin

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

@danwinship danwinship force-pushed the danwinship:nodeaddresses-update-fix branch from 79fa94b to 05a9634 Jul 2, 2019

@kfox1111

This comment has been minimized.

Copy link

commented Jul 2, 2019

What typically causes the set of active addresses to change? adding new interfaces/deleting interfaces only? simply restarting kubelet? rebooting? other?

If you do not have a cloud provider set, then the bug doesn't affect you. (I'll update the release note to clarify that.) If you do have a cloud provider set, then the set of active addresses is up to the cloud provider, but generally corresponds to adding/deleting interfaces, and maybe adding/deleting secondary IPs on those interfaces.

In the case of vsphere driver, it calls net.Interfaces and filters out any non vmware prefix. It returns them in the order net.Interfaces returns them. net.Interfaces calls interfaceTable(0), defined here:
https://golang.org/src/net/interface_linux.go
which uses netlink to grab the list. I don't think netlink provides any guarantee on what order...

So its possible simply restarting kubelet would cause the order to change. Is a reordering a problem, or is it only when the number of entries changes?

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

I think I agree with Dan, though - we have a case where a bug exists. If that bug exists in 2 components, they both need to be fixed.

I agree with this too, but would prefer to take a pragmatic approach for the out-of-tree case. Maybe we can isolate a commit that fixes the patch strategy for only the CCM case and backport that to previous versions? Forcing every out-of-tree provider to upgrade to v1.16 Kubernetes will not be easy because of the way things are vendored now (vendoring k8s.io/kubernetes is required).

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

@kfox1111 for the vSphere case, it would only apply if there is more than 1 network interface with the matching prefix.

@thockin

This comment has been minimized.

Copy link
Member

commented Jul 2, 2019

@thockin
Copy link
Member

left a comment

Thanks!

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm label Jul 2, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Jul 2, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, thockin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@andrewsykim

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

Does CCM have this equivalent loop in a new controller? Can we just make a different PR for that?

It shares nodeutil.PatchNodeStatus function with the kubelet, but yeah good point we can duplicate the function locally worse-case scenario. Thanks!

@kfox1111

This comment has been minimized.

Copy link

commented Jul 3, 2019

@kfox1111 for the vSphere case, it would only apply if there is more than 1 network interface with the matching prefix.

Yeah. I see that on my cluster for sure. I'm just trying to understand how this bug will effect me until its fixed. Is restarting a kubelet (during a crash, say) enough to trigger node update failures due to interface reordering?

Just how frequent it happens and what triggers it, dictates how to respond to it. If its only when an interface is removed or added, this won't effect any of my production systems as I don't do that. If its on kubelet restart, then its fairly serious and I need to tread very lightly as I am not in full control of that... I'd maybe have to even patch it out of band until upstreamed in that case to be safe.

@danwinship

This comment has been minimized.

Copy link
Contributor Author

commented Jul 3, 2019

Restarting won't trigger the bug

@kfox1111

This comment has been minimized.

Copy link

commented Jul 3, 2019

Ok. Thank you.

@k8s-ci-robot k8s-ci-robot merged commit c8cee54 into kubernetes:master Jul 3, 2019

23 checks passed

cla/linuxfoundation danwinship authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details
@cheftako

This comment has been minimized.

Copy link
Member

commented Jul 3, 2019

@danwinship danwinship deleted the danwinship:nodeaddresses-update-fix branch Jul 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.