Skip to content

Conversation

@danwinship
Copy link
Contributor

TL;DR: OpenShift is overriding IPTablesMasqueradeBit in the proxy configuration, but upstream requires that if you override it there, you have to override it in the kubelet config too, which we were not doing.

Upstream kubernetes has two separate codepaths for dealing with iptables masquerade and drop rules; one in kubelet for hostports, and one in kube-proxy for services. They attempt to coordinate with each other (by both using the same iptables chain names), but since kubelet and kube-proxy are configured separately, they have to both be given the same alternative configuration if you want to change something.

We have been overriding the IPTablesMasqueradeBit value in kube-proxy from bit 14 to bit 0 for a long time now. (I don't know why.) But we weren't overriding the corresponding value in kubelet. This means that kubelet will periodically insert a rule into the KUBE-MARK-MASQ chain to masquerade any packet with bit 14 set in its mark, but then iptables will delete that rule the next time it updates anything. This doesn't affect service or hostsubnet functioning (it just means that at certain times, packets destined to be masqueraded get marked with two separate bits, and then we masquerade them if they have either of those two bits set) but it can cause problems with other people trying to use the mark for other things (like egress IPs).

This patch fixes things by overriding the kubelet default value to match what we're overriding the kube-proxy default value to. Except that in theory, some users might have already worked around the bug in the opposite direction, by overriding kube-proxy to use the kubelet/upstream-kube-proxy default value. So if they're doing that then we use that value for both kube-proxy and kubelet to avoid breaking them. Other than that though, if they are explicitly setting the kube-proxy and kubelet arguments to different values, then this lets them lose, just like before and just like upstream. (Though maybe we shouldn't?)

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1552869

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 16, 2018
@deads2k
Copy link
Contributor

deads2k commented Mar 16, 2018

@sjenning Now that we just build the kubelet args, I think this is the sort of wiring pull that @openshift/sig-pod can reasonably own now.

@danwinship This appears to be coupling the kubelet flags to config that is logically for a different process. Have we regressed from a previous release or is this just to make life a little easier? If it's the latter, I think I'm more inclined to try a warning or an error in validation which could move to a higher level process when we have structured config changes.

@danwinship
Copy link
Contributor Author

This appears to be coupling the kubelet flags to config that is logically for a different process. Have we regressed from a previous release or is this just to make life a little easier?

The PR is to fix a bug in OpenShift. The "coupling" part is to avoid breaking people who were previously working around that bug. The core of the PR is just

setIfUnset(args, "iptables-masquerade-bit", "0")

(Upstream has always required that if you override kube-proxy's iptables-masquerade-bit (which we do), then you have to also override kubelet's iptables-masquerade-bit too (which we didn't). So this just fixes that.)

But some people might have already worked around the mismatch themselves. If they fixed it by adding

kubeletArguments:
  iptables-masquerade-bit: [ "0" ]

to their node-config, then that's not a problem, it just becomes redundant with our fix. But if they fixed it by adding

proxyArguments:
  iptables-masquerade-bit: [ "14" ]

then our change to set kubelet's iptables-masquerade-bit to 0 now would end up breaking their previously-working configuration. So we check if they're doing that, and if so, we leave everything configured the way it was. Other than that one check, there's no coupling between the kubelet and kube-proxy args; if the user wants to change the mark bit to any other value, they need to set that value in both kubeletArguments and proxyArguments, just like they would have to upstream.

I guess an alternative way to avoid breaking people would be to fix up their node-config via ansible at upgrade time. Then we could get rid of the proxyArguments check in the node setup code.

@sjenning
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 16, 2018
// Override kubelet iptables-masquerade-bit value to match overridden kube-proxy
// iptables-masquerade-bit value, UNLESS the user has overridden kube-proxy to match the
// previously-not-overridden kubelet value, in which case we don't want to re-break them.
if len(options.ProxyArguments["iptables-masquerade-bit"]) != 1 || options.ProxyArguments["iptables-masquerade-bit"][0] != "14" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we want to override this flag only when they are using openshift SDN?
[i.e. when network.IsOpenShiftNetworkPlugin(options.NetworkConfig.NetworkPluginName) == true]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravisankar no, looks like the defaulting to 0 is done unconditionally for kube-proxy config in pkg/cmd/server/kubernetes/network/options/options.go which is called from start_node.go's StartNode() whether openshift-sdn is enabled or not.

Copy link
Contributor

@dcbw dcbw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

// Override kubelet iptables-masquerade-bit value to match overridden kube-proxy
// iptables-masquerade-bit value, UNLESS the user has overridden kube-proxy to match the
// previously-not-overridden kubelet value, in which case we don't want to re-break them.
if len(options.ProxyArguments["iptables-masquerade-bit"]) != 1 || options.ProxyArguments["iptables-masquerade-bit"][0] != "14" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pravisankar no, looks like the defaulting to 0 is done unconditionally for kube-proxy config in pkg/cmd/server/kubernetes/network/options/options.go which is called from start_node.go's StartNode() whether openshift-sdn is enabled or not.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, dcbw, sjenning

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 20, 2018
@danwinship
Copy link
Contributor Author

/hold
@deads2k had wanted to do the backward-compatibility part as a separate ansible upgrade thing rather than mixing up the kubelet and kube-proxy configuration like this

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 20, 2018
@danwinship
Copy link
Contributor Author

I filed openshift/openshift-ansible#7589 to do the fix in ansible, but I think it would be a lot simpler to just commit this as-is (since it already got approved and lgtm anyway) and backport it, and then remove the extra check (in master only) in the future once we have the corresponding fix in ansible

/hold cancel
/retest

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2018
@danwinship
Copy link
Contributor Author

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot
Copy link
Contributor

/test all [submit-queue is verifying that this PR is safe to merge]

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 19005, 19032).

@openshift-merge-robot openshift-merge-robot merged commit 16a9a09 into openshift:master Mar 21, 2018
@bmeng
Copy link
Contributor

bmeng commented Mar 22, 2018

Should we just fail the node starting if the two IPTablesMasqueradeBit have different values? Or should we log the warning in system log than fix it silently?

@danwinship
Copy link
Contributor Author

In theory, the kubelet code should not be looking at the kube-proxy config, and vice versa. (And in the future it's possible that they won't even be able to. That's true of the upstream code.) We probably need to do something better upstream. I'll file a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. sig/pod size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants