Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-proxy/ipvs] Add flag to disable ipvs for external ip #79976

Open
wants to merge 2 commits into
base: master
from

Conversation

@imroc
Copy link
Contributor

imroc commented Jul 10, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

Not exactly a bug but a regression for some users which use LoadBalancer not compatible with this change. This change bind LB IP to kube-ipvs0, so packet from LB to NodePort will never reach the pod, cuz linux will ignore the packets which source ip is a local ip (kernel will treat LB IP as a local ip, cuz it's been bounded to kube-ipvs0), and if we enable kernel param accept_local, pod can receive packets, but the reply packet will also cannot reach the client, cuz destination ip is LB IP, which is a local ip, so the packet cannot go out, this caused some problems, for instance:

  1. If LoadBalancer SNAT packet to its own IP (EXTERNAL-IP), the packet will never reach the pod
  2. If LoadBalancer not do SNAT, but it uses its own IP as probe packet and send to NodePort for health check, it will also never reach the pod, so LB will think the NodePort is unhealthy

Which issue(s) this PR fixes:

Fixes #79783

Does this PR introduce a user-facing change?:

[IPVS] Introduces flag ipvs-exclude-external-ip to disable the ipvs forwarding for both ExternalIPs and LoadBalancerIPs, used in scenarios where it is expected that the traffic from pod to LB should leave the host to reach LB and then route back to a host instead of being forwarded directly to the pod by ipvs, defaulting to false to preserve existing behaviors. 
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 10, 2019

Hi @imroc. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 10, 2019

/sig network
/area ipvs
/assign @lbernail
/assign @m1093782566

@@ -58,6 +58,8 @@ type KubeProxyIPVSConfiguration struct {
// strict ARP configure arp_ignore and arp_announce to avoid answering ARP queries
// from kube-ipvs0 interface
StrictARP bool
// excludeExternalIP disable the ipvs forwarding from pod to LoadBalancer IP
ExcludeExternalIP bool

This comment has been minimized.

Copy link
@neolit123

neolit123 Jul 11, 2019

Member

this change needs to be reflected in the external types.
see https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/kube-proxy/config/v1alpha1/types.go#L42

on the other hand i really think we should lock v1alpha1 at this point.
cc @kubernetes/sig-network-api-reviews
cc @rosti @mtaufen

This comment has been minimized.

Copy link
@imroc

imroc Jul 11, 2019

Author Contributor

good catch ! I'll add it

This comment has been minimized.

Copy link
@neolit123

neolit123 Jul 11, 2019

Member

hi, you should probably hold for now.
(i.e. wait for comments)

on the other hand i really think we should lock v1alpha1 at this point.

what i meant is that we probably should not modify v1alpha1 any more and move to v1alpha2.

This comment has been minimized.

Copy link
@imroc

imroc Jul 11, 2019

Author Contributor

ok, I'll wait for comments from others

This comment has been minimized.

Copy link
@rosti

rosti Jul 11, 2019

Member

Unfortunately, the kube-proxy component config KEP is still nowhere near to be merged.
@imroc let's have this in v1alpha1. It's an IPVS specific setting, so it doesn't change the proposed scheme for now. Ideally, we want to shift to use config files for now and this parameter is not present in the config file ATM (at least as the PR is).

This comment has been minimized.

Copy link
@imroc

imroc Jul 11, 2019

Author Contributor

Yes, because I am waiting to confirm if this should be added to v1alpha1, now I have added it, PTAL @rosti

This comment has been minimized.

Copy link
@imroc

imroc Jul 12, 2019

Author Contributor

one more commit added for make generated_files

@neolit123

This comment has been minimized.

Copy link
Member

neolit123 commented Jul 11, 2019

/remove-kind bug
/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature and removed kind/bug labels Jul 11, 2019
@imroc imroc force-pushed the imroc:ipvs-exclude-external-ip branch 2 times, most recently from 81435b4 to e77b1ff Jul 11, 2019
@@ -186,6 +186,7 @@ func (o *Options) AddFlags(fs *pflag.FlagSet) {
fs.DurationVar(&o.config.UDPIdleTimeout.Duration, "udp-timeout", o.config.UDPIdleTimeout.Duration, "How long an idle UDP connection will be kept open (e.g. '250ms', '2s'). Must be greater than 0. Only applicable for proxy-mode=userspace")

fs.BoolVar(&o.config.IPVS.StrictARP, "ipvs-strict-arp", o.config.IPVS.StrictARP, "Enable strict ARP by setting arp_ignore to 1 and arp_announce to 2")
fs.BoolVar(&o.config.IPVS.ExcludeExternalIP, "ipvs-exclude-external-ip", o.config.IPVS.ExcludeExternalIP, "If true disable the ipvs forwarding for LoadBalancer IP, used in scenarios where it is expected that the packet from pod to LB should really reach LB instead of being forwarded directly to the pod by ipvs.")

This comment has been minimized.

Copy link
@m1093782566

m1093782566 Jul 11, 2019

Member

s/LoadBalancer IP/external IP

This comment has been minimized.

Copy link
@imroc

imroc Jul 11, 2019

Author Contributor

done

@imroc imroc force-pushed the imroc:ipvs-exclude-external-ip branch from e77b1ff to 354b097 Jul 11, 2019
@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Jul 11, 2019

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@imroc imroc force-pushed the imroc:ipvs-exclude-external-ip branch from 354b097 to e36517b Jul 11, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jul 11, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: imroc
To complete the pull request process, please assign thockin
You can assign the PR to them by writing /assign @thockin in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@imroc imroc force-pushed the imroc:ipvs-exclude-external-ip branch from e36517b to 33ec085 Jul 12, 2019
@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Jul 12, 2019

Can we make it clear that we exclude both ExternalIPs and LoadbalancerIPs? (in the flag documentation and release note)

I think we may have a few edge cases for pod-to-ExternalIP and pod-to-LoadBalancerIP traffic: today this traffic is load-balanced locally by IPVS. If these IPs are not bound anymore it means the traffic will leave this instance and needs to be routed back to a host (through a LoadBalancer or routing). We need to make sure that traffic flows properly in this situation.

@imroc imroc force-pushed the imroc:ipvs-exclude-external-ip branch from 33ec085 to 3dff717 Jul 12, 2019
@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 12, 2019

Can we make it clear that we exclude both ExternalIPs and LoadbalancerIPs? (in the flag documentation and release note)

I think we may have a few edge cases for pod-to-ExternalIP and pod-to-LoadBalancerIP traffic: today this traffic is load-balanced locally by IPVS. If these IPs are not bound anymore it means the traffic will leave this instance and needs to be routed back to a host (through a LoadBalancer or routing). We need to make sure that traffic flows properly in this situation.

Nice catch, I've updated the flag documentation, comment and release note, emphasized "both ExternalIPs and LoadbalancerIPs", and improved usage scenario description. @lbernail

@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 12, 2019

/remove-kind bug

@k8s-ci-robot k8s-ci-robot removed the kind/bug label Jul 12, 2019
@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Jul 13, 2019

I completely see the use case but changing how the traffic flows for pod-to-loadbalancer and pod-to-external-ip feels like breaking the expected behavior. Even if it is behind a flag I'm not sure we want to do it (introducing a different behavior will likely make debugging issues more complicated).

@thockin what do you think?

@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 15, 2019

If we don't do it, then how can we solve this problem? I think the use case covers a lot users, when they upgrade to higher k8s version, this problem will become their unexpected behaviour

@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 21, 2019

@thockin PTAL. We're wating for this to be solved, we can backport this patch once been merged

Copy link
Member

andrewsykim left a comment

I completely see the use case but changing how the traffic flows for pod-to-loadbalancer and pod-to-external-ip feels like breaking the expected behavior. Even if it is behind a flag I'm not sure we want to do it (introducing a different behavior will likely make debugging issues more complicated).

I agree we should avoid this if possible. Wondering if we can fix this somehow with iptables, maybe masquerade only for health check node ports at PREROUTING?

@@ -1041,6 +1048,9 @@ func (proxier *Proxier) syncProxyRules() {
}

// ipvs call
if proxier.excludeExternalIP {

This comment has been minimized.

Copy link
@andrewsykim

andrewsykim Jul 25, 2019

Member

This is technically loadbalancer IP and not external IP?

This comment has been minimized.

Copy link
@imroc

imroc Jul 26, 2019

Author Contributor

It means both Service's ExternalIP and ingress IP

@andrewsykim

This comment has been minimized.

Copy link
Member

andrewsykim commented Jul 25, 2019

/assign

@imroc

This comment has been minimized.

Copy link
Contributor Author

imroc commented Jul 26, 2019

I agree we should avoid this if possible. Wondering if we can fix this somehow with iptables, maybe masquerade only for health check node ports at PREROUTING?

@andrewsykim Yeah, I think this is feasible, but we should consider this: Linux will ignore all packets from the non-loopback interface which source IP is the local IP by default. If we bind LB's IP to the kube-ipvs0, the LB's IP will become local IP, that means we need to change the linux's default behaviour, this could be done by enabling accept_local parameter which has been backported till kernel v2.6.33, we can enable it during kube-proxy startup. And I think we should masquerade LB's IP not just only for node ports, masquerade all LB's IP at PREROUTING instead, because the implementation of LoadBalancer Service may not based on node ports, LB can bind pod ip directly in some implemetation. I think I can initiate another PR for implementing this idea.

@lbernail

This comment has been minimized.

Copy link
Contributor

lbernail commented Aug 12, 2019

I think the flag name ExcludeExternalIP is misleading (because we also exclude loadbalancer IPs)

Also if we do it this way, we should update tests to take this flag into account

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Nov 10, 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@andrewsykim

This comment has been minimized.

Copy link
Member

andrewsykim commented Nov 11, 2019

/remove-lifecycle stale

@Sh4d1

This comment has been minimized.

Copy link
Contributor

Sh4d1 commented Dec 4, 2019

Also, some LBs implements PROXY protocol, and bypassing the LB can lead to error when the application receives the packet (TLS instead of PROXY).
(Also this problem is kink of linked to #66607 )
Another way to solve th eproblem would be to add specific annotation to a Service that indicates do not bind ? It could maybe then be extended to the cloud provider interface, in order to never bind IPs for a cloud provider?

@Sh4d1

This comment has been minimized.

Copy link
Contributor

Sh4d1 commented Dec 4, 2019

@lbernail @m1093782566 what do you think?

@MatthiasLohr

This comment has been minimized.

Copy link

MatthiasLohr commented Dec 31, 2019

Linux will ignore all packets from the non-loopback interface which source IP is the local IP by default. If we bind LB's IP to the kube-ipvs0, the LB's IP will become local IP, that means we need to change the linux's default behaviour, this could be done by enabling accept_local parameter which has been backported till kernel v2.6.33, we can enable it during kube-proxy startup. And I think we should masquerade LB's IP not just only for node ports, masquerade all LB's IP at PREROUTING instead, because the implementation of LoadBalancer Service may not based on node ports, LB can bind pod ip directly in some implemetation. I think I can initiate another PR for implementing this idea.

Any progress on this? @imroc, do you have a workaround using the PREROUTING/accept_local idea? Could you maybe provide the pull request or some ideas, how one could to that manually? Thank you so much!

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 31, 2019

@imroc: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Doca

This comment has been minimized.

Copy link

Doca commented Dec 31, 2019

We have the same problem using metallb. We found an extraordinary great workaround by stopping the public ip from being bound by ipvs with

while true; do ip addr del dev kube-ipvs0; done

:-D A proper solution would be awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.