Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipvs UDP to garbage collect realServers #76664

Closed
sepich opened this issue Apr 16, 2019 · 17 comments

Comments

Projects
None yet
9 participants
@sepich
Copy link

commented Apr 16, 2019

What happened:
We're running syslog pod behind service and sending logs via UDP to it.
It is k8s v1.13.3 with ipvs services.
When syslog pod gets restarted logs getting blackholed.
Simplified example on a node:

# ipvsadm -Ln | grep :514
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
UDP  192.168.127.89:514 lc
  -> 192.168.128.193:514          Masq    1      0          1
  -> 192.168.129.116:514          Masq    0      0          1

So, old pod had IP 192.168.129.116, new pod - 192.168.128.193.
Weight correctly forwards all new traffic to new pod, but there are some InActConn which are still sending traffic to non-existing IP:

# ipvsadm -Lnc | grep :514
IPVS connection entries
pro expire state       source             virtual            destination
UDP 04:24  UDP         192.168.127.89:44117 192.168.127.89:514 192.168.129.116:514
UDP 04:12  UDP         192.168.128.178:42898 192.168.127.89:514 192.168.128.193:514

Minimal value for UDP connection expiration is 1s:
ipvsadm --set 0 0 1
But logs are being sent at much faster rates, so these broken connections are never expire.
Issue introduced by
#71514
#71881
Looks like this supposed to keep old pod running, while there are still connections to it

			// Delete RS with no connections
			// For UDP, ActiveConn is always 0
			// For TCP, InactiveConn are connections not in ESTABLISHED state
			if rs.ActiveConn+rs.InactiveConn != 0 {

But is there a reason to still keep these realServer entries, while old pod already cleaned up and no such IP exist?

What you expected to happen:
kube-proxy would cleanup all non-existing realServers from ipvs entries.

How to reproduce it (as minimally and precisely as possible):
Repro steps are same as for host-port issue:
#59033

Anything else we need to know?:
sysctls are default:

net.ipv4.vs.expire_nodest_conn = 1
net.ipv4.vs.conn_reuse_mode = 0
net.ipv4.vs.conntrack = 1

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:08:12Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.3", GitCommit:"721bfa751924da8d1680787490c54b9179b1fed0", GitTreeState:"clean", BuildDate:"2019-02-01T20:00:57Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration: on-premise
  • OS (e.g: cat /etc/os-release): Debian 9
  • Kernel (e.g. uname -a): 4.9.0-8-amd64
  • Install tools:
  • Others:
@neolit123

This comment has been minimized.

Copy link
Member

commented Apr 16, 2019

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network and removed needs-sig labels Apr 16, 2019

@athenabot

This comment has been minimized.

Copy link

commented Apr 16, 2019

/triage unresolved

Comment /remove-triage unresolved when the issue is assessed and confirmed.

🤖 I am a bot run by @vllry. 👩‍🔬

@m1093782566

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

@strmdvr

This comment has been minimized.

Copy link

commented Apr 24, 2019

We observe the same behavior in k8s v1.12.7 and it hits us hard.
In our case Nginx in a pod uses coredns service IP as DNS resolver address and every time coredns pods get new addresses DNS queries are blackholed.
Query rate is higher than ipvs UDP connection timeout (30s in our case), so the connection to a non-existent ip address stays forever.
Kube-proxy keeps non-existent real server entry in ipvs service and conntrack entry is never getting removed either.

Kube-proxy logs:

I0424 17:28:06.427238       1 graceful_termination.go:160] Trying to delete rs: <service ip>:53/UDP/<long gone coredns pod ip>:53
I0424 17:28:06.427386       1 graceful_termination.go:171] Not deleting, RS <service ip>:53/UDP/<long gone coredns pod ip>: 0 ActiveConn, 1 InactiveConn

ipvsadm -ln:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
...
UDP  <service ip>:53 rr
  -> <non-existent pod IP>:53                Masq    0      0          1
...
@lbernail

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

I've been putting some thoughts on the issue (we actually have the same one).

Currently, we handle graceful termination the same way for udp as we do for tcp. However on heavily loaded UDP services, (DNS for instance) udp ports are reused quickly and old backends are not deleted fast enough.

There are two possible solutions:

  • use more aggressive timeouts (30s instead of the default 300s). Not perfect but better
  • immediately delete real servers and associated connections when a backend is not ready anymore (very good for DNS queries but this means no graceful termination for udp)

@m1093782566 / @thockin what do you think? I'm not sure what the exact behavior with iptables
I can create a detailed issue with options to have a wider discussion. Or even directly send an email to the sig-network group

@m1093782566

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

immediately delete real servers and associated connections when a backend is not ready anymore (very good for DNS queries but this means no graceful termination for udp)

I like the idea. BTW, iptables also does not support graceful termination for UDP connections.

@lbernail

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

@lbernail

This comment has been minimized.

Copy link
Contributor

commented Apr 29, 2019

If we agree this is the way to go, I can implement it
(if iptables works this way this is the best solution, I agree)

We could still tweak timeouts for TCP (defaults are very long: 900s, which can lead to complete blackholing for 15mn on a crash of a target because the connection is never closed)

@m1093782566

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

I agree.

@lbernail

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

/area ipvs

@lbernail

This comment has been minimized.

Copy link
Contributor

commented May 1, 2019

@sepich If you have a test environment I have a test image with the fix discussed just above: lbernail/hyperkube-amd64:v1.15.0-udp
Commit here: DataDog@8a7bd6c

This is not perfect, but I believe it is much better. What happens is the following:

  • udp target pod becomes NotReady
  • IP is removed from RealServers immediately
  • Because of expire_nodest_conn any new packet reusing the port will generate a write error and the connection will be removed from the ipvs connection table (if the application retries you are fine)

@m1093782566 I think in addition to that we should set much more aggressive UDP timeouts (5s? instead of the default: 300s) and make it configurable with a flag for specific use-case (I haven't seen a use case for long lived udp "connections" with very low amount of traffic)

@sepich

This comment has been minimized.

Copy link
Author

commented May 2, 2019

If you have a test environment I have a test image with the fix discussed just above:

Sorry, i was unable to quick test this on local dev cluster.
Tried to replace kube-proxy image with lbernail/hyperkube-amd64:v1.15.0-udp but it seems it does not work with k8s 1.13.

# k -n kube-system logs -f kube-proxy-mlpzb
I0502 15:13:29.369586       1 server_others.go:170] Using ipvs Proxier.
I0502 15:13:29.370216       1 server.go:540] Version: v1.15.0-alpha.2.32+0f4b666469d9bb-dirty
I0502 15:13:29.380490       1 conntrack.go:52] Setting nf_conntrack_max to 131072
I0502 15:13:29.381343       1 config.go:202] Starting service config controller
I0502 15:13:29.381365       1 controller_utils.go:1029] Waiting for caches to sync for service config controller
I0502 15:13:29.381378       1 config.go:102] Starting endpoints config controller
I0502 15:13:29.381384       1 controller_utils.go:1029] Waiting for caches to sync for endpoints config controller
I0502 15:13:29.481576       1 controller_utils.go:1036] Caches are synced for service config controller
I0502 15:13:29.481576       1 controller_utils.go:1036] Caches are synced for endpoints config controller
E0502 15:13:29.510489       1 ipset.go:162] Failed to make sure ip set: &{{KUBE-NODE-PORT-LOCAL-SCTP hash:ip,port inet 1024 65536 0-65535 Kubernetes nodeport SCTP port with externalTrafficPolicy=local} map[] 0xc000434070} exist, error: error creating ipset KUBE-NODE-PORT-LOCAL-SCTP, error: exit status 1
E0502 15:13:55.832737       1 ipset.go:162] Failed to make sure ip set: &{{KUBE-NODE-PORT-LOCAL-SCTP hash:ip,port inet 1024 65536 0-65535 Kubernetes nodeport SCTP port with externalTrafficPolicy=local} map[] 0xc000434070} exist, error: error creating ipset KUBE-NODE-PORT-LOCAL-SCTP, error: exit status 1
...

It then generates a lot of the same messages, and no any ipvs VIPs are created.
I've also tried on fresh new cluster v1.14.1 with the same result.
Then tried v1.15.0-alpha.1 but was unable to even start the cluster (with modified kubeadm.yml config from our 1.13)
Can you please cherry-pick this to 1.13 or 1.14 or am I doing something wrong?

@athenabot

This comment has been minimized.

Copy link

commented May 4, 2019

@m1093782566
If this issue has been triaged, please comment /remove-triage unresolved.

If you aren't able to handle this issue, consider unassigning yourself and/or adding the help-wanted label.

🤖 I am a bot run by vllry. 👩‍🔬

@stamm

This comment has been minimized.

Copy link

commented May 9, 2019

We tested this patch to kubernetes v1.12.7 and it works well in production.

BTW, I don't know much about SCTP. But question is do we need to threat it as UDP or TCP in this case? But it seems more close to UDP.

@lbernail lbernail referenced this issue May 13, 2019

Closed

REQUEST: New membership for lbernail #811

6 of 6 tasks complete
@lbernail

This comment has been minimized.

Copy link
Contributor

commented May 13, 2019

We tested this patch to kubernetes v1.12.7 and it works well in production.

Thanks for testing!

BTW, I don't know much about SCTP. But question is do we need to threat it as UDP or TCP in this case? But it seems more close to UDP.

The PR is limiting graceful termination to TCP which seems the safest. I don't know SCTP but making it behave like TCP would be an easy change.

@jsravn

This comment has been minimized.

Copy link
Contributor

commented May 22, 2019

This feels like a fundamental problem w/ endpoints... ideally the proxier would remove the realserver when the pod is nonexistent rather than relying on arbitrary timeouts. Otherwise, there is no way to support graceful termination with UDP based protocols. Not a big deal for most UDP things, but I imagine e.g. QUIC users would be affected. Immediately terminating udp connections is in line with the iptables behavior, which seems to be okay for most people, so the change seems reasonable until there is a better way to determine pod state.

@lbernail

This comment has been minimized.

Copy link
Contributor

commented May 29, 2019

@jsravn The PR that was just merged will improve things a lot for UDP but I completely agree with you on the fundamental problem with endpoints. I think we should have 3 different states:

  • Ready (accepting traffic), in which case we should route traffic to the endpoint
  • NotReady (terminating for instance), in which case we should stop routing new traffic to the endpoint but should keep existing connections
  • "Gone" (no in the endpoint object anymore), in which case we should remove the endpoint altogether

Today the endpoint object is too simple to handle these differences. I hope the new EndpointCondition status in the EndpointSlice API will help address this problem. More than happy to integrate this in IPVS when it is released (cc @andrewsykim / @m1093782566 )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.