ipvs proxier doesn't respect graceful termination #57841

jsravn · 2018-01-04T14:39:33Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Upon removing an endpoint, the ipvs proxier immediately deletes the ipvs real server, causing all connections to get dropped.

What you expected to happen:
It should allow the terminating pod to gracefully close connections, just like the iptables proxier.

How to reproduce it (as minimally and precisely as possible):

Enable ipvs proxier
Create a keepalive / long lived connection to a pod (e.g. while :; do echo -e "GET / HTTP/1.1\nhost: $host\n\n"; sleep 5; echo; done | telnet $serviceip 80)
Delete that pod - observe the connection gets closed immediately, further requests will fail. On iptables proxier, it will continue to work (until the pod itself stops or closes the connection).

Anything else we need to know?:
The ipvs proxier should instead be setting weight to 0, then reaping the stale real servers after some time period (that should be greater than any pod's graceful termination time). This may also fix the existing bug around UDP connections getting dropped prematurely (#45976).

Environment:

Kubernetes version (use kubectl version): tested on 1.8, but same issue in 1.9 afaict
Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu 16.04
Kernel (e.g. uname -a): 4.4.0
Install tools:
Others:

The text was updated successfully, but these errors were encountered:

jsravn · 2018-01-04T14:42:04Z

/cc @m1093782566

k8s-ci-robot · 2018-01-04T14:42:05Z

@jsravn: GitHub didn't allow me to assign the following users: sig-node.

Note that only kubernetes members can be assigned.

In response to this:

/assign sig-node
/cc @m1093782566

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsravn · 2018-01-04T14:42:22Z

/sig network

m1093782566 · 2018-01-05T01:41:36Z

@jsravn You can assign any IPVS related issue to me directly :)

/assign

/area ipvs

m1093782566 · 2018-01-05T03:50:12Z

@jsravn

I have a question:

The ipvs proxier should instead be setting weight to 0

Why? Currently there is no way to pass the weight to IPVS proxier, so we set weight to 1 by default. What's the advantage of setting weight to 0? Will it resolve this issue?

jsravn · 2018-01-05T09:35:23Z

@m1093782566 Setting weight to 0 puts the real server into "maintenance mode" so to speak. The scheduler will stop sending new connections to it, but existing connections remain active. It's the correct way to gracefully remove a server from IPVS as far as I know.

So syncEndpoint on endpoint delete should:

Set weight to 0 of the real server for the endpoint that is removed
Create a delayed task to delete the real server after n seconds, where n > pod grace period (e.g. a sufficiently large enough delay - like 5 minutes or something).

This simulates the same behavior in iptables where TCP connections time out after some period of time rather than being forcefully removed by the proxier.

m1093782566 · 2018-01-05T09:40:06Z

great comment! will try and fix it if works

Thanks

jsravn · 2018-01-05T09:45:38Z

Hopefully you can fix this for UDP connections too, since iptables proxier suffers from a bug where it drops the udp connection immediately (causing errors when kube-dns is restarted for instance...). It'd be nice if the ipvs proxier could handle that better.

fejta-bot · 2018-04-05T09:47:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

jsravn · 2018-04-05T10:44:33Z

/remove-lifecycle stale

rramkumar1 · 2018-05-02T05:59:30Z

@m1093782566 Are you actively working on this? This is something that definitely needs to be fixed before IPVS GA.

rramkumar1 · 2018-05-02T06:07:48Z

I think the iptables fix is at #60074

For IPVS we might need to implement a similar change.

m1093782566 · 2018-05-02T08:01:40Z

@rramkumar1

This is something that definitely needs to be fixed before IPVS GA.

I slightly disagree... I don't think this issue is a big deal(it only affect UDP connections) that will block IPVS GA and iptables proxy has the same issue, I am looking at the issue though.

jsravn · 2018-05-03T09:29:49Z

I believe this affects TCP connections as well, since it immediately removes the IPVS entry as I described. iptables doesn't have this problem (except for UDP).

jhorwit2 · 2018-06-06T03:46:27Z

@m1093782566 @rramkumar1 is this a blocker for GA? This seems like a serious gap in functionality from iptables mode especially with IPVS becoming the new default.

Specifically referring to dropping TCP connections.

m1093782566 · 2018-06-06T05:01:31Z

@jhorwit2

I don't think this issue should be a blocker for IPVS GA. Per discussion, the graduation criteria are:

a) CIs are green

b) necessary documents are available there

Please note that iptables is still the default mode though IPVS have become GA.

Anyway, my team will take a look at this issue.

cc @Lion-Wei @islinwb @stewart-yu

rramkumar1 · 2018-06-06T05:50:13Z

@jhorwit2 at this point, I don’t think this should block GA. However, we should prioritize releasing a fix for this in a patch release.

…

On Jun 5, 2018, at 10:02 PM, DuJun ***@***.***> wrote: @jhorwit2 I don't think this issue should be a blocker for IPVS GA. Per discussion, the graduation criteria are: a) CIs are green b) necessary documents are available there Please note that iptables is still the default mode though IPVS have become GA. Anyway, my team will take a look at this issue. cc @Lion-Wei @islinwb @stewart-yu — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Lion-Wei · 2018-06-06T05:58:36Z

Problem is we don't know whether this is what we needed. I mean, some people like this graceful termination, some people don't.
Even for iptables proxy mode, UDP type connection don't support graceful termination, and people seems not rush to fix it.

m1093782566 · 2018-06-06T06:00:06Z

I think it cat be an interesting topic for next sig-network call?

jhorwit2 · 2018-06-06T11:43:33Z

@rramkumar1 i don't see how not supporting graceful termination of applications with ipvs provides users with what they expect from the kubernetes platform around providing production ready solutions. This bug alone will block us from using the feature since it would start causing noticeable impact immediately. Imo If it isn't fixed prior to GA, then this lack of functionality should be made more visible in docs and release notes.

rramkumar1 · 2018-06-06T15:34:57Z

@jhorwit2 I am not opposed to adding this caveat to release notes/docs. However, given the current state I'm not sure its worth blocking GA for one more whole cycle just for this when we can just push a fix in a patch release. @m1093782566 if possible, can your team prioritize coming up with a fix that maybe we can squeeze into 1.11.0 rather than wait for the first patch?

m1093782566 · 2018-06-06T16:51:59Z

Sure, my team is looking at this issue. Will let you all kown if we make any progress.

jsravn · 2018-06-18T13:31:17Z

I think it will break a lot of people's expectations if IPVS doesn't support graceful termination with at least TCP - this is the current behavior with iptables and userspace mode.

kfox1111 · 2018-06-28T17:10:24Z

Thank you for the release note. It saved me a bunch of time testing something that was not ready yet.

Lion-Wei · 2018-07-03T04:09:13Z

@jsravn @m1093782566 @rramkumar1 @jhorwit2 Hi, guys. Recently I have been testing this issue, and got some result. Here is my test step:

Create a pod with liseCycle/preStop specified and terminationGracePeriodSeconds=300. Expose this pod to a service.

apiVersion: v1
kind: Pod
...
    lifecycle:
      preStop:
        exec:
          command:
          - /bin/sh
          - -c
          - sleep 300
...
  terminationGracePeriodSeconds: 300

Check this pod and service.

# k get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
sourceip     ClusterIP   10.0.0.175   <none>        8080/TCP   8s
# k get pod -owide
NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
sourceip       1/1       Running   0          45s       172.17.0.4   127.0.0.1

Use telnet test service clusterIP connection.

# time telnet 10.0.0.133 8080
Trying 10.0.0.133...
Connected to 10.0.0.133.
Escape character is '^]'.
Connection closed by foreign host.

Use kubectl delete pod, considering what we specified in pod preStop, this pod should be in terminating status for 300s.

# k delete pod sourceip
pod "sourceip" deleted
# k get pod
NAME           READY     STATUS        RESTARTS   AGE
sourceip       1/1       Terminating   0          3m

Check ipvs rules.

# ipvs real server have been deleted
# ipvsadm -l
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.1:https rr
  -> 10.162.199.82:6443           Masq    1      1          0
TCP  10.0.0.10:domain rr
  -> 172.17.0.4:domain            Masq    1      0          0
TCP  10.0.0.133:http-alt rr
UDP  10.0.0.10:domain rr
  -> 172.17.0.4:domain            Masq    1      0          0
# check ipvs connections
# ipvsadm -lnc
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:06  ESTABLISHED 10.0.0.133:58441   10.0.0.133:8080    172.17.0.5:80

We can see ipvs realserver have been deleted, but telnet still in connection, and the connection is in state ESTABLISHED with an expire time 14m6s.

After 5min, the container be deleted, telnet connection closed. Check ipvsadm connection.

# telnet returned
# time telnet 10.0.0.133 8080
Trying 10.0.0.133...
Connected to 10.0.0.133.
Escape character is '^]'.
Connection closed by foreign host.

real    5m12.514s
user    0m0.000s
sys     0m0.003s

# ipvs connection in FIN_WAIT status, with expire time.
# ipvsadm -lnc
IPVS connection entries
pro expire state       source             virtual            destination
TCP 03:56  FIN_WAIT    10.0.0.133:59648   10.0.0.133:8080    172.17.0.5:80

So, according to my test, I think ipvs proxier should have graceful termination for lone lived connection. If you have any question or suggestion about the test process, please let me know. And I'd like people introduce other test process.

m1093782566 · 2018-07-03T06:10:13Z

Impressive test, @Lion-Wei

@jsravn @jhorwit2 please provide your feedback.

jsravn · 2018-07-05T11:18:24Z

@m1093782566

I discussed this with @Lion-Wei on slack, and I tested it myself to confirm. I'll repeat my notes here:

Even though ipvsadm -lnc shows ESTABLISHED, if you try to actually send packets on the telnet connection they will get blackholed. A simple test is:

Start http server exposed as a service (replace nodeSelector as appropriate):

$ kubectl run james-http-server --image=python:3.7.0-stretch --port=1234 --expose --attach -it --overrides='{"spec": {"nodeSelector": {"kubernetes.io/hostname":"ip-172-20-32-191.eu-west-1.compute.internal"}}}'  -- bash 
$ python -m http.server 1234

Start a client shell, on a separate node:

$ kubectl run james-client2 --image=mikesplain/telnet --rm --attach -it --overrides='{"spec": {"nodeSelector": {"kubernetes.io/hostname":"ip-172-20-41-71.eu-west-1.compute.internal"}}}' --command -- sh
$ telnet <svcip> 1234

Then:

In the client, type GET /, press enter once to keep the connection open.
Delete the server pod
Wait for the real server to be removed, takes a couple seconds in my testing. Observe that ipvsadm -lnc still shows ESTABLISHED.
In the client, press enter to finish the request. Note that nothing happens! The packets are getting blackholed.

If you set a weight of 0 instead of removing the real server, in step (4) the request will succeed. It also works in iptables mode.

I hope this helps.

jhorwit2 · 2018-07-05T16:20:00Z

@jsravn what you described sounds like what should happen when expire_nodest_conn = 0(default).

https://www.kernel.org/doc/Documentation/networking/ipvs-sysctl.txt

Lion-Wei · 2018-07-06T01:54:02Z

When you set expire_nodest_conn = 0, after delete rs the connection will be closed immediately when a packet arrives, instead of nothing.
But still not what we need. . . Seems like "set weight to 0" is the only way, though we prefer IPVS origin solution...

jsravn · 2018-07-09T09:13:30Z

@Lion-Wei do you need any help w/ the dev side of this? I can probably find some time for it. Although, it's still not clear to me what the proposed solution is.

Either

Put a configurable timeout period into kube-proxy that is used to manage the IPVS connections (set weight 0 for the timeout period, then remove). Just fixes IPVS and is a little hacky, but works similar to the iptables proxier (which relies on the iptables implementation to timeout connections - which is 24h? for TCP). Would need to figure out how to persist the timeout state across kube-proxy restarts, or something stateless like restarting the timer (perhaps with a shorter period after a restart).
More holistic solution that fixes this across all proxiers - like a new resource that expires after some ttl that represents gracefully terminating connections, which all the proxiers can use to reconcile against connection state. Maybe it can be tacked onto endpoints themselves somehow. Probably requires a proposal of some sort.

m1093782566 · 2018-07-09T09:41:01Z

#64947?

Lion-Wei · 2018-07-10T04:33:25Z

@jsravn Hi, I already sended a pr to fix this issue.
Not support graceful time configurable, but I think that's not necessary.
And maybe we need further discussion if we want iptables UDP support this.

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jan 4, 2018

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 4, 2018

k8s-ci-robot assigned m1093782566 Jan 5, 2018

k8s-ci-robot added the area/ipvs label Jan 5, 2018

jsravn mentioned this issue Feb 22, 2018

Give UDP connections a chance to close gracefully #60074

Closed

rramkumar1 mentioned this issue Apr 2, 2018

IPVS dev work #59662

Closed

19 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2018

maxlaverse mentioned this issue Jun 2, 2018

Enable fully random masquerading ports flannel-io/flannel#1001

Closed

sergeylanzman mentioned this issue Jun 10, 2018

[WIP] add grace period delay to ipvs proxier #64947

Closed

jhorwit2 mentioned this issue Jun 21, 2018

Update IPVS topics in release notes draft kubernetes/sig-release#194

Closed

Lion-Wei mentioned this issue Jun 27, 2018

v1.11 known issues / FAQ accumulator #64806

Closed

vlerenc mentioned this issue Jun 29, 2018

IPVS Adoption gardener/gardener#247

Closed

NicolasT mentioned this issue Jun 29, 2018

Use IPVS for proxy scality/metalk8s#95

Closed

Lion-Wei mentioned this issue Jul 10, 2018

ipvs support graceful termination #66012

Merged

szuecs mentioned this issue Jul 10, 2018

Update to Kubernetes 1.11 zalando-incubator/kubernetes-on-aws#1198

Merged

njuettner mentioned this issue Aug 8, 2018

Kube 1.11 patch zalando-incubator/kubernetes-on-aws#1299

Merged

Lion-Wei mentioned this issue Sep 20, 2018

update vendor for ipvs graceful termination #68874

Merged

k8s-ci-robot closed this as completed in #66012 Sep 28, 2018

annProg mentioned this issue Nov 13, 2018

IPVS low throughput #70747

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ipvs proxier doesn't respect graceful termination #57841

ipvs proxier doesn't respect graceful termination #57841

jsravn commented Jan 4, 2018 •

edited

jsravn commented Jan 4, 2018 •

edited

k8s-ci-robot commented Jan 4, 2018

jsravn commented Jan 4, 2018

m1093782566 commented Jan 5, 2018 •

edited

m1093782566 commented Jan 5, 2018 •

edited

jsravn commented Jan 5, 2018 •

edited

m1093782566 commented Jan 5, 2018

jsravn commented Jan 5, 2018

fejta-bot commented Apr 5, 2018

jsravn commented Apr 5, 2018

rramkumar1 commented May 2, 2018

rramkumar1 commented May 2, 2018

m1093782566 commented May 2, 2018 •

edited

jsravn commented May 3, 2018

jhorwit2 commented Jun 6, 2018 •

edited

m1093782566 commented Jun 6, 2018

rramkumar1 commented Jun 6, 2018 via email

Lion-Wei commented Jun 6, 2018

m1093782566 commented Jun 6, 2018 •

edited

jhorwit2 commented Jun 6, 2018

rramkumar1 commented Jun 6, 2018

m1093782566 commented Jun 6, 2018 •

edited

jsravn commented Jun 18, 2018

kfox1111 commented Jun 28, 2018

Lion-Wei commented Jul 3, 2018

m1093782566 commented Jul 3, 2018

jsravn commented Jul 5, 2018

jhorwit2 commented Jul 5, 2018

Lion-Wei commented Jul 6, 2018

jsravn commented Jul 9, 2018 •

edited

m1093782566 commented Jul 9, 2018

Lion-Wei commented Jul 10, 2018

ipvs proxier doesn't respect graceful termination #57841

ipvs proxier doesn't respect graceful termination #57841

Comments

jsravn commented Jan 4, 2018 • edited

jsravn commented Jan 4, 2018 • edited

k8s-ci-robot commented Jan 4, 2018

jsravn commented Jan 4, 2018

m1093782566 commented Jan 5, 2018 • edited

m1093782566 commented Jan 5, 2018 • edited

jsravn commented Jan 5, 2018 • edited

m1093782566 commented Jan 5, 2018

jsravn commented Jan 5, 2018

fejta-bot commented Apr 5, 2018

jsravn commented Apr 5, 2018

rramkumar1 commented May 2, 2018

rramkumar1 commented May 2, 2018

m1093782566 commented May 2, 2018 • edited

jsravn commented May 3, 2018

jhorwit2 commented Jun 6, 2018 • edited

m1093782566 commented Jun 6, 2018

rramkumar1 commented Jun 6, 2018 via email

Lion-Wei commented Jun 6, 2018

m1093782566 commented Jun 6, 2018 • edited

jhorwit2 commented Jun 6, 2018

rramkumar1 commented Jun 6, 2018

m1093782566 commented Jun 6, 2018 • edited

jsravn commented Jun 18, 2018

kfox1111 commented Jun 28, 2018

Lion-Wei commented Jul 3, 2018

m1093782566 commented Jul 3, 2018

jsravn commented Jul 5, 2018

jhorwit2 commented Jul 5, 2018

Lion-Wei commented Jul 6, 2018

jsravn commented Jul 9, 2018 • edited

m1093782566 commented Jul 9, 2018

Lion-Wei commented Jul 10, 2018

jsravn commented Jan 4, 2018 •

edited

jsravn commented Jan 4, 2018 •

edited

m1093782566 commented Jan 5, 2018 •

edited

m1093782566 commented Jan 5, 2018 •

edited

jsravn commented Jan 5, 2018 •

edited

m1093782566 commented May 2, 2018 •

edited

jhorwit2 commented Jun 6, 2018 •

edited

m1093782566 commented Jun 6, 2018 •

edited

m1093782566 commented Jun 6, 2018 •

edited

jsravn commented Jul 9, 2018 •

edited