Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipvs proxier doesn't respect graceful termination #57841

Closed
jsravn opened this issue Jan 4, 2018 · 32 comments · Fixed by #66012
Closed

ipvs proxier doesn't respect graceful termination #57841

jsravn opened this issue Jan 4, 2018 · 32 comments · Fixed by #66012
Assignees
Labels
area/ipvs kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@jsravn
Copy link
Contributor

jsravn commented Jan 4, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
Upon removing an endpoint, the ipvs proxier immediately deletes the ipvs real server, causing all connections to get dropped.

What you expected to happen:
It should allow the terminating pod to gracefully close connections, just like the iptables proxier.

How to reproduce it (as minimally and precisely as possible):

  1. Enable ipvs proxier
  2. Create a keepalive / long lived connection to a pod (e.g. while :; do echo -e "GET / HTTP/1.1\nhost: $host\n\n"; sleep 5; echo; done | telnet $serviceip 80)
  3. Delete that pod - observe the connection gets closed immediately, further requests will fail. On iptables proxier, it will continue to work (until the pod itself stops or closes the connection).

Anything else we need to know?:
The ipvs proxier should instead be setting weight to 0, then reaping the stale real servers after some time period (that should be greater than any pod's graceful termination time). This may also fix the existing bug around UDP connections getting dropped prematurely (#45976).

Environment:

  • Kubernetes version (use kubectl version): tested on 1.8, but same issue in 1.9 afaict
  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): 4.4.0
  • Install tools:
  • Others:
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. labels Jan 4, 2018
@jsravn
Copy link
Contributor Author

jsravn commented Jan 4, 2018

/cc @m1093782566

@k8s-ci-robot
Copy link
Contributor

@jsravn: GitHub didn't allow me to assign the following users: sig-node.

Note that only kubernetes members can be assigned.

In response to this:

/assign sig-node
/cc @m1093782566

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jsravn
Copy link
Contributor Author

jsravn commented Jan 4, 2018

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 4, 2018
@m1093782566
Copy link
Contributor

m1093782566 commented Jan 5, 2018

@jsravn You can assign any IPVS related issue to me directly :)

/assign

/area ipvs

@m1093782566
Copy link
Contributor

m1093782566 commented Jan 5, 2018

@jsravn

I have a question:

The ipvs proxier should instead be setting weight to 0

Why? Currently there is no way to pass the weight to IPVS proxier, so we set weight to 1 by default. What's the advantage of setting weight to 0? Will it resolve this issue?

@jsravn
Copy link
Contributor Author

jsravn commented Jan 5, 2018

@m1093782566 Setting weight to 0 puts the real server into "maintenance mode" so to speak. The scheduler will stop sending new connections to it, but existing connections remain active. It's the correct way to gracefully remove a server from IPVS as far as I know.

So syncEndpoint on endpoint delete should:

  1. Set weight to 0 of the real server for the endpoint that is removed
  2. Create a delayed task to delete the real server after n seconds, where n > pod grace period (e.g. a sufficiently large enough delay - like 5 minutes or something).

This simulates the same behavior in iptables where TCP connections time out after some period of time rather than being forcefully removed by the proxier.

@m1093782566
Copy link
Contributor

great comment! will try and fix it if works

Thanks

@jsravn
Copy link
Contributor Author

jsravn commented Jan 5, 2018

Hopefully you can fix this for UDP connections too, since iptables proxier suffers from a bug where it drops the udp connection immediately (causing errors when kube-dns is restarted for instance...). It'd be nice if the ipvs proxier could handle that better.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2018
@jsravn
Copy link
Contributor Author

jsravn commented Apr 5, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2018
@rramkumar1
Copy link
Contributor

@m1093782566 Are you actively working on this? This is something that definitely needs to be fixed before IPVS GA.

@rramkumar1
Copy link
Contributor

I think the iptables fix is at #60074

For IPVS we might need to implement a similar change.

@m1093782566
Copy link
Contributor

m1093782566 commented May 2, 2018

@rramkumar1

This is something that definitely needs to be fixed before IPVS GA.

I slightly disagree... I don't think this issue is a big deal(it only affect UDP connections) that will block IPVS GA and iptables proxy has the same issue, I am looking at the issue though.

@jsravn
Copy link
Contributor Author

jsravn commented May 3, 2018

I believe this affects TCP connections as well, since it immediately removes the IPVS entry as I described. iptables doesn't have this problem (except for UDP).

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jun 6, 2018

@m1093782566 @rramkumar1 is this a blocker for GA? This seems like a serious gap in functionality from iptables mode especially with IPVS becoming the new default.

Specifically referring to dropping TCP connections.

@m1093782566
Copy link
Contributor

@jhorwit2

I don't think this issue should be a blocker for IPVS GA. Per discussion, the graduation criteria are:

a) CIs are green

b) necessary documents are available there

Please note that iptables is still the default mode though IPVS have become GA.

Anyway, my team will take a look at this issue.

cc @Lion-Wei @islinwb @stewart-yu

@rramkumar1
Copy link
Contributor

rramkumar1 commented Jun 6, 2018 via email

@Lion-Wei
Copy link

Lion-Wei commented Jun 6, 2018

Problem is we don't know whether this is what we needed. I mean, some people like this graceful termination, some people don't.
Even for iptables proxy mode, UDP type connection don't support graceful termination, and people seems not rush to fix it.

@m1093782566
Copy link
Contributor

m1093782566 commented Jun 6, 2018

I think it cat be an interesting topic for next sig-network call?

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jun 6, 2018

@rramkumar1 i don't see how not supporting graceful termination of applications with ipvs provides users with what they expect from the kubernetes platform around providing production ready solutions. This bug alone will block us from using the feature since it would start causing noticeable impact immediately. Imo If it isn't fixed prior to GA, then this lack of functionality should be made more visible in docs and release notes.

@rramkumar1
Copy link
Contributor

@jhorwit2 I am not opposed to adding this caveat to release notes/docs. However, given the current state I'm not sure its worth blocking GA for one more whole cycle just for this when we can just push a fix in a patch release. @m1093782566 if possible, can your team prioritize coming up with a fix that maybe we can squeeze into 1.11.0 rather than wait for the first patch?

@m1093782566
Copy link
Contributor

m1093782566 commented Jun 6, 2018

Sure, my team is looking at this issue. Will let you all kown if we make any progress.

@jsravn
Copy link
Contributor Author

jsravn commented Jun 18, 2018

I think it will break a lot of people's expectations if IPVS doesn't support graceful termination with at least TCP - this is the current behavior with iptables and userspace mode.

@kfox1111
Copy link

Thank you for the release note. It saved me a bunch of time testing something that was not ready yet.

@Lion-Wei
Copy link

Lion-Wei commented Jul 3, 2018

@jsravn @m1093782566 @rramkumar1 @jhorwit2 Hi, guys. Recently I have been testing this issue, and got some result. Here is my test step:

  1. Create a pod with liseCycle/preStop specified and terminationGracePeriodSeconds=300. Expose this pod to a service.

    apiVersion: v1
    kind: Pod
    ...
        lifecycle:
          preStop:
            exec:
              command:
              - /bin/sh
              - -c
              - sleep 300
    ...
      terminationGracePeriodSeconds: 300
  2. Check this pod and service.

    # k get svc
    NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
    sourceip     ClusterIP   10.0.0.175   <none>        8080/TCP   8s
    # k get pod -owide
    NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
    sourceip       1/1       Running   0          45s       172.17.0.4   127.0.0.1
  3. Use telnet test service clusterIP connection.

    # time telnet 10.0.0.133 8080
    Trying 10.0.0.133...
    Connected to 10.0.0.133.
    Escape character is '^]'.
    Connection closed by foreign host.	
  4. Use kubectl delete pod, considering what we specified in pod preStop, this pod should be in terminating status for 300s.

    # k delete pod sourceip
    pod "sourceip" deleted
    # k get pod
    NAME           READY     STATUS        RESTARTS   AGE
    sourceip       1/1       Terminating   0          3m
  5. Check ipvs rules.

    # ipvs real server have been deleted
    # ipvsadm -l
    IP Virtual Server version 1.2.1 (size=4096)
    Prot LocalAddress:Port Scheduler Flags
      -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
    TCP  10.0.0.1:https rr
      -> 10.162.199.82:6443           Masq    1      1          0
    TCP  10.0.0.10:domain rr
      -> 172.17.0.4:domain            Masq    1      0          0
    TCP  10.0.0.133:http-alt rr
    UDP  10.0.0.10:domain rr
      -> 172.17.0.4:domain            Masq    1      0          0
    # check ipvs connections
    # ipvsadm -lnc
    IPVS connection entries
    pro expire state       source             virtual            destination
    TCP 14:06  ESTABLISHED 10.0.0.133:58441   10.0.0.133:8080    172.17.0.5:80

    We can see ipvs realserver have been deleted, but telnet still in connection, and the connection is in state ESTABLISHED with an expire time 14m6s.

  6. After 5min, the container be deleted, telnet connection closed. Check ipvsadm connection.

    # telnet returned
    # time telnet 10.0.0.133 8080
    Trying 10.0.0.133...
    Connected to 10.0.0.133.
    Escape character is '^]'.
    Connection closed by foreign host.
    
    real    5m12.514s
    user    0m0.000s
    sys     0m0.003s
    
    # ipvs connection in FIN_WAIT status, with expire time.
    # ipvsadm -lnc
    IPVS connection entries
    pro expire state       source             virtual            destination
    TCP 03:56  FIN_WAIT    10.0.0.133:59648   10.0.0.133:8080    172.17.0.5:80

So, according to my test, I think ipvs proxier should have graceful termination for lone lived connection. If you have any question or suggestion about the test process, please let me know. And I'd like people introduce other test process.

@m1093782566
Copy link
Contributor

Impressive test, @Lion-Wei

@jsravn @jhorwit2 please provide your feedback.

@jsravn
Copy link
Contributor Author

jsravn commented Jul 5, 2018

@m1093782566

I discussed this with @Lion-Wei on slack, and I tested it myself to confirm. I'll repeat my notes here:

Even though ipvsadm -lnc shows ESTABLISHED, if you try to actually send packets on the telnet connection they will get blackholed. A simple test is:

Start http server exposed as a service (replace nodeSelector as appropriate):

$ kubectl run james-http-server --image=python:3.7.0-stretch --port=1234 --expose --attach -it --overrides='{"spec": {"nodeSelector": {"kubernetes.io/hostname":"ip-172-20-32-191.eu-west-1.compute.internal"}}}'  -- bash 
$ python -m http.server 1234

Start a client shell, on a separate node:

$ kubectl run james-client2 --image=mikesplain/telnet --rm --attach -it --overrides='{"spec": {"nodeSelector": {"kubernetes.io/hostname":"ip-172-20-41-71.eu-west-1.compute.internal"}}}' --command -- sh
$ telnet <svcip> 1234

Then:

  1. In the client, type GET /, press enter once to keep the connection open.
  2. Delete the server pod
  3. Wait for the real server to be removed, takes a couple seconds in my testing. Observe that ipvsadm -lnc still shows ESTABLISHED.
  4. In the client, press enter to finish the request. Note that nothing happens! The packets are getting blackholed.

If you set a weight of 0 instead of removing the real server, in step (4) the request will succeed. It also works in iptables mode.

I hope this helps.

@jhorwit2
Copy link
Contributor

jhorwit2 commented Jul 5, 2018

@jsravn what you described sounds like what should happen when expire_nodest_conn = 0(default).

https://www.kernel.org/doc/Documentation/networking/ipvs-sysctl.txt

@Lion-Wei
Copy link

Lion-Wei commented Jul 6, 2018

When you set expire_nodest_conn = 0, after delete rs the connection will be closed immediately when a packet arrives, instead of nothing.
But still not what we need. . . Seems like "set weight to 0" is the only way, though we prefer IPVS origin solution...

@jsravn
Copy link
Contributor Author

jsravn commented Jul 9, 2018

@Lion-Wei do you need any help w/ the dev side of this? I can probably find some time for it. Although, it's still not clear to me what the proposed solution is.

Either

  1. Put a configurable timeout period into kube-proxy that is used to manage the IPVS connections (set weight 0 for the timeout period, then remove). Just fixes IPVS and is a little hacky, but works similar to the iptables proxier (which relies on the iptables implementation to timeout connections - which is 24h? for TCP). Would need to figure out how to persist the timeout state across kube-proxy restarts, or something stateless like restarting the timer (perhaps with a shorter period after a restart).
  2. More holistic solution that fixes this across all proxiers - like a new resource that expires after some ttl that represents gracefully terminating connections, which all the proxiers can use to reconcile against connection state. Maybe it can be tacked onto endpoints themselves somehow. Probably requires a proposal of some sort.

@m1093782566
Copy link
Contributor

#64947?

@Lion-Wei
Copy link

@jsravn Hi, I already sended a pr to fix this issue.
Not support graceful time configurable, but I think that's not necessary.
And maybe we need further discussion if we want iptables UDP support this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ipvs kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants