New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement IPVS-based in-cluster service load balancing #44063

Closed
quinton-hoole opened this Issue Apr 4, 2017 · 49 comments

Comments

Projects
None yet
@quinton-hoole
Member

quinton-hoole commented Apr 4, 2017

At KubeCon Europe in Berlin last week I presented some work we've done at Huawei scaling Kubernetes in-cluster load balancing to 50,000+ services and beyond, the challenges associated with doing this using the current iptables approach, and what we've achieved using an alternative IPVS-based approach. iptables is designed for firewalling, and based on in-kernel rule lists, while IPVS is designed for load balancing and based on in-kernel hash tables. IPVS also supports more sophisticated load balancing algorithms than iptables (least load, least conns, locality, weighted) as well as other useful features (e.g. health checking, retries etc).

After the presentation, there was strong support (a.k.a. a riot :-) ) for us to open source this work, which we are happy to do. We can use this issue to track that.

For those who were not able to be there, here is the video:

https://youtu.be/c7d_kD2eH4w

And the slides:

https://docs.google.com/presentation/d/1BaIAywY2qqeHtyGZtlyAp89JIZs59MZLKcFLxKE6LyM/edit?usp=sharing

We will follow up on this with a more formal design proposal, and a set of PR's, but in summary we added a about 680 lines of code to the existing 12,000 lines of kube-proxy (~5%), and added a third mode flag to it's command-line (mode=IPVS, to the existing mode=userspace and mode=iptables).
Performance improvement of load balancer updates is dramatic (update latency reduced from hours per rule to 2ms per rule). Network latency and variability also reduced dramatically for large numbers of services.

@kubernetes/sig-network-feature-requests
@kubernetes/sig-scalability-feature-requests
@thockin
@wojtek-t

@gyliu513

This comment has been minimized.

Show comment
Hide comment
@gyliu513

gyliu513 Apr 5, 2017

Member

Even though Kubernetes 1.6 support 5000 nodes, but the kube-proxy with iptables is actually a bottleneck to scale the cluster to 5000 nodes. One example is that with NodePort service in a 5000 node cluster, if I have 2000 services and each services have 10 pods, this will cause 20000 iptable records on each worker node, and this can make the kernel pretty busy. Using IPVS-based in-cluster service load balancing can help a lot for such case.

Member

gyliu513 commented Apr 5, 2017

Even though Kubernetes 1.6 support 5000 nodes, but the kube-proxy with iptables is actually a bottleneck to scale the cluster to 5000 nodes. One example is that with NodePort service in a 5000 node cluster, if I have 2000 services and each services have 10 pods, this will cause 20000 iptable records on each worker node, and this can make the kernel pretty busy. Using IPVS-based in-cluster service load balancing can help a lot for such case.

@ravilr

This comment has been minimized.

Show comment
Hide comment
@ravilr

ravilr Apr 5, 2017

Contributor

@quinton-hoole is your implementation using IPVS in nat mode or Direct routing mode?

Contributor

ravilr commented Apr 5, 2017

@quinton-hoole is your implementation using IPVS in nat mode or Direct routing mode?

@haibinxie

This comment has been minimized.

Show comment
Hide comment
@haibinxie

haibinxie Apr 5, 2017

@ravilr it's NAT mode.

haibinxie commented Apr 5, 2017

@ravilr it's NAT mode.

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Apr 5, 2017

Member

gr8 work, the iptables issues has been a problem for a while.

re: flow based scheduling, happy to help get the firmament scheduler in place. ;-)
/cc @kubernetes/sig-scheduling-feature-requests

Member

timothysc commented Apr 5, 2017

gr8 work, the iptables issues has been a problem for a while.

re: flow based scheduling, happy to help get the firmament scheduler in place. ;-)
/cc @kubernetes/sig-scheduling-feature-requests

@resouer

This comment has been minimized.

Show comment
Hide comment
@resouer

resouer Apr 6, 2017

Member

@timothysc I paid attention to firmament for a period of time, but not quite get it's value adding to kubernetes. Would you mind to explain what problem flow based scheduling can solve in current Kubernetes scheduler?

Member

resouer commented Apr 6, 2017

@timothysc I paid attention to firmament for a period of time, but not quite get it's value adding to kubernetes. Would you mind to explain what problem flow based scheduling can solve in current Kubernetes scheduler?

@timothysc

This comment has been minimized.

Show comment
Hide comment
@timothysc

timothysc Apr 6, 2017

Member

@resouer speed at scale and rescheduling.

From @quinton-hoole 's talk, linked above, it looks like huawei has been prototyping this.

Member

timothysc commented Apr 6, 2017

@resouer speed at scale and rescheduling.

From @quinton-hoole 's talk, linked above, it looks like huawei has been prototyping this.

@quinton-hoole

This comment has been minimized.

Show comment
Hide comment
@quinton-hoole

quinton-hoole Apr 6, 2017

Member

@resouer @timothysc Yes, I can confirm that we're working on a Firmament scheduler, and will upstream it as soon as it's in good enough shape. We might have an initial implementation in the next few weeks.

Member

quinton-hoole commented Apr 6, 2017

@resouer @timothysc Yes, I can confirm that we're working on a Firmament scheduler, and will upstream it as soon as it's in good enough shape. We might have an initial implementation in the next few weeks.

@deepak-vij

This comment has been minimized.

Show comment
Hide comment
@deepak-vij

deepak-vij Apr 6, 2017

Member

Hi folks, we are currently working on implementing Firmament Scheduler as part of K8S scheduling environment. We will create a new separate issue to track the progress and provide updates etc. thanks.

Member

deepak-vij commented Apr 6, 2017

Hi folks, we are currently working on implementing Firmament Scheduler as part of K8S scheduling environment. We will create a new separate issue to track the progress and provide updates etc. thanks.

@sureshvis

This comment has been minimized.

Show comment
Hide comment
@sureshvis

sureshvis Apr 6, 2017

@quinton-hoole thanks for sharing. Waiting to see the design proposal.

In term of Healthcheck, every worker doing health check across all pods to keep the table upto date?, how are you planning to handle this @scale ?

sureshvis commented Apr 6, 2017

@quinton-hoole thanks for sharing. Waiting to see the design proposal.

In term of Healthcheck, every worker doing health check across all pods to keep the table upto date?, how are you planning to handle this @scale ?

@quinton-hoole

This comment has been minimized.

Show comment
Hide comment
@quinton-hoole
Member

quinton-hoole commented Apr 6, 2017

@quinton-hoole

This comment has been minimized.

Show comment
Hide comment
@quinton-hoole

quinton-hoole Apr 6, 2017

Member

To be clear, @haibinxie did all the hard work here. Please direct questions to him.

Member

quinton-hoole commented Apr 6, 2017

To be clear, @haibinxie did all the hard work here. Please direct questions to him.

@MikeSpreitzer

This comment has been minimized.

Show comment
Hide comment
@MikeSpreitzer

MikeSpreitzer Apr 6, 2017

Collaborator

IPVS only deals with IP, not transport protocols, right? A k8s service can include a port transformation. A Service object has a potential distinction between port and targetPort in the items in the spec.ports list. And an Endpoints also has a ports.port field in items in the subsets list. Can your implementation handle this generality, and if not, what happens when the user asks for it?

Collaborator

MikeSpreitzer commented Apr 6, 2017

IPVS only deals with IP, not transport protocols, right? A k8s service can include a port transformation. A Service object has a potential distinction between port and targetPort in the items in the spec.ports list. And an Endpoints also has a ports.port field in items in the subsets list. Can your implementation handle this generality, and if not, what happens when the user asks for it?

@haibinxie

This comment has been minimized.

Show comment
Hide comment
@haibinxie

haibinxie Apr 6, 2017

@MikeSpreitzer Port transformation is well supported.

haibinxie commented Apr 6, 2017

@MikeSpreitzer Port transformation is well supported.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Apr 10, 2017

Member
Member

thockin commented Apr 10, 2017

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Apr 10, 2017

Member

Port mapping is handled in NAT mode (called masquerade by IPVS, sadly). As an optimization, a future followup could enable direct-return mode for environments that support it for services that do not do remapping. We'd have to add service IPs as local addresses in pods, which we may want to do anyway.

Member

thockin commented Apr 10, 2017

Port mapping is handled in NAT mode (called masquerade by IPVS, sadly). As an optimization, a future followup could enable direct-return mode for environments that support it for services that do not do remapping. We'd have to add service IPs as local addresses in pods, which we may want to do anyway.

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin Apr 10, 2017

Member

Last comment for the record here, though I have said it elsewhere.

I am very much in favor of an IPVS implementation. We have somewhat more than JUST load-balancing in our iptables (session affinity, firewalls, hairpin-masquerade tricks), but I believe those can all be overcome.

We also have been asked, several times, to add support for port ranges to Services, up to and including a whole IP. The obvious way to add this would also support remapping, though it is not at all clear how NodePorts would work. IPVS, as far as I know, has no facility for exposing ranges of ports.

Member

thockin commented Apr 10, 2017

Last comment for the record here, though I have said it elsewhere.

I am very much in favor of an IPVS implementation. We have somewhat more than JUST load-balancing in our iptables (session affinity, firewalls, hairpin-masquerade tricks), but I believe those can all be overcome.

We also have been asked, several times, to add support for port ranges to Services, up to and including a whole IP. The obvious way to add this would also support remapping, though it is not at all clear how NodePorts would work. IPVS, as far as I know, has no facility for exposing ranges of ports.

@ChenLingPeng

This comment has been minimized.

Show comment
Hide comment
@ChenLingPeng

ChenLingPeng Apr 10, 2017

Contributor

In IPVS mode, we have to add all the service address to host device like lo or ethx, am I right?

Contributor

ChenLingPeng commented Apr 10, 2017

In IPVS mode, we have to add all the service address to host device like lo or ethx, am I right?

@warmchang

This comment has been minimized.

Show comment
Hide comment
@warmchang

warmchang May 7, 2017

Contributor

Nice job!

Contributor

warmchang commented May 7, 2017

Nice job!

@haibinxie

This comment has been minimized.

Show comment
Hide comment
@haibinxie

haibinxie commented May 9, 2017

Hi All, I put together a proposal for the alpha version of IPVS implementation hoping to get into kubernetes 1.7. need your feedback.

https://docs.google.com/document/d/1YEBWR4EWeCEWwxufXzRM0e82l_lYYzIXQiSayGaVQ8M/edit?usp=sharing

@kubernetes/sig-network-feature-requests
@kubernetes/sig-scalability-feature-requests
@thockin
@wojtek-t

@dhilipkumars

This comment has been minimized.

Show comment
Hide comment
@dhilipkumars

dhilipkumars May 22, 2017

Member

FYI
if docker accepts this PR, we may be able to loose seesaw (libnl.so) dependency. docker/libnetwork#1770

Member

dhilipkumars commented May 22, 2017

FYI
if docker accepts this PR, we may be able to loose seesaw (libnl.so) dependency. docker/libnetwork#1770

@gyliu513

This comment has been minimized.

Show comment
Hide comment
@gyliu513

gyliu513 May 25, 2017

Member

Does the kube-router can help this?

Member

gyliu513 commented May 25, 2017

Does the kube-router can help this?

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin May 27, 2017

Member
Member

thockin commented May 27, 2017

@quinton-hoole

This comment has been minimized.

Show comment
Hide comment
@quinton-hoole

quinton-hoole May 27, 2017

Member
Member

quinton-hoole commented May 27, 2017

@thockin

This comment has been minimized.

Show comment
Hide comment
@thockin

thockin May 27, 2017

Member
Member

thockin commented May 27, 2017

@dujun1990

This comment has been minimized.

Show comment
Hide comment
@dujun1990

dujun1990 May 28, 2017

@thockin @quinton-hoole

Initial PR #46580 is already sent out. PTAL.

dujun1990 commented May 28, 2017

@thockin @quinton-hoole

Initial PR #46580 is already sent out. PTAL.

@dhilipkumars

This comment has been minimized.

Show comment
Hide comment
@dhilipkumars

dhilipkumars May 28, 2017

Member

@thockin originally we were relying on seesaw library and had a plan of updating it to a pure go implementation as phase 2 (probably in 1.8). Because of the complexities introduced by libnl.so dependencies last week we decided to move away from seesaw. Docker's libnetwork had a good set of ipvs apis but was missing GETXXX() methods. We quicky contributed to libnetwork and that got merged 3 days ago. Now we have vendored libnetwork. PTAL.

Member

dhilipkumars commented May 28, 2017

@thockin originally we were relying on seesaw library and had a plan of updating it to a pure go implementation as phase 2 (probably in 1.8). Because of the complexities introduced by libnl.so dependencies last week we decided to move away from seesaw. Docker's libnetwork had a good set of ipvs apis but was missing GETXXX() methods. We quicky contributed to libnetwork and that got merged 3 days ago. Now we have vendored libnetwork. PTAL.

k8s-merge-robot added a commit that referenced this issue Aug 30, 2017

Merge pull request #46580 from Huawei-PaaS/kube-proxy-ipvs-pr
Automatic merge from submit-queue (batch tested with PRs 51377, 46580, 50998, 51466, 49749)

Implement IPVS-based in-cluster service load balancing

**What this PR does / why we need it**:

Implement IPVS-based in-cluster service load balancing. It can provide some performance enhancement and some other benefits to kube-proxy while comparing iptables and userspace mode. Besides, it also support more sophisticated load balancing algorithms than iptables (least conns, weighted, hash and so on).

**Which issue this PR fixes**

#17470 #44063

**Special notes for your reviewer**:


* Since the PR is a bit large, I splitted it and move the commits related to ipvs util pkg to PR #48994. Hopefully can make it easier to review.

@thockin @quinton-hoole @kevin-wangzefeng @deepak-vij @haibinxie @dhilipkumars @fisherxu 

**Release note**:

```release-note
Implement IPVS-based in-cluster service load balancing
```
@m1093782566

This comment has been minimized.

Show comment
Hide comment
@m1093782566

m1093782566 Nov 20, 2017

Member

/area ipvs

Member

m1093782566 commented Nov 20, 2017

/area ipvs

@m1093782566

This comment has been minimized.

Show comment
Hide comment
@m1093782566

m1093782566 Dec 7, 2017

Member

IPVS-based kube-proxy is in beta phase now.

Member

m1093782566 commented Dec 7, 2017

IPVS-based kube-proxy is in beta phase now.

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Mar 7, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot commented Mar 7, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Apr 15, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot commented Apr 15, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot May 15, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot commented May 15, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@chrishiestand

This comment has been minimized.

Show comment
Hide comment
@chrishiestand

chrishiestand May 30, 2018

Contributor

With this issue closed as stale, is there a better issue to follow progress of adding ipvs scheduling algorithms to individual kubernetes services? I couldn't find another issue that explicitly covers this part of the ipvs roadmap.

Contributor

chrishiestand commented May 30, 2018

With this issue closed as stale, is there a better issue to follow progress of adding ipvs scheduling algorithms to individual kubernetes services? I couldn't find another issue that explicitly covers this part of the ipvs roadmap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment