Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kube-proxy iptables load-balancing probability is not equal #37932

Closed
JackTiger opened this issue Dec 2, 2016 · 28 comments
Closed

kube-proxy iptables load-balancing probability is not equal #37932

JackTiger opened this issue Dec 2, 2016 · 28 comments

Comments

@JackTiger
Copy link

Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.7", GitCommit:"a2cba278cba1f6881bb0a7704d9cac6fca6ed435", GitTreeState:"clean", BuildDate:"2016-09-12T23:15:30Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.7", GitCommit:"a2cba278cba1f6881bb0a7704d9cac6fca6ed435", GitTreeState:"clean", BuildDate:"2016-09-12T23:08:43Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}

Environment:

Cloud provider or hardware configuration:
Amazon AWS
OS (e.g. from /etc/os-release):
CentOS Linux 7 (Core)
Kernel (e.g. uname -a):
Linux bastion 4.7.5-1.el7.elrepo.x86_64 #1 SMP Sat Sep 24 11:54:29 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Custom install
Others:
What happened:

I've created a deployment (nginx) with 5 replicas like so:
kubectl run nginx --image=nginx --replicas=5

I've exposed a service like so:
kubectl expose deployment nginx --port=80 --target-port=80

On a worker node, I see this in iptables for this service:
-A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-W565FQEHFWB3IXKJ -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-EZJQVRVACWHUOBBQ -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-FPX6KRWBZCCVJJV6 -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-PHUVLZZC77CF4GSN -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -j KUBE-SEP-DCNJBZVVUJKLG55E

Those probabilities look really odd.

What you expected to happen:

All probabilities should be 0.2.

How to reproduce it (as minimally and precisely as possible):

See above.

Anything else do we need to know:

Looks like it's line 953 in pkg/proxy/iptables/proxier.go.

It seems like that there is some problem in kube-proxy iptables rules.

// Now write loadbalancing & DNAT rules.
		n := len(endpointChains)
		for i, endpointChain := range endpointChains {
			// Balancing rules in the per-service chain.
			args := []string{
				"-A", string(svcChain),
				"-m", "comment", "--comment", svcName.String(),
			}
			if i < (n - 1) {
				// Each rule is a probabilistic match.
				args = append(args,
					"-m", "statistic",
					"--mode", "random",
					"--probability", fmt.Sprintf("%0.5f", 1.0/float64(n-i)))
			}

In the code above, I find that the probability is not equal, and the total of the probability is above 1. For example:

I have 5 pod in one server, and I use the command iptables-save to see the chain of the iptables, the result is as below:

-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-E4QKA7SLJRFZZ2DD 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-LZ7EGMG4DRXMY26H 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-RKIFTWKKG3OHTTMI 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-CGDKBCNM24SZWCMS
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -j KUBE-SEP-RI4SRNQQXWSTGE2Y

I find the probability is 1/5, 1/4, 1/3, 1/2, so when the access request is coming, the choose probability of KUBE-SEP-CGDKBCNM24SZWCMS pod is the latest, and I test I send the request to the server at the same time, and a lot of time this pod can handle 3 request, so if I want to every pod handle one task at the same time is not done.

@MrHohn
Copy link
Member

MrHohn commented Dec 5, 2016

This is intended because these iptables rules will be examined from top to bottom.

Take your case as an example:

-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-E4QKA7SLJRFZZ2DD 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-LZ7EGMG4DRXMY26H 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-RKIFTWKKG3OHTTMI 
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-CGDKBCNM24SZWCMS
-A KUBE-SVC-LI77LBOOMGYET5US -m comment --comment "default/showreadiness:showreadiness" -j KUBE-SEP-RI4SRNQQXWSTGE2Y

When it reach the first rule, there are still 5 possible endpoints to choose from. Hence the possibility should be set to 1/5 to achieve equality.
If the first endpoint has not be chosen, we have 4 possible endpoints left to choose from. Hence the possibility should be 1/4 now. So on so forth.
After all, the possibilities to reach each of these endpoints would be:

1th endpoint: 1/5
2th endpoint: 4/5 * 1/4 = 1/5
3th endpoint: 4/5 * 3/4 * 1/3 = 1/5
4th endpoint: 4/5 * 3/4 * 2/3 * 1/2 = 1/5
5th endpoint: 4/5 * 3/4 * 2/3 * 1/2 * 1 = 1/5

@JackTiger
Copy link
Author

Hi, @MrHohn, Thank you for your detailed answers. As you said, if I use 5 pods, the probabilities of every pod is 1/5. But I actually test, I run the same server in every pod to receive restful http request, and I send the request from the client at the same time, I find that not all the pod handle the request, some requests are handled by the same pod from the log description. So I do not understand that if the probabilities is equal every pod can receive the request.

Thank you very much again!

@MrHohn
Copy link
Member

MrHohn commented Dec 6, 2016

Hi @JackTiger, could you provide more details about:

  • How many clients you ran and where did these clients run on? (If unfairness really happened, I will suspect it is the result of connection tracking)
  • How many requests did the clients send? (Although it is obvious, unfairness may happen if you are sampling a relatively small amount of requests)
  • Which kind of load distribution you got? Does it come with a specific pattern or just random?

@JackTiger
Copy link
Author

Hi, @MrHohn,

I use curl command to send the http request from the local host, as below:

curl 192.168.3.163:9090 & curl 192.168.3.163:9090 & curl 192.168.3.163:9090 & curl 192.168.3.163:9090 &

and my service describe is like this:

root@SZV1000050172:/opt/bin/k8s/healthy# kubectl describe svc showreadiness
Name:			showreadiness
Namespace:		default
Labels:			app=showreadiness
Selector:		app=showreadiness
Type:			NodePort
IP:			192.168.3.163
Port:			showreadiness	9090/TCP
NodePort:		showreadiness	32090/TCP
Endpoints:		172.16.37.4:9090,172.16.37.5:9090,172.16.54.3:9090 + 2 more...
Session Affinity:	None
No events.

The service resource file I created is as below:

root@SZV1000050172:/opt/bin/k8s/healthy# cat showreadiness-svc.yaml 
apiVersion: v1
kind: Service
metadata:
  name: showreadiness
  labels:
    app: showreadiness
spec:
  type: NodePort
  sessionAffinity: None
  ports:
  - name: showreadiness
    port: 9090 # Internal Port
    targetPort: 9090 # External port
    nodePort: 32090
    protocol: TCP 
  selector:
    app: showreadiness

I will try many times, and the result is the same, and from the logs as below:

2016-12-06T03:03:44.400828413Z 2016/12/06 03:03:44 I am serving traffic!!!
2016-12-06T03:03:44.402812113Z 2016/12/06 03:03:44 I am serving traffic!!!

One of the endpoints handles the two request at the same time.I use the default(iptables)load balance rules.

@MrHohn
Copy link
Member

MrHohn commented Dec 6, 2016

Correct me if I misunderstood the situation: the curl command sent out 4 requests to the service IP roughly at the same time and two of them were being served by one of the endpoints.

I would say it is normal because we are not using round robin loadbalancing here. The possibility for 2 out of 4 requests being served by one endpoints is not that low.

Suggest make this curl command in a loop that repeats more than 1000 times and check each endpoint serves how many requests. I believe the result will be much more fair.

@JackTiger
Copy link
Author

@MrHohn Ok, Thanks. I will try as your suggestion. By the way, if I want to use round robin loadbalancing in my server, what should I do, could you have some suggestions? Because I want to every pod to handle the single task that may take a lot of time to execute, so if there comes some request at the same time, I want to handle in more pods, not only in a few of these. At the best solution, the loadbalancing can use round robin method not use random method.

@MrHohn
Copy link
Member

MrHohn commented Dec 6, 2016

I think the userspace proxy mode for kube-proxy does round robin for choosing backends. But as the iptables proxy mode should be faster and more reliable than the userspace proxy, so I don't recommend to switch back only because you want round robin.

I might suggest to keep this logic in the application layer. The server should know what its capacity would be and don't take more requests than it can.

@MrHohn
Copy link
Member

MrHohn commented Dec 6, 2016

And you may be able to utilize the readiness probe feature. If you don't want any more request be forwarded to backends that are already serving enough traffic, mark them as unready and they will be removed from the service endpoints. Mark them as ready when they are done and the endpoints will show up again.

This may not be a general solution, just an idea.

@JackTiger
Copy link
Author

Ok, thanks, I use the readiness probe feature now in k8s, it may be the only solution if to meet my requirements. Or I replace the kube-proxy, I see someone use haproxy to replace kube-proxy, The load balance of kube-proxy is low. I will try in next feature.

Thanks very much again!

@jsravn
Copy link
Contributor

jsravn commented Dec 9, 2016

@JackTiger can you link the haproxy solution? kube-proxy, unfortunately, isn't really fit for my purposes (heavy use of persistent connections).

@JackTiger
Copy link
Author

@jsravn I am sorry, I have not find the link about the haproxy solution, but I will try this solution to verify the load balance ability in next feature, it will take me a lot of time, because I am not very familiar with haproxy.

@bmarks-mylo
Copy link

I am seeing something similar. I have 5 hostname pods (gcr.io/google_containers/serve_hostname:1.3) connected to a NodePort service with sessionAffinity: None. When I run:
for i in `seq 1 100`; do curl -s sandbox-hostnames-redacted.us-west-2.elb.amazonaws.com; echo; done | sort | uniq -c
I always get a single unique host. If I jump onto a browser and furiously refresh the url then I at most can get two hosts. Running several consoles with this command doesn't produce more than 1 host either.

This is very troublesome on several fronts because 1. I expect my traffic to be balanced much more evenly than this and 2. the trivial example I'm using has been used as a troubleshooting mechanism when testing session affinity problems (why I'm doing it) and could very easily indicate a false positive test. See this comment by @thockin and others.

My configuration:

apiVersion: v1
kind: Service
metadata:
  name: hostnames
spec:
  type: NodePort
  ports:
    - port: 9376
      protocol: TCP
      nodePort: 31111
  selector:
    app: hostnames
  sessionAffinity: None
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: hostnames
spec:
  replicas: 5
  template:
    metadata:
      labels:
        app: hostnames
    spec:
      containers:
        - name: hostnames
          image: gcr.io/google_containers/serve_hostname:1.3
          ports:
            - containerPort: 9376

and my pods:

$ kubectl get pods -o wide
NAME                                  READY     STATUS    RESTARTS   AGE       IP          NODE
...
hostnames-884590183-b8q0a             1/1       Running   0          28m       10.2.96.9   ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-dq3j4             1/1       Running   0          14m       10.2.96.6   ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-g8kq9             1/1       Running   0          28m       10.2.3.3    ip-10-0-0-200.us-west-2.compute.internal
hostnames-884590183-o6jv6             1/1       Running   0          14m       10.2.96.5   ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-v4vhb             1/1       Running   0          14m       10.2.96.8   ip-10-0-0-201.us-west-2.compute.internal
...

@thockin
Copy link
Member

thockin commented Dec 31, 2016 via email

@bmarks-mylo
Copy link

Well this is interesting. Sure enough from a worker to the service IP or nodeport and it is distributing requests very evenly.

From outside into ELB:
$ curl --proxy redacted:80 -s sandbox-hostnames-redacted.us-west-2.elb.amazonaws.com
      hostnames-884590183-dq3j4

To service IP:
core@ip-10-0-0-201 ~ $ for i in `seq 1 100`; do curl -s 10.3.0.191:9376; echo; done | sort | uniq -c
     16 hostnames-884590183-b8q0a
     21 hostnames-884590183-dq3j4
     25 hostnames-884590183-g8kq9
     15 hostnames-884590183-o6jv6
     23 hostnames-884590183-v4vhb

To NodePort:
core@ip-10-0-0-201 ~ $ for i in `seq 1 100`; do curl -s 127.0.0.1:31111; echo; done | sort | uniq -c
     14 hostnames-884590183-b8q0a
     22 hostnames-884590183-dq3j4
     23 hostnames-884590183-g8kq9
     20 hostnames-884590183-o6jv6
     21 hostnames-884590183-v4vhb

Do you have any suggestions on troubleshooting why I'm seeing no distribution through the ELB?

@thockin
Copy link
Member

thockin commented Jan 3, 2017 via email

@bmarks-mylo
Copy link

It looks like our corporate proxy strikes yet again. I tethered to my phone and sure enough it distributes among the 5 hosts just fine. Sorry to give you any grief over this.

@thockin
Copy link
Member

thockin commented Jan 3, 2017

Not the first, won't be the last. :)

@thockin thockin closed this as completed Jan 3, 2017
@aseemk98
Copy link

aseemk98 commented Nov 5, 2020

Hi, I am facing something similar to all the previous comment. One difference being that only 1 of my pods is serving all the incoming requests. I am also using a default ip-tables proxy mode. My service does have 2 endpoints (for 2 pods) and both of them are ready but all the requests are routed to a single pod. Here's my deployment config:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: project-nameprodmstb
  labels:
    app: project-nameprodmstb
spec:
  replicas: 2
  selector:
    matchLabels:
      app: project-nameprodmstb
  template:
    metadata:
      labels:
        app: project-nameprodmstb
    spec:
      containers:
        - name: project-nameprodmstb
          image: <some image>
          imagePullPolicy: Always
          resources:
            requests:
              cpu: "1024m"
              memory: "4096Mi"
      imagePullSecrets:
        - name: "gcr-json-key"
      
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
  minReadySeconds: 5

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app: project-nameprodmstb
  name: project-nameprodmstb 
  namespace: default
spec:
  ports:
  - name: project-nameprodmstb
    port: 8006
    protocol: TCP
    targetPort: 8006
  selector:
    app: project-nameprodmstb
  type: LoadBalancer

@thockin
Copy link
Member

thockin commented Nov 5, 2020 via email

@0Ams
Copy link

0Ams commented Aug 11, 2022

Hi, I am facing something similar to all the previous comments.

my service is composed of a frontend and a backend.
the frontend service http with service name as parameter.

$  for i in `seq 1 100`; do curl -o /dev/null -s --write-out '%{remote_ip}' backend.default.svc.cluster.local:5672; echo; done | sort | uniq -c
     60 10.xx.81.10
     10 10.xx.81.11
     7 10.xx.81.12
     3 10.xx.81.13
     6 10.xx.81.14
     4 10.xx.81.15
     10 10.xx.81.16

even if I try many times, I get a similarly high value for a specific IP address(10.xx.81.10).
"kube-proxy" uses iptable, and each server has a similar performance, and the same occurs without the impact of other services.

In addition, the above test was performed on a pod inside k8s cluster.
Please provide me with helpful information.

@dariodsa
Copy link

Hi @0Ams, I am from time to time also experiencing that issue. The reason for that is sadly that netfilter is using one seed for all rules (link to netfilter code) which means it will be random in total picture, but no if we look one single segment (like one service to pods). Also interesting thing that happens is, that pod with the largest number of requests is usually the one at the end of the rules in the iptables (one's witout probability attribute). So I am think that it would be nice if those probability would change also over time. So for example at time 1 iptables rules would look like this:

  1. podA 0.25
  2. podB 0.3333
  3. podC 0.5
  4. podD
    And for example at next iptables sync period those valules would be permuted randomly like:
  5. podC 0.25
  6. podB 0.3333
  7. podD 0.5
  8. podA

I think that that would increase uniform score of service balancing. I am investigating that because I am preparing a talk for a conference on that topic. I will test it, so if you want I can reach to you with the findings.

@thockin
Copy link
Member

thockin commented Feb 16, 2023

I don't understand what you mean that it's not random - if you have 2 or more services, it seems that the natural randomness of traffic patterns and scheduling would get nodes out of phase-lock very quickly, no?

@danwinship - interesting idea to shuffle the endpoints within a probability set. It will make testing ugly (I think).

@dariodsa
Copy link

Let's imagine that each node has a bag of random numbers, and each time you call get_radom you will get one number from the bag. If you choose N numbers and you if look then they will obey random (uniform) distribution. But what if you look not all N, but subsection of N, for example every 3-rd. Your distribution wouldn't be uniform any more. But as you N is getting bigger, your distribution will come close to uniform. So with more requests it will come close to uniform, but as lag between requests is increasing, that will as decay the uniform distribution. But don't bother with that, we can't do anything regarding that, it is just universe of randomness.
But I think shuffle is a great idea worth pursuing.

@danwinship
Copy link
Contributor

danwinship commented Feb 21, 2023

$  for i in `seq 1 100`; do curl -o /dev/null -s --write-out '%{remote_ip}' backend.default.svc.cluster.local:5672; echo; done | sort | uniq -c
     60 10.xx.81.10
     10 10.xx.81.11
     7 10.xx.81.12
     3 10.xx.81.13
     6 10.xx.81.14
     4 10.xx.81.15
     10 10.xx.81.16

So there's an interesting problem where if you have clients that use persistent connections, and servers that periodically turn over and get replaced, then you can end up with unbalanced load like this. We noticed this in OpenShift with our apiservers. Eg, you start with

  • apiserver-1: 33 clients
  • apiserver-2: 33 clients
  • apiserver-3: 34 clients

Then you reboot/restart/redeploy apiserver-1. The 33 clients that had been connected there get disconnected, reconnect to 172.30.0.1, and get redistributed evenly by their kube-proxies with ~16.5 going to apiserver-2 and ~16.5 going to apiserver-3. So once apiserver-1 comes back up, you have:

  • apiserver-1: 0 clients
  • apiserver-2: 50 clients
  • apiserver-3: 50 clients

Then you restart apiserver-2. The 50 clients there get redistributed evenly across apiserver-1 and apiserver-2, leading to:

  • apiserver-1: 25 clients
  • apiserver-2: 0 clients
  • apiserver-3: 75 clients

And then you restart apiserver-3. Same deal; its clients get redistributed to apiserver-1 and apiserver-2:

  • apiserver-1: 63 clients
  • apiserver-2: 37 clients
  • apiserver-3: 0 clients

Then, as new clients join, they will be distributed evenly across all servers, because the kube-proxies don't take existing load into account (and don't know what the kube-proxies on other nodes are doing anyway). So, eg, if 9 more clients connect, you'll see:

  • apiserver-1: 66 clients
  • apiserver-2: 40 clients
  • apiserver-3: 3 clients

This imbalance will remain until the next time you restart the apiservers, at which point (assuming you restart all 3 in sequence) you'll end up with a new imbalance. (Or exactly the same imbalance if you always restart them in the same order.)

Workaround

If you add an idle timeout to either the client or server side (where the timeout is noticeably less than the average time between server restarts/redeploys) then the clients will get rebalanced over time.

@danwinship
Copy link
Contributor

@danwinship - interesting idea to shuffle the endpoints within a probability set. It will make testing ugly (I think).

It would be a problem for the "strcmp" tests, but not the "trace" tests...

@aojea
Copy link
Member

aojea commented Feb 24, 2023

We noticed this in OpenShift with our apiservers. Eg, you start with

this is because of how http/2 and grpc that uses it underneath works, and why you need to move to more sophisticated loadbalancers, L4 can do what can do 🤷

@danwinship
Copy link
Contributor

?
No, it's not anything specific to http/2 or grpc, other than that they use persistent connections, and that the clients try to reconnect immediately if they are dropped by the server. Given that behavior, plus periodic server restarts, plus a proxy/LB that doesn't take current load into account when handling new connections, you automatically get the unbalanced behavior.

@aojea
Copy link
Member

aojea commented Dec 19, 2023

Given that behavior, plus periodic server restarts, plus a proxy/LB that doesn't take current load into account when handling new connections, you automatically get the unbalanced behavior.

apiservers has a probabilities goaway handler to avoid this situation

// GoawayChance is the probability that send a GOAWAY to HTTP/2 clients. When client received
// GOAWAY, the in-flight requests will not be affected and new requests will use
// a new TCP connection to triggering re-balancing to another server behind the load balance.
// Default to 0, means never send GOAWAY. Max is 0.02 to prevent break the apiserver.
GoawayChance float64

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants