New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-proxy iptables load-balancing probability is not equal #37932
Comments
This is intended because these iptables rules will be examined from top to bottom. Take your case as an example:
When it reach the first rule, there are still 5 possible endpoints to choose from. Hence the possibility should be set to 1/5 to achieve equality.
|
Hi, @MrHohn, Thank you for your detailed answers. As you said, if I use 5 pods, the probabilities of every pod is 1/5. But I actually test, I run the same server in every pod to receive restful http request, and I send the request from the client at the same time, I find that not all the pod handle the request, some requests are handled by the same pod from the log description. So I do not understand that if the probabilities is equal every pod can receive the request. Thank you very much again! |
Hi @JackTiger, could you provide more details about:
|
Hi, @MrHohn, I use curl command to send the http request from the local host, as below:
and my service describe is like this:
The service resource file I created is as below:
I will try many times, and the result is the same, and from the logs as below:
One of the endpoints handles the two request at the same time.I use the default(iptables)load balance rules. |
Correct me if I misunderstood the situation: the curl command sent out 4 requests to the service IP roughly at the same time and two of them were being served by one of the endpoints. I would say it is normal because we are not using round robin loadbalancing here. The possibility for 2 out of 4 requests being served by one endpoints is not that low. Suggest make this curl command in a loop that repeats more than 1000 times and check each endpoint serves how many requests. I believe the result will be much more fair. |
@MrHohn Ok, Thanks. I will try as your suggestion. By the way, if I want to use round robin loadbalancing in my server, what should I do, could you have some suggestions? Because I want to every pod to handle the single task that may take a lot of time to execute, so if there comes some request at the same time, I want to handle in more pods, not only in a few of these. At the best solution, the loadbalancing can use round robin method not use random method. |
I think the userspace proxy mode for kube-proxy does round robin for choosing backends. But as the iptables proxy mode should be faster and more reliable than the userspace proxy, so I don't recommend to switch back only because you want round robin. I might suggest to keep this logic in the application layer. The server should know what its capacity would be and don't take more requests than it can. |
And you may be able to utilize the readiness probe feature. If you don't want any more request be forwarded to backends that are already serving enough traffic, mark them as unready and they will be removed from the service endpoints. Mark them as ready when they are done and the endpoints will show up again. This may not be a general solution, just an idea. |
Ok, thanks, I use the readiness probe feature now in k8s, it may be the only solution if to meet my requirements. Or I replace the kube-proxy, I see someone use haproxy to replace kube-proxy, The load balance of kube-proxy is low. I will try in next feature. Thanks very much again! |
@JackTiger can you link the haproxy solution? kube-proxy, unfortunately, isn't really fit for my purposes (heavy use of persistent connections). |
@jsravn I am sorry, I have not find the link about the haproxy solution, but I will try this solution to verify the load balance ability in next feature, it will take me a lot of time, because I am not very familiar with haproxy. |
I am seeing something similar. I have 5 hostname pods (gcr.io/google_containers/serve_hostname:1.3) connected to a NodePort service with sessionAffinity: None. When I run: This is very troublesome on several fronts because 1. I expect my traffic to be balanced much more evenly than this and 2. the trivial example I'm using has been used as a troubleshooting mechanism when testing session affinity problems (why I'm doing it) and could very easily indicate a false positive test. See this comment by @thockin and others. My configuration:
and my pods:
|
Please verify from within the cluster (e.g. from a node to the cluster IP
and then to the nodeport). This is well tested and considered very
stable. I';d be shocked if this is really an iptables problem, but nothing
is impossible. Help us tease it apart please.
For those who think they care about it being RoundRobin, consider this. If
you have N nodes, M backends for a given Service, and each Node is
independently making RR decisions for a non-deterministic number of client,
is it really any different than random?
…On Fri, Dec 30, 2016 at 11:57 AM, Ben Marks ***@***.***> wrote:
I am seeing something similar. I have 5 hostname pods (
gcr.io/google_containers/serve_hostname:1.3) connected to a NodePort
service with sessionAffinity: None. When I run:
for i in `seq 1 100`; do curl -s sandbox-hostnames-redacted.us-
west-2.elb.amazonaws.com; echo; done | sort | uniq -c
I always get a single unique host. If I jump onto a browser and furiously
refresh the url then I at most can get two hosts. Running several consoles
with this command doesn't produce more than 1 host either.
This is very troublesome on several fronts because 1. I expect my traffic
to be balanced much more evenly than this and 2. the trivial example I'm
using has been used as a troubleshooting mechanism when testing session
affinity problems (why I'm doing it) and could very easily indicate a false
positive test. See this comment
<#13299 (comment)>
by @thockin <https://github.com/thockin> and others.
My configuration:
apiVersion: v1
kind: Service
metadata:
name: hostnames
spec:
type: NodePort
ports:
- port: 9376
protocol: TCP
nodePort: 31111
selector:
app: hostnames
sessionAffinity: None
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: hostnames
spec:
replicas: 5
template:
metadata:
labels:
app: hostnames
spec:
containers:
- name: hostnames
image: gcr.io/google_containers/serve_hostname:1.3
ports:
- containerPort: 9376
and my pods:
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
...
hostnames-884590183-b8q0a 1/1 Running 0 28m 10.2.96.9 ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-dq3j4 1/1 Running 0 14m 10.2.96.6 ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-g8kq9 1/1 Running 0 28m 10.2.3.3 ip-10-0-0-200.us-west-2.compute.internal
hostnames-884590183-o6jv6 1/1 Running 0 14m 10.2.96.5 ip-10-0-0-201.us-west-2.compute.internal
hostnames-884590183-v4vhb 1/1 Running 0 14m 10.2.96.8 ip-10-0-0-201.us-west-2.compute.internal
...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37932 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVBNeBsxHU4Y0F1HXhx8lffEcuDzsks5rNWJEgaJpZM4LCVah>
.
|
Well this is interesting. Sure enough from a worker to the service IP or nodeport and it is distributing requests very evenly.
Do you have any suggestions on troubleshooting why I'm seeing no distribution through the ELB? |
Could it be caching? Could you be triggering some HTTP client re-use (go's
client lib does this)?
…On Tue, Jan 3, 2017 at 8:19 AM, Ben Marks ***@***.***> wrote:
Well this is interesting. Sure enough from a worker to the service IP or
nodeport and it is distributing requests very evenly.
From outside into ELB:
$ curl --proxy redacted:80 -s sandbox-hostnames-redacted.us-west-2.elb.amazonaws.com
hostnames-884590183-dq3j4
To service IP:
***@***.*** ~ $ for i in `seq 1 100`; do curl -s 10.3.0.191:9376; echo; done | sort | uniq -c
16 hostnames-884590183-b8q0a
21 hostnames-884590183-dq3j4
25 hostnames-884590183-g8kq9
15 hostnames-884590183-o6jv6
23 hostnames-884590183-v4vhb
To NodePort:
***@***.*** ~ $ for i in `seq 1 100`; do curl -s 127.0.0.1:31111; echo; done | sort | uniq -c
14 hostnames-884590183-b8q0a
22 hostnames-884590183-dq3j4
23 hostnames-884590183-g8kq9
20 hostnames-884590183-o6jv6
21 hostnames-884590183-v4vhb
Do you have any suggestions on troubleshooting why I'm seeing no
distribution through the ELB?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#37932 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVMK8BVPCSr_ChOnG4zzhi_CnRwB_ks5rOnUKgaJpZM4LCVah>
.
|
It looks like our corporate proxy strikes yet again. I tethered to my phone and sure enough it distributes among the 5 hosts just fine. Sorry to give you any grief over this. |
Not the first, won't be the last. :) |
Hi, I am facing something similar to all the previous comment. One difference being that only 1 of my pods is serving all the incoming requests. I am also using a default ip-tables proxy mode. My service does have 2 endpoints (for 2 pods) and both of them are ready but all the requests are routed to a single pod. Here's my deployment config:
|
Is this about internal "clusterip" requests or external load-balancer
requests? What are your clients? Multiple apps or a load-tester? Are
you re-using connections (some HTTP libs will do this without telling
you).
…On Thu, Nov 5, 2020 at 7:15 AM Aseem Kannal ***@***.***> wrote:
Hi, I am facing something similar to all the previous comment. One difference being that only 1 of my pods is serving all the incoming requests. I am also using a default ip-tables proxy mode. My service does have 2 endpoints (for 2 pods) and both of them are ready but all the requests are routed to a single pod. Here's my deployment config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: project-nameprodmstb
labels:
app: project-nameprodmstb
spec:
replicas: 2
selector:
matchLabels:
app: project-nameprodmstb
template:
metadata:
labels:
app: project-nameprodmstb
spec:
containers:
- name: project-nameprodmstb
image: <some image>
imagePullPolicy: Always
resources:
requests:
cpu: "1024m"
memory: "4096Mi"
imagePullSecrets:
- name: "gcr-json-key"
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
minReadySeconds: 5
---
apiVersion: v1
kind: Service
metadata:
labels:
app: project-nameprodmstb
name: project-nameprodmstb
namespace: default
spec:
ports:
- name: project-nameprodmstb
port: 8006
protocol: TCP
targetPort: 8006
selector:
app: project-nameprodmstb
type: LoadBalancer
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
Hi, I am facing something similar to all the previous comments. my service is composed of a frontend and a backend. $ for i in `seq 1 100`; do curl -o /dev/null -s --write-out '%{remote_ip}' backend.default.svc.cluster.local:5672; echo; done | sort | uniq -c
60 10.xx.81.10
10 10.xx.81.11
7 10.xx.81.12
3 10.xx.81.13
6 10.xx.81.14
4 10.xx.81.15
10 10.xx.81.16 even if I try many times, I get a similarly high value for a specific IP address(10.xx.81.10). In addition, the above test was performed on a pod inside k8s cluster. |
Hi @0Ams, I am from time to time also experiencing that issue. The reason for that is sadly that
I think that that would increase uniform score of service balancing. I am investigating that because I am preparing a talk for a conference on that topic. I will test it, so if you want I can reach to you with the findings. |
I don't understand what you mean that it's not random - if you have 2 or more services, it seems that the natural randomness of traffic patterns and scheduling would get nodes out of phase-lock very quickly, no? @danwinship - interesting idea to shuffle the endpoints within a probability set. It will make testing ugly (I think). |
Let's imagine that each node has a bag of random numbers, and each time you call |
So there's an interesting problem where if you have clients that use persistent connections, and servers that periodically turn over and get replaced, then you can end up with unbalanced load like this. We noticed this in OpenShift with our apiservers. Eg, you start with
Then you reboot/restart/redeploy apiserver-1. The 33 clients that had been connected there get disconnected, reconnect to 172.30.0.1, and get redistributed evenly by their kube-proxies with ~16.5 going to apiserver-2 and ~16.5 going to apiserver-3. So once apiserver-1 comes back up, you have:
Then you restart apiserver-2. The 50 clients there get redistributed evenly across apiserver-1 and apiserver-2, leading to:
And then you restart apiserver-3. Same deal; its clients get redistributed to apiserver-1 and apiserver-2:
Then, as new clients join, they will be distributed evenly across all servers, because the kube-proxies don't take existing load into account (and don't know what the kube-proxies on other nodes are doing anyway). So, eg, if 9 more clients connect, you'll see:
This imbalance will remain until the next time you restart the apiservers, at which point (assuming you restart all 3 in sequence) you'll end up with a new imbalance. (Or exactly the same imbalance if you always restart them in the same order.) WorkaroundIf you add an idle timeout to either the client or server side (where the timeout is noticeably less than the average time between server restarts/redeploys) then the clients will get rebalanced over time. |
It would be a problem for the "strcmp" tests, but not the "trace" tests... |
this is because of how http/2 and grpc that uses it underneath works, and why you need to move to more sophisticated loadbalancers, L4 can do what can do 🤷 |
? |
apiservers has a probabilities goaway handler to avoid this situation kubernetes/staging/src/k8s.io/apiserver/pkg/server/config.go Lines 240 to 244 in 755b4e2
|
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.7", GitCommit:"a2cba278cba1f6881bb0a7704d9cac6fca6ed435", GitTreeState:"clean", BuildDate:"2016-09-12T23:15:30Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"3", GitVersion:"v1.3.7", GitCommit:"a2cba278cba1f6881bb0a7704d9cac6fca6ed435", GitTreeState:"clean", BuildDate:"2016-09-12T23:08:43Z", GoVersion:"go1.6.2", Compiler:"gc", Platform:"linux/amd64"}
Environment:
Cloud provider or hardware configuration:
Amazon AWS
OS (e.g. from /etc/os-release):
CentOS Linux 7 (Core)
Kernel (e.g. uname -a):
Linux bastion 4.7.5-1.el7.elrepo.x86_64 #1 SMP Sat Sep 24 11:54:29 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
Custom install
Others:
What happened:
I've created a deployment (nginx) with 5 replicas like so:
kubectl run nginx --image=nginx --replicas=5
I've exposed a service like so:
kubectl expose deployment nginx --port=80 --target-port=80
On a worker node, I see this in iptables for this service:
-A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-W565FQEHFWB3IXKJ -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-EZJQVRVACWHUOBBQ -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-FPX6KRWBZCCVJJV6 -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-PHUVLZZC77CF4GSN -A KUBE-SVC-H2F4SOSDHAEHZFXQ -m comment --comment "default/nginx:" -j KUBE-SEP-DCNJBZVVUJKLG55E
Those probabilities look really odd.
What you expected to happen:
All probabilities should be 0.2.
How to reproduce it (as minimally and precisely as possible):
See above.
Anything else do we need to know:
Looks like it's line 953 in pkg/proxy/iptables/proxier.go.
It seems like that there is some problem in kube-proxy iptables rules.
In the code above, I find that the probability is not equal, and the total of the probability is above 1. For example:
I have 5 pod in one server, and I use the command iptables-save to see the chain of the iptables, the result is as below:
I find the probability is 1/5, 1/4, 1/3, 1/2, so when the access request is coming, the choose probability of KUBE-SEP-CGDKBCNM24SZWCMS pod is the latest, and I test I send the request to the server at the same time, and a lot of time this pod can handle 3 request, so if I want to every pod handle one task at the same time is not done.
The text was updated successfully, but these errors were encountered: