Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubernets service not distributing traffic in equally , seeing imbalance in traffic . #125013

Closed
uttam-phygitalz opened this issue May 21, 2024 · 18 comments
Assignees
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@uttam-phygitalz
Copy link

What happened?

We are seeing traffic is not in balancing among ingress controller replicas when replica count gets higher .
We have set HPA like 40 as Maximum replicas and when the load test happen the HPA get triggered and spawn new replicas but the load is not evenly distributed even though resources are available . . PFB the screenshot .

image

It is deployed in AWS NLB . There is not long-lived connection preset , all are new connections being hit .

description of ingress

│ Labels: app=ingress-nginx-external-nlb │ │ app.kubernetes.io/managed-by=Helm │ Annotations: helm.sh/resource-policy: keep │ │ │ service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags: │ │ │ service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp │ │ service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: true │ │ service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: 60 │ │ service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: 300 │ │ service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: true │ │ service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: sg-0116assa519f2f2aa1fe8c │ │ service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip │ │ service.beta.kubernetes.io/aws-load-balancer-type: nlb │ │ Selector: app=ingress-nginx-external │ │ Type: LoadBalancer │ │ IP Family Policy: SingleStack │ │ IP Families: IPv4 │ │ IP: 172.20.189.13 │ │ IPs: 172.20.189.13 │ │ LoadBalancer Ingress: a47c0fada1425caa057592-76e4445441da70fa.elb.us-west-2.amazonaws.com │ │ Port: https 443/TCP │ │ TargetPort: 443/TCP │ │ NodePort: https 31411/TCP │ │ Endpoints: 100.64.165.237:443,100.65.173.35:443,100.64.244.118:443 │ │ Session Affinity: None │ │ External Traffic Policy: Local │ │ HealthCheck NodePort: 31286 │ │ Events: │ │ Type Reason Age From Message │ │ ---- ------ ---- ---- ------- │ │ Normal UpdatedLoadBalancer 16m (x163 over 2d17h) service-controller Updated load balancer with new hosts

What did you expect to happen?

The traffic should distributed among all replicas evenly or somewhere near to that .Not like totally imbalanced way .

How can we reproduce it (as minimally and precisely as possible)?

Deploy Ingress controllers
set the HPA for ingress controller ,like min 3 and max 40 .
perform the load test .

Anything else we need to know?

No response

Kubernetes version

$ kubectl version
# paste output here

Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.8-eks-adc7111

Cloud provider

AWS

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Rocky linux / alpine

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@uttam-phygitalz uttam-phygitalz added the kind/bug Categorizes issue or PR as related to a bug. label May 21, 2024
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label May 21, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 21, 2024
@T-Lakshmi
Copy link

/sig network
/sig cloud-provider

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 21, 2024
@adrianmoisey
Copy link
Member

Are your pods equally spread across your nodes?

We noticed a similar problem, and our issue was that some nodes had more ingress-nginx pods than others, so each node would distribute the traffic it received amongst the pods hosted on itself.

@uttam-phygitalz
Copy link
Author

Hi @adrianmoisey , Yeah could see its spread over different nodes . each node has one replica pod running .

@adrianmoisey
Copy link
Member

Hi @adrianmoisey , Yeah could see its spread over different nodes . each node has one replica pod running .

And just to confirm, when you are scaled up (to 40 pods), you have and equal spread of pods to nodes?

@uttam-phygitalz
Copy link
Author

Hi @adrianmoisey , Yeah could see its spread over different nodes . each node has one replica pod running .

And just to confirm, when you are scaled up (to 40 pods), you have and equal spread of pods to nodes?

Yes correct .. Its being equal spread to nodes .. Each node has one replica running ..

@aojea
Copy link
Member

aojea commented May 21, 2024

You need to test from inside the cluster and from outside, to investigate if is a loadbalancer problem or a kubernetes problem

@adrianmoisey
Copy link
Member

What does the Service look like? Can you paste a YAML representation of it here?

@uttam-phygitalz
Copy link
Author

What does the Service look like? Can you paste a YAML representation of it here?

Service looks okay ..

`apiVersion: v1
kind: Service
metadata:
annotations:
helm.sh/resource-policy: keep
service.beta.kubernetes.io/aws-load-balancer-backend-protocol: tcp
service.beta.kubernetes.io/aws-load-balancer-connection-draining-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-connection-draining-timeout: "60"
service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout: "300"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-extra-security-groups: sg-0165192f2aa1fe8cc
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-type: nlb
creationTimestamp: "2024-02-18T19:10:42Z"
finalizers:

  • service.kubernetes.io/load-balancer-cleanup
    labels:
    app: ingress-nginx-external-nlb
    app.kubernetes.io/managed-by: Helm
    bu: cloud
    name: ingress-nginx-external-nlb
    namespace: qa
    resourceVersion: "4809557489"
    uid: 985e1e4e-cb41-4afa-896a-012e40d826dc
    spec:
    allocateLoadBalancerNodePorts: true
    clusterIP: 172.20.196.247
    clusterIPs:
  • 172.20.196.247
    externalTrafficPolicy: Local
    healthCheckNodePort: 32161
    internalTrafficPolicy: Cluster
    ipFamilies:
  • IPv4
    ipFamilyPolicy: SingleStack
    ports:
  • name: https
    nodePort: 31172
    port: 443
    protocol: TCP
    targetPort: 443
    selector:
    app: ingress-nginx-external
    sessionAffinity: None
    type: LoadBalancer
    status:
    loadBalancer:
    ingress:
    • hostname: a985e1eecbbbt414afa896a012e40d826d-7b200cd7fdc4afef.elb.us-west-2.amazonaws.com`

@adrianmoisey
Copy link
Member

externalTrafficPolicy: Local
internalTrafficPolicy: Cluster

Given that internalTrafficPolicy is set to Cluster, I'd assume that Kubernetes would distribute the traffic evenly.

Since externalTrafficPolicy is set to Local, it may be the NLB that is causing this behaviour.

I agree with @aojea's suggestion of doing a test inside the cluster. That we it will help eliminate either the cluster to the load balancer.

@aroradaman
Copy link
Member

/remove-kind bug
(until we are sure etp: local and NLB is not the culprit)

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 22, 2024
@elmiko
Copy link
Contributor

elmiko commented May 22, 2024

we are discussing this in the sig cloud provider meeting today, we aren't quite sure this is specific to the cloud controller manager and not a configuration with the load balancer in aws. would like to see more data related the questions asked earlier.

cc @kmala

@shaneutt
Copy link
Member

/assign @shaneutt

@shaneutt
Copy link
Member

We discussed this one in the SIG Network meeting today, and it seems we have several open questions. I've assigned myself this just to try and help shepherd it forward, but @uttam-phygitalz it does seem there are some open questions about this above including a desire to see if this is something that might be happening outside the cluster? Please let us know?

/triage needs-information

@k8s-ci-robot k8s-ci-robot added the triage/needs-information Indicates an issue needs more information in order to work on it. label May 23, 2024
@shaneutt
Copy link
Member

Seems like this is getting stale.

/lifecycle stale

Let us know your thoughts on some of the above questions @uttam-phygitalz, or if you need any help or support in this?

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 25, 2024
@elmiko
Copy link
Contributor

elmiko commented Jul 17, 2024

we talked about this again at sig cloud--provider today, deferring an acceptance on triage while we wait for more information.

@aojea
Copy link
Member

aojea commented Jul 17, 2024

/close

last comment from the reporter from May, it can always be reopened if there is more information

@k8s-ci-robot
Copy link
Contributor

@aojea: Closing this issue.

In response to this:

/close

last comment from the reporter from May, it can always be reopened if there is more information

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

8 participants