Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS - Randomly unhealty nodes in target groups #9990

Open
bjtox opened this issue May 23, 2023 · 16 comments
Open

AWS - Randomly unhealty nodes in target groups #9990

bjtox opened this issue May 23, 2023 · 16 comments
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@bjtox
Copy link

bjtox commented May 23, 2023

Hi, i try to implement a nginx-ingress controller on my EKS installation.
i'm try to move on a new fresh installation on aws. i'm able to provide the NLB and the target group but seem not all nodes pass the health check, seem ramdomly fail, currently there are only 2 nodes on 5 availabe on my cluster.

the issue is the same of this one 8312

we move our application from k8s 1.22 to 1.26. We use the chart version 4.6.1 and we hope all nodes going healty.

Seem the node port on nodes are unavailabe for some reason i can't understand

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
Release: v1.7.1
Build: f48b03b
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6

Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Environment:
QA

  • Cloud provider or hardware configuration:
    AWS

  • OS (e.g. from /etc/os-release):

    • Amazon Linux 2
  • Install tools:

    • installed using helm
  • Basic cluster related info:

    • kubectl version
      • Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:59:18Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
    • kubectl get nodes -o wide
      • ip-10-176-0-218.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.218 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
        ip-10-176-0-227.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.227 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
        ip-10-176-0-77.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.77 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
        ip-10-176-1-124.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.124 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
        ip-10-176-1-68.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.68 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
  • How was the ingress-nginx-controller installed:

    • If helm was used then please show output of helm ls -A | grep -i ingress
      • s-oms-ingress s-oms 1 2023-05-23 17:34:33.603173818 +0200 CEST deployed s-oms-ingress-4.0.1 4.0.1
    • If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>
    • If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
    • if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
  • Current State of the controller:

    • kubectl describe ingressclasses
      • Name: nginx
        Labels: app.kubernetes.io/component=controller
        app.kubernetes.io/instance=s-oms-ingress
        app.kubernetes.io/managed-by=Helm
        app.kubernetes.io/name=ingressnginx
        app.kubernetes.io/part-of=ingressnginx
        app.kubernetes.io/version=1.7.1
        helm.sh/chart=ingressnginx-4.6.1
        Annotations: meta.helm.sh/release-name: s-oms-ingress
        meta.helm.sh/release-namespace: s-oms
        Controller: k8s.io/ingress-nginx
        Events:
    • kubectl -n <ingresscontrollernamespace> get all -A -o wide
    • kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
    • kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
  • Current state of ingress object, if applicable:

    • kubectl -n <appnnamespace> get all,ing -o wide
    • kubectl -n <appnamespace> describe ing <ingressname>
    • If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
  • Others:

    • Any other related information like ;
      • copy/paste of the snippet (if applicable)
ingressnginx:
 controller:
   ingressClassResource:
     name: nginx
   replicaCount: 3
   service:
     internalTrafficPolicy: local
     ipFamilies: false
     ipFamilyPolicy: false
     # type: ClusterIP
     annotations:
       service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
       # The number of successive successful health checks required for a backend to be considered healthy for traffic. Defaults to 2, must be between 2 and 10
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
       # The number of unsuccessful health checks required for a backend to be considered unhealthy for traffic. Defaults to 6, must be between 2 and 10
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
       # The approximate interval, in seconds, between health checks of an individual instance. Defaults to 10, must be between 5 and 300
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
       # The amount of time, in seconds, during which no response means a failed health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval value. Defaults to 5, must be between 2 and 60
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
       # can be integer or traffic-port
   config:
     enable-modsecurity: "true"
     enable-owasp-modsecurity-crs: "true"

How to reproduce this issue:

Anything else we need to know:
no other information are availabe

Thanks in advance best regards

@bjtox bjtox added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2023
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 23, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@longwuyuan
Copy link
Contributor

/remove-kind bug

Is this related #9367

@k8s-ci-robot k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 23, 2023
@bjtox
Copy link
Author

bjtox commented May 24, 2023

thank for reply @longwuyuan the issue linked is different for me, in my case EC2 Instaces are registerd on Target Group but they are unhealty. i've check if it was a network issue but nodes in same subnet had 2 different status (healty and unhealty)

@longwuyuan
Copy link
Contributor

longwuyuan commented May 24, 2023

please show kubectl -n ingress-nginx get svc -o yaml | grep -i aws

@bjtox
Copy link
Author

bjtox commented May 24, 2023

here the conten

      service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      - hostname: a8e842bcf9d14473ea8460a067058c46-f7c4d42e3047f41b.elb.eu-south-1.amazonaws.com

@bjtox
Copy link
Author

bjtox commented May 24, 2023

  • Additiona info
    Checking connection to the nodes i saw the node port are availabe only for a short period, i'v just check with a telnet command from another vm in same network

is it possible to set externalTrafficPolicy to local ?

just to add context the problem is the same reported in this post
https://stackoverflow.com/questions/61183167/kubernetes-issue-with-nodeport-connectivity

@longwuyuan
Copy link
Contributor

I think there is a healthz path related annotation required. Can you check docs

@bjtox
Copy link
Author

bjtox commented May 24, 2023

but tcp healtcheck don't have a path, am i wrong?

@longwuyuan
Copy link
Contributor

I am not sure. I think I have seen some comment about path. I am checking

@longwuyuan
Copy link
Contributor

Sorry, it was about AKS and not EKS

@longwuyuan
Copy link
Contributor

If you can edit your issue description and improve it, maybe more useful data will be available for debugging.

  • Please answer all questions asked in the issue template
  • Please format the information as per markdown

@bjtox
Copy link
Author

bjtox commented May 24, 2023

i'm not able to provde you another info, seem something goes down on K8, so that port are unavailabe on the host

@bjtox
Copy link
Author

bjtox commented May 25, 2023

@longwuyuan the issu is the same reported here #8312

@longwuyuan
Copy link
Contributor

longwuyuan commented May 25, 2023 via email

@sebastienrospars
Copy link

Hi , I have the same problem, sometimes I have 0 healthy node in the target group and a few minutes later I have one or two node up. Have you found a solution to this problem or do you still have this problem @bjtox ? Thanks

@minhhieu76qng
Copy link

minhhieu76qng commented Jul 14, 2023

@sebastienrospars Yeah, I faced with the same problem.
I installed ingress-nginx with Helm chart. I tried to install using install.yaml manifest in the documentation and it works.
Then I compared between helm chart values and manifest. Therefore, I found the externalTrafficPolicy for helm chart is not configured so it will get default value (Cluster) and in the manifest, it is Local.
So I added to values.yaml of chart: controler.service.externalTrafficPolicy: Local.
=> The problem was fixed now.

I have no idea about the difference, is it a mistake? @longwuyuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

5 participants