AWS - Randomly unhealty nodes in target groups #9990

bjtox · 2023-05-23T15:57:58Z

Hi, i try to implement a nginx-ingress controller on my EKS installation.
i'm try to move on a new fresh installation on aws. i'm able to provide the NLB and the target group but seem not all nodes pass the health check, seem ramdomly fail, currently there are only 2 nodes on 5 availabe on my cluster.

the issue is the same of this one 8312

we move our application from k8s 1.22 to 1.26. We use the chart version 4.6.1 and we hope all nodes going healty.

Seem the node port on nodes are unavailabe for some reason i can't understand

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.):

NGINX Ingress controller
Release: v1.7.1
Build: f48b03b
Repository: https://github.com/kubernetes/ingress-nginx
nginx version: nginx/1.21.6

Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"26+", GitVersion:"v1.26.4-eks-0a21954", GitCommit:"4a3479673cb6d9b63f1c69a67b57de30a4d9b781", GitTreeState:"clean", BuildDate:"2023-04-15T00:33:09Z", GoVersion:"go1.19.8", Compiler:"gc", Platform:"linux/amd64"}

Environment:
QA

Cloud provider or hardware configuration:
AWS
OS (e.g. from /etc/os-release):
- Amazon Linux 2
Install tools:
- installed using helm
Basic cluster related info:
- kubectl version
- - Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.12", GitCommit:"b058e1760c79f46a834ba59bd7a3486ecf28237d", GitTreeState:"clean", BuildDate:"2022-07-13T14:59:18Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
- kubectl get nodes -o wide
- - ip-10-176-0-218.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.218 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
    ip-10-176-0-227.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.227 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
    ip-10-176-0-77.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.0.77 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
    ip-10-176-1-124.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.124 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
    ip-10-176-1-68.eu-south-1.compute.internal Ready 3d23h v1.26.4-eks-0a21954 10.176.1.68 Amazon Linux 2 5.10.178-162.673.amzn2.x86_64 containerd://1.6.19
How was the ingress-nginx-controller installed:
- If helm was used then please show output of helm ls -A | grep -i ingress
- - s-oms-ingress s-oms 1 2023-05-23 17:34:33.603173818 +0200 CEST deployed s-oms-ingress-4.0.1 4.0.1
- If helm was used then please show output of helm -n <ingresscontrollernamepspace> get values <helmreleasename>
- If helm was not used, then copy/paste the complete precise command used to install the controller, along with the flags and options used
- if you have more than one instance of the ingress-nginx-controller installed in the same cluster, please provide details for all the instances
Current State of the controller:
- kubectl describe ingressclasses
- - Name: nginx
    Labels: app.kubernetes.io/component=controller
    app.kubernetes.io/instance=s-oms-ingress
    app.kubernetes.io/managed-by=Helm
    app.kubernetes.io/name=ingressnginx
    app.kubernetes.io/part-of=ingressnginx
    app.kubernetes.io/version=1.7.1
    helm.sh/chart=ingressnginx-4.6.1
    Annotations: meta.helm.sh/release-name: s-oms-ingress
    meta.helm.sh/release-namespace: s-oms
    Controller: k8s.io/ingress-nginx
    Events:
- kubectl -n <ingresscontrollernamespace> get all -A -o wide
- kubectl -n <ingresscontrollernamespace> describe po <ingresscontrollerpodname>
- kubectl -n <ingresscontrollernamespace> describe svc <ingresscontrollerservicename>
Current state of ingress object, if applicable:
- kubectl -n <appnnamespace> get all,ing -o wide
- kubectl -n <appnamespace> describe ing <ingressname>
- If applicable, then, your complete and exact curl/grpcurl command (redacted if required) and the reponse to the curl/grpcurl command with the -v flag
Others:
- Any other related information like ;
  - copy/paste of the snippet (if applicable)

ingressnginx:
 controller:
   ingressClassResource:
     name: nginx
   replicaCount: 3
   service:
     internalTrafficPolicy: local
     ipFamilies: false
     ipFamilyPolicy: false
     # type: ClusterIP
     annotations:
       service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
       # The number of successive successful health checks required for a backend to be considered healthy for traffic. Defaults to 2, must be between 2 and 10
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
       # The number of unsuccessful health checks required for a backend to be considered unhealthy for traffic. Defaults to 6, must be between 2 and 10
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
       # The approximate interval, in seconds, between health checks of an individual instance. Defaults to 10, must be between 5 and 300
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
       # The amount of time, in seconds, during which no response means a failed health check. This value must be less than the service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval value. Defaults to 5, must be between 2 and 60
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
       service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
       # can be integer or traffic-port
   config:
     enable-modsecurity: "true"
     enable-owasp-modsecurity-crs: "true"

How to reproduce this issue:

Anything else we need to know:
no other information are availabe

Thanks in advance best regards

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-05-23T15:58:06Z

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

longwuyuan · 2023-05-23T16:14:40Z

/remove-kind bug

Is this related #9367

bjtox · 2023-05-24T08:33:44Z

thank for reply @longwuyuan the issue linked is different for me, in my case EC2 Instaces are registerd on Target Group but they are unhealty. i've check if it was a network issue but nodes in same subnet had 2 different status (healty and unhealty)

longwuyuan · 2023-05-24T09:09:58Z

please show kubectl -n ingress-nginx get svc -o yaml | grep -i aws

bjtox · 2023-05-24T09:16:38Z

here the conten

      service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: "2"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "20"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: traffic-port
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: TCP
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: "5"
      service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: "3"
      service.beta.kubernetes.io/aws-load-balancer-type: nlb
      - hostname: a8e842bcf9d14473ea8460a067058c46-f7c4d42e3047f41b.elb.eu-south-1.amazonaws.com

bjtox · 2023-05-24T10:05:47Z

Additiona info
Checking connection to the nodes i saw the node port are availabe only for a short period, i'v just check with a telnet command from another vm in same network

is it possible to set externalTrafficPolicy to local ?

just to add context the problem is the same reported in this post
https://stackoverflow.com/questions/61183167/kubernetes-issue-with-nodeport-connectivity

longwuyuan · 2023-05-24T11:18:33Z

I think there is a healthz path related annotation required. Can you check docs

bjtox · 2023-05-24T12:09:48Z

but tcp healtcheck don't have a path, am i wrong?

longwuyuan · 2023-05-24T12:15:14Z

I am not sure. I think I have seen some comment about path. I am checking

longwuyuan · 2023-05-24T12:21:08Z

Sorry, it was about AKS and not EKS

longwuyuan · 2023-05-24T12:25:08Z

If you can edit your issue description and improve it, maybe more useful data will be available for debugging.

Please answer all questions asked in the issue template
Please format the information as per markdown

bjtox · 2023-05-24T13:05:52Z

i'm not able to provde you another info, seem something goes down on K8, so that port are unavailabe on the host

bjtox · 2023-05-25T07:50:46Z

@longwuyuan the issu is the same reported here #8312

longwuyuan · 2023-05-25T08:54:15Z

I am wondering if this is related #9367

…

On Thu, 25 May, 2023, 1:20 pm Antonio Bitonti, ***@***.***> wrote: @longwuyuan <https://github.com/longwuyuan> the issu is the same reported here #8312 <#8312> — Reply to this email directly, view it on GitHub <#9990 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABGZVWS47ALZUGBUWZEEUELXH4FOFANCNFSM6AAAAAAYMDOW3A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

sebastienrospars · 2023-06-21T12:48:40Z

Hi , I have the same problem, sometimes I have 0 healthy node in the target group and a few minutes later I have one or two node up. Have you found a solution to this problem or do you still have this problem @bjtox ? Thanks

minhhieu76qng · 2023-07-14T10:06:40Z

@sebastienrospars Yeah, I faced with the same problem.
I installed ingress-nginx with Helm chart. I tried to install using install.yaml manifest in the documentation and it works.
Then I compared between helm chart values and manifest. Therefore, I found the externalTrafficPolicy for helm chart is not configured so it will get default value (Cluster) and in the manifest, it is Local.
So I added to values.yaml of chart: controler.service.externalTrafficPolicy: Local.
=> The problem was fixed now.

I have no idea about the difference, is it a mistake? @longwuyuan

bjtox added the kind/bug Categorizes issue or PR as related to a bug. label May 23, 2023

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label May 23, 2023

k8s-ci-robot added the needs-priority label May 23, 2023

k8s-ci-robot added needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS - Randomly unhealty nodes in target groups #9990

AWS - Randomly unhealty nodes in target groups #9990

bjtox commented May 23, 2023 •

edited

k8s-ci-robot commented May 23, 2023

longwuyuan commented May 23, 2023

bjtox commented May 24, 2023

longwuyuan commented May 24, 2023 •

edited

bjtox commented May 24, 2023

bjtox commented May 24, 2023 •

edited

longwuyuan commented May 24, 2023

bjtox commented May 24, 2023

longwuyuan commented May 24, 2023

longwuyuan commented May 24, 2023

longwuyuan commented May 24, 2023

bjtox commented May 24, 2023

bjtox commented May 25, 2023

longwuyuan commented May 25, 2023 via email

sebastienrospars commented Jun 21, 2023

minhhieu76qng commented Jul 14, 2023 •

edited

AWS - Randomly unhealty nodes in target groups #9990

AWS - Randomly unhealty nodes in target groups #9990

Comments

bjtox commented May 23, 2023 • edited

k8s-ci-robot commented May 23, 2023

longwuyuan commented May 23, 2023

bjtox commented May 24, 2023

longwuyuan commented May 24, 2023 • edited

bjtox commented May 24, 2023

bjtox commented May 24, 2023 • edited

longwuyuan commented May 24, 2023

bjtox commented May 24, 2023

longwuyuan commented May 24, 2023

longwuyuan commented May 24, 2023

longwuyuan commented May 24, 2023

bjtox commented May 24, 2023

bjtox commented May 25, 2023

longwuyuan commented May 25, 2023 via email

sebastienrospars commented Jun 21, 2023

minhhieu76qng commented Jul 14, 2023 • edited

bjtox commented May 23, 2023 •

edited

longwuyuan commented May 24, 2023 •

edited

bjtox commented May 24, 2023 •

edited

minhhieu76qng commented Jul 14, 2023 •

edited