Many heath check pods crash #40

hervelemeur · 2021-01-12T13:34:24Z

On a new AKS cluster, I've got this error:

time="2021-01-12T12:49:05Z" level=info msg="Found instance namespace: kuberhealthy"
time="2021-01-12T12:49:05Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
starting jx-webhooks health checks
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12080b4]
goroutine 1 [running]:
main.Options.findErrors(0x15b0f80, 0xc000341430, 0x0, 0x22, 0x0, 0x0, 0x0)
/workspace/source/cmd/jx-secrets/main.go:68 +0x114
main.main()
/workspace/source/cmd/jx-secrets/main.go:32 +0x107
stream closed

(commit ref for the cluster: jx3-gitops-repositories/jx3-terraform-azure@8688dcb)

The text was updated successfully, but these errors were encountered:

hervelemeur · 2021-01-12T13:56:49Z

No missing external secret:

$ kubectl get externalsecrets -n jx
NAME                             LAST SYNC   STATUS    AGE
jenkins-maven-settings           16s         SUCCESS   60m
jenkins-release-gpg              2s          SUCCESS   60m
jenkins-x-chartmuseum            14s         SUCCESS   60m
jx-basic-auth-htpasswd           13s         SUCCESS   60m
jx-basic-auth-user-password      1s          SUCCESS   60m
lighthouse-hmac-token            11s         SUCCESS   60m
lighthouse-oauth-token           6s          SUCCESS   60m
nexus                            7s          SUCCESS   60m
tekton-container-registry-auth   10s         SUCCESS   60m
tekton-git                       4s          SUCCESS   60m

hervelemeur · 2021-01-12T13:57:48Z

I've then manually deleted this failing pod, all the next ones are "completed" without any problem.

hervelemeur · 2021-01-14T10:03:26Z

I've got other health checks crashing.

deployment:

time="2021-01-13T21:10:14Z" level=info msg="Found instance namespace: kuberhealthy"  
time="2021-01-13T21:10:14Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."  
time="2021-01-13T21:10:14Z" level=info msg="Found pod namespace: kuberhealthy"  
time="2021-01-13T21:10:14Z" level=info msg="Performing check in kuberhealthy namespace."  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_DEPLOYMENT\_REPLICAS: 4"  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_SERVICE\_ACCOUNT: default"  
time="2021-01-13T21:10:14Z" level=info msg="Check time limit set to: 14m45.119966409s"  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_DEPLOYMENT\_ROLLING\_UPDATE: true"  
time="2021-01-13T21:10:14Z" level=info msg="Check deployment image will be rolled from \[nginxinc/nginx-unprivileged:1.17.8\] to \[nginxinc/nginx-unprivileged:1.17.9\]"  
time="2021-01-13T21:10:14Z" level=info msg="Kubernetes client created."  
time="2021-01-13T21:10:14Z" level=info msg="Waiting for node to become ready before starting check."  
time="2021-01-13T21:10:15Z" level=error msg="Failed to check node age: nodes \\"aks-default-15766151-vmss000002\\" is forbidden: User \\"system:serviceaccount:kuberhealthy:deployment-sa\\" cannot get resource \\"nodes\\" in API group \\"\\" at the cluster scope"  
time="2021-01-13T21:10:15Z" level=info msg="Starting check."  
time="2021-01-13T21:10:15Z" level=info msg="Wiping all found orphaned resources belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Attempting to find previously created service(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Did not find any old service(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Attempting to find previously created deployment(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Did not find any old deployment(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Successfully cleaned up prior check resources."  
time="2021-01-13T21:10:15Z" level=info msg="Creating deployment resource with 4 replica(s) in kuberhealthy namespace using image \[nginxinc/nginx-unprivileged:1.17.8\] with environment variables: map\[\]"  
time="2021-01-13T21:10:15Z" level=info msg="Creating container using image \[nginxinc/nginx-unprivileged:1.17.8\] with environment variables: map\[\]"  
time="2021-01-13T21:10:15Z" level=info msg="Created deployment resource."  
time="2021-01-13T21:10:15Z" level=info msg="Creating deployment in cluster with name: deployment-deployment"  
time="2021-01-13T21:10:16Z" level=info msg="Watching for deployment to exist."  
time="2021-01-13T21:10:31Z" level=info msg="Deployment is reporting Available with True."  
time="2021-01-13T21:10:31Z" level=info msg="Created deployment in kuberhealthy namespace: deployment-deployment"  
time="2021-01-13T21:10:31Z" level=info msg="Creating service resource for kuberhealthy namespace."  
time="2021-01-13T21:10:31Z" level=info msg="Created service resource."  
time="2021-01-13T21:10:31Z" level=info msg="Creating service in cluster with name: deployment-svc"  
time="2021-01-13T21:10:31Z" level=info msg="Watching for service to exist."  
time="2021-01-13T21:10:31Z" level=info msg="Cluster IP found: 10.0.44.239"  
time="2021-01-13T21:10:31Z" level=info msg="Created service in kuberhealthy namespace: deployment-svc"  
time="2021-01-13T21:10:31Z" level=info msg="Found service cluster IP address: 10.0.44.239"  
time="2021-01-13T21:10:31Z" level=info msg="Looking for a response from the endpoint."  
time="2021-01-13T21:10:31Z" level=info msg="Beginning backoff loop for HTTP GET request."  
time="2021-01-13T21:11:01Z" level=info msg="Retrying in 5 seconds."  
time="2021-01-13T21:11:06Z" level=info msg="Successfully made an HTTP request on attempt: 2"  
time="2021-01-13T21:11:06Z" level=info msg="Got a 200 with a GET to http://10.0.44.239"  
time="2021-01-13T21:11:06Z" level=info msg="Got a result from GET request backoff: 200 OK"  
time="2021-01-13T21:11:06Z" level=info msg="Successfully hit service endpoint."  
time="2021-01-13T21:11:06Z" level=info msg="Rolling update option is enabled. Performing roll."  
time="2021-01-13T21:11:06Z" level=info msg="Creating deployment resource with 4 replica(s) in kuberhealthy namespace using image \[nginxinc/nginx-unprivileged:1.17.9\] with environment variables: map\[\]"  
time="2021-01-13T21:11:06Z" level=info msg="Creating container using image \[nginxinc/nginx-unprivileged:1.17.9\] with environment variables: map\[\]"  
time="2021-01-13T21:11:06Z" level=info msg="Created rolling-update deployment resource."  
time="2021-01-13T21:11:06Z" level=info msg="Performing rolling-update on deployment deployment-deployment to \[nginxinc/nginx-unprivileged:1.17.9\]"  
time="2021-01-13T21:11:06Z" level=info msg="Rolled deployment in kuberhealthy namespace: deployment-deployment"  
time="2021-01-13T21:11:06Z" level=info msg="Looking for a response from the endpoint."  
time="2021-01-13T21:11:06Z" level=info msg="Beginning backoff loop for HTTP GET request."  
time="2021-01-13T21:11:06Z" level=info msg="Successfully made an HTTP request on attempt: 1"  
time="2021-01-13T21:11:06Z" level=info msg="Got a 200 with a GET to http://10.0.44.239"  
time="2021-01-13T21:11:06Z" level=info msg="Got a result from GET request backoff: 200 OK"  
time="2021-01-13T21:11:06Z" level=info msg="Successfully hit service endpoint after rolling-update."  
time="2021-01-13T21:11:06Z" level=info msg="Cleaning up deployment and service."  
time="2021-01-13T21:11:06Z" level=info msg="Attempting to delete service deployment-svc in kuberhealthy namespace."  
time="2021-01-13T21:11:11Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."  
time="2021-01-13T21:11:16Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."  
time="2021-01-13T21:11:21Z" level=info msg="Finished clean up process."  
time="2021-01-13T21:11:21Z" level=info msg="Reporting success to Kuberhealthy."  
time="2021-01-13T21:12:40Z" level=info msg="Recovered panic: runtime error: invalid memory address or nil pointer dereference"  
panic: runtime error: invalid memory address or nil pointer dereference \[recovered\]  
   panic: interface conversion: interface {} is runtime.errorString, not string  
\[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11ea333\]  
goroutine 1 \[running\]:  
main.main.func1(0xc0001f9f00)  
   /build/cmd/deployment-check/main.go:189 +0x175  
panic(0x130e480, 0x2000ab0)  
   /usr/local/go/src/runtime/panic.go:679 +0x1b2  
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.sendReport(0x2039aa8, 0x0, 0x0, 0x1, 0x0, 0xc0001f9638)  
   /build/pkg/checks/external/checkclient/main.go:99 +0x4e3  
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.ReportSuccess(0xc0001f9658, 0x46983c)  
   /build/pkg/checks/external/checkclient/main.go:44 +0x7e  
main.reportToKuberhealthy(0xc000113101, 0x2039aa8, 0x0, 0x0)  
   /build/cmd/deployment-check/main.go:260 +0x33  
main.reportOKToKuberhealthy()  
   /build/cmd/deployment-check/main.go:253 +0x92  
main.runDeploymentCheck(0x166ed60, 0xc00031d4a0)  
   /build/cmd/deployment-check/run\_check.go:243 +0x149d  
main.main()  
   /build/cmd/deployment-check/main.go:194 +0x36e  
stream closed

dns-status-internal:

time="2021-01-13T21:10:11Z" level=info msg="Found instance namespace: kuberhealthy"
time="2021-01-13T21:10:11Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2021-01-13T21:10:11Z" level=info msg="Check time limit set to: 14m47.427936725s"
time="2021-01-13T21:10:11Z" level=info msg="Check pod is running on node: aks-default-15766151-vmss000000"
time="2021-01-13T21:10:11Z" level=debug msg="Getting pod: dns-status-internal-1610572204 in order to get its node information"
time="2021-01-13T21:10:11Z" level=error msg="Error waiting for node to reach minimum age: pods \"dns-status-internal-1610572204\" is forbidden: User \"system:serviceaccount:kuberhealthy:default\" cannot get resource \"pods\" in API group \"\" in the namespace \"kuberhealthy\""
time="2021-01-13T21:10:11Z" level=debug msg="Checking if the kuberhealthy endpoint: http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready."
time="2021-01-13T21:10:11Z" level=debug msg="http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready."
time="2021-01-13T21:10:11Z" level=debug msg="Kuberhealthy endpoint: http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready. Proceeding to run check."
time="2021-01-13T21:10:11Z" level=debug msg="Getting pod: dns-status-internal-1610572204 in order to get its node information"
time="2021-01-13T21:10:11Z" level=error msg="Error waiting for kube proxy to be ready: error getting kuberhealthy pod: pods \"dns-status-internal-1610572204\" is forbidden: User \"system:serviceaccount:kuberhealthy:default\" cannot get resource \"pods\" in API group \"\" in the namespace \"kuberhealthy\""
time="2021-01-13T21:10:11Z" level=info msg="Running DNS status checker"
time="2021-01-13T21:10:11Z" level=info msg="DNS Status check testing hostname: kubernetes.default"
time="2021-01-13T21:10:11Z" level=info msg="DNS Status check determined that kubernetes.default was OK."
2021/01/13 21:10:11 checkClient: DEBUG: Reporting SUCCESS
2021/01/13 21:10:11 checkClient: DEBUG: Sending report with error length of:0
2021/01/13 21:10:11 checkClient: DEBUG: Sending report with ok state of:true
2021/01/13 21:10:11 checkClient: INFO: Using kuberhealthy reporting URL:http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus
2021/01/13 21:10:11 checkClient: DEBUG: Making POST request to kuberhealthy:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11f1473]
goroutine 1 [running]:
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.sendReport(0x202a688, 0x0, 0x0, 0x1, 0xc0003265e8, 0xc0002edd74)
   /build/pkg/checks/external/checkclient/main.go:99 +0x4e3
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.ReportSuccess(0xc0002eddf0, 0xc0003265a0)
   /build/pkg/checks/external/checkclient/main.go:44 +0x7e
main.reportKHSuccess(0xc0002eddc8, 0xc0002edd70)
   /build/cmd/dns-resolution-check/main.go:182 +0x2d
main.(*Checker).Run(0xc0002f5ee0, 0xc00026fe40, 0xc0002eded0, 0x2)
   /build/cmd/dns-resolution-check/main.go:161 +0x204
main.main()
   /build/cmd/dns-resolution-check/main.go:119 +0x3ce
stream closed

hervelemeur mentioned this issue Jan 14, 2021

dns-status-internal healcheck crashes #43

Closed

hervelemeur changed the title ~~jx-secret heath check pod crashes~~ Many heath check pods crash Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many heath check pods crash #40

Many heath check pods crash #40

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 14, 2021

Many heath check pods crash #40

Many heath check pods crash #40

Comments

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 12, 2021

hervelemeur commented Jan 14, 2021