Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many heath check pods crash #40

Open
hervelemeur opened this issue Jan 12, 2021 · 3 comments
Open

Many heath check pods crash #40

hervelemeur opened this issue Jan 12, 2021 · 3 comments

Comments

@hervelemeur
Copy link

On a new AKS cluster, I've got this error:

time="2021-01-12T12:49:05Z" level=info msg="Found instance namespace: kuberhealthy"
time="2021-01-12T12:49:05Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
starting jx-webhooks health checks
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x12080b4]
goroutine 1 [running]:
main.Options.findErrors(0x15b0f80, 0xc000341430, 0x0, 0x22, 0x0, 0x0, 0x0)
    /workspace/source/cmd/jx-secrets/main.go:68 +0x114
main.main()
    /workspace/source/cmd/jx-secrets/main.go:32 +0x107
stream closed

(commit ref for the cluster: jx3-gitops-repositories/jx3-terraform-azure@8688dcb)

@hervelemeur
Copy link
Author

No missing external secret:

$ kubectl get externalsecrets -n jx
NAME                             LAST SYNC   STATUS    AGE
jenkins-maven-settings           16s         SUCCESS   60m
jenkins-release-gpg              2s          SUCCESS   60m
jenkins-x-chartmuseum            14s         SUCCESS   60m
jx-basic-auth-htpasswd           13s         SUCCESS   60m
jx-basic-auth-user-password      1s          SUCCESS   60m
lighthouse-hmac-token            11s         SUCCESS   60m
lighthouse-oauth-token           6s          SUCCESS   60m
nexus                            7s          SUCCESS   60m
tekton-container-registry-auth   10s         SUCCESS   60m
tekton-git                       4s          SUCCESS   60m

@hervelemeur
Copy link
Author

I've then manually deleted this failing pod, all the next ones are "completed" without any problem.

@hervelemeur
Copy link
Author

I've got other health checks crashing.

deployment:

time="2021-01-13T21:10:14Z" level=info msg="Found instance namespace: kuberhealthy"  
time="2021-01-13T21:10:14Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."  
time="2021-01-13T21:10:14Z" level=info msg="Found pod namespace: kuberhealthy"  
time="2021-01-13T21:10:14Z" level=info msg="Performing check in kuberhealthy namespace."  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_DEPLOYMENT\_REPLICAS: 4"  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_SERVICE\_ACCOUNT: default"  
time="2021-01-13T21:10:14Z" level=info msg="Check time limit set to: 14m45.119966409s"  
time="2021-01-13T21:10:14Z" level=info msg="Parsed CHECK\_DEPLOYMENT\_ROLLING\_UPDATE: true"  
time="2021-01-13T21:10:14Z" level=info msg="Check deployment image will be rolled from \[nginxinc/nginx-unprivileged:1.17.8\] to \[nginxinc/nginx-unprivileged:1.17.9\]"  
time="2021-01-13T21:10:14Z" level=info msg="Kubernetes client created."  
time="2021-01-13T21:10:14Z" level=info msg="Waiting for node to become ready before starting check."  
time="2021-01-13T21:10:15Z" level=error msg="Failed to check node age: nodes \\"aks-default-15766151-vmss000002\\" is forbidden: User \\"system:serviceaccount:kuberhealthy:deployment-sa\\" cannot get resource \\"nodes\\" in API group \\"\\" at the cluster scope"  
time="2021-01-13T21:10:15Z" level=info msg="Starting check."  
time="2021-01-13T21:10:15Z" level=info msg="Wiping all found orphaned resources belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Attempting to find previously created service(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Did not find any old service(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Attempting to find previously created deployment(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Did not find any old deployment(s) belonging to this check."  
time="2021-01-13T21:10:15Z" level=info msg="Successfully cleaned up prior check resources."  
time="2021-01-13T21:10:15Z" level=info msg="Creating deployment resource with 4 replica(s) in kuberhealthy namespace using image \[nginxinc/nginx-unprivileged:1.17.8\] with environment variables: map\[\]"  
time="2021-01-13T21:10:15Z" level=info msg="Creating container using image \[nginxinc/nginx-unprivileged:1.17.8\] with environment variables: map\[\]"  
time="2021-01-13T21:10:15Z" level=info msg="Created deployment resource."  
time="2021-01-13T21:10:15Z" level=info msg="Creating deployment in cluster with name: deployment-deployment"  
time="2021-01-13T21:10:16Z" level=info msg="Watching for deployment to exist."  
time="2021-01-13T21:10:31Z" level=info msg="Deployment is reporting Available with True."  
time="2021-01-13T21:10:31Z" level=info msg="Created deployment in kuberhealthy namespace: deployment-deployment"  
time="2021-01-13T21:10:31Z" level=info msg="Creating service resource for kuberhealthy namespace."  
time="2021-01-13T21:10:31Z" level=info msg="Created service resource."  
time="2021-01-13T21:10:31Z" level=info msg="Creating service in cluster with name: deployment-svc"  
time="2021-01-13T21:10:31Z" level=info msg="Watching for service to exist."  
time="2021-01-13T21:10:31Z" level=info msg="Cluster IP found: 10.0.44.239"  
time="2021-01-13T21:10:31Z" level=info msg="Created service in kuberhealthy namespace: deployment-svc"  
time="2021-01-13T21:10:31Z" level=info msg="Found service cluster IP address: 10.0.44.239"  
time="2021-01-13T21:10:31Z" level=info msg="Looking for a response from the endpoint."  
time="2021-01-13T21:10:31Z" level=info msg="Beginning backoff loop for HTTP GET request."  
time="2021-01-13T21:11:01Z" level=info msg="Retrying in 5 seconds."  
time="2021-01-13T21:11:06Z" level=info msg="Successfully made an HTTP request on attempt: 2"  
time="2021-01-13T21:11:06Z" level=info msg="Got a 200 with a GET to http://10.0.44.239"  
time="2021-01-13T21:11:06Z" level=info msg="Got a result from GET request backoff: 200 OK"  
time="2021-01-13T21:11:06Z" level=info msg="Successfully hit service endpoint."  
time="2021-01-13T21:11:06Z" level=info msg="Rolling update option is enabled. Performing roll."  
time="2021-01-13T21:11:06Z" level=info msg="Creating deployment resource with 4 replica(s) in kuberhealthy namespace using image \[nginxinc/nginx-unprivileged:1.17.9\] with environment variables: map\[\]"  
time="2021-01-13T21:11:06Z" level=info msg="Creating container using image \[nginxinc/nginx-unprivileged:1.17.9\] with environment variables: map\[\]"  
time="2021-01-13T21:11:06Z" level=info msg="Created rolling-update deployment resource."  
time="2021-01-13T21:11:06Z" level=info msg="Performing rolling-update on deployment deployment-deployment to \[nginxinc/nginx-unprivileged:1.17.9\]"  
time="2021-01-13T21:11:06Z" level=info msg="Rolled deployment in kuberhealthy namespace: deployment-deployment"  
time="2021-01-13T21:11:06Z" level=info msg="Looking for a response from the endpoint."  
time="2021-01-13T21:11:06Z" level=info msg="Beginning backoff loop for HTTP GET request."  
time="2021-01-13T21:11:06Z" level=info msg="Successfully made an HTTP request on attempt: 1"  
time="2021-01-13T21:11:06Z" level=info msg="Got a 200 with a GET to http://10.0.44.239"  
time="2021-01-13T21:11:06Z" level=info msg="Got a result from GET request backoff: 200 OK"  
time="2021-01-13T21:11:06Z" level=info msg="Successfully hit service endpoint after rolling-update."  
time="2021-01-13T21:11:06Z" level=info msg="Cleaning up deployment and service."  
time="2021-01-13T21:11:06Z" level=info msg="Attempting to delete service deployment-svc in kuberhealthy namespace."  
time="2021-01-13T21:11:11Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."  
time="2021-01-13T21:11:16Z" level=info msg="Attempting to delete deployment in kuberhealthy namespace."  
time="2021-01-13T21:11:21Z" level=info msg="Finished clean up process."  
time="2021-01-13T21:11:21Z" level=info msg="Reporting success to Kuberhealthy."  
time="2021-01-13T21:12:40Z" level=info msg="Recovered panic: runtime error: invalid memory address or nil pointer dereference"  
panic: runtime error: invalid memory address or nil pointer dereference \[recovered\]  
   panic: interface conversion: interface {} is runtime.errorString, not string  
\[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11ea333\]  
goroutine 1 \[running\]:  
main.main.func1(0xc0001f9f00)  
   /build/cmd/deployment-check/main.go:189 +0x175  
panic(0x130e480, 0x2000ab0)  
   /usr/local/go/src/runtime/panic.go:679 +0x1b2  
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.sendReport(0x2039aa8, 0x0, 0x0, 0x1, 0x0, 0xc0001f9638)  
   /build/pkg/checks/external/checkclient/main.go:99 +0x4e3  
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.ReportSuccess(0xc0001f9658, 0x46983c)  
   /build/pkg/checks/external/checkclient/main.go:44 +0x7e  
main.reportToKuberhealthy(0xc000113101, 0x2039aa8, 0x0, 0x0)  
   /build/cmd/deployment-check/main.go:260 +0x33  
main.reportOKToKuberhealthy()  
   /build/cmd/deployment-check/main.go:253 +0x92  
main.runDeploymentCheck(0x166ed60, 0xc00031d4a0)  
   /build/cmd/deployment-check/run\_check.go:243 +0x149d  
main.main()  
   /build/cmd/deployment-check/main.go:194 +0x36e  
stream closed

dns-status-internal:

time="2021-01-13T21:10:11Z" level=info msg="Found instance namespace: kuberhealthy"
time="2021-01-13T21:10:11Z" level=info msg="Kuberhealthy is located in the kuberhealthy namespace."
time="2021-01-13T21:10:11Z" level=info msg="Check time limit set to: 14m47.427936725s"
time="2021-01-13T21:10:11Z" level=info msg="Check pod is running on node: aks-default-15766151-vmss000000"
time="2021-01-13T21:10:11Z" level=debug msg="Getting pod: dns-status-internal-1610572204 in order to get its node information"
time="2021-01-13T21:10:11Z" level=error msg="Error waiting for node to reach minimum age: pods \"dns-status-internal-1610572204\" is forbidden: User \"system:serviceaccount:kuberhealthy:default\" cannot get resource \"pods\" in API group \"\" in the namespace \"kuberhealthy\""
time="2021-01-13T21:10:11Z" level=debug msg="Checking if the kuberhealthy endpoint: http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready."
time="2021-01-13T21:10:11Z" level=debug msg="http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready."
time="2021-01-13T21:10:11Z" level=debug msg="Kuberhealthy endpoint: http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus is ready. Proceeding to run check."
time="2021-01-13T21:10:11Z" level=debug msg="Getting pod: dns-status-internal-1610572204 in order to get its node information"
time="2021-01-13T21:10:11Z" level=error msg="Error waiting for kube proxy to be ready: error getting kuberhealthy pod: pods \"dns-status-internal-1610572204\" is forbidden: User \"system:serviceaccount:kuberhealthy:default\" cannot get resource \"pods\" in API group \"\" in the namespace \"kuberhealthy\""
time="2021-01-13T21:10:11Z" level=info msg="Running DNS status checker"
time="2021-01-13T21:10:11Z" level=info msg="DNS Status check testing hostname: kubernetes.default"
time="2021-01-13T21:10:11Z" level=info msg="DNS Status check determined that kubernetes.default was OK."
2021/01/13 21:10:11 checkClient: DEBUG: Reporting SUCCESS
2021/01/13 21:10:11 checkClient: DEBUG: Sending report with error length of:0
2021/01/13 21:10:11 checkClient: DEBUG: Sending report with ok state of:true
2021/01/13 21:10:11 checkClient: INFO: Using kuberhealthy reporting URL:http://kuberhealthy.kuberhealthy.svc.cluster.local/externalCheckStatus
2021/01/13 21:10:11 checkClient: DEBUG: Making POST request to kuberhealthy:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x11f1473]
goroutine 1 [running]:
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.sendReport(0x202a688, 0x0, 0x0, 0x1, 0xc0003265e8, 0xc0002edd74)
   /build/pkg/checks/external/checkclient/main.go:99 +0x4e3
github.com/Comcast/kuberhealthy/v2/pkg/checks/external/checkclient.ReportSuccess(0xc0002eddf0, 0xc0003265a0)
   /build/pkg/checks/external/checkclient/main.go:44 +0x7e
main.reportKHSuccess(0xc0002eddc8, 0xc0002edd70)
   /build/cmd/dns-resolution-check/main.go:182 +0x2d
main.(*Checker).Run(0xc0002f5ee0, 0xc00026fe40, 0xc0002eded0, 0x2)
   /build/cmd/dns-resolution-check/main.go:161 +0x204
main.main()
   /build/cmd/dns-resolution-check/main.go:119 +0x3ce
stream closed

@hervelemeur hervelemeur changed the title jx-secret heath check pod crashes Many heath check pods crash Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant