Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Good gRPC deployment pods frequently fail at least one health check #3308

Closed
ahmetb opened this issue Feb 23, 2019 · 7 comments · Fixed by #4148
Assignees
Milestone

Comments

@ahmetb
Copy link
Member

@ahmetb ahmetb commented Feb 23, 2019

/area networking
/kind bug

From my current experience, no matter what gRPC service you deploy, even including the sample grpc-ping-go app, you'll see at least one failed health check in the pod.

Despite the app is totally fine, it fails one HTTP health check (%90 of the time), which is GET /health on :8082 (queue-proxy container):

$ kubectl describe pod [...]

Events:
  Type     Reason     Age   From                                                    Message
  ----     ------     ----  ----                                                    -------
  Normal   Scheduled  7s    default-scheduler                                       Successfully assigned default/productcatalogservice-dsvgk-deployment-77b69b669f-v2rjm to gke-kn-hipstershop-default-pool-dd20ed91-nd2x
  Normal   Pulled     6s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Container image "docker.io/istio/proxy_init:1.0.2" already present on machine
  Normal   Created    6s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Created container
  Normal   Started    6s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Started container
  Normal   Pulled     4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Container image "gcr.io/knative-releases/github.com/knative/serving/cmd/queue@sha256:e19ca17d2b729904d2662a30b6c5c27cf4b62fd64baef2da4125525a4f9346e5" already present on machine
  Normal   Created    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Created container
  Normal   Started    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Started container
  Normal   Pulled     4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Container image "gcr.io/google-samples/microservices-demo/productcatalogservice@sha256:1aa051838dd0de9be25321bdc42cb3b4f247a01642cfd75b6ae866182495c531" already present on machine
  Normal   Created    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Created container
  Normal   Started    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Started container
  Normal   Pulled     4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Container image "docker.io/istio/proxyv2:1.0.2" already present on machine
  Normal   Created    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Created container
  Normal   Started    4s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Started container
  Warning  Unhealthy  3s    kubelet, gke-kn-hipstershop-default-pool-dd20ed91-nd2x  Readiness probe failed: Get http://10.36.1.43:8022/health: dial tcp 10.36.1.43:8022: connect: connection refused

I think it's a timing/ordering issue ––but I think it's kinda distracting from actual failures about my app. I spent a long time trying to understand what is /health and why is it querying :8022. It's not immediately obvious to new users.

@tanzeeb

This comment has been minimized.

Copy link
Member

@tanzeeb tanzeeb commented Apr 2, 2019

/assign

tanzeeb added a commit to tanzeeb/serving that referenced this issue Apr 23, 2019
Fixes knative#3308

There's a race between the `user-container` binding to the port, and the
`queue-proxy` reporting successful connection to that port. We get
falsely alarming warnings like this in the k8s event log:

```
Readiness probe failed: Get http://10.36.0.97:8022/health: dial tcp 10.36.0.97:8022: connect: connection refused
```

This change gives the `user-container` a few seconds to get started.
tanzeeb added a commit to tanzeeb/serving that referenced this issue Apr 23, 2019
Fixes knative#3308

There's a race between the `queue-proxy` health check handler and the
readiness probe. We get falsely alarming warnings like this in the k8s event log:

```
Readiness probe failed: Get http://10.36.0.97:8022/health: dial tcp 10.36.0.97:8022: connect: connection refused
```

This change gives the `queue-proxy` a few seconds to get started.
@tanzeeb

This comment has been minimized.

Copy link
Member

@tanzeeb tanzeeb commented Apr 23, 2019

Not gRPC specific, can be reproduced on all services. K8s readiness probe checks before the queue-proxy health handler is started.

@tanzeeb

This comment has been minimized.

Copy link
Member

@tanzeeb tanzeeb commented Apr 23, 2019

Update: We may want to leave this unfixed, see thread in #3869

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented May 8, 2019

A possible solution to this is to use an exec probe in the queue-proxy to probe itself, which can hang.

IIRC this has potential issues checking for mesh readiness, but I'm mulling a possible solution involving the downward API.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented May 8, 2019

tl;dr I think that we should probably find a way to make exec probes in the queue-proxy work, and block until they get what they want.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented May 8, 2019

IDK that we need the downward API other than to workaround Istio, and we hopefully no longer need to workaround istio, which IIUC has its own readiness probes.

@mattmoor

This comment has been minimized.

Copy link
Member

@mattmoor mattmoor commented Jun 12, 2019

/assign @joshrider

I believe the issue assigned to Josh should address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.