-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Readiness probe fails if the container restarts after a liveness probe fail #12571
Comments
@Shashankft9 could you check the kubelet logs about the failure? Reason of request failure should be recorded in there. You need to set log level>=4 btw and restart the kubelet service. |
I can see the following logs in kubelet:
|
@Shashankft9 it seems your are hitting this line:
Due to the user container restart queue proxy goes in draining mode, as described here: serving/vendor/knative.dev/pkg/network/handlers/drain.go Lines 51 to 57 in 52f07b0
In the comments it should be: "503 shutting down".
Knative applies some revision defaults but not in this case (in the ksvc above only liveness probes are set): serving/pkg/apis/serving/v1/revision_defaults.go Lines 136 to 158 in 52f07b0
One thing to consider although it does not seem to be the case is that:
What happens if you increase the readiness failure threshold to 5? Right now it seems Kubelet will try for 3*10=30 seconds and then fail which is lower than the draining period. That probably explains what you see in the events: 6 failures and 2 container restarts. My rough guess. |
hey, thanks for digging in, just to be sure - are you suggesting that I should try changing the readiness failure threshold of the Another scenario I tested, might be relevant - this time I added a readiness probe in the ksvc:
and I expected the same failure log in kubelet, but I got this:
I will try changing the default |
@Shashankft9 I meant the user container let me explain though because eventually the issue is different. I debug the scenario
Here is what happens after a while:
So if you check the link above, implementation makes sure readiness is set only to queue container on behalf of the user container while liveness is set on the user container, although all http probe calls at first go through to queue proxy anyway. the drain function will only reset the timer if a normal call is made, otherwise it will keep returning "shutting down" to probes (that cant happen because the unhealthy container cannot be accessed). serving/vendor/knative.dev/pkg/network/handlers/drain.go Lines 88 to 111 in 52f07b0
I modified the func above by adding some comments and as you can see from the queue proxy logs it is stuck. Btw readiness probes are managed by queue proxy which every 50ms will try poll the user container trying the readiness url passed. It respects the readiness probe settings passed and these are passed via an env var to the queue container which is later used to set up the logic of the prober:
/cc @julz is this behavior expected (by design)? |
@Shashankft9 @julz gentle ping |
It seems to be the case that serving/vendor/knative.dev/pkg/network/handlers/drain.go Lines 165 to 170 in 52f07b0
d.timer is set, its never being set to nil again and so the queue-proxy is staying in drain mode forever because this condition is true always.
If the idea of drain mode is to be there for some
|
The Quiet period is overriden at the queue proxy side: The knative/pkg sets this by default to be 45 here:
However queue main sets that to 30s Line 64 in 52f07b0
https://github.com/knative/serving/blob/52f07b02254b42f18fa0f71dbb0462410a5bc7b1/cmd/queue/main.go Some more context: The drain call here is what we experience. Line 354 in 52f07b0
Line 16 in the queue proxy logs: https://gist.github.com/skonto/749b5a34b464a786d2d3f473da0453d2 This drain is called because we use preStop handlers on the user container.
The user container was restarted.
Now when we drain for the first time we set a timer, that timer is continously being reset if there is Now if we set that timer to nil after draining is done I suspect that we will be able to move on with readiness probes. |
@skonto I am running into this issue on knative v1.1.0, wondering if there is a fix WIP. |
@pramenn afaik @Shashankft9 was planning to do a PR to restart the timer as a candidate fix. |
hey yeah, I have done the change in my dev and the readiness probe seems to work fine with it - waiting on @dprotaso's opinions on the suggested solution couple of comments above and also (same solution) here: https://knative.slack.com/archives/CA4DNJ9A4/p1646983509263459 |
We are also running on this issue, and were wondering if there was any recent update ? ( @Shashankft9 ) Thanks. |
/triage accepted Thanks for the deep breakdown everyone - I think what we really need is a clear signal to distinguish between the Pod shutting down and either container (queue-proxy/user-container) restarting. I'm not sure that's straightforward. Will dig into this for the next release. I worry about the setting the timer to nil - because I believe in the case the Pod is shutting down it may let a request through rather than returning /assign @dprotaso |
Hi, we are also facing this issue, is there any workaround? |
I see @dprotaso changed the milestone for this issue to v1.7.0 but I was wondering if anyone is aware of a workaround, as @langelini-photosi is also asking. I think this issue makes impossible to use all the 1.x versions in a production environment. |
It would be very interesting to know how google cloud got around this issue. I tried a quick setup of cloud run for anthos and I have seen that it does not present this issue. Anyway, on cloud run for anthos they provide their own queue-proxy: gke.gcr.io/knative/queue@sha256:2148fc4af5f5cda0594d97c3027b316bcba3c8f400d3d57087807f90e80641d3 I tried to use this queue-proxy image on my knative 1.3.1 in self managed cluster ad it seems to solve the issue. I hope this could helps |
Hey @faxioman - the 1.6 release is today. I'll be taking a look at this for 1.7.
Google's Cloud Run is a separate implementation and they've stopped contributing to upstream |
Thanks @dprotaso! |
<!-- Thanks for sending a pull request! Here are some tips for you: 1. Run unit tests and ensure that they are passing 2. If your change introduces any API changes, make sure to update the e2e tests 3. Make sure documentation is updated for your PR! --> **What this PR does / why we need it**: <!-- Explain here the context and why you're making the change. What is the problem you're trying to solve. ---> Update end to end test to use knative 1.7.4 to test its compatibility against Merlin and kserve 0.9.0. The PR also update the k8s version that is used for testing from 1.21 to 1.22. We need to update to knative 1.7.4 because current version (1.3.3) suffers from issue knative/serving#12571 **Does this PR introduce a user-facing change?**: <!-- If no, just write "NONE" in the release-note block below. If yes, a release note is required. Enter your extended release note in the block below. If the PR requires additional action from users switching to the new release, include the string "action required". For more information about release notes, see kubernetes' guide here: http://git.k8s.io/community/contributors/guide/release-notes.md --> None **Checklist** - [X] Added unit test, integration, and/or e2e tests - [X] Tested locally
What version of Knative?
v1.2
Expected Behavior
After a liveness probe fail, the container should restart and ideally should start serving the traffic again, just like how it would happen for a k8s deployment.
Actual Behavior
After the restart from liveness probe fail, the pod starts to fail with readiness probe, and serves no traffic.
Steps to Reproduce the Problem
ksvc:
the image here is produced from the code here: https://github.com/knative/serving/blob/db4c85b641703f94a700ada8cc074d28b423b5eb/test/test_images/healthprobes/health_probes.go
Hit the
/start-failing
pathrelevant events of the ksvc pod:
I tried to run a k8s deployment with the same image and liveness probe configuration, and it seemed to behave as I hoped, whenever I explicitly hit the
/start-failing
path, in some seconds liveness probe will report failure and the container will restart and start serving the traffic again.The text was updated successfully, but these errors were encountered: