New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Probes] Incorrect application of the initialDelaySeconds #96614
Comments
/sig Scheduling |
I believe that this is expected behavior. I'll report my findings and we'll see if someone who knows this better then me can confirm. The kubernetes/pkg/kubelet/prober/worker.go Lines 226 to 228 in b2ecd1b
What you're probably seeing is a product of this kubernetes/pkg/kubelet/prober/worker.go Lines 128 to 133 in b2ecd1b
which indicates that the actual first probe time can be expressed as anytime after the Again, I didn't write this but based on your description and what I'm seeing in the code, I believe what you're seeing is expected behavior. Also I believe this falls under sig/node. |
/sig Node |
I agree with @clarkmcc . The kubernetes/pkg/kubelet/prober/worker.go Lines 226 to 228 in b2ecd1b
|
@clarkmcc Thanks for the answer. I understand the underlined idea to not have kubelet over-flooded with probes when it starts/restarts. |
@Kanshiroron it wouldn't be random if the same value were applied to all probes. Kubernetes wasn't designed to get into the specifics of your application's life cycle. If all instances of your application require coordination with each other, that's a problem that should be solved in the application layer, not the infrastructure layer. |
@clarkmcc That's not what I meant, sorry if that wasn't clear enough. |
/assign |
/triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@Kanshiroron: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Since we have a PR open #102064, removing rotten: /remove-lifecycle rotten @Kanshiroron do you have any means to check if PR addresses the bug? |
@SergeyKanzhelev sorry I actually missed @matthyx PR. No I unfortunately no easy way to test this, but I guess with some guidance I may be able to. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
Reporting a new finding in 1.21+. The jitter is only added when See https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/prober/worker.go#L137-L139 Say a container has 3 probes with same |
I see 3 problems with predictible probe invocation.
|
@qiutongs thanks for your analysis, so then what do you suggest? |
Probes are started immediately but return without running the probe if the Pod.Running start time < InitialDelaySeconds [source]. This is probably due to the event handling to detect if the pod has terminated earlier in the function. |
So far, I believe the correct logic is:
Notes:
|
what is the
Can you explain what it means? Pod is created while kubelet was still running as oppose to worker is being initialized after kubelet has restarted? I think the good first step in improving this logic is to analyze the test coverage for it - which scenarios are covered and which needs to be added to ensure the consistency of docs and probes. For example, do we have jitter logic tested? Do we need a test that demonstrates how running all probes at once at a start is a problem and jitter actually helps? |
Same as the current value in
No. Sorry for the confusion.
That is a good idea. I believe we don't have tests for this jitter today. Any recommendation for testing such random code? |
Testing a thundering herd problem looks like a performance test to me... not easy. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
/remove-lifecycle rotten |
Hello,
We are experiencing an issue with pod probes where they are not ran in sync with the configuration, which let suggests that the
initialDelaySeconds
is not correctly applied (or some other random delay is introduced).We have a deployment that contains a
livenessProbe
and areadinessProbe
configured as followed:According to the configuration both probes should be running 15 seconds apart from one another, but this is not the case. The time difference between pods are actually variable and random for each pods and replicas.
Here are some logs from Kubelet, showing only 6 seconds delay between the two probes:
Here is what we noticed:
Environment:
Thanks for the help and the support
The text was updated successfully, but these errors were encountered: