Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Revisit] Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds #80431

Closed
philip-fox opened this issue Jul 22, 2019 · 19 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@philip-fox
Copy link

philip-fox commented Jul 22, 2019

/kind bug

What happened:
Similar to the now closed ticket: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

On that ticket, people have commented expressing wonder as to why the ticket was closed when the problem was never fixed. I'm also wondering this, and I've encountered the problem, and would like a fix for it in a future release, if possible :-).

It seems that if the periodSeconds is quite small, e.g. 60, then the readiness probe seems to be invoked very soon after the initialDelaySeconds interval. However, if periodSeconds is relatively large, e.g. 900, then (pretty much) no matter what initialDelaySeconds, timeoutSeconds, and failureThreshold are set to, then the readiness probe seems to be invoked some time after the periodSeconds has elapsed.

What you expected to happen:
See: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

I'd expect that the first call of the probe would be very soon after initialDelaySeconds elapses, but my experiments show that that's not the case.

How to reproduce it (as minimally and precisely as possible):
See: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.8+IKS", GitCommit:"fe3e332c2b0f47d4572433c3b0a1687a27fb88c6", GitTreeState:"clean", BuildDate:"2019-07-11T13:45:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
@philip-fox philip-fox added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2019
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 22, 2019
@philip-fox
Copy link
Author

/sig node

@k8s-ci-robot k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 22, 2019
@mattjmcnaughton
Copy link
Contributor

Hi @philip-fox ! Thanks for reopening this.

Can you explain more about what you view as the desired behavior? I just want us to quickly think through if there are any downsides to the potential new behavior before going forward with an implementation.

@philip-fox
Copy link
Author

Hi @mattjmcnaughton

Thanks for the reply. I suspect that the readiness probe isn't being called before the first periodSeconds elapses in cases in which the periodSeconds is large. I would expect this not to be the case.

I'm trying to run some experiments so that I can compile a table of timings which I'll add to this ticket very soon.

Maybe I'm wrong, but I think there's something weird going on, and from looking at the previous ticket, I don't think I'm alone.

@zouyee
Copy link
Member

zouyee commented Jul 24, 2019

/assign

@philip-fox
Copy link
Author

philip-fox commented Jul 24, 2019

The following three tables contain the timings witnessed for deployments for three different services. The column first_probe_completes shows how many seconds it took before the readiness probe returned, and within_first_period? shows whether the probe returned within the first time interval of its periodSeconds.

Service_1_Pod (timeoutSeconds=5)

periodSeconds initialDelaySeconds first_probe_completes within_first_period?
120 15 86 yes
120 15 86 yes
180 15 60 yes
180 15 85 yes
180 15 198 no
180 15 55 yes
240 15 180 yes
240 15 159 yes
240 15 139 yes
300 15 143 yes
300 15 186 yes
300 15 173 yes
300 15 102 yes
480 15 344 yes
480 30 342 yes
600 30 89 yes
600 30 300 yes
900 30 251 yes
900 30 548 yes
900 30 840 yes
1800 30 972 yes
1800 30 1122 yes
1800 30 283 yes
1800 30 1714 yes

Service_2_Pod (timeoutSeconds=5)

periodSeconds initialDelaySeconds first_probe_completes within_first_period?
120 10 43 yes
120 10 69 yes
120 10 60 yes
120 10 55 yes
240 10 240 no
240 10 221 yes
300 10 43 yes
300 10 52 yes
300 10 135 yes
300 10 25 yes
480 10 49 yes
480 10 125 yes
600 10 244 yes
600 10 347 yes
900 10 786 yes
900 10 563 yes
900 10 603 yes
1800 10 248 yes
1800 10 1146 yes
1800 10 1452 yes
1800 10 421 yes

Service_3_Pod (timeoutSeconds=5)

periodSeconds initialDelaySeconds first_probe_completes within_first_period?
240 30 205 yes
240 30 180 yes
300 30 261 yes
300 30 315 no
300 30 143 yes
300 30 176 yes
480 30 112 yes
480 30 449 yes
600 30 361 yes
600 30 230 yes
900 30 693 yes
900 30 186 yes
900 30 152 yes
1800 30 1644 yes
1800 30 855 yes
1800 30 1285 yes
1800 30 801 yes

So two things seem clear from looking at those timings:

  1. My initial suspicion that the first call of the probe comes after the first periodSeconds interval elapses if periodSeconds is relatively large appears to be invalid (see within_first_period in the tables above).
  2. I would have expected that the first call to the readiness probe would return very soon after initialDelaySeconds elapses, i.e. at most, just after
    `initialDelaySeconds` + (`timeoutSeconds` * `failureThreshold`)
    
    but this isn't usually the case, as witnessed. My understanding of timeoutSeconds and failureThreshold might be a little suspect, but I'd at least expect the probe to return shortly after its initialDelaySeconds elapses once the pod is deployed.

@philip-fox
Copy link
Author

@zouyee
Hi Zou Nengren. Do you plan to progress this issue?

@zouyee zouyee removed their assignment Aug 22, 2019
@zouyee
Copy link
Member

zouyee commented Aug 22, 2019

I don't have time for a while, if you have plans to solve it,please carry on.

@philip-fox
Copy link
Author

@mattjmcnaughton
Hey Matt. Can this be assigned to someone who can take a look at it?

@mattjmcnaughton
Copy link
Contributor

Hi @philip-fox ! It might be useful to see if anyone in the #sig-node slack channel has the interest/capacity to pick it up?

@alanorwick
Copy link

alanorwick commented Nov 5, 2019

Hi! I'm a CS student in university. Me and @ryanarifin134 are interested in resolving this issue for a virtualization class assignment. Can we have this assigned to us? @mattjmcnaughton

@philip-fox
Copy link
Author

@alanorwick Sure, that would be great, thanks!

@ryanarifin134
Copy link

/assign

@alanorwick
Copy link

alanorwick commented Dec 1, 2019

We have submitted a pull request regarding this issue. @philip-fox do you think our changes are reasonable?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 30, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@samgoncalves90
Copy link

This bug remains at version 1.17, can be reopened? 85769

@ryanarifin134 ryanarifin134 removed their assignment Jun 16, 2020
@oaicstef
Copy link

oaicstef commented Oct 9, 2020

Hi, I'm having the same issue, at the beginning I was thinking that it was happening due to a misconfiguration on my side but then I see that the behavior is not the one expected reading at the documentation.
This is the second "closed" issue I found about the same problem, is anyone trying to solve it? It seems a real issue to me, a pod to be ready is taking around 3 to 5 minutes in my pipelines and this is not a nice behavior.

Please, can anyone help on the topic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants