[Revisit] Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds` #80431

philip-fox · 2019-07-22T14:38:36Z

/kind bug

What happened:
Similar to the now closed ticket: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

On that ticket, people have commented expressing wonder as to why the ticket was closed when the problem was never fixed. I'm also wondering this, and I've encountered the problem, and would like a fix for it in a future release, if possible :-).

It seems that if the periodSeconds is quite small, e.g. 60, then the readiness probe seems to be invoked very soon after the initialDelaySeconds interval. However, if periodSeconds is relatively large, e.g. 900, then (pretty much) no matter what initialDelaySeconds, timeoutSeconds, and failureThreshold are set to, then the readiness probe seems to be invoked some time after the periodSeconds has elapsed.

What you expected to happen:
See: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

I'd expect that the first call of the probe would be very soon after initialDelaySeconds elapses, but my experiments show that that's not the case.

How to reproduce it (as minimally and precisely as possible):
See: Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.8+IKS", GitCommit:"fe3e332c2b0f47d4572433c3b0a1687a27fb88c6", GitTreeState:"clean", BuildDate:"2019-07-11T13:45:03Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

philip-fox · 2019-07-22T14:41:18Z

/sig node

mattjmcnaughton · 2019-07-23T13:56:51Z

Hi @philip-fox ! Thanks for reopening this.

Can you explain more about what you view as the desired behavior? I just want us to quickly think through if there are any downsides to the potential new behavior before going forward with an implementation.

philip-fox · 2019-07-23T15:06:00Z

Hi @mattjmcnaughton

Thanks for the reply. I suspect that the readiness probe isn't being called before the first periodSeconds elapses in cases in which the periodSeconds is large. I would expect this not to be the case.

I'm trying to run some experiments so that I can compile a table of timings which I'll add to this ticket very soon.

Maybe I'm wrong, but I think there's something weird going on, and from looking at the previous ticket, I don't think I'm alone.

zouyee · 2019-07-24T08:25:11Z

/assign

philip-fox · 2019-07-24T14:04:52Z

The following three tables contain the timings witnessed for deployments for three different services. The column first_probe_completes shows how many seconds it took before the readiness probe returned, and within_first_period? shows whether the probe returned within the first time interval of its periodSeconds.

Service_1_Pod (timeoutSeconds=5)

periodSeconds	initialDelaySeconds	first_probe_completes	within_first_period?
120	15	86	yes
120	15	86	yes
180	15	60	yes
180	15	85	yes
180	15	198	no
180	15	55	yes
240	15	180	yes
240	15	159	yes
240	15	139	yes
300	15	143	yes
300	15	186	yes
300	15	173	yes
300	15	102	yes
480	15	344	yes
480	30	342	yes
600	30	89	yes
600	30	300	yes
900	30	251	yes
900	30	548	yes
900	30	840	yes
1800	30	972	yes
1800	30	1122	yes
1800	30	283	yes
1800	30	1714	yes

Service_2_Pod (timeoutSeconds=5)

periodSeconds	initialDelaySeconds	first_probe_completes	within_first_period?
120	10	43	yes
120	10	69	yes
120	10	60	yes
120	10	55	yes
240	10	240	no
240	10	221	yes
300	10	43	yes
300	10	52	yes
300	10	135	yes
300	10	25	yes
480	10	49	yes
480	10	125	yes
600	10	244	yes
600	10	347	yes
900	10	786	yes
900	10	563	yes
900	10	603	yes
1800	10	248	yes
1800	10	1146	yes
1800	10	1452	yes
1800	10	421	yes

Service_3_Pod (timeoutSeconds=5)

periodSeconds	initialDelaySeconds	first_probe_completes	within_first_period?
240	30	205	yes
240	30	180	yes
300	30	261	yes
300	30	315	no
300	30	143	yes
300	30	176	yes
480	30	112	yes
480	30	449	yes
600	30	361	yes
600	30	230	yes
900	30	693	yes
900	30	186	yes
900	30	152	yes
1800	30	1644	yes
1800	30	855	yes
1800	30	1285	yes
1800	30	801	yes

So two things seem clear from looking at those timings:

My initial suspicion that the first call of the probe comes after the first periodSeconds interval elapses if periodSeconds is relatively large appears to be invalid (see within_first_period in the tables above).
I would have expected that the first call to the readiness probe would return very soon after initialDelaySeconds elapses, i.e. at most, just after
```
`initialDelaySeconds` + (`timeoutSeconds` * `failureThreshold`)
```
but this isn't usually the case, as witnessed. My understanding of timeoutSeconds and failureThreshold might be a little suspect, but I'd at least expect the probe to return shortly after its initialDelaySeconds elapses once the pod is deployed.

philip-fox · 2019-08-21T16:02:07Z

@zouyee
Hi Zou Nengren. Do you plan to progress this issue?

zouyee · 2019-08-22T01:13:04Z

I don't have time for a while, if you have plans to solve it，please carry on.

philip-fox · 2019-08-22T08:41:13Z

@mattjmcnaughton
Hey Matt. Can this be assigned to someone who can take a look at it?

mattjmcnaughton · 2019-08-22T12:02:15Z

Hi @philip-fox ! It might be useful to see if anyone in the #sig-node slack channel has the interest/capacity to pick it up?

alanorwick · 2019-11-05T20:03:20Z

Hi! I'm a CS student in university. Me and @ryanarifin134 are interested in resolving this issue for a virtualization class assignment. Can we have this assigned to us? @mattjmcnaughton

philip-fox · 2019-11-06T10:15:53Z

@alanorwick Sure, that would be great, thanks!

ryanarifin134 · 2019-11-08T18:19:07Z

/assign

alanorwick · 2019-12-01T05:41:09Z

We have submitted a pull request regarding this issue. @philip-fox do you think our changes are reasonable?

fejta-bot · 2020-02-29T06:22:15Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-03-30T07:06:03Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-04-29T07:48:46Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-04-29T07:49:00Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

samgoncalves90 · 2020-06-15T09:53:10Z

This bug remains at version 1.17, can be reopened? 85769

oaicstef · 2020-10-09T09:08:00Z

Hi, I'm having the same issue, at the beginning I was thinking that it was happening due to a misconfiguration on my side but then I see that the behavior is not the one expected reading at the documentation.
This is the second "closed" issue I found about the same problem, is anyone trying to solve it? It seems a real issue to me, a pod to be ready is taking around 3 to 5 minutes in my pipelines and this is not a nice behavior.

Please, can anyone help on the topic?

philip-fox added the kind/bug Categorizes issue or PR as related to a bug. label Jul 22, 2019

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jul 22, 2019

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 22, 2019

k8s-ci-robot assigned zouyee Jul 24, 2019

zouyee removed their assignment Aug 22, 2019

k8s-ci-robot assigned ryanarifin134 Nov 8, 2019

alanorwick mentioned this issue Dec 1, 2019

Worker probe 80431 #85769

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 30, 2020

k8s-ci-robot closed this as completed Apr 29, 2020

ryanarifin134 removed their assignment Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Revisit] Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds` #80431

[Revisit] Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds` #80431

philip-fox commented Jul 22, 2019 •

edited

philip-fox commented Jul 22, 2019

mattjmcnaughton commented Jul 23, 2019

philip-fox commented Jul 23, 2019

zouyee commented Jul 24, 2019

philip-fox commented Jul 24, 2019 •

edited

philip-fox commented Aug 21, 2019

zouyee commented Aug 22, 2019

philip-fox commented Aug 22, 2019

mattjmcnaughton commented Aug 22, 2019

alanorwick commented Nov 5, 2019 •

edited

philip-fox commented Nov 6, 2019

ryanarifin134 commented Nov 8, 2019

alanorwick commented Dec 1, 2019 •

edited

fejta-bot commented Feb 29, 2020

fejta-bot commented Mar 30, 2020

fejta-bot commented Apr 29, 2020

k8s-ci-robot commented Apr 29, 2020

samgoncalves90 commented Jun 15, 2020

oaicstef commented Oct 9, 2020

[Revisit] Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds #80431

[Revisit] Odd timing behavior with Readiness Probes with initialDelaySeconds and periodSeconds #80431

Comments

philip-fox commented Jul 22, 2019 • edited

philip-fox commented Jul 22, 2019

mattjmcnaughton commented Jul 23, 2019

philip-fox commented Jul 23, 2019

zouyee commented Jul 24, 2019

philip-fox commented Jul 24, 2019 • edited

Service_1_Pod (timeoutSeconds=5)

Service_2_Pod (timeoutSeconds=5)

Service_3_Pod (timeoutSeconds=5)

philip-fox commented Aug 21, 2019

zouyee commented Aug 22, 2019

philip-fox commented Aug 22, 2019

mattjmcnaughton commented Aug 22, 2019

alanorwick commented Nov 5, 2019 • edited

philip-fox commented Nov 6, 2019

ryanarifin134 commented Nov 8, 2019

alanorwick commented Dec 1, 2019 • edited

fejta-bot commented Feb 29, 2020

fejta-bot commented Mar 30, 2020

fejta-bot commented Apr 29, 2020

k8s-ci-robot commented Apr 29, 2020

samgoncalves90 commented Jun 15, 2020

oaicstef commented Oct 9, 2020

[Revisit] Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds` #80431

[Revisit] Odd timing behavior with Readiness Probes with `initialDelaySeconds` and `periodSeconds` #80431

philip-fox commented Jul 22, 2019 •

edited

philip-fox commented Jul 24, 2019 •

edited

alanorwick commented Nov 5, 2019 •

edited

alanorwick commented Dec 1, 2019 •

edited