[Probes] Incorrect application of the initialDelaySeconds #96614

Kanshiroron · 2020-11-16T17:28:30Z

Hello,

We are experiencing an issue with pod probes where they are not ran in sync with the configuration, which let suggests that the initialDelaySeconds is not correctly applied (or some other random delay is introduced).

We have a deployment that contains a livenessProbe and a readinessProbe configured as followed:

        livenessProbe:
          exec:
            command: [ "/opt/prime/bin/prime_health_checker", "control_status" ]
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          exec:
            command: [ "/opt/prime/bin/prime_health_checker", "rabbit_riak_connectivity", "log_blocking_fix" ]
          initialDelaySeconds: 15
          periodSeconds: 30
          timeoutSeconds: 10

According to the configuration both probes should be running 15 seconds apart from one another, but this is not the case. The time difference between pods are actually variable and random for each pods and replicas.

Here are some logs from Kubelet, showing only 6 seconds delay between the two probes:

Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: I1113 13:44:14.954625   17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] []  []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:14 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:17 ip-10-213-59-169 kubelet[17547]: I1113 13:44:17.678245   17547 prober.go:129] Liveness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: I1113 13:44:20.547366   17547 prober.go:162] Exec-Probe Pod: &Pod{ObjectMeta:{taskman-sandbox-0 taskman-sandbox- sandbox /api/v1/namespaces/sandbox/pods/taskman-sandbox-0 0615148c-1210-4cb5-850b-52992f9a7ac9 46134867 0 2020-11-11 04:51:13 +0000 UTC <nil> <nil> map[controller-revision-hash:taskman-sandbox-755dc74dd8 pypestream.com/app:taskman pypestream.com/component:taskman pypestream.com/env:sandbox pypestream.com/group:prime pypestream.com/namespace:sandbox spotinst.io/restrict-scale-down:true statefulset.kubernetes.io/pod-name:taskman-sandbox-0] map[cni.projectcalico.org/podIP:100.119.77.208/32 cni.projectcalico.org/podIPs:100.119.77.208/32 fluentbit.io/parser:prime_log kubernetes.io/config.seen:2020-11-13T10:26:49.001091487Z kubernetes.io/config.source:api] [{apps/v1 StatefulSet taskman-sandbox bfb661c9-4ebc-4c6f-9c2f-98deaec09e56 0xc0005e7c16 0xc0005e7c17}] []  []},Spec:PodSpec{Volumes:[]Volume{Volume{Name:taskman-keys,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:taskman-keys,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},Volume{Name:default-token-rcwqc,VolumeSource:VolumeSource{HostPath:nil,EmptyDir:nil,GCEPersistentDisk:nil,AWSElasticBlockStore:nil,GitRepo:nil,Secret:&SecretVolumeSource{SecretName:default-token-rcwqc,Items:[]KeyToPath{},DefaultMode:*420,Optional:nil,},NFS:nil,ISCSI:nil,Glusterfs:nil,PersistentVolumeClaim:nil,RBD:nil,FlexVolume:nil,Cinder:nil,CephFS:nil,Flocker:nil,DownwardAPI:nil,FC:nil,AzureFile:nil,ConfigMap:nil,VsphereVolume:nil,Quobyte:nil,AzureDisk:nil,PhotonPersistentDisk:nil,PortworxVolume:nil,ScaleIO:nil,Projected:nil,StorageOS:nil,CSI:nil,},},},Containers:[]Container{Contai
Nov 13 13:44:20 ip-10-213-59-169 kubelet[17547]: ,Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-aws-s3-presentation-key,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:common-freshchat-token,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:&ConfigMapEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-env,},Optional:nil,},SecretRef:nil,},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-aws-keys,},Optional:nil,},},EnvFromSource{Prefix:,ConfigMapRef:nil,SecretRef:&SecretEnvSource{LocalObjectReference:LocalObjectReference{Name:taskman-rabbitmq-user,},Optional:nil,},},},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,},},RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:ip-10-213-59-169.us-west-2.compute.internal,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[]Sysctl{},WindowsOptions:nil,},ImagePullSecrets:[]LocalObjectReference{},Hostname:taskman-sandbox-0,Subdomain:taskman-sandbox,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[]Container{},AutomountServiceAccountToken:nil,Tolerations:[]Toleration{Toleration{Key:node.kubernetes.io/not-ready,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},Toleration{Key:node.kubernetes.io/unreachable,Operator:Exists,Value:,Effect:NoExecute,TolerationSeconds:*300,},},HostAliases:[]HostAlias{},PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[]PodReadinessGate{},RuntimeClassName:nil,EnableServiceLinks:*true,PreemptionPolicy:nil,Overhead:ResourceList{},TopologySpreadConstraints:[]Topolo
Nov 13 13:44:23 ip-10-213-59-169 kubelet[17547]: I1113 13:44:23.377094   17547 prober.go:129] Readiness probe for "taskman-sandbox-0_sandbox(0615148c-1210-4cb5-850b-52992f9a7ac9):taskman-sandbox" succeeded

Here is what we noticed:

The time difference stay unchanged if the pod is restarted (by k8s, after some liveness probes failed).
If the pod is re-created, timings changes.
Timings are random and "unique" per replica.
This issue doesn't come from a time drift as it's observable as soon as the pod is created.

Environment:

K8s 1.16.10
Kops 1.16.3
Debian Stretch
Linux 4.9.0-13-amd64 # 1 SMP Debian 4.9.228-1 (2020-07-05) x86_64 GNU/Linux

Thanks for the help and the support

The text was updated successfully, but these errors were encountered:

Kanshiroron · 2020-11-16T17:30:35Z

/sig Scheduling

clarkmcc · 2020-11-21T23:45:55Z

I believe that this is expected behavior. I'll report my findings and we'll see if someone who knows this better then me can confirm. The InitialDelaySeconds prevents a probe from running before the delay has passed always. So I would imagine that you're at least seeing that the probes aren't occurring before that initial delay.

kubernetes/pkg/kubelet/prober/worker.go

Lines 226 to 228 in b2ecd1b

    
           if int32(time.Since(c.State.Running.StartedAt.Time).Seconds()) < w.spec.InitialDelaySeconds { 
        
           	return true 
        
           }

What you're probably seeing is a product of this

kubernetes/pkg/kubelet/prober/worker.go

Lines 128 to 133 in b2ecd1b

    
           func (w *worker) run() { 
        
           	probeTickerPeriod := time.Duration(w.spec.PeriodSeconds) * time.Second 
        
           	// If kubelet restarted the probes could be started in rapid succession. 
        
           	// Let the worker wait for a random portion of tickerPeriod before probing. 
        
           	time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod)))

which indicates that the actual first probe time can be expressed as anytime after the InitialDelaySeconds and before InitialDelaySeconds + PeriodSeconds. Given the code comment, the idea behind this "random timing" is the avoid a flood of probes for a kubelet that has just started/restarted, we're basically providing some "jitter" here.

Again, I didn't write this but based on your description and what I'm seeing in the code, I believe what you're seeing is expected behavior.

Also I believe this falls under sig/node.

clarkmcc · 2020-11-21T23:46:54Z

/sig Node

Cweiping · 2021-01-12T11:09:31Z

I agree with @clarkmcc .

The InitialDelaySeconds of readiness probes are determined according to the start-up time of the container, not according to the readinessProbe -livenessProbe >= InitialDelaySeconds.

kubernetes/pkg/kubelet/prober/worker.go

Lines 226 to 228 in b2ecd1b

    
           if int32(time.Since(c.State.Running.StartedAt.Time).Seconds()) < w.spec.InitialDelaySeconds { 
        
           	return true 
        
           }

Kanshiroron · 2021-02-09T13:13:27Z

@clarkmcc Thanks for the answer. I understand the underlined idea to not have kubelet over-flooded with probes when it starts/restarts.
But I think it's quite important to keep the same time difference between the two probes in the case something needs to be locked during the execution of the probe (which is our case).
-> Would it be possible to have the same random value applied to both probes then?

clarkmcc · 2021-02-13T19:25:02Z

@Kanshiroron it wouldn't be random if the same value were applied to all probes. Kubernetes wasn't designed to get into the specifics of your application's life cycle. If all instances of your application require coordination with each other, that's a problem that should be solved in the application layer, not the infrastructure layer.

Kanshiroron · 2021-02-22T15:21:04Z

@clarkmcc That's not what I meant, sorry if that wasn't clear enough.
What I suggested is to have the same random value applied to both the readiness and the liveness probe in the same pod. And the reason for this request is that in some cases we can have a shared lock in those probes thus resulting in some health check failing (if the added random values result in having the two probes running at the same time).

matthyx · 2021-05-21T11:41:35Z

/assign

mengjiao-liu · 2021-06-24T05:07:35Z

/triage accepted

k8s-triage-robot · 2021-09-22T05:56:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-10-22T06:40:19Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2021-11-21T06:46:41Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2021-11-21T06:46:58Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Kanshiroron · 2021-11-21T22:37:49Z

/reopen

k8s-ci-robot · 2021-11-21T22:38:06Z

@Kanshiroron: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

SergeyKanzhelev · 2021-11-24T18:40:19Z

Since we have a PR open #102064, removing rotten:

/remove-lifecycle rotten

@Kanshiroron do you have any means to check if PR addresses the bug?

Kanshiroron · 2021-11-24T18:57:07Z

@SergeyKanzhelev sorry I actually missed @matthyx PR. No I unfortunately no easy way to test this, but I guess with some guidance I may be able to.

k8s-triage-robot · 2022-02-22T20:04:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-02-24T16:33:49Z

/remove-lifecycle stale

k8s-triage-robot · 2022-05-25T16:44:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 · 2022-06-21T19:33:27Z

/remove-lifecycle stale

qiutongs · 2022-09-04T04:33:46Z

Reporting a new finding in 1.21+.

The jitter is only added when periodSeconds > the elapsed time since Kubelet has started. This means the initial invocation of doProbe() for all probes in a container will ALWAYS happen at the same time. initialDelaySeconds is only a mask to auto-ok doProbe()

See https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/prober/worker.go#L137-L139

4870e64

Say a container has 3 probes with same periodSeconds and different initialDelaySeconds. The 3 probes will always be executed in the same seconds eventually.

qiutongs · 2022-09-04T04:42:31Z

I see 3 problems with predictible probe invocation.

initialDelaySeconds doesn't work as the spec says: Number of seconds after the container has started before probes are initiated. Intuitively, this simply means the first invocation of the probe is containerStatuses.state.running.startedAt + initialDelaySeconds. The implementation is far from that, as [Probes] Incorrect application of the initialDelaySeconds #96614 (comment).
The jitters given to the probes in the same container are different, as [Probes] Incorrect application of the initialDelaySeconds #96614 (comment). synchronize readiness and startup workers on kubelet start #102064 is trying to solve it.
Since 1.21, the jitters are only added when kubelet is recently started/restarted, as [Probes] Incorrect application of the initialDelaySeconds #96614 (comment)

matthyx · 2022-09-06T06:33:11Z

@qiutongs thanks for your analysis, so then what do you suggest?

rphillips · 2022-09-06T18:20:28Z

Probes are started immediately but return without running the probe if the Pod.Running start time < InitialDelaySeconds [source]. This is probably due to the event handling to detect if the pod has terminated earlier in the function.

qiutongs · 2022-09-14T22:51:42Z

So far, I believe the correct logic is:

func (w *worker) run() {
  podStatus = GetPodStatus()
  
  if shouldNotProbe(podStatus) {
    return
  }
  
  // This is the current value
  jitter = rand(0, 1) * PeriodSeconds

  // podStatus.startTime is when the pod is acknowledged by Kubelet. Based on my test, this value remains the same on Kubelet restart
  // w.probeManager.start indicates when Kubelet starts
  //
  // If pod is acknowledged later than Kubelet start, it means this is a new pod. We simply sleep initialDelaySeconds
  // If Kubelet starts later than pod is acknowledged, it means this is an existing pod and Kubelet just restarted. We give some jitter.
  if podStatus.startTime > w.probeManager.start {
    sleep(initialDelaySeconds)
  } else {
    sleep(jitter)
  }

  for doProbe() {
    ...
  }
}

Notes:

Based on my test, the PodStatus.startTime remains the same on kubelet restart.
I will find time(likely 2 weeks later) to implement it with real code, as matthyx@ suggested. In the meanwhile, please correct my pseudocode if you see problems.

SergeyKanzhelev · 2022-09-14T23:59:42Z

sleep(jitter)

what is the jitter value in this pseudocode?

if "pod is created after kubelet starts" {

Can you explain what it means? Pod is created while kubelet was still running as oppose to worker is being initialized after kubelet has restarted?

I think the good first step in improving this logic is to analyze the test coverage for it - which scenarios are covered and which needs to be added to ensure the consistency of docs and probes. For example, do we have jitter logic tested? Do we need a test that demonstrates how running all probes at once at a start is a problem and jitter actually helps?

qiutongs · 2022-09-15T00:46:39Z

what is the jitter value in this pseudocode?

Same as the current value in
https://github.com/kubernetes/kubernetes/blob/v1.21.0/pkg/kubelet/prober/worker.go#L138. This is the jitter I meant. The purpose of this jitter is to avoid thundering herd problem when kubelet restarts.

if "pod is created after kubelet starts" {

Can you explain what it means? Pod is created while kubelet was still running as oppose to worker is being initialized after kubelet has restarted?

No. Sorry for the confusion.
I mean comparing PodStatus.startTime and the timestamp that Kubelet starts. Then, for this pod, we can know it is (1) Kubelet restart or (2) new pod. See my update in the pseudocode. #96614 (comment)

I think the good first step in improving this logic is to analyze the test coverage for it

That is a good idea. I believe we don't have tests for this jitter today. Any recommendation for testing such random code?

matthyx · 2022-09-15T12:21:51Z

Do we need a test that demonstrates how running all probes at once at a start is a problem and jitter actually helps?

Testing a thundering herd problem looks like a performance test to me... not easy.

k8s-triage-robot · 2022-12-14T12:44:18Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-01-13T12:54:36Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

vaibhav2107 · 2023-08-01T04:33:41Z

/remove-lifecycle rotten

Kanshiroron added the kind/bug Categorizes issue or PR as related to a bug. label Nov 16, 2020

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 16, 2020

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 16, 2020

k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Nov 21, 2020

matthyx mentioned this issue May 21, 2021

synchronize readiness and startup workers on kubelet start #102064

Closed

k8s-ci-robot assigned matthyx May 21, 2021

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 24, 2021

n4j added this to Triaged in SIG Node Bugs Jul 9, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 22, 2021

k8s-ci-robot closed this as completed Nov 21, 2021

SIG Node Bugs automation moved this from Triaged to Done Nov 21, 2021

k8s-ci-robot reopened this Nov 21, 2021

SIG Node Bugs automation moved this from Done to Triage Nov 21, 2021

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Nov 24, 2021

SergeyKanzhelev moved this from Triage to Triaged in SIG Node Bugs Nov 24, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 24, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 25, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 14, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 13, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Probes] Incorrect application of the initialDelaySeconds #96614

[Probes] Incorrect application of the initialDelaySeconds #96614

Kanshiroron commented Nov 16, 2020

Kanshiroron commented Nov 16, 2020

clarkmcc commented Nov 21, 2020 •

edited

clarkmcc commented Nov 21, 2020

Cweiping commented Jan 12, 2021

Kanshiroron commented Feb 9, 2021

clarkmcc commented Feb 13, 2021

Kanshiroron commented Feb 22, 2021 •

edited

matthyx commented May 21, 2021

mengjiao-liu commented Jun 24, 2021

k8s-triage-robot commented Sep 22, 2021

k8s-triage-robot commented Oct 22, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-ci-robot commented Nov 21, 2021

Kanshiroron commented Nov 21, 2021

k8s-ci-robot commented Nov 21, 2021

SergeyKanzhelev commented Nov 24, 2021 •

edited

Kanshiroron commented Nov 24, 2021

k8s-triage-robot commented Feb 22, 2022

vaibhav2107 commented Feb 24, 2022

k8s-triage-robot commented May 25, 2022

vaibhav2107 commented Jun 21, 2022

qiutongs commented Sep 4, 2022

qiutongs commented Sep 4, 2022

matthyx commented Sep 6, 2022

rphillips commented Sep 6, 2022 •

edited

qiutongs commented Sep 14, 2022 •

edited

SergeyKanzhelev commented Sep 14, 2022

qiutongs commented Sep 15, 2022 •

edited

matthyx commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

k8s-triage-robot commented Jan 13, 2023

vaibhav2107 commented Aug 1, 2023

[Probes] Incorrect application of the initialDelaySeconds #96614

[Probes] Incorrect application of the initialDelaySeconds #96614

Comments

Kanshiroron commented Nov 16, 2020

Kanshiroron commented Nov 16, 2020

clarkmcc commented Nov 21, 2020 • edited

clarkmcc commented Nov 21, 2020

Cweiping commented Jan 12, 2021

Kanshiroron commented Feb 9, 2021

clarkmcc commented Feb 13, 2021

Kanshiroron commented Feb 22, 2021 • edited

matthyx commented May 21, 2021

mengjiao-liu commented Jun 24, 2021

k8s-triage-robot commented Sep 22, 2021

k8s-triage-robot commented Oct 22, 2021

k8s-triage-robot commented Nov 21, 2021

k8s-ci-robot commented Nov 21, 2021

Kanshiroron commented Nov 21, 2021

k8s-ci-robot commented Nov 21, 2021

SergeyKanzhelev commented Nov 24, 2021 • edited

Kanshiroron commented Nov 24, 2021

k8s-triage-robot commented Feb 22, 2022

vaibhav2107 commented Feb 24, 2022

k8s-triage-robot commented May 25, 2022

vaibhav2107 commented Jun 21, 2022

qiutongs commented Sep 4, 2022

qiutongs commented Sep 4, 2022

matthyx commented Sep 6, 2022

rphillips commented Sep 6, 2022 • edited

qiutongs commented Sep 14, 2022 • edited

SergeyKanzhelev commented Sep 14, 2022

qiutongs commented Sep 15, 2022 • edited

matthyx commented Sep 15, 2022

k8s-triage-robot commented Dec 14, 2022

k8s-triage-robot commented Jan 13, 2023

vaibhav2107 commented Aug 1, 2023

clarkmcc commented Nov 21, 2020 •

edited

Kanshiroron commented Feb 22, 2021 •

edited

SergeyKanzhelev commented Nov 24, 2021 •

edited

rphillips commented Sep 6, 2022 •

edited

qiutongs commented Sep 14, 2022 •

edited

qiutongs commented Sep 15, 2022 •

edited