the reason behind 20 minutes kubelet executor suicide timeout default ? #465

ravilr · 2015-08-30T21:05:43Z

@jdef
Observed in kubernetes v1.0.3 release.

why is the default kubelet executor's suicide timeout so large (20 minutes) ? kubelet executor seem to commit suicide after some period of "inactivity":

no running tasks
no task launch requests in the last X minutes

what we are seeing is, after an executor is asked to kill pod tasks, it kills them and enters into suicide timeout timer wait. meanwhile, the kube node controller sees the pods running on that slave node as healthy, as the kubelet seems to still report node status during this timer wait interval. Eventually, when the executor exits, node controller's pod evictor eventually kicks in (again after a default podEvictionTimeout of 5 mins!) and deletes the pod assignments to that slave node, after which Replication controller kicks in, recreates the pod and k8sm scheduler schedules the pod task.

Can you please validate the above analysis. Are there any downsides to running with lower executor timeout ?
we are planning to run with a executor-suicide-timeout of 2 mins and implicit reconcile-interval down to 60secs.

jdef · 2015-08-31T00:18:11Z

thanks for the analysis! this part sounds like a bug in the scheduler:

meanwhile, the kube node controller sees the pods running on that slave node as healthy

running with a lower value for the executor suicide timeout is probably fine. the default of 20m was selected arbitrarily.

implicit reconciliation running every 60s sounds like it won't scale very well with large number of tasks in the cluster.

jdef · 2016-01-31T07:15:52Z

possibly fixed by #730 (and if so, fix is available in v0.7.2-v1.1.5) -- can you re-test?

ravilr changed the title ~~the reason behind 20 minute kubelet executor suicide timeout default ?~~ the reason behind 20 minutes kubelet executor suicide timeout default ? Aug 30, 2015

jdef added class/bug class/question labels Aug 31, 2015

jdef added priority/soon priority/P2 and removed priority/soon labels Jan 31, 2016

jdef mentioned this issue Feb 17, 2016

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the reason behind 20 minutes kubelet executor suicide timeout default ? #465

the reason behind 20 minutes kubelet executor suicide timeout default ? #465

ravilr commented Aug 30, 2015

jdef commented Aug 31, 2015

jdef commented Jan 31, 2016

the reason behind 20 minutes kubelet executor suicide timeout default ? #465

the reason behind 20 minutes kubelet executor suicide timeout default ? #465

Comments

ravilr commented Aug 30, 2015

jdef commented Aug 31, 2015

jdef commented Jan 31, 2016