Skip to content
This repository has been archived by the owner on Dec 5, 2017. It is now read-only.

the reason behind 20 minutes kubelet executor suicide timeout default ? #465

Open
ravilr opened this issue Aug 30, 2015 · 2 comments
Open

Comments

@ravilr
Copy link

ravilr commented Aug 30, 2015

@jdef
Observed in kubernetes v1.0.3 release.

why is the default kubelet executor's suicide timeout so large (20 minutes) ? kubelet executor seem to commit suicide after some period of "inactivity":

  • no running tasks
  • no task launch requests in the last X minutes

what we are seeing is, after an executor is asked to kill pod tasks, it kills them and enters into suicide timeout timer wait. meanwhile, the kube node controller sees the pods running on that slave node as healthy, as the kubelet seems to still report node status during this timer wait interval. Eventually, when the executor exits, node controller's pod evictor eventually kicks in (again after a default podEvictionTimeout of 5 mins!) and deletes the pod assignments to that slave node, after which Replication controller kicks in, recreates the pod and k8sm scheduler schedules the pod task.

Can you please validate the above analysis. Are there any downsides to running with lower executor timeout ?
we are planning to run with a executor-suicide-timeout of 2 mins and implicit reconcile-interval down to 60secs.

@ravilr ravilr changed the title the reason behind 20 minute kubelet executor suicide timeout default ? the reason behind 20 minutes kubelet executor suicide timeout default ? Aug 30, 2015
@jdef
Copy link

jdef commented Aug 31, 2015

thanks for the analysis! this part sounds like a bug in the scheduler:

meanwhile, the kube node controller sees the pods running on that slave node as healthy

running with a lower value for the executor suicide timeout is probably fine. the default of 20m was selected arbitrarily.

implicit reconciliation running every 60s sounds like it won't scale very well with large number of tasks in the cluster.

@jdef
Copy link

jdef commented Jan 31, 2016

possibly fixed by #730 (and if so, fix is available in v0.7.2-v1.1.5) -- can you re-test?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants