scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

jdef · 2016-02-16T21:59:21Z

the mesos master has given up on the slave at this point and if the slave process starts up again on the same node it will get a new slave ID. all prior tasks are recorded as LOST so the scheduler should delete the related pods so that they can be rescheduled

jdef · 2016-02-16T22:00:06Z

found while debugging #778

jdef · 2016-02-16T22:03:15Z

should also delete any mirror pods associated with the slave

jdef · 2016-02-17T01:50:04Z

fix: kubernetes/kubernetes#21366

jdef · 2016-02-17T01:54:28Z

I've tried deleting the mirror pods, and they do go away but if...

the slave process dies --> mesos fires slave-lost
the slave stays dead
the scheduler pod GC deletes all pods associated with {host,slaveID}
there are static pods running on the kubelet

the kubelet-executor will stay running until suicide timeout. this argues in favor of a lower value for the default suicide timeout threshold. #465

jdef · 2016-02-17T03:51:49Z

FWIW the apiserver has this status about the node (I'm wondering if the replication controller's pod GC is supposed to kick in here?):

status:
  addresses:
  - address: 10.2.0.6
    type: LegacyHostIP
  - address: 10.2.0.6
    type: InternalIP
  allocatable:
    cpu: "2"
    memory: 3745Mi
    pods: "40"
  capacity:
    cpu: "2"
    memory: 3745Mi
    pods: "40"
  conditions:
  - lastHeartbeatTime: 2016-02-17T03:48:23Z
    lastTransitionTime: 2016-02-17T03:44:02Z
    message: mesos reports ready status 
    reason: SlaveReady
    status: "True"
    type: Ready
  - lastHeartbeatTime: 2016-02-17T01:39:40Z
    lastTransitionTime: 2016-02-17T03:44:01Z
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: OutOfDisk

jdef · 2016-02-19T02:19:39Z

we need to track additional mesos state as a node condition. upstream doesn't provide a mechanism with which we can inject such a condition. I filed a PR to add such: kubernetes/kubernetes#21521

waiting for merge of the above
MERGED on 2016-02-24

jdef added area/scheduler class/bug integration/mesos WIP labels Feb 16, 2016

jdef mentioned this issue Feb 17, 2016

WIP/MESOS: handle slave-lost and REASON_SLAVE_REMOVED events kubernetes/kubernetes#21366

Closed

jdef added the tracking label Feb 17, 2016

jdef added the area/executor label Feb 17, 2016

jdef mentioned this issue Feb 19, 2016

kubelet provides functional option to extend node status update behavior kubernetes/kubernetes#21521

Merged

jdef added the blocked label Feb 19, 2016

jdef mentioned this issue Feb 21, 2016

controller needed to remove pods on unhealthy and vanished minions #43

Open

jdef removed the blocked label Feb 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

jdef commented Feb 16, 2016

jdef commented Feb 16, 2016

jdef commented Feb 16, 2016

jdef commented Feb 17, 2016

jdef commented Feb 17, 2016

jdef commented Feb 17, 2016

jdef commented Feb 19, 2016

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

Comments

jdef commented Feb 16, 2016

jdef commented Feb 16, 2016

jdef commented Feb 16, 2016

jdef commented Feb 17, 2016

jdef commented Feb 17, 2016

jdef commented Feb 17, 2016

jdef commented Feb 19, 2016