Skip to content
This repository has been archived by the owner on Dec 5, 2017. It is now read-only.

scheduler should take action when receiving TASK_LOST for REASON_SLAVE_REMOVED #789

Open
jdef opened this issue Feb 16, 2016 · 6 comments
Open

Comments

@jdef
Copy link

jdef commented Feb 16, 2016

the mesos master has given up on the slave at this point and if the slave process starts up again on the same node it will get a new slave ID. all prior tasks are recorded as LOST so the scheduler should delete the related pods so that they can be rescheduled

@jdef
Copy link
Author

jdef commented Feb 16, 2016

found while debugging #778

@jdef
Copy link
Author

jdef commented Feb 16, 2016

should also delete any mirror pods associated with the slave

@jdef
Copy link
Author

jdef commented Feb 17, 2016

fix: kubernetes/kubernetes#21366

@jdef jdef added the tracking label Feb 17, 2016
@jdef
Copy link
Author

jdef commented Feb 17, 2016

I've tried deleting the mirror pods, and they do go away but if...

  1. the slave process dies --> mesos fires slave-lost
  2. the slave stays dead
  3. the scheduler pod GC deletes all pods associated with {host,slaveID}
  4. there are static pods running on the kubelet

the kubelet-executor will stay running until suicide timeout. this argues in favor of a lower value for the default suicide timeout threshold. #465

@jdef
Copy link
Author

jdef commented Feb 17, 2016

FWIW the apiserver has this status about the node (I'm wondering if the replication controller's pod GC is supposed to kick in here?):

status:
  addresses:
  - address: 10.2.0.6
    type: LegacyHostIP
  - address: 10.2.0.6
    type: InternalIP
  allocatable:
    cpu: "2"
    memory: 3745Mi
    pods: "40"
  capacity:
    cpu: "2"
    memory: 3745Mi
    pods: "40"
  conditions:
  - lastHeartbeatTime: 2016-02-17T03:48:23Z
    lastTransitionTime: 2016-02-17T03:44:02Z
    message: mesos reports ready status 
    reason: SlaveReady
    status: "True"
    type: Ready
  - lastHeartbeatTime: 2016-02-17T01:39:40Z
    lastTransitionTime: 2016-02-17T03:44:01Z
    message: Kubelet stopped posting node status.
    reason: NodeStatusUnknown
    status: Unknown
    type: OutOfDisk

@jdef
Copy link
Author

jdef commented Feb 19, 2016

we need to track additional mesos state as a node condition. upstream doesn't provide a mechanism with which we can inject such a condition. I filed a PR to add such: kubernetes/kubernetes#21521

  • waiting for merge of the above
  • MERGED on 2016-02-24

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant