Deployment gets stuck killing tasks that are lost #4039

gkleiman · 2016-06-28T12:40:06Z

During the a deployment, Marathon should expunge lost tasks instead of trying to kill them.

Right now Marathon keeps trying to kill the lost tasks, resulting in a stuck deployment:

Jun 28 12:18:38 java[2129]: [2016-06-28 12:18:38,390] INFO Killing 40 instances from 40 (mesosphere.marathon.upgrade.AppStopActor:marathon-akka.actor.default-dispatcher-448)
Jun 28 12:18:38 java[2129]: [2016-06-28 12:18:38,392] INFO Received status update for task foo_bar.09e6dade-3d24-11e6-b927-867b4e0a3eb0: TASK_LOST (Reconciliation: Task is unknown) (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$2ddc3939:Thread-11449)
[...]

The text was updated successfully, but these errors were encountered:

meichstedt · 2016-06-28T12:43:54Z

The StoppingBehavior would actually consider a TASK_LOST as finished, but if the task was lost prior to the delete request, the PostToEventStreamStepImpl will not publish an event when receiving a TASK_LOST in response to a kill request, because the task state didn't change … so the StoppingBehavior won't receive an event and will try again and again.

(Note: we'll probably have similar issues when scaling down)

Note: lacking tests. The touched code is not very elegant and it now fetches tasks more often than needed. This should not be a permanent solution. TaskKillActor and AppStopActor are mostly the same btw.

timcharper · 2016-07-18T16:58:04Z

I believe I'm seeing the same issue. We had a mesos agent go down but the tasks that were running on it will not clear out of Cassandra. Here's an excerpt from our logs which shows that task status update is being received by Marathon:

Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,271] INFO Synchronize tasks: 1 instances to kill (mesosphere.marathon.upgrade.TaskKillActor:marathon-akka.actor.default-dispatcher-43)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,271] INFO Killing 1 instances from 1 (mesosphere.marathon.upgrade.TaskKillActor:marathon-akka.actor.default-dispatcher-43)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,272] INFO Received status update for task app-dev.609b1508-4d05-11e6-ad40-3a3960b49770: TASK_LOST (Reconciliation: Task is unknown) (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$d29e084e:Thread-677)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Start processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770]. 1 active, 0 queued. (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Ignoring status update for task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770]. Status did not change. (mesosphere.marathon.core.task.Task$LaunchedEphemeral$:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Finished processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770] (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Finished processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770] 0 active, 0 queued. (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)

Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service. Background This will now kill LOST and UNREACHABLE tasks before running tasks.

pavels · 2016-08-08T00:59:11Z

Hi

any news on when this could be released and packaged? There is use case, under which this is major problem

when the task has:

"upgradeStrategy": {
"minimumHealthCapacity": 0,
"maximumOverCapacity": 0
}

and scale of 1, than after the node on which this tusk was running got lost (because of failure), the task will get stuck in deploying forever. It will never start again, unless you set it to scale 2 (but if this state changes somewhere in the future somehow and there will be 2 actual instance of this task running, it will be doomed again)

If you for example have DB server running from shared filesystem, so there must be always only one copy running, than simple node failure will trigger complete task failure, and defeat all the HA.

unterstein · 2016-08-08T09:05:15Z

Hi @pavels,

this issue is released in marathon 1.2.0-RC8 and 1.3.0-RC2

Kind regards
Johannes

pavels · 2016-08-08T09:30:07Z

Hi

thanks for reply

i didn't know if somebody is actually reading comments on closed issues, so i filled one more, if it is duplicate of this, than i am sorry and please close it. #4190

dmcwhorter · 2016-09-06T15:19:49Z

Is there any chance this fix can be backported into the 1.1.x series?
We're having a number of other issues trying to run 1.3.x-RC but can't run the stable version of 1.1.x due to this bug...

meichstedt · 2016-09-06T15:24:21Z

@dmcwhorter I'll see how much effort it is to backport from 1.3/forward port from 0.15.7, but I think we can do that.

Edit: mistook this with another issue – backporting this is not going to happen, but I'll think about adding a band aid for 1.1.

unterstein · 2016-09-06T15:30:21Z

I think #4232 also discussed a backport of this or a similar behavior.

meichstedt · 2016-09-06T16:00:24Z

@dmcwhorter does the approach as described in #4232 work for you?

--task_lost_expunge_gc 180000 --task_lost_expunge_initial_delay 120000 --task_lost_expunge_interval 300000

will expunge tasks that are lost for more than 3 minutes, initially run this TASK_LOST GC after 2 minutes and then every 5 minutes.

dmcwhorter · 2016-09-06T16:08:33Z

@meichstedt I will try that out and respond back, thanks for pointing it out

gkleiman added bug labels Jun 28, 2016

gkleiman added this to the 1.2 milestone Jun 28, 2016

meichstedt added the in progress label Jun 28, 2016

meichstedt assigned gkleiman and meichstedt Jun 28, 2016

meichstedt mentioned this issue Jul 6, 2016

Scaling an app down after a node goes down doesn't work #4059

Closed

meichstedt mentioned this issue Jul 20, 2016

WIP – TaskKillService #4117

Merged

meichstedt mentioned this issue Jul 28, 2016

tasks turn into ipless ghosts after killing agent #4076

Closed

aquamatthias closed this as completed in #4117 Jul 28, 2016

aquamatthias removed the in progress label Jul 28, 2016

mesosphere locked and limited conversation to collaborators Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deployment gets stuck killing tasks that are lost #4039

Deployment gets stuck killing tasks that are lost #4039

gkleiman commented Jun 28, 2016 •

edited

meichstedt commented Jun 28, 2016

timcharper commented Jul 18, 2016

pavels commented Aug 8, 2016

unterstein commented Aug 8, 2016

pavels commented Aug 8, 2016

dmcwhorter commented Sep 6, 2016

meichstedt commented Sep 6, 2016 •

edited

unterstein commented Sep 6, 2016

meichstedt commented Sep 6, 2016

dmcwhorter commented Sep 6, 2016 •

edited

Deployment gets stuck killing tasks that are lost #4039

Deployment gets stuck killing tasks that are lost #4039

Comments

gkleiman commented Jun 28, 2016 • edited

meichstedt commented Jun 28, 2016

timcharper commented Jul 18, 2016

pavels commented Aug 8, 2016

unterstein commented Aug 8, 2016

pavels commented Aug 8, 2016

dmcwhorter commented Sep 6, 2016

meichstedt commented Sep 6, 2016 • edited

unterstein commented Sep 6, 2016

meichstedt commented Sep 6, 2016

dmcwhorter commented Sep 6, 2016 • edited

gkleiman commented Jun 28, 2016 •

edited

meichstedt commented Sep 6, 2016 •

edited

dmcwhorter commented Sep 6, 2016 •

edited