Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deployment gets stuck killing tasks that are lost #4039

Closed
gkleiman opened this issue Jun 28, 2016 · 10 comments
Closed

Deployment gets stuck killing tasks that are lost #4039

gkleiman opened this issue Jun 28, 2016 · 10 comments
Assignees
Milestone

Comments

@gkleiman
Copy link
Contributor

gkleiman commented Jun 28, 2016

During the a deployment, Marathon should expunge lost tasks instead of trying to kill them.

Right now Marathon keeps trying to kill the lost tasks, resulting in a stuck deployment:

Jun 28 12:18:38 java[2129]: [2016-06-28 12:18:38,390] INFO Killing 40 instances from 40 (mesosphere.marathon.upgrade.AppStopActor:marathon-akka.actor.default-dispatcher-448)
Jun 28 12:18:38 java[2129]: [2016-06-28 12:18:38,392] INFO Received status update for task foo_bar.09e6dade-3d24-11e6-b927-867b4e0a3eb0: TASK_LOST (Reconciliation: Task is unknown) (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$2ddc3939:Thread-11449)
[...]
@gkleiman gkleiman added this to the 1.2 milestone Jun 28, 2016
@meichstedt
Copy link
Contributor

The StoppingBehavior would actually consider a TASK_LOST as finished, but if the task was lost prior to the delete request, the PostToEventStreamStepImpl will not publish an event when receiving a TASK_LOST in response to a kill request, because the task state didn't change … so the StoppingBehavior won't receive an event and will try again and again.

(Note: we'll probably have similar issues when scaling down)

meichstedt added a commit that referenced this issue Jun 28, 2016
Note: lacking tests. The touched code is not very elegant and it now fetches tasks more often than needed. This should not be a permanent solution. TaskKillActor and AppStopActor are mostly the same btw.
@timcharper
Copy link
Contributor

I believe I'm seeing the same issue. We had a mesos agent go down but the tasks that were running on it will not clear out of Cassandra. Here's an excerpt from our logs which shows that task status update is being received by Marathon:

Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,271] INFO Synchronize tasks: 1 instances to kill (mesosphere.marathon.upgrade.TaskKillActor:marathon-akka.actor.default-dispatcher-43)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,271] INFO Killing 1 instances from 1 (mesosphere.marathon.upgrade.TaskKillActor:marathon-akka.actor.default-dispatcher-43)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,272] INFO Received status update for task app-dev.609b1508-4d05-11e6-ad40-3a3960b49770: TASK_LOST (Reconciliation: Task is unknown) (mesosphere.marathon.MarathonScheduler$$EnhancerByGuice$$d29e084e:Thread-677)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Start processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770]. 1 active, 0 queued. (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Ignoring status update for task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770]. Status did not change. (mesosphere.marathon.core.task.Task$LaunchedEphemeral$:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Finished processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770] (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: [2016-07-18 10:52:53,273] DEBUG Finished processing MesosUpdate(LaunchedEphemeral(task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770],AgentInfo(mesos-11.domain.com,Some(9c43df5b-897a-4815-88c0-0f542a262fdc-S26),Vector()),2016-07-18T16:22:04.058Z,Status(2016-07-18T16:33:37.235Z,Some(2016-07-18T16:33:39.644Z),Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Slave mesos-11.domain.com removed: health check timed out"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: slave_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "9c43df5b-897a-4815-88c0-0f542a262fdc-S26"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860005823856E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_SLAVE_REMOVED
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),Vector(31693)),Lost(Some(task_id {
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]:   value: "app-dev.609b1508-4d05-11e6-ad40-3a3960b49770"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: }
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: state: TASK_LOST
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: message: "Reconciliation: Task is unknown"
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: timestamp: 1.468860773271776E9
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: source: SOURCE_MASTER
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: reason: REASON_RECONCILIATION
Jul 18 10:52:53 mesos-1.domain.com marathon[67752]: )),2016-07-18T16:52:53.273Z) for app [/app-dev] and task [app-dev.609b1508-4d05-11e6-ad40-3a3960b49770] 0 active, 0 queued. (mesosphere.marathon.core.task.tracker.impl.TaskUpdateActor:marathon-akka.actor.default-dispatcher-34)

meichstedt added a commit that referenced this issue Jul 28, 2016
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service.

Background

This will now kill LOST and UNREACHABLE tasks before running tasks.
aquamatthias pushed a commit that referenced this issue Jul 28, 2016
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service.

Background

This will now kill LOST and UNREACHABLE tasks before running tasks.
unterstein pushed a commit that referenced this issue Jul 29, 2016
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service.

Background

This will now kill LOST and UNREACHABLE tasks before running tasks.
@pavels
Copy link

pavels commented Aug 8, 2016

Hi

any news on when this could be released and packaged? There is use case, under which this is major problem

when the task has:

"upgradeStrategy": {
"minimumHealthCapacity": 0,
"maximumOverCapacity": 0
}

and scale of 1, than after the node on which this tusk was running got lost (because of failure), the task will get stuck in deploying forever. It will never start again, unless you set it to scale 2 (but if this state changes somewhere in the future somehow and there will be 2 actual instance of this task running, it will be doomed again)

If you for example have DB server running from shared filesystem, so there must be always only one copy running, than simple node failure will trigger complete task failure, and defeat all the HA.

@unterstein
Copy link
Contributor

Hi @pavels,

this issue is released in marathon 1.2.0-RC8 and 1.3.0-RC2

Kind regards
Johannes

@pavels
Copy link

pavels commented Aug 8, 2016

Hi

thanks for reply

i didn't know if somebody is actually reading comments on closed issues, so i filled one more, if it is duplicate of this, than i am sorry and please close it. #4190

@dmcwhorter
Copy link

Is there any chance this fix can be backported into the 1.1.x series?
We're having a number of other issues trying to run 1.3.x-RC but can't run the stable version of 1.1.x due to this bug...

@meichstedt
Copy link
Contributor

meichstedt commented Sep 6, 2016

@dmcwhorter I'll see how much effort it is to backport from 1.3/forward port from 0.15.7, but I think we can do that.

Edit: mistook this with another issue – backporting this is not going to happen, but I'll think about adding a band aid for 1.1.

@unterstein
Copy link
Contributor

I think #4232 also discussed a backport of this or a similar behavior.

@meichstedt
Copy link
Contributor

@dmcwhorter does the approach as described in #4232 work for you?

--task_lost_expunge_gc 180000 --task_lost_expunge_initial_delay 120000 --task_lost_expunge_interval 300000

will expunge tasks that are lost for more than 3 minutes, initially run this TASK_LOST GC after 2 minutes and then every 5 minutes.

@dmcwhorter
Copy link

dmcwhorter commented Sep 6, 2016

@meichstedt I will try that out and respond back, thanks for pointing it out

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants