-
Notifications
You must be signed in to change notification settings - Fork 847
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deployment gets stuck killing tasks that are lost #4039
Comments
The StoppingBehavior would actually consider a TASK_LOST as finished, but if the task was lost prior to the delete request, the (Note: we'll probably have similar issues when scaling down) |
Note: lacking tests. The touched code is not very elegant and it now fetches tasks more often than needed. This should not be a permanent solution. TaskKillActor and AppStopActor are mostly the same btw.
I believe I'm seeing the same issue. We had a mesos agent go down but the tasks that were running on it will not clear out of Cassandra. Here's an excerpt from our logs which shows that task status update is being received by Marathon:
|
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service. Background This will now kill LOST and UNREACHABLE tasks before running tasks.
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service. Background This will now kill LOST and UNREACHABLE tasks before running tasks.
Fixes #4039 by adding a service that handles killing tasks, throttling kill requests and retrying kills if no terminal status update was received within a certain amount of time. Replaces StoppingBehavior and actors using it, and reduces any driver.kill calls to exactly one inside the service. Background This will now kill LOST and UNREACHABLE tasks before running tasks.
Hi any news on when this could be released and packaged? There is use case, under which this is major problem when the task has: "upgradeStrategy": { and scale of 1, than after the node on which this tusk was running got lost (because of failure), the task will get stuck in deploying forever. It will never start again, unless you set it to scale 2 (but if this state changes somewhere in the future somehow and there will be 2 actual instance of this task running, it will be doomed again) If you for example have DB server running from shared filesystem, so there must be always only one copy running, than simple node failure will trigger complete task failure, and defeat all the HA. |
Hi @pavels, this issue is released in marathon 1.2.0-RC8 and 1.3.0-RC2 Kind regards |
Hi thanks for reply i didn't know if somebody is actually reading comments on closed issues, so i filled one more, if it is duplicate of this, than i am sorry and please close it. #4190 |
Is there any chance this fix can be backported into the 1.1.x series? |
@dmcwhorter I'll see how much effort it is to backport from 1.3/forward port from 0.15.7, but I think we can do that. Edit: mistook this with another issue – backporting this is not going to happen, but I'll think about adding a band aid for 1.1. |
I think #4232 also discussed a backport of this or a similar behavior. |
@dmcwhorter does the approach as described in #4232 work for you?
will expunge tasks that are lost for more than 3 minutes, initially run this TASK_LOST GC after 2 minutes and then every 5 minutes. |
@meichstedt I will try that out and respond back, thanks for pointing it out |
During the a deployment, Marathon should expunge lost tasks instead of trying to kill them.
Right now Marathon keeps trying to kill the lost tasks, resulting in a stuck deployment:
The text was updated successfully, but these errors were encountered: