Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parent issue: Marathon does not re-use reserved resources for which a lost task is associated #4137

Closed
6 tasks done
timcharper opened this issue Jul 25, 2016 · 21 comments
Closed
6 tasks done
Assignees

Comments

@timcharper
Copy link
Contributor

timcharper commented Jul 25, 2016

This is a parent issue to aggregate the handful of sub-issues related to resident tasks.

(check indicates it is merged to master. Please see #5206 for the backport to 1.4 status)

-- original --

I've recorded a video to show the problem:

http://screencast.com/t/Lkgdi6tIEG6

In effect, Mesos tells Marathon a task was lost during a reconciliation (for a variety of reasons, but in this demonstrated occurrence it is lost because the mesos-slave id is forcibly changed and a new ID comes up on the same mesos-slave IP address). Then, Marathon responds to that by reserving a new set of resources and persistent volume, and launching a new task.

The expected behavior should be that Marathon should reuse the reserved resources (which it can't because it thinks there is a task running there... status.state == Unknown from looking at the protobuf hexdump in zookeeper). If it can't use the reserved resources because it thinks something might be running then it should not launch additional persistent volumes (when push comes to shove, if it can't satisfy 0% over capacity and 0% under capacity thresholds, it should heed the 0% over capacity limit).

@timcharper
Copy link
Contributor Author

timcharper commented Jul 25, 2016

I suspect #4118 may partially or wholly fix this issue, but the disregard for max-over-capacity is still a problem.

@meichstedt
Copy link
Contributor

@timcharper thanks for reporting and providing the screencast.

tl;dr: top prio on our tech-debt list for 1.2

We're aware of that and the next work item for me and @unterstein is to provide an implementation/functionality for specifying/fixing task lost behavior, both for tasks using persistent volumes and for normal tasks. #4118 will not fix that issue alone, but it contains necessary prerequisites to allow for a clean implementation.

The underlying problem is that TASK_LOST is not very specific, and under certain circumstances, such a task might come back running – that's why Marathon as of now not always expunges these tasks from state. If that's the case, it will not consider them when scaling/assuring the required instance count, which is ok for non-resident tasks but broken for tasks using persistent volumes.

@meichstedt meichstedt added this to the 1.2 milestone Jul 25, 2016
@meichstedt
Copy link
Contributor

Just in case you're wondering: we're currently organizing part of our work in a closed tracker, which is why there hasn't been an issue for this in GH. We'll change that short-term.

@unterstein unterstein modified the milestones: 1.2, 1.3 Aug 24, 2016
@aquamatthias aquamatthias modified the milestones: 1.4, 1.3 Sep 5, 2016
@jasongilanfarr
Copy link
Contributor

@timcharper is this still an issue or can we close?

@aquamatthias aquamatthias added ready and removed ready labels Dec 6, 2016
@timcharper
Copy link
Contributor Author

@jasongilanfarr it's still an issue.

@timcharper
Copy link
Contributor Author

I can reproduce the issue in 1.4.0-rc1 and will post a video documenting it.

@timcharper timcharper changed the title Persistent volume overallocation in 1.2.0RC5 Marathon does not re-use reserved resources for which a lost task is associated Dec 8, 2016
@timcharper
Copy link
Contributor Author

timcharper commented Dec 8, 2016

From @meichstedt :

it might be enough to say when we see a reservation for a known instance and are trying to launch instances for the related runSpec, consider launching that instance no matter it’s state

So, consider updating reserved as in object InstanceOpFactory -> case class Request { lazy val reserved: Seq[Instance] = instances.filter(_.isReserved)} to possiblyMatchingReservations or something like that, and have it include Unreachable and UnreachableInactive tasks.

@timcharper
Copy link
Contributor Author

Decided that this is not a release blocker but should be fixed soon after blockers

@timcharper timcharper self-assigned this Dec 8, 2016
@timcharper
Copy link
Contributor Author

Can confirm that 1.4.0-rc1 is not re-using reservations once a task becomes LOST.

@timcharper
Copy link
Contributor Author

I still need to verify if this is enough, or I still need @meichstedt 's patch

@timcharper
Copy link
Contributor Author

Found and proposed solution for #5142 while working on this

@timcharper
Copy link
Contributor Author

Another bug found: #5155

@timcharper
Copy link
Contributor Author

Found and fixed this: #5163

@timcharper
Copy link
Contributor Author

Another one: #5165

@timcharper
Copy link
Contributor Author

Cherry-picked and rebased @meichstedt's patch; it still doesn't work. Will look more tomorrow.

https://phabricator.mesosphere.com/D488 should be ready to land

timcharper pushed a commit that referenced this issue Feb 17, 2017
Summary:
Require disabled for resident tasks. Fixes #5163. Partially addresses
#4137

Test Plan:
create resident task. Make it get lost. Ensure that it
doesn't come go inactive.

Reviewers: aquamatthias, jdef, meichstedt, jenkins

Reviewed By: aquamatthias, jdef, meichstedt, jenkins

Subscribers: jdef, marathon-team

Differential Revision: https://phabricator.mesosphere.com/D488
@timcharper
Copy link
Contributor Author

Found another one: #5207. The solution proposed here will help at least give operators a manual way to recover lost tasks.

@timcharper
Copy link
Contributor Author

With the fix for #5207 operators are at least given a valid work-around. The primary (only?) cause of this issue will go away with Mesos 1.2.0, slated for release in a few months, which will fix the issue in which agents are assigned a new agentId on host reboot, thereby allowing Mesos to officially declare a task as GONE (interpreted by us as a terminal state, and, therefore, prompts a re-launch).

Due to the decreased severity with the other fixes, a valid work-around to get resident tasks running, and, a planned fix for Mesos 1.2.0, I'm inclined to allow this ticket to just get fixed by Mesos 1.2.0

@timcharper
Copy link
Contributor Author

The kill while unreachable approach was ultimately too complex.

We tried modifying reconciliation to reconcile with the agent ID, and this did not help.

We're going to monitor the offer stream and watch for reservations for unreachable tasks, and map them into terminal mesos updates.

@timcharper timcharper changed the title Marathon does not re-use reserved resources for which a lost task is associated Parent issue: Marathon does not re-use reserved resources for which a lost task is associated Mar 1, 2017
@meichstedt
Copy link
Contributor

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-1713. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

@meichstedt
Copy link
Contributor

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-1713. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.

@mesosphere mesosphere locked and limited conversation to collaborators Mar 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants