-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running job deadlock #32
Comments
Update: Killing off a few of the task trackers one by one (resulting in TASK_LOST) is sufficient to kick the hadoop scheduler back into action and it will then start launching some reduce slots. |
There are 2 parameters that exists for dealing with this case: <property>
<name>mapred.mesos.total.map.slots.minimum</name>
<value>500</value>
</property>
<property>
<name>mapred.mesos.total.reduce.slots.minimum</name>
<value>750</value>
</property> At least, this is how I've dealt with it. |
So essentially you ensure there's always some small amount of map and reduce capacity available? |
Yes, precisely. |
Ah righty, I see. We're running a pretty small cluster and pack things quite tightly in so I might have a stab at an alternative fix... (to free up these resources for other frameworks). |
Gotchya. Let me know if I can help. |
We've just rolled the proposed fix (#33) out into production and it seems to be working well enough so far! |
Hooray! |
Hi, It seems I am experiencing the same issue. I see that the proposed fixed is not yet merged. Could you please update me on the current status and what would be best to do to fix this on my cluster? I am using mesos 0.20 with many different frameworks built for this version. I also saw this seemingly related issue here: https://issues.apache.org/jira/browse/MESOS-1817 that is solved for 0.21 and wondering whether upgrading is a good idea... |
Hi @strat0sphere! Glad to hear you're using the Hadoop on Mesos framework. The issue you linked to there is unrelated as far as I can tell. It's always good to keep up with the latest versions of Mesos though. I think that fix was committed to the master anyway, so an upgrade of the framework library isn't required. We've been running this fix in production for a number of months now, so it's pretty good to go. I've held back merging it as there's still one issue we have which I want to iron out (in some instances, mostly around task/job failures, task trackers don't always commit suicide). Go ahead and try out the branch, it should be pretty much good to go. Regardless of cluster size, the deadlock shouldn't happen with this code... however if you have spare memory (e.g when all CPU is allocated to hadoop you have some memory left) you'll see a more fluid cluster and probably a performance boost with your jobs. Given that, I'd suggest keeping your memory allocation for hadoop as low as you possibly can to leave head room. |
Thank you for your answer @tarnfeld! I'll give it a try and let you know... So far I am dealing with it also by assigning a minimum number of reducers. Regarding your memory comment, I am having a small overloaded cluster with HA enabled for mesos, namenodes and jobtrackers - 3 masters and 7 slaves with only 4GB of RAM per machine. So resource wise its executor gets 1.5GB of memory and 2 CPU slots and is really slow at the time. To avoid having the executor being staged and lost all the time I also had to increase the executor_registration_timeout - I'll try different allocations and see what works best... But of course any hint is welcome! |
@strat0sphere Hey! I've pushed a bunch of changes to #33 and rebased over the latest master, you might want to give the new version a try. |
Hi @tarnfeld - I pulled the branch 3 days ago. Unfortunately it doesn't always work in my case and just wanted to be sure before replying. Actually, I've seen it killing the trackers only a couple of times but mostly it doesn't. The log files do not give me any useful information to understand why its not consistent. My only observation is from the Mesos UI showing that the task trackers are actually consuming a very small portion of the CPU (fluctuating in very small values like 0.016) even when all the mappers are done - So I am guessing that for this reason the slot is not considered idle and that's why it isn't killed. As a reminder I am using Mesos 0.20 in case this makes any difference - I'll try to upgrade to 0.21.1 and try again but I don't think this is the issue... Also I notice that your patch will work as long as there is more than one slot per executor. So for very small machines I guess the only option left is to have at minimum one reducer slot running all the time... since it seems there is not a possibility to learn the total slots needed from the jobtracker and plan the allocation accordingly... |
Yeah, can you try on the latest version? |
Essentially what happens is a two-phase termination of the task tracker. When all slots in a task tracker become idle (if you have a TT with map AND reduce the system is not 100% efficient) the slots will be "revoked" from the tracker but it will remain online. The reason for this is that Hadoop serves map output from the task tracker the map task ran on, so we need to keep this map data around. Once the slots have been revoked the CPU and RAM that was allocated to the task tracker for the slots is also freed in the cluster which is where the real benefit it. The task tracker will monitor itself and commit suicide once it no longer needs to stay around to serve map output. The most recent fixes should have improved this behaviour. Perhaps i'm not entirely understanding your issue? |
@strat0sphere Could you also be more specific around the issue "Also I notice that your patch will work as long as there is more than one slot per executor." |
I am referring to this "This is skirted around by only revoking a percentage of map slots from each TaskTracker (remaining = max(slots - (slots * 0.9), 1) by default)." - So in my case I had an allocation with which each TT could run only one slot at a time. So I had all slaves running 1 mapper slot each and your patch by definition wouldn't work with this as expected. Then I reduced my allocation and as I said there were some times that it worked but most of them it doesn't probably because the CPU doesn't go completely idle as mentioned above.
I am working on upgrading to 0.21.1 and will try the new version. Unfortunately it doesn't go as smooth as expected ( all task losts at the moment) but I will let you know when I finish this and try your code again... |
Hi @tarnfeld - So I upgraded to Mesos 0.21.1 and used your latest version but the problems remain. As I mentioned before it doesn't work reliably in my cluster. Unfortunately I don't have time to debug this more so I will stick with having a reducer as a workaround to this issue. In case this helps one of the things I observed happening is that after all the mappers are completed, they will go to finished state but no reducer will be ever started. Checking the slave log shows that the resources needed are not satisfied (which is not true unless I miss something) and also that there are pending map tasks (that they are not since all map tasks are finished!):
and after the task is already killed the executor keeps on be checked for being idle and half of the resources in the cluster remain used even though the TTs are on finished state:
Finally I noticed that some TTs are killed when the job is starting... The following is for a TT that just started...
|
Hmm. This branch doesn't change the scheduling behaviour so it concerns me that you're seeing issues where not the right balance of map and reduce slots are being launched.
This is actually correct behaviour, because the TT needs to stay alive to serve the map output data to the reducers. The slots should have been "revoked" which it looks like happened OK in the logs, and resources freed for use in the cluster. I wonder whether this is all a result of the fact the resources are so tight that the
Regarding this, I think this is a known issue that I need to fix. TLDR we need to not class task trackers as idle when they haven't run any tasks yet (aka just started). @strat0sphere I really appreciate all the details! |
Regarding this
I think I've seen it before - So I am guessing it's not related to your patch. Though I am still unsure why the above resources are considered "insufficient". Either the reporting is mistaken or some bug on the code. Really thanks for trying to help... If I find the time I dig deeper into the code to find out what is going on and let you know... |
It might be to do with resource roles? If you could give me some exact details on the cluster (number of slaves of what size) as well as the resource specs for task tracker/slots you have in your hadoop config, and if you can, the rough size/shape of the job you're running. I'll try and reproduce here and dig into it. I'm keen to help see this patch through and get it merged into master but i'm not entirely confident with it yet (as evident with these issues). |
I am not sure what you mean with "resource roles". My current cluster has 3 On Thu, Mar 5, 2015 at 2:59 AM, Tom Arnfeld notifications@github.com
|
I'm going to close this as #33 has been merged. Please open any new tickets if strange scheduling behaviour persists. |
So this is an interesting issue. I've seen (several times now) situations where the Hadoop scheduler will get itself into a deadlock with running jobs. Here's how it goes.
Cluster: Some number of mesos slaves, let's say the resources equate to 100 slots. The underlying scheduler here is the hadoop FairScheduler, not the FIFO one.
At this point, all the cluster resources are being given to the running TaskTrackers. These resources are not going to be released until the running job completes, but that job is waiting for some reducers to launch. This is a deadlock, the job will never complete because the task trackers are never released, and vice versa.
I'm wondering if you can suggest anything here @brndnmtthws @florianleibert @benh?
This problem fits quite well with the Task/Executor relationship. In this example I need to keep the executors alive (so they can stream data to the reducers for shuffle/sort) but I need to free up the "slots" or task resources. Perhaps if the framework was able to terminate the Task that held resources for the slots independently from the TaskTracker itself, and then internally mark that Task Tracker as "going to be killed soon".
We have to maintain some state internally because it is not possible to reduce the number of slots on a task tracker while it is running, so the hadoop/mesos scheduler needs to pro-actively not schedule tasks there. Though I don't think this is too complicated to do.
The text was updated successfully, but these errors were encountered: