Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running job deadlock #32

Closed
tarnfeld opened this issue Sep 10, 2014 · 23 comments
Closed

Running job deadlock #32

tarnfeld opened this issue Sep 10, 2014 · 23 comments

Comments

@tarnfeld
Copy link
Member

So this is an interesting issue. I've seen (several times now) situations where the Hadoop scheduler will get itself into a deadlock with running jobs. Here's how it goes.

Cluster: Some number of mesos slaves, let's say the resources equate to 100 slots. The underlying scheduler here is the hadoop FairScheduler, not the FIFO one.

  1. Empty cluster.
  2. Launch MAP ONLY job with 1000 tasks.
  3. Framework will launch N task trackers with a sum() of 100 MAP slots (no reduce, no need to do that).
  4. The job will start running ticking through those tasks.
  5. While the first job is running, launch a second with 1000 MAP and 5 REDUCE.
  6. Hadoop with share the currently running task trackers and slots evenly between the two tasks.
  7. The first job completes fine, no issues.
  8. Every task tracker in the cluster has some map output from the second job, that data needs to be streamed to the reducers later on. Because of this, the framework won't kill the TTs and the resources are not released.
  9. The job is waiting for reducers, but there are no freed resources to launch some TTs with reduce slots.
  10. The job will never complete, hadoop will never release the mesos resources.

At this point, all the cluster resources are being given to the running TaskTrackers. These resources are not going to be released until the running job completes, but that job is waiting for some reducers to launch. This is a deadlock, the job will never complete because the task trackers are never released, and vice versa.

I'm wondering if you can suggest anything here @brndnmtthws @florianleibert @benh?


This problem fits quite well with the Task/Executor relationship. In this example I need to keep the executors alive (so they can stream data to the reducers for shuffle/sort) but I need to free up the "slots" or task resources. Perhaps if the framework was able to terminate the Task that held resources for the slots independently from the TaskTracker itself, and then internally mark that Task Tracker as "going to be killed soon".

We have to maintain some state internally because it is not possible to reduce the number of slots on a task tracker while it is running, so the hadoop/mesos scheduler needs to pro-actively not schedule tasks there. Though I don't think this is too complicated to do.

@tarnfeld tarnfeld changed the title Job deadlock Running job deadlock Sep 10, 2014
@tarnfeld
Copy link
Member Author

Update: Killing off a few of the task trackers one by one (resulting in TASK_LOST) is sufficient to kick the hadoop scheduler back into action and it will then start launching some reduce slots.

@brndnmtthws
Copy link
Member

There are 2 parameters that exists for dealing with this case:

  <property>
    <name>mapred.mesos.total.map.slots.minimum</name>
    <value>500</value>
  </property>

  <property>
    <name>mapred.mesos.total.reduce.slots.minimum</name>
    <value>750</value>
  </property>

At least, this is how I've dealt with it.

@tarnfeld
Copy link
Member Author

So essentially you ensure there's always some small amount of map and reduce capacity available?

@brndnmtthws
Copy link
Member

Yes, precisely.

@tarnfeld
Copy link
Member Author

Ah righty, I see. We're running a pretty small cluster and pack things quite tightly in so I might have a stab at an alternative fix... (to free up these resources for other frameworks).

@brndnmtthws
Copy link
Member

Gotchya. Let me know if I can help.

@tarnfeld tarnfeld mentioned this issue Sep 16, 2014
4 tasks
@tarnfeld
Copy link
Member Author

We've just rolled the proposed fix (#33) out into production and it seems to be working well enough so far!

@brndnmtthws
Copy link
Member

Hooray!

@strat0sphere
Copy link

Hi, It seems I am experiencing the same issue. I see that the proposed fixed is not yet merged. Could you please update me on the current status and what would be best to do to fix this on my cluster? I am using mesos 0.20 with many different frameworks built for this version. I also saw this seemingly related issue here: https://issues.apache.org/jira/browse/MESOS-1817 that is solved for 0.21 and wondering whether upgrading is a good idea...

@tarnfeld
Copy link
Member Author

Hi @strat0sphere! Glad to hear you're using the Hadoop on Mesos framework.

The issue you linked to there is unrelated as far as I can tell. It's always good to keep up with the latest versions of Mesos though. I think that fix was committed to the master anyway, so an upgrade of the framework library isn't required.

We've been running this fix in production for a number of months now, so it's pretty good to go. I've held back merging it as there's still one issue we have which I want to iron out (in some instances, mostly around task/job failures, task trackers don't always commit suicide).

Go ahead and try out the branch, it should be pretty much good to go. Regardless of cluster size, the deadlock shouldn't happen with this code... however if you have spare memory (e.g when all CPU is allocated to hadoop you have some memory left) you'll see a more fluid cluster and probably a performance boost with your jobs. Given that, I'd suggest keeping your memory allocation for hadoop as low as you possibly can to leave head room.

@strat0sphere
Copy link

Thank you for your answer @tarnfeld! I'll give it a try and let you know... So far I am dealing with it also by assigning a minimum number of reducers. Regarding your memory comment, I am having a small overloaded cluster with HA enabled for mesos, namenodes and jobtrackers - 3 masters and 7 slaves with only 4GB of RAM per machine. So resource wise its executor gets 1.5GB of memory and 2 CPU slots and is really slow at the time. To avoid having the executor being staged and lost all the time I also had to increase the executor_registration_timeout - I'll try different allocations and see what works best... But of course any hint is welcome!

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 3, 2015

@strat0sphere Hey! I've pushed a bunch of changes to #33 and rebased over the latest master, you might want to give the new version a try.

@strat0sphere
Copy link

Hi @tarnfeld - I pulled the branch 3 days ago. Unfortunately it doesn't always work in my case and just wanted to be sure before replying. Actually, I've seen it killing the trackers only a couple of times but mostly it doesn't. The log files do not give me any useful information to understand why its not consistent. My only observation is from the Mesos UI showing that the task trackers are actually consuming a very small portion of the CPU (fluctuating in very small values like 0.016) even when all the mappers are done - So I am guessing that for this reason the slot is not considered idle and that's why it isn't killed.

As a reminder I am using Mesos 0.20 in case this makes any difference - I'll try to upgrade to 0.21.1 and try again but I don't think this is the issue...

Also I notice that your patch will work as long as there is more than one slot per executor. So for very small machines I guess the only option left is to have at minimum one reducer slot running all the time... since it seems there is not a possibility to learn the total slots needed from the jobtracker and plan the allocation accordingly...

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 4, 2015

@strat0sphere

I pulled the branch 3 days ago. Unfortunately it doesn't always work in my case and just wanted to be sure before replying. Actually, I've seen it killing the trackers only a couple of times but mostly it doesn't. The log files do not give me any useful information to understand why its not consistent.

Yeah, can you try on the latest version?

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 4, 2015

Also I notice that your patch will work as long as there is more than one slot per executor. So for very small machines I guess the only option left is to have at minimum one reducer slot running all the time... since it seems there is not a possibility to learn the total slots needed from the jobtracker and plan the allocation accordingly...

Essentially what happens is a two-phase termination of the task tracker. When all slots in a task tracker become idle (if you have a TT with map AND reduce the system is not 100% efficient) the slots will be "revoked" from the tracker but it will remain online. The reason for this is that Hadoop serves map output from the task tracker the map task ran on, so we need to keep this map data around.

Once the slots have been revoked the CPU and RAM that was allocated to the task tracker for the slots is also freed in the cluster which is where the real benefit it. The task tracker will monitor itself and commit suicide once it no longer needs to stay around to serve map output.

The most recent fixes should have improved this behaviour. Perhaps i'm not entirely understanding your issue?

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 4, 2015

@strat0sphere Could you also be more specific around the issue "Also I notice that your patch will work as long as there is more than one slot per executor."

@strat0sphere
Copy link

Could you also be more specific around the issue

I am referring to this "This is skirted around by only revoking a percentage of map slots from each TaskTracker (remaining = max(slots - (slots * 0.9), 1) by default)." - So in my case I had an allocation with which each TT could run only one slot at a time. So I had all slaves running 1 mapper slot each and your patch by definition wouldn't work with this as expected. Then I reduced my allocation and as I said there were some times that it worked but most of them it doesn't probably because the CPU doesn't go completely idle as mentioned above.

Yeah, can you try on the latest version?

I am working on upgrading to 0.21.1 and will try the new version. Unfortunately it doesn't go as smooth as expected ( all task losts at the moment) but I will let you know when I finish this and try your code again...

@strat0sphere
Copy link

Hi @tarnfeld - So I upgraded to Mesos 0.21.1 and used your latest version but the problems remain. As I mentioned before it doesn't work reliably in my cluster. Unfortunately I don't have time to debug this more so I will stick with having a reducer as a workaround to this issue. In case this helps one of the things I observed happening is that after all the mappers are completed, they will go to finished state but no reducer will be ever started. Checking the slave log shows that the resources needed are not satisfied (which is not true unless I miss something) and also that there are pending map tasks (that they are not since all map tasks are finished!):


2015-03-04 23:33:26,482 INFO org.apache.hadoop.mapred.ResourcePolicy: Unable to fully satisfy needed map/reduce slots: 2 map slots 1 reduce slots remaining
2015-03-04 23:33:32,493 INFO org.apache.hadoop.mapred.ResourcePolicy: JobTracker Status
      Pending Map Tasks: 2
   Pending Reduce Tasks: 1
      Running Map Tasks: 0
   Running Reduce Tasks: 0
         Idle Map Slots: 0
      Idle Reduce Slots: 0
     Inactive Map Slots: 0 (launched but no hearbeat yet)
  Inactive Reduce Slots: 0 (launched but no hearbeat yet)
       Needed Map Slots: 2
    Needed Reduce Slots: 1
     Unhealthy Trackers: 0
2015-03-04 23:33:32,494 INFO org.apache.hadoop.mapred.ResourcePolicy: Declining offer with insufficient resources for a TaskTracker:
  cpus: offered 1.0 needed at least 1.0
  mem : offered 1907.0 needed at least 1024.0
  disk: offered 13510.0 needed at least 1024.0
  ports:  at least 2 (sufficient)

and after the task is already killed the executor keeps on be checked for being idle and half of the resources in the cluster remain used even though the TTs are on finished state:


15/03/04 23:18:55 INFO mapred.TaskTracker: Task attempt_201503042303_0001_m_000039_1 is done.
15/03/04 23:18:55 INFO mapred.TaskTracker: reported output size for attempt_201503042303_0001_m_000039_1  was -1
15/03/04 23:18:55 INFO mapred.TaskTracker: addFreeSlot : current free slots : 2
15/03/04 23:18:56 INFO mapred.JvmManager: JVM : jvm_201503042303_0001_m_1483177208 exited with exit code 0. Number of tasks it ran: 1
15/03/04 23:19:17 INFO mapred.MesosExecutor: Killing task : Task_Tracker_2
15/03/04 23:19:17 INFO mapred.MesosExecutor: Revoking task tracker map/reduce slots
15/03/04 23:19:18 INFO mapred.MesosExecutor: Checking to see if TaskTracker is idle
.
.
.
5/03/04 23:31:57 INFO mapred.MesosExecutor: Checking to see if TaskTracker is idle
15/03/04 23:31:57 INFO mapred.MesosExecutor: TaskTracker has 2 running tasks and [] tasks to clean up.
15/03/04 23:31:58 INFO mapred.MesosExecutor: Checking to see if TaskTracker is idle
15/03/04 23:31:58 INFO mapred.MesosExecutor: TaskTracker has 2 running tasks and [] tasks to clean up.
.
.
.
--> It goes forever...

Finally I noticed that some TTs are killed when the job is starting... The following is for a TT that just started...


5/03/04 23:56:50 INFO mapred.MesosExecutor: Killing task : Task_Tracker_4
15/03/04 23:56:50 INFO mapred.MesosExecutor: Revoking task tracker map/reduce slots
15/03/04 23:56:51 INFO mapred.MesosExecutor: Checking to see if TaskTracker is idle
15/03/04 23:56:51 WARN mapred.MesosExecutor: TaskTracker is idle, terminating

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 5, 2015

Hmm. This branch doesn't change the scheduling behaviour so it concerns me that you're seeing issues where not the right balance of map and reduce slots are being launched.

and after the task is already killed the executor keeps on be checked for being idle and half of the resources in the cluster remain used even though the TTs are on finished state:

This is actually correct behaviour, because the TT needs to stay alive to serve the map output data to the reducers. The slots should have been "revoked" which it looks like happened OK in the logs, and resources freed for use in the cluster.

I wonder whether this is all a result of the fact the resources are so tight that the (remaining = max(slots - (slots * 0.9), 1) by default) calculation will always retain a single slot. I think we can actually remove that check, now.

Finally I noticed that some TTs are killed when the job is starting... The following is for a TT that just started...

Regarding this, I think this is a known issue that I need to fix. TLDR we need to not class task trackers as idle when they haven't run any tasks yet (aka just started).

@strat0sphere I really appreciate all the details!

@strat0sphere
Copy link

Regarding this


2015-03-04 23:33:32,494 INFO org.apache.hadoop.mapred.ResourcePolicy: Declining offer with insufficient resources for a TaskTracker:
  cpus: offered 1.0 needed at least 1.0
  mem : offered 1907.0 needed at least 1024.0
  disk: offered 13510.0 needed at least 1024.0
  ports:  at least 2 (sufficient)

I think I've seen it before - So I am guessing it's not related to your patch. Though I am still unsure why the above resources are considered "insufficient". Either the reporting is mistaken or some bug on the code.

Really thanks for trying to help... If I find the time I dig deeper into the code to find out what is going on and let you know...

@tarnfeld
Copy link
Member Author

tarnfeld commented Mar 5, 2015

It might be to do with resource roles? If you could give me some exact details on the cluster (number of slaves of what size) as well as the resource specs for task tracker/slots you have in your hadoop config, and if you can, the rough size/shape of the job you're running. I'll try and reproduce here and dig into it.

I'm keen to help see this patch through and get it merged into master but i'm not entirely confident with it yet (as evident with these issues).

@strat0sphere
Copy link

I am not sure what you mean with "resource roles". My current cluster has 3
masters that run with a zk quorum, namenodes, jobtrackers and mesos master
in HA and 7 slaves that also run the HDFS datanodes. All machines have
2CPUs and 4GB of memory (from which to mesos is available 2.9GB). I am
currently experimenting with many different configurations
for mapred.mesos.slot.cpus and mapred.mesos.slot.mem as I am trying to
understand what works best and which configuration will also work best in
combination with Spark on the cluster. You can try with 0.5 CPU and 384 mem
to reproduce the issue - With this setup each TT can runs max 2 slots on my
cluster (though I would expect it can run up to 4 I have a limit for up to 2 mappers per tracker in a hope that might use the extra space for the reducer...). The exact numbers are not important I think... As
long as the number of slots per node multiplied with the number of nodes is
less than the total number of maps needed the deadlock will occur. As for
the benchmark I am using WordCount on a 5GB file at the moment produced
with BigDataBench - so this needs 40Maps and 1 Reduce - My goal is to find
a configuration that is good enough with different benchmarks and also
compare with Spark which at the moment seems to run smoother on Mesos but
it might be because of my configuration, memory assignments etc...
profiling is still work in progress...

On Thu, Mar 5, 2015 at 2:59 AM, Tom Arnfeld notifications@github.com
wrote:

It might be to do with resource roles? If you could give me some exact
details on the cluster (number of slaves of what size) as well as the
resource specs for task tracker/slots you have in your hadoop config, and
if you can, the rough size/shape of the job you're running. I'll try and
reproduce here and dig into it.


Reply to this email directly or view it on GitHub
#32 (comment).

@tarnfeld
Copy link
Member Author

I'm going to close this as #33 has been merged. Please open any new tickets if strange scheduling behaviour persists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants