hadoop: tune max {map,reduce} tasks to num CPUs available #345

ypwais · 2013-12-09T07:53:26Z

This pull request is in response to issue #115

Hadoop defaults to 2 maps and 1 reduce per node/machine, and starcluster's hadoop plugin uses the default configs. For large AWS instance types, this configuration leaves much CPU capacity unutilized. This change creates a custom mapred-site.xml file for each node that configures the mapred.tasktracker.{map,reduce}.tasks.maximum parameters based upon the node's CPU count. In particular, the change employs a simple heuristic (similar to the one used in EMR's hadoop configs) that assigns 1 map per CPU and ~1/3 reduce per CPU. The params are included as kwargs to the plugin's constructor, so the user can override this in the plugin's config.

I have manually tested this change using the following (key) starcluster config settings:

[cluster microdumbo]
NODE_INSTANCE_TYPE = c1.xlarge
CLUSTER_SIZE = 2
PLUGINS = hadoop
NODE_IMAGE_ID = ami-765b3e1f

and ran this procedure on the master node:

$ wget http://norvig.com/big.txt
$ pip install mrjob
$ export HADOOP_HOME=/usr/lib/hadoop
$ python /usr/local/lib/python2.7/dist-packages/mrjob/examples/mr_word_freq_count.py -r hadoop big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt

The job tracker showed 2 nodes, 16 maps, and 4 reduces available, and the job ran up to 16 map tasks in parallel, as desired. The output looked correct.

jtriley · 2013-12-10T16:23:30Z

Thanks! Merging soon.

ypwais · 2013-12-11T03:04:38Z

sweet, thanks so much for adding the docs!! sorry didn't see those at first

On Tue, Dec 10, 2013 at 8:43 AM, Justin Riley notifications@github.comwrote:

Closed #345 #345 via 7b9187chttps://github.com/jtriley/StarCluster/commit/7b9187c9abe3c09b7a0e94b2827bc69febc6e7d4
.

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/345
.

jtriley · 2014-02-07T21:26:06Z

@ypwais My pleasure. This is now available in the 0.95 release:

http://star.mit.edu/cluster/docs/latest/changelog.html#version-0-95

ypwais added 4 commits December 7, 2013 01:27

use proc ratios to tune

a3df72b

fix style

2cc65e5

stupid string formatting

8b69957

wake up

6fd7d0e

ghost assigned jtriley Dec 10, 2013

jtriley closed this in 7b9187c Dec 10, 2013

jtriley mentioned this pull request Dec 10, 2013

add parameter for max map/reduce tasks to hadoop plugin #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hadoop: tune max {map,reduce} tasks to num CPUs available #345

hadoop: tune max {map,reduce} tasks to num CPUs available #345

ypwais commented Dec 9, 2013

jtriley commented Dec 10, 2013

ypwais commented Dec 11, 2013

jtriley commented Feb 7, 2014

hadoop: tune max {map,reduce} tasks to num CPUs available #345

hadoop: tune max {map,reduce} tasks to num CPUs available #345

Conversation

ypwais commented Dec 9, 2013

jtriley commented Dec 10, 2013

ypwais commented Dec 11, 2013

jtriley commented Feb 7, 2014