hadoop: tune max {map,reduce} tasks to num CPUs available #345
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request is in response to issue #115
Hadoop defaults to 2 maps and 1 reduce per node/machine, and starcluster's hadoop plugin uses the default configs. For large AWS instance types, this configuration leaves much CPU capacity unutilized. This change creates a custom mapred-site.xml file for each node that configures the mapred.tasktracker.{map,reduce}.tasks.maximum parameters based upon the node's CPU count. In particular, the change employs a simple heuristic (similar to the one used in EMR's hadoop configs) that assigns 1 map per CPU and ~1/3 reduce per CPU. The params are included as kwargs to the plugin's constructor, so the user can override this in the plugin's config.
I have manually tested this change using the following (key) starcluster config settings:
[cluster microdumbo]
NODE_INSTANCE_TYPE = c1.xlarge
CLUSTER_SIZE = 2
PLUGINS = hadoop
NODE_IMAGE_ID = ami-765b3e1f
and ran this procedure on the master node:
$ wget http://norvig.com/big.txt
$ pip install mrjob
$ export HADOOP_HOME=/usr/lib/hadoop
$ python /usr/local/lib/python2.7/dist-packages/mrjob/examples/mr_word_freq_count.py -r hadoop big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt big.txt
The job tracker showed 2 nodes, 16 maps, and 4 reduces available, and the job ran up to 16 map tasks in parallel, as desired. The output looked correct.