Choosing the number of reducers
Clone this wiki locally
How many reducers should I use?
The following statement is found in the Hadoop documentation:
The right number of reduces seems to be 0.95 or 1.75 * (nodes * mapred.tasktracker.tasks.maximum). At 0.95 all of the reduces can launch immediately and start transfering map outputs as the maps finish. At 1.75 the faster nodes will finish their first round of reduces and launch a second round of reduces doing a much better job of load balancing.
That advice is (usually) correct for the optimization of a single Hadoop stage, but there are other issues that impact that decision.
Maximizing reduce performance for a specific cluster
Let's say the system is capable of running 20 reduce tasks in parallel, let's say there are 10 machines and each one has two reduce slots available.
If we're splitting the output 19 ways, each and every reduce happens in parallel. As each map step runs, it is immediately able to send a share of its data to every reducer. Once the maps are done, all the reducers can start, without the need for any further shuffling.
If we were splitting the output 200 ways, each mapper would have to save its output for all the forthcoming reducers. The first 20 reducers could gather 10% of the data during the map phase, and only when they are done, can the next 20 reducers get started. This causes CPU and I/O activity to be alternated instead of interleaved, reducing performance.
The '0.95' option above is an attempt to run all the reducers simultaneously, leaving a little bit of slack in case something goes wrong and a reducer must be re-run. By default, AMZN EMR allocates 6 CPU to map and 2 CPU to reduce, which is right for this scenario.
The '1.75' number is looking for a sweet spot between I/O interleaving and load balancing. If we run all our reducers at the same time, the time it takes for the longest running reducer determines the length of our job. If reduction is done in more than one round, these variations can balance out, which can be helpful sometimes.
How big of an effect is this?
To say something specific, I found that PSE3, running on a 12 x c1.xlarge cluster, took 2.5 hours when set with 500 reducers. With 23 reducers, the task completed in 1.75 hours.
Excluding other kinds of overhead, I think the difference you get from the mechanism above is unlikely to be more than a factor of two. The argument is as such:
If it takes
Cminutes of CPU times and
Iminutes of I/O time, the best we can do when we interleave I/O is
max(C,I). If we don't interleave at at all, it takes
(C+I)/max(C,I) <= 2
Reasons to disregard the above
You want exactly one reducer
If you really need to output a sorted list or implement some algorithm that needs to see everything in a specific order, you can create this situation by having exactly one reduce slot.
By nature, this isn't scalable at all and you'll be severely limited in the speedup you can get, but it is not bad if you're dealing with a small amount of data.
You want to avoid problems with the grain of your filesystem
HDFS is not efficient for small files for a few reasons.
One of them is that RAM consumption in the namenode is about 600 bytes per file. This is not a problem for thousands of files, but it is if you are working with millions or billions of small files.
Note also that a new map process is created for every input file, so if your files are small, the creation and setup of mapper processes can take more time than running the mappers.
Hadoop has special tools, such as the
SequenceFile that can be used to pack a large number of blobs into a single big file if you want to process, say, millions of small GIF files.
If you are using S3N, you face another problem. Amazon S3 has no problem with large numbers of small files, but the reliability of S3 operations go downhill once the file size gets larger than 4GB or so. Make life easier for you and your downstream and set your reducer so that your product files are no bigger than that.
The next phase of your process requires more concurrency
If you're using a non-splittable compression (particularly GZIP) the number of reducers you set now sets the maximum level of concurrency of the next job. That is, if you split a file into 16 indivisible pieces, you can have at most 16 map processes work on it later.
Some people will say, "use a splittable format like LZO" and that's certainly an answer. However, GZIP is going to appeal to people because of its very wide use and compatibility with tools other.
If you were planning to run this reduce once to produce a data file that will be used again and again, you might trade some efficiency in the reduce for the possibility of more concurrency later by increasing the number of reducers you use