# Hadoop Streaming Partitions FAQ

<a id=TOC></a>
## CONTENTS
* [Sample file](#samp) 
* __Q__: _What partitioning does Hadoop do on its own?_ __A__: [Default Partitioning](#default) 
* __Q__: _How can I ask Hadoop to partition on a specific field?_ __A__: [Specifying a Parition Key](#specify)
* __Q:__ _How can I sort within my custom partitions?_ __A:__ [Partitioning & Secondary Sort](#psort)
* __Q:__ _Why doesn't my combiner recognize that the mapper output is sorted?_ __A:__ [part1](#combo1), [part2](#combo2)
* __Q:__ _How can I make sure it does?_ __A:__ [Combining & Composite Keys](#combo3)
* __Q:__ _Why does my combiner still not work fully?_ __A:__ [Default Combining](#combo-default)
* __Q:__ _Why does a combiner mess up my secondary sort? - TRICK QUESTION_ __A:__ [Funky Stuff](#funky-stuff)
* __Q:__ _What other funky things happen with combiners and partitions?_ __A:__ [More Funky Stuff](#more-funky-stuff)

<a id=samp></a>
## Sample Input
[Return to Contents](#TOC)

In [1]:
%%writefile test.txt
A	C	1
A	C	2
A	C	1
A	D	5
A	D	5
A	D	2
B	C	5
B	C	1
B	C	10
B	D	2
B	D	10
B	D	3

Overwriting test.txt


In [2]:
# put it into HDFS
!hdfs dfs -mkdir Data
!hdfs dfs -rm Data/test.txt
!hdfs dfs -copyFromLocal test.txt Data/test.txt

mkdir: `Data': File exists
Deleted Data/test.txt


In [5]:
# save my jar file for less typing
JAR_FILE = '/usr/lib/hadoop-mapreduce/hadoop-streaming.jar'

<a id=default></a>
## Default Partitioning
[Return to Contents](#TOC)  | [Skip to Specifying a Non-Default Partition](#specify)   

__Q__: _What partitioning does Hadoop do on its own (i.e. if we specify two reducers but don't tell it how to partition...)?_    
__A:__ Hadoop will partition on the key. Though there is one important caveat -- read the warning at the end of this section.

In [4]:
# hadoop job w/ 2 reducers but no partitioner class
!hdfs dfs -rm -r no-partitioner-output
!hadoop jar {JAR_FILE} \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output no-partitioner-output \
    -numReduceTasks 2

Deleted no-partitioner-output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob5795878492879064182.jar tmpDir=null
17/09/22 23:43:27 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:43:28 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:43:29 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/22 23:43:29 INFO mapreduce.JobSubmitter: number of splits:2
17/09/22 23:43:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506091508167_0061
17/09/22 23:43:30 INFO impl.YarnClientImpl: Submitted application application_1506091508167_0061
17/09/22 23:43:30 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506091508167_0061/
17/09/22 23:43:30 INFO mapreduce.Job: Running job: job_1506091508167_0061
17/09/22 23:43:38 INFO mapreduce.Job: Job job_1506091508167_0061 running in uber mode : false
17/09/22 23:43:38 INFO 

In [5]:
# first partition 
!hdfs dfs -cat no-partitioner-output/part-00000

A	D	2
A	D	5
A	D	5
A	C	1
A	C	2
A	C	1


In [6]:
# second partition
!hdfs dfs -cat no-partitioner-output/part-00001

B	C	5
B	D	3
B	D	10
B	D	2
B	C	10
B	C	1


__WARNING:__ Hadoop, uses a hash function to perform this default partitioning. We got lucky and ended up with "A" in the first partition and "B" in the second but that is not guaranteed. With different keys they might end up in the opposite order or even both in the same partition! (_try replacing A with 'apple' and B with 'bear' to see this in action!_)

__NOTE:__ Look closely, at the second partition output, there's something weird about the order of the records, this again has to do with Hadoop's default shuflle (Hint: _what sorting does Hadoop guarantee? in this example, what does Hadoop consider the 'key' to be?_). We'll return to this weirdness when we look at combiners below.

<a id=specify></a>
## Specifying a Partition Key
[Return to Contents](#TOC) | [Skip to Partition Sort](#psort)  

__Q:__ _How can we ask Hadoop to partition on something other than the first field?_  
__A__: Hadoop will only partition on a field that is part of the key so we must add 3 new paramters to the streaming command from the last section (note the order in which they appear below):  

>__`-D stream.num.map.output.key.fields=2`__ _tells Hadoop to treat the first two fields form a composite key._    
>__`-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner`__ _tells Hadoop that we want to partition based on one of the fields in the composite key._     
>__`-D mapreduce.partition.keypartitioner.options="-k2,2"`__  _tells Hadoop that in this example, we want to partition on the second field in the composite key._  

In [7]:
# same as above, but now we specify the partitioner & partition key
!hdfs dfs -rm -r custom-partition-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keypartitioner.options="-k2,2" \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output custom-partition-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Deleted custom-partition-output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob3184373659579329880.jar tmpDir=null
17/09/22 23:49:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:49:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:49:40 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/22 23:49:40 INFO mapreduce.JobSubmitter: number of splits:2
17/09/22 23:49:41 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506091508167_0062
17/09/22 23:49:41 INFO impl.YarnClientImpl: Submitted application application_1506091508167_0062
17/09/22 23:49:41 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506091508167_0062/
17/09/22 23:49:41 INFO mapreduce.Job: Running job: job_1506091508167_0062
17/09/22 23:49:49 INFO mapreduce.Job: Job job_1506091508167_0062 running in uber mode : false
17/09/22 23:49:49 INF

In [8]:
# first partition
!hdfs dfs -cat custom-partition-output/part-00000

A	D	2
A	D	5
A	D	5
B	D	3
B	D	10
B	D	2


In [9]:
# second partition
!hdfs dfs -cat custom-partition-output/part-00001

A	C	1
A	C	2
A	C	1
B	C	5
B	C	10
B	C	1


Success! we partitioned based on the 2nd field.

<a id=psort></a>
## Secondary Sort Partitions
[Return to Contents](#TOC) | [Skip to sort with combiners](#combo1)

__Q:__ _How can I sort within my custom partitions?_  
__A:__ Since sorting is part of the shuffle and the shuffle is all about organizing by 'keys' we now want to include the third field as a part of our composite key: 

>__`-D stream.num.map.output.key.fields=3`__ _changed from '2' in the last example._    

Then we add two more fields to specify unix style sorting on that field.

>__`-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator`__ _tells Hadoop that we want to sort on one of the fields in our composite key._  
>__`-D mapreduce.partition.keycomparator.options="-k3,3nr"`__ _tells Hadoop that for this specific example we want a reverse numerical sort on the 3rd field._  

In [10]:
# same as above, but now we specify the partitioner & partition key
!hdfs dfs -rm -r partition-sort-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -D mapreduce.partition.keypartitioner.options="-k2,2" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k3,3nr" \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output partition-sort-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Deleted partition-sort-output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob2474225473955301954.jar tmpDir=null
17/09/22 23:50:24 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:50:24 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:50:25 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/22 23:50:25 INFO mapreduce.JobSubmitter: number of splits:2
17/09/22 23:50:25 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506091508167_0063
17/09/22 23:50:25 INFO impl.YarnClientImpl: Submitted application application_1506091508167_0063
17/09/22 23:50:25 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506091508167_0063/
17/09/22 23:50:25 INFO mapreduce.Job: Running job: job_1506091508167_0063
17/09/22 23:50:34 INFO mapreduce.Job: Job job_1506091508167_0063 running in uber mode : false
17/09/22 23:50:34 INFO 

In [11]:
# first partition
!hdfs dfs -cat partition-sort-output/part-00000

B	D	10	
A	D	5	
A	D	5	
B	D	3	
B	D	2	
A	D	2	


In [12]:
# second partition
!hdfs dfs -cat partition-sort-output/part-00001

B	C	10	
B	C	5	
A	C	2	
A	C	1	
A	C	1	
B	C	1	


Doesn't it make you happy when things are in order?

<a id=combo1></a>
## Default Combiners
[Return to Contents](#TOC) | [Skip to part2 of this answer](#combo2)

__Q:__ _Why doesn't my combiner recognize that the mapper output is sorted?_   
__A (take 1):__ Chances are your mapper output __isn't actually sorted in the way you think__. Take a look at the following example where the combiner appears not to work properly, then keep reading for an explanation of why this is actually normal behavior & what do do if you want a different result.

In [1]:
%%writefile combiner.py
#!/opt/anaconda/bin/python
"""
A small combiner which sums the 3rd field.
Input Format: group \t part \t integer
NOTE: input must be pre-sorted by first 2 fields.
"""
import sys

gp,part,num = ['','',None]
for line in sys.stdin:
    new_gp,new_part,new_num = line.split()    
    # EITHER update current record
    if new_gp == gp and new_part == part:
        num += int(new_num) 
    # OR emit & update
    else:
        if num: # skips the initialized dummy count
            print "%s\t%s\t%s"%(gp, part, num)
        gp,part,num = new_gp,new_part,int(new_num)
# emit the last record
print "%s\t%s\t%s"%(gp,part,num)

Overwriting combiner.py


In [6]:
# hadoop job w/ 2 reducers but no partitioner class
!hdfs dfs -rm -r combiner-output
!hadoop jar {JAR_FILE} \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output combiner-output \
    -numReduceTasks 2

rm: `combiner-output': No such file or directory
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob3074670169631509248.jar tmpDir=null
17/09/25 19:49:24 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:49:24 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:49:26 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/25 19:49:26 INFO mapreduce.JobSubmitter: number of splits:2
17/09/25 19:49:27 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506368833436_0001
17/09/25 19:49:28 INFO impl.YarnClientImpl: Submitted application application_1506368833436_0001
17/09/25 19:49:28 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506368833436_0001/
17/09/25 19:49:28 INFO mapreduce.Job: Running job: job_1506368833436_0001
17/09/25 19:49:41 INFO mapreduce.Job: Job job_1506368833436_0001 running in uber mode : false
17/0

In [7]:
# first partition 
!hdfs dfs -cat combiner-output/part-00000

A	D	12
A	C	4


In [8]:
# second partition 
!hdfs dfs -cat combiner-output/part-00001

B	C	5
B	D	15
B	C	11


__EEK!__ It looks like the combining worked in the first partition but then in the second there's that extra 'C' record that appears at the end? It makes sense that the combiner won't work if the mapper input is out of order, but I thought my input file was in order? In fact lets check:

In [17]:
# input file
!cat test.txt

A	C	1
A	C	2
A	C	1
A	D	5
A	D	5
A	D	2
B	C	5
B	C	1
B	C	10
B	D	2
B	D	10
B	D	3

<a id=combo2></a>
So back to the question at hand: 

__Q:__ _Why doesn't my combiner recognize that the mapper output is sorted?_  

__A (take 2):__ Haddop doesn't preserve the order of the records unless you tell it to. So while the input file is sorted and the mapper is just `/bin/cat` when the records leave the mapper & head over to the combiner Hadoop is no longer paying attention to which line was first or second or third. Instead Hadoop has reverted to its default behavior: _records with the same key will be 'shuffled' around together_. 

As we learned in from observing Hadoop's ([default partitioning](#default)) behavior, if we don't specify `stream.num.map.output.keyfields` then the key is simply the first field. In other words, in our example, the only order Hadoop pays attention to is the distinction between 'A' and 'B'. In fact we can see that the records order get scrambled even without the use of the combiner. Here's another look at the output from the [default partitioning](#default) example at the start of the notebook. (recall that both the mapper & reducer were `/bin/cat` and that the only extra option we specified was 2 reducers.

In [9]:
# first partition, the 'D's and 'C' seem to be in order
!hdfs dfs -cat no-partitioner-output/part-00000

A	D	2
A	D	5
A	D	5
A	C	1
A	C	2
A	C	1


In [10]:
# second partition -- WHOA, there is a 'C' out of place!
!hdfs dfs -cat no-partitioner-output/part-00001

B	C	5
B	D	3
B	D	10
B	D	2
B	C	10
B	C	1


In other words, the first partition got lucky, but in general these records are not guaranteed to be sorted on anything except the first field.

<a id=combo3></a>
## Combining with Composite Keys
[Return to Contents](#TOC) | [Skip to My Combiner still doesn't work](#combo-default)  

__Q:__ _How can I make sure my combiner DOES receive sorted input?_   
__A:__ As with our [partition sort](example) above, we'll need to make sure that Hadoop's shuffle phase pays attention to both the first and 2nd field:
>__`-D stream.num.map.output.key.fields=2`__    

and though we only want it to _partition on the first field_...  
> __`-D mapreduce.partition.keypartitioner.options="-k1,1"`__ 
__`-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner`__

... we want the keys _sorted on both primary and secondary key_:
>__`-D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator`__
__`-D mapreduce.partition.keycomparator.options="-k1,1 -k2,2"`__

In [11]:
!hdfs dfs -rm -r combiner-sort-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keypartitioner.options="-k1,1" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k1,1 -k2,2" \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output combiner-sort-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Deleted combiner-sort-output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob9096172325350348427.jar tmpDir=null
17/09/25 19:50:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:50:51 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:50:53 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/25 19:50:53 INFO mapreduce.JobSubmitter: number of splits:2
17/09/25 19:50:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506368833436_0002
17/09/25 19:50:54 INFO impl.YarnClientImpl: Submitted application application_1506368833436_0002
17/09/25 19:50:54 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506368833436_0002/
17/09/25 19:50:54 INFO mapreduce.Job: Running job: job_1506368833436_0002
17/09/25 19:51:03 INFO mapreduce.Job: Job job_1506368833436_0002 running in uber mode : false
17/09/25 19:51:03 INFO m

In [12]:
# first partition
!hdfs dfs -cat combiner-sort-output/part-00000

B	C	11
B	C	5
B	D	15


In [13]:
# second partition
!hdfs dfs -cat combiner-sort-output/part-00001

A	C	4
A	D	12


__EEEK!__ The keys now seem to be arriving in order, so why doesn't the combining work?   
(_queue suspenseful music_) ... keep reading to find out!

<a id=combo-default></a>
## Default Combiner Behavior
[Return to Contents](#TOC) | [Skip to Why does my Combiner mess up secondary sort](#funky-stuff)  

__Q:__ _In the example above, why didn't the combining work fully?_   
__A:__ Again (sigh!) this is expected behavior. Hadoop doesn't guarantee that the combiner will be run on every record... it makes that decision at runtime. Thats why you need to be sure that the mapper output format matches the combiner output format -- your reducer needs to work regardless of whether any combining happens. This is also why, if you want to see fully combined output, you need to use a real reducer (instead of the `/bin/cat` we've been using.) In the final call below, we'll just use the same combiner script as a reducer and you'll see some nice output finally!

In [14]:
!hdfs dfs -rm -r combiner-final
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=2 \
    -D mapreduce.partition.keypartitioner.options="-k1,1" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k1,1 -k2,2" \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py \
    -reducer combiner.py \
    -input Data/test.txt \
    -output combiner-final \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Deleted combiner-final
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob3694188382917670316.jar tmpDir=null
17/09/25 19:51:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:51:56 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/25 19:51:58 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/25 19:51:58 INFO mapreduce.JobSubmitter: number of splits:2
17/09/25 19:51:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506368833436_0003
17/09/25 19:51:59 INFO impl.YarnClientImpl: Submitted application application_1506368833436_0003
17/09/25 19:51:59 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506368833436_0003/
17/09/25 19:51:59 INFO mapreduce.Job: Running job: job_1506368833436_0003
17/09/25 19:52:08 INFO mapreduce.Job: Job job_1506368833436_0003 running in uber mode : false
17/09/25 19:52:08 INFO mapredu

In [15]:
# first partition
!hdfs dfs -cat combiner-final/part-00000

B	C	16
B	D	15


In [16]:
# second partition
!hdfs dfs -cat combiner-final/part-00001

A	C	4
A	D	12


__Wooo hooo!__

<a id=funky-stuff></a>
## Funky Stuff
[Return to Contents](#TOC) | [Skip to More Funky Stuff](#more-funky-stuff)  


__Q:__ _Why did the combiner mess up my partial sort?_  
__A:__ :) I think you know where this is going by now. That's right. It didn't.

As you might expect, there are a ton of funky things that happen when you use different combinations of the parameters we've explored above. I think you should be able to explain them if you think through the following questions carefully:
* What fields have I told Hadoop to pay attention to?
 * (_and does Hadoop know how to recognize these fields -- we haven't explored that here... its the 'delimiter' option though, default is tab_)
* What fields did I tell Hadoop to partition on?
* What fields did I tell Hadoop sort by?
* Have my instructions contradicted each other? If so, what default behavior will Hadoop revert to?
* Will the way Hadoop is going to sort produce records ordered in the way my reducer & combiner expect?

Using these questions, take a look at the funkiness below ... can you figure out why the answer looks like it does? [Note -- test your understanding by trying to explain the observed output BEFORE you try to fix it.]

In [26]:
!hdfs dfs -rm -r funky1-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -D mapreduce.partition.keypartitioner.options="-k1,1" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k3,3nr -k2,2" \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py \
    -reducer combiner.py \
    -input Data/test.txt \
    -output funky1-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Deleted funky1-output
packageJobJar: [] [/usr/lib/hadoop-mapreduce/hadoop-streaming-2.6.0-cdh5.12.1.jar] /tmp/streamjob3827876995430545254.jar tmpDir=null
17/09/22 23:57:39 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:57:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/22 23:57:41 INFO mapred.FileInputFormat: Total input paths to process : 1
17/09/22 23:57:41 INFO mapreduce.JobSubmitter: number of splits:2
17/09/22 23:57:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1506091508167_0067
17/09/22 23:57:42 INFO impl.YarnClientImpl: Submitted application application_1506091508167_0067
17/09/22 23:57:42 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1506091508167_0067/
17/09/22 23:57:42 INFO mapreduce.Job: Running job: job_1506091508167_0067
17/09/22 23:57:52 INFO mapreduce.Job: Job job_1506091508167_0067 running in uber mode : false
17/09/22 23:57:52 INFO mapreduc

In [27]:
# first partition
!hdfs dfs -cat funky1-output/part-00000

B	C	15
B	D	15
B	C	1


In [28]:
# second partition
!hdfs dfs -cat funky1-output/part-00001

A	D	10
A	C	2
A	D	2
A	C	1


<a id=more-funky-stuff></a>
## More Funky Stuff

[Return to Contents](#TOC) 

Was that fun? Here are some more funky outputs... feel free to run them & test your understanding.

In [None]:
!hdfs dfs -rm -r funky2-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -mapper /bin/cat \
    -reducer /bin/cat \
    -input Data/test.txt \
    -output funky2-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

In [None]:
# first partition
!hdfs dfs -cat funky2-output/part-00000

In [None]:
# second partition
!hdfs dfs -cat funky2-output/part-00000

...

In [None]:
!hdfs dfs -rm -r funky3-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py
    -reducer /bin/cat \
    -input Data/test.txt \
    -output funky3-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

In [None]:
# first partition
!hdfs dfs -cat funky3-output/part-00000

In [None]:
# second partition
!hdfs dfs -cat funky3-output/part-00000

...

In [None]:
!hdfs dfs -rm -r funky4-output
!hadoop jar {JAR_FILE} \
    -D stream.num.map.output.key.fields=3 \
    -D mapreduce.partition.keypartitioner.options="-k2,2" \
    -D mapreduce.job.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator \
    -D mapreduce.partition.keycomparator.options="-k2,2 -k1,1" \
    -files combiner.py \
    -mapper /bin/cat \
    -combiner combiner.py \
    -reducer combiner.py \
    -input Data/test.txt \
    -output funky4-output \
    -numReduceTasks 2 \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

In [None]:
# first partition
!hdfs dfs -cat funky4-output/part-00000

In [None]:
# second partition
!hdfs dfs -cat funky4-output/part-00000

---