Name: Patrick Ng  
Email: patng@ischool.berkeley.edu  
Class: W261-2  
Week: 01  
Date of submission: Jan 26, 2016"

## HW2.1. Sort in Hadoop MapReduce
Given as input: Records of the form < integer, “NA” >, where integer is any integer, and “NA” is just the empty string.
Output: sorted key value pairs of the form < integer, “NA” > in decreasing order; what happens if you have multiple reducers? Do you need additional steps? Explain.

Write code to generate N  random records of the form < integer, “NA” >. Let N = 10,000.
Write the python Hadoop streaming map-reduce job to perform this sort. Display the top 10 biggest numbers. Display the 10 smallest numbers

### Generate random numbers

In [73]:
%%writefile genrand.py
#!/usr/bin/python
import random
import sys

nums = 10000
if len(sys.argv) > 1:
    nums = int(sys.argv[1])

random.seed(0)
for i in range(nums):
    print '< %d, "NA" >' % random.randint(-1000000, 1000000)

Overwriting genrand.py


In [15]:
!chmod +x genrand.py

### Mapper

In [71]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re

# The regex which captures the integer from a line in the format < integer, "NA" >
regex = re.compile(ur'\<\s*(-?\d+)\s*,\s*\"NA\"\s*\>')

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    
    # Get the integer from the line
    result = regex.findall(line)
    if len(result) == 0:
        # Cannot find any integer. Could be a corrupted input line.  Skip it.
        continue
    
    # print the integer as the key of the output.  Absence of value means there is no value.
    print result[0]

Overwriting mapper.py


### Reducer

In [20]:
%%writefile reducer.py
#!/usr/bin/python
from operator import itemgetter
import sys

# input comes from STDIN
for line in sys.stdin:
    print '<%s, "NA">' % line.strip()

Overwriting reducer.py


### Simple Test Code

In [74]:
!python genrand.py 20 | python mapper.py | python reducer.py

<688844, "NA">
<515909, "NA">
<-158857, "NA">
<-482167, "NA">
<22549, "NA">
<-190132, "NA">
<567597, "NA">
<-393375, "NA">
<-46806, "NA">
<166764, "NA">
<816226, "NA">
<9374, "NA">
<-436325, "NA">
<511609, "NA">
<236738, "NA">
<-498988, "NA">
<819493, "NA">
<965571, "NA">
<620435, "NA">
<804332, "NA">


## Run it in hadoop

### start yarn and hdfs

In [32]:
!/usr/local/Cellar/hadoop/2.7.1/sbin/start-yarn.sh
!/usr/local/Cellar/hadoop/2.7.1/sbin/start-dfs.sh

starting yarn daemons
starting resourcemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-patrickng-resourcemanager-Patricks-MacBook-Pro.local.out
localhost: starting nodemanager, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/yarn-patrickng-nodemanager-Patricks-MacBook-Pro.local.out
16/01/23 12:32:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-patrickng-namenode-Patricks-MacBook-Pro.local.out
localhost: starting datanode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-patrickng-datanode-Patricks-MacBook-Pro.local.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/Cellar/hadoop/2.7.1/libexec/logs/hadoop-patrickng-secondarynamenode-Patricks-MacBook-Pro.local.out
16/01/23 12:33:12 WARN uti

In [75]:
!echo "Generating random numbers, each in the range [-1000000, 1000000]."
!rm -f randomNums.txt
!./genrand.py 10000 >> randomNums.txt

Generating random numbers, each in the range [-1000000, 1000000].


### upload randomNums.txt to hdfs

In [76]:
!hdfs dfs -rm -f randomNums.txt
!hdfs dfs -put randomNums.txt

16/01/23 13:20:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted randomNums.txt
16/01/23 13:21:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Hadoop streaming command

In [77]:
!hdfs dfs -rm -r sortRandomNums
!hadoop jar $HADOOP_INSTALL/share/hadoop/tools/lib/hadoop-*streaming*.jar -D mapred.output.key.comparator.class=org.apache.hadoop.mapred.lib.KeyFieldBasedComparator -D mapred.text.key.comparator.options="-nr" -mapper mapper.py -reducer reducer.py -input randomNums.txt -output sortRandomNums

16/01/23 13:21:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Deleted sortRandomNums
16/01/23 13:21:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### show the results

In [78]:
!rm -f w2.1.result
!hdfs dfs -get sortRandomNums/part-00000 w2.1.result
!echo
!echo "10 biggest numbers:"
!head -n 10 w2.1.result
!echo
!echo "10 smallest numbers:"
!tail -n 10 w2.1.result

16/01/23 13:21:16 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/01/23 13:21:17 WARN hdfs.DFSClient: DFSInputStream has been closed already

10 biggest numbers:
<999806, "NA">	
<999764, "NA">	
<999727, "NA">	
<999663, "NA">	
<999371, "NA">	
<998888, "NA">	
<998841, "NA">	
<998388, "NA">	
<997707, "NA">	
<997613, "NA">	

10 smallest numbers:
<-997715, "NA">	
<-997902, "NA">	
<-997975, "NA">	
<-998040, "NA">	
<-998770, "NA">	
<-998808, "NA">	
<-999519, "NA">	
<-999672, "NA">	
<-999732, "NA">	
<-999954, "NA">	


### stop yarn and hdfs

In [27]:
!/usr/local/Cellar/hadoop/2.7.1/sbin/stop-yarn.sh
!/usr/local/Cellar/hadoop/2.7.1/sbin/stop-dfs.sh

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
16/01/23 12:23:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
16/01/23 12:23:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
