# DATASCI W261: Machine Learning at Scale
## Assignment Week 2
Miki Seltzer (miki.seltzer@berkeley.edu)<br>
W261-2, Spring 2016<br>
Submission: 

### HW2.0:

#### What is a race condition in the context of parallel computation? Give an example.
A race condition is when a section of code is executed by multiple processes, and the order in which the processes execute will impact the final result.

![Race condition example](race_conditions.png)
Source: https://en.wikipedia.org/wiki/Race_condition#Example

#### What is MapReduce?
MapReduce can refer to multiple concepts:
- **Programming model:** Processes are split into a "mapping" phase, and a "reducing" phase. In the map phase, a certain function is mapped on to each value in a data set, and then in the reduce phase, the result of the map phase is aggregated. 
- **Execution framework:** This framework coordinates running processes written with the above model in mind.
- **Software implementation:** MapReduce is the name of Google's proprietary implementation of this programming model, while Apache Hadoop is the open-source implementation.

#### How does it differ from Hadoop?
Hadoop is the open-source implementation of Google's MapReduce. Hadoop consists of two parts: distributed storage of data, and distributed processing of data. HDFS is the storage part, and MapReduce is the processing part.

#### Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.
Hadoop is based on the MapReduce paradigm. The classic example of the MapReduce programming paradigm is word count. In the map phase of word count, each word in a document is assigned a count of 1. In the reduce phase, the counts for each unique word are summed to yield the final count of each word.

In [20]:
def hw2_0():
    doc = "Hello this is a test to test word count test should have a count of three".lower()
    key_vals = []
    
    print "MAP PHASE"
    for word in doc.split():
        print [word, 1]
        key_vals.append([word, 1])
    
    print "\nREDUCE PHASE"
    key_vals = sorted(key_vals)
    
    current_word = None
    current_count = 0
    
    for pair in key_vals:
        if current_word == pair[0]:
            print [current_word, current_count], "(intermediate step)"
            current_count += pair[1]
        else:
            if current_word:
                print [current_word, current_count], "FINAL SUM"
            current_word = pair[0]
            current_count = pair[1]

    print [current_word, current_count], "FINAL SUM"
    
hw2_0()

MAP PHASE
['hello', 1]
['this', 1]
['is', 1]
['a', 1]
['test', 1]
['to', 1]
['test', 1]
['word', 1]
['count', 1]
['test', 1]
['should', 1]
['have', 1]
['a', 1]
['count', 1]
['of', 1]
['three', 1]

REDUCE PHASE
['a', 1] (intermediate step)
['a', 2] FINAL SUM
['count', 1] (intermediate step)
['count', 2] FINAL SUM
['have', 1] FINAL SUM
['hello', 1] FINAL SUM
['is', 1] FINAL SUM
['of', 1] FINAL SUM
['should', 1] FINAL SUM
['test', 1] (intermediate step)
['test', 2] (intermediate step)
['test', 3] FINAL SUM
['this', 1] FINAL SUM
['three', 1] FINAL SUM
['to', 1] FINAL SUM
['word', 1] FINAL SUM


### HW2.1: Sort in Hadoop MapReduce
**Given as input: Records of the form `<integer, “NA”>`, where integer is any integer, and “NA” is just the empty string.**<br>
**Output: Sorted key value pairs of the form `<integer, “NA”>`; what happens if you have multiple reducers? Do you need additional steps? Explain.**

If there are multiple reducers, then a straightforward MapReduce process will yield outputs that are sorted within each reducer, but not sorted across all reducers. In order to output a sort across all reducers, an extra step needs to be implemented that will intelligently send keys to reducers so that the result from all reducers will yield a complete sort. For example, let's say our keys ranged from 0-300. If we had 3 reducers, we could send all keys in the range [0,100) to reducer 1, [100, 200) to reducer 2, and [200, 300] to reducer 3. Thus, the output of each reducer will yield documents that are completely sorted. We would need to balance the keys sent to each reducer to ensure that the load is still balanced between all reducers, which will require some calculations.

#### Write code to generate N  random records of the form `<integer, “NA”>`. Let N = 10,000.

We are going to need the Hadoop Streaming jar file, so let's download it here so that we know which one to use

In [22]:
!wget http://central.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.7.1/hadoop-streaming-2.7.1.jar

--2016-01-21 18:37:45--  http://central.maven.org/maven2/org/apache/hadoop/hadoop-streaming/2.7.1/hadoop-streaming-2.7.1.jar
Resolving central.maven.org... 23.235.47.209
Connecting to central.maven.org|23.235.47.209|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 105736 (103K) [application/java-archive]
Saving to: “hadoop-streaming-2.7.1.jar”


2016-01-21 18:38:00 (1.06 MB/s) - “hadoop-streaming-2.7.1.jar” saved [105736/105736]



In [1]:
import random

with open("random.txt", "w") as myfile:
    for i in range(10000):
        myfile.write("{:d},{:s}\n".format(random.randint(0, 100000), "NA"))

#### Write the Python Hadoop streaming map-reduce job to perform this sort.

In [2]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW2.1

import sys

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    # In this case, we do not need to map the input to anything
    print line.strip()

Overwriting mapper.py


In [3]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW2.1

from operator import itemgetter
import sys

for line in sys.stdin:
    # In this case, we do not need to reduce anything
    print line.strip()

Overwriting reducer.py


In [4]:
!chmod +x mapper.py
!chmod +x reducer.py

In [None]:
# How do I specify the partitioner?
# Since MapReduce will automatically sort between partitions,
# I need a way to make sure that the reducers are given chunks
# of data that will result in a total sort

### HW2.2: Using the Enron data from HW1 and Hadoop MapReduce streaming, write mapper/reducer pair that  will determine the number of occurrences of a single, user-specified word. 

Examine the word “assistance” and report your results. To do so, make sure that mapper.py counts all occurrences of a single word, and reducer.py collates the counts of the single word.

In [16]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW2.2

import sys

# Get the user-specified word
keyword = sys.argv[1]

# Our input comes from STDIN (standard input)
for line in sys.stdin:
    
    # Strip white space from line, then split
    
    ######## THIS NEEDS TO BE SPLIT INTO FIELDS, THEN LOOK AT BODY AND SUBJ
    words = line.strip().split()
    
    # Loop through words
    # If it matches the keyword, write it to file
    # key = word
    # value = 1
    for word in words:
        if word == keyword:
            print "%s\t%s" % (word, 1)

Overwriting mapper.py


In [17]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW2.2

from operator import itemgetter
import sys

# Initialize some variables
# We know that the words will be sorted
# We need to keep track of state
prev_word = None
prev_count = 0
word = None

for line in sys.stdin:
    # Strip and split line from mapper
    word, count = line.strip().split('\t', 1)
    
    # If possible, turn count into an int (it's read as a string)
    try:
        count = int(count)
    except ValueError:
        # We couldn't make count into an int, so move on
        continue
        
    # Since the words will be sorted, all counts for a word will be grouped
    if prev_word == word:
        # If prev_word is word, then we haven't changed words
        # Just update prev_count
        prev_count += count
    else:
        # We've encountered a new word!
        # This might be the first word, though
        if prev_word:
            # We need to print the last word we were on
            print "%s\t%s" % (prev_word, prev_count)
        
        # Now we need to initialize our variables for the new word and count
        prev_word = word
        prev_count = count

# We have reached the end of the file, so print the last word and count
if prev_word == word:
    print "%s\t%s" % (prev_word, prev_count)

Overwriting reducer.py


In [18]:
!chmod +x mapper.py
!chmod +x reducer.py

In [19]:
!echo "foo foo quux labs foo bar quux" | python mapper.py 'foo' | sort -k1,1 | python reducer.py

foo	3


#### Set up Hadoop folder and load enronemail_1h.txt

In [20]:
!hdfs dfs -mkdir /user/miki/week02
!hdfs dfs -put enronemail_1h.txt /user/miki/week02

#### Run Hadoop streaming command

In [26]:
!hadoop jar hadoop-streaming-2.7.1.jar \
-mapper '/home/cloudera/Documents/W261-Fall2016/Week02/mapper.py assistance' \
-reducer /home/cloudera/Documents/W261-Fall2016/Week02/reducer.py \
-input /user/miki/week02/enronemail_1h.txt \
-output /user/miki/week02/hw2_2_output

packageJobJar: [] [/usr/jars/hadoop-streaming-2.6.0-cdh5.5.0.jar] /tmp/streamjob5852430093459052275.jar tmpDir=null
16/01/21 18:38:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/01/21 18:38:54 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
16/01/21 18:38:55 INFO mapred.FileInputFormat: Total input paths to process : 1
16/01/21 18:38:55 INFO mapreduce.JobSubmitter: number of splits:2
16/01/21 18:38:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1453405632837_0011
16/01/21 18:38:55 INFO impl.YarnClientImpl: Submitted application application_1453405632837_0011
16/01/21 18:38:55 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1453405632837_0011/
16/01/21 18:38:55 INFO mapreduce.Job: Running job: job_1453405632837_0011
16/01/21 18:39:01 INFO mapreduce.Job: Job job_1453405632837_0011 running in uber mode : false
16/01/21 18:39:01 INFO mapreduce.Job:  map 0% reduce 0%
16/01/21 18:39

#### Look at output of job

In [27]:
!hdfs dfs -cat /user/miki/week02/hw2_2_output/part-00000

assistance	6
