#DATASCI W261: Machine Learning at Scale

#Assignment: Week 2

- Juanjo Carin
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- W261-2
- Week 02
- Submission date: 9/15/2015

#HW2.0

1. **What is a race condition in the context of parallel computation? Give an example.**

2. **What is MapReduce?**

3. **How does it differ from Hadoop?**

4. **Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.**

1. A **race condition** is a situation, in parallel computation, in which the final value can be different depending on the order of parallel processes. For example, it can occur if two threads read and write (after performing some computation) the same variable: it one thread reads and writes the variable, and then the other thread reads and writes that same variable after that, the result will be different than what it would be if the second threads reads the variable before the first one has written it.

2. **MapReduce** is a programming model (and an associated implementation) for processing and generating large data sets with a parallel, distributed algorithm on a cluster. An implementation of this model (i.e., a MapReduce program) is composed of a `map` procedure (that performs filtering and sorting) and a `reduce` method (that performs a summary operation).

3. (Apache) **Hadoop** uses MapReduce, but is more than that: it is a software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware, whose core consists of MapReduce (for processing) and HDFS (Hadoop Distributed File System; for storage). It is also composed of other modules, apart from those two: Hadoop Common, which contains libraries and utilities needed by other Hadoop modules, and Hadoop YARN, which is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications. Hadoop can also refer to the whole ecosystem or collection of additional software packages that can be installed on top of or alongside it, such as Apache Pig, Apache Hive, Apache Spark, Apache Storm, etc.

4. As mentioned, Hadoop is based on the MapReduce (or map-and-reduce) programming paradigm, which in turn is based on functional programming and parallel computation. What Hadoop does (in its simplest form) is splitting the input into several chunks, processing those chunks in parallel with a `map` task, and combine the intermediate result of each mapper by means of `reducer` tasks.  **An example is given below, in HW2.1, where the first 10,000 integers, which are shuffled and in string format, are sorted.** Sorting the strings would not work, because they would be sorted according to the leading digit(s) . . .

In [382]:
!echo '798,\n' '98,\n' '2043,' | sort -k1,1

 2043,
798,
 98,


. . . so the mappers include leading zeros to each number (in string format) in the portion of the data passed to each one, and the reducer discards those leading zeros (and the sorting is done by the Hadoop framework, no need to code it!).

So the command line equivalent to Hadoop code (given the same mapper and reducer, as briefly described above) would be equivalent to:

```python
!echo '798,\n' '98,\n' '2043,' | python mapper.py | sort -k1,1 | python reducer.py
```

which would sort those three numbers correctly:

`98,
798,
2043`

#HW2.1

**Sort in Hadoop MapReduce**

**Given as input: Records of the form `<integer, “NA”>`, where `integer` is any integer, and `“NA”` is just the empty string.**

**Output: sorted key value pairs of the form `<integer, “NA”>`; what happens if you have multiple reducers? Do you need additional steps? Explain.**

**Write code to generate N  random records of the form `<integer, “NA”>`. Let N = 10,000.**

**Write the python Hadoop streaming map-reduce job to perform this sort.**

In [1]:
N = 10000
from random import sample
with open('input', 'w') as myfile:
    # Sample (without replacement) from 1 to N
    integer = sample(range(1, N+1), N) 
    for i in range(N):
        # Add "," and an empty string to each integer
        myfile.write(str(integer[i])+',\n')
    myfile.close()

# Let's check a few of the generated records, to see if they're random
with open('input', 'r') as myfile:
    text = [word.strip(",.") for line in myfile for word in line.split()]
print text[:10]

['4324', '3955', '4127', '598', '6060', '7684', '849', '6367', '2424', '4939']


In [2]:
%%writefile mapper.py
#!/usr/bin/python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # Discard the "," (keep only the integer=key)
    line = line.strip().rstrip(',')
    # Convert to string and add leading zeros
    integer = str(line).zfill(5)
    # Intermediate result (empty value)
    print '%s\t%s' % (integer, '')

Overwriting mapper.py


In [3]:
%%writefile reducer.py
#!/usr/bin/python
import sys
# input comes from STDIN
for line in sys.stdin:
    # Remove value (empty) keeping only the key and convert to int 
        # (which discards leading zeros)
    integer = int(line.strip())
    # Same format as the original input ("," and empty string)
    print '%s%s' % (integer, ',')

Overwriting reducer.py


In [4]:
# Start Hadoop
!/usr/local/hadoop/sbin/start-yarn.sh
!/usr/local/hadoop/sbin/start-dfs.sh

starting yarn daemons
resourcemanager running as process 30751. Stop it first.
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-juanjo-VB.out
15/09/15 18:43:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: namenode running as process 31165. Stop it first.
localhost: datanode running as process 31432. Stop it first.
Starting secondary namenodes [0.0.0.0]
0.0.0.0: secondarynamenode running as process 31736. Stop it first.
15/09/15 18:43:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
# Create new folder
!hdfs dfs -mkdir -p /user/hadoop/dirhw21

15/09/15 18:43:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
## Upload input file to HDFS
!hdfs dfs -put -f input /user/hadoop/dirhw21

15/09/15 18:43:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [7]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r sortOutput

15/09/15 18:44:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:44:04 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted sortOutput


In [8]:
# Hadoop streaming command
    # Forcing number of reducers to be 1
!hadoop jar hadoop-streaming*.jar -D mapred.reduce.tasks=1 -mapper mapper.py \
    -reducer reducer.py -input /user/hadoop/dirhw21/input -output sortOutput

15/09/15 18:44:08 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:44:10 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/15 18:44:10 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/15 18:44:10 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/15 18:44:10 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/15 18:44:11 INFO mapreduce.JobSubmitter: number of splits:1
15/09/15 18:44:11 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/15 18:44:12 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local2131440745_0001
15/09/15 18:44:13 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/15 18:44:13 INFO mapreduce.Job: Running job: jo

In [9]:
# Move output to local (rather than direcly reading its content)
#!hdfs dfs -cat sortOutput/part-00000
# Delete it from local if a previous version exists
!rm ~/Downloads/HW2/part-00000
!hdfs dfs -copyToLocal sortOutput/part-00000 ~/Downloads/HW2/

15/09/15 18:44:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [10]:
# Read first records to check they were sorted
!head -20 ~/Downloads/HW2/part-00000

1,	
2,	
3,	
4,	
5,	
6,	
7,	
8,	
9,	
10,	
11,	
12,	
13,	
14,	
15,	
16,	
17,	
18,	
19,	
20,	


If we use multiple reducers, running the following, for example:

```python
# Hadoop streaming command
!hadoop jar hadoop-streaming*.jar -D mapred.reduce.tasks=1 -mapper mapper.py \
    -reducer reducer.py -input /user/hadoop/dirhw21/input -output sortOutput
```
the first records of output `part-00000` are:

`1,	
3,	
5,	
7,	
9,	
10,	
12,	
14,	
16,	
18,	
21,	
23,	
25,	
27,	
29,	
30,	
32,	
34,	
36,	
38,	`

I.e., there are more than one output (as many as reducers), and each one is sorted, but comes from just one part of the mappers, so it does not necessarily have to include a consecutive subset of records.

Say there are 4 mappers whose inputs are `<4,>`, `<11,>`, `<7,>`, `<2,>`, `<6,>`, `<10,>`, `<8,>`, `<5,>`, `<12,>`, `<3,>`, `<9,>`, `<1,>`. If the outputs of the first 2 mappers are processed by a reducer, and the outputs of the remaining 2 mappers are processed by a second reducer, their outputs would be `<2,>`, `<4,>`, `<6,>`, `<7,>`, `<10,>`, `<11,>`, and `<1,>`, `<3,>`, `<5,>`, `<8,>`, `<9,>`, `<12,>`, respectively.

**We would have to apply a 2nd MapReduce layer, this one with only one Reducer at the end.**

#HW2.2

**Using the Enron data from HW1 and Hadoop MapReduce streaming, write mapper/reducer pair that  will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.**

**To do so, make sure that**

- **mapper.py counts all occurrences of a single word, and**
- **reducer.py collates the counts of the single word.**

In [11]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword = env_vars['findword']

WORD_RE = re.compile(r"[\w']+")
word_count = 0

# input comes from STDIN (standard input)
for line in sys.stdin:
    for w in WORD_RE.findall(line):
        if findword.lower() == w.lower():
            word_count += 1
print findword + '\t' + str(word_count)

Overwriting mapper.py


In [12]:
%%writefile reducer.py
#!/usr/bin/python
import sys

sum_words = 0
for line in sys.stdin:
    key_value = line.split('\t')
    # The key is the single word we're counting
    key = key_value[0]
    # And each value, its count from a mapper
    value = key_value[1]
    sum_words += int(value)
print key + '\t' + str(sum_words)

Overwriting reducer.py


In [13]:
## Upload input file to HDFS
!hdfs dfs -put -f enronemail_1h.txt /user/hadoop/dirhw21

15/09/15 18:44:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [14]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r countWordEnron

15/09/15 18:44:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:44:51 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted countWordEnron


In [15]:
# Hadoop streaming command
!hadoop jar hadoop-streaming*.jar -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance -input /user/hadoop/dirhw21/enronemail_1h.txt \
    -output countWordEnron

15/09/15 18:44:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:44:57 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/15 18:44:57 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/15 18:44:57 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/15 18:44:57 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/15 18:44:57 INFO mapreduce.JobSubmitter: number of splits:1
15/09/15 18:44:58 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local612556684_0001
15/09/15 18:44:58 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/15 18:44:58 INFO mapreduce.Job: Running job: job_local612556684_0001
15/09/15 18:44:58 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/15 18:44:58 

In [16]:
!hdfs dfs -cat countWordEnron/part-00000

15/09/15 18:45:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
assistance	10


Same result than in HW1, because I'm including the subject of the emails, and all occurrences in a single email are summed.

#HW2.3

**Using the Enron data from HW1 and Hadoop MapReduce, write  a mapper/reducer pair that will classify the email messages by a single, user-specified word. Examine the word “assistance” and report your results. To do so, make sure that**

- **mapper.py**
- **reducer.py**

**performs a single word multinomial Naive Bayes classification.**

In [17]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword = env_vars['findword']

WORD_RE = re.compile(r"[\w']+")

# input comes from STDIN (standard input)
for line in sys.stdin:
    word_count = 0 # count of word in the email
    ID = line.split("\t")[0]
    TRUTH = line.split("\t")[1]
    content = ' '.join(line.strip().split("\t")[2:])
        # We search the word in both the subject and the content
            # because one or the other may not exist, but the way the data are
            # stored we don't know which one may be missing
    word_count = WORD_RE.findall(content).count(findword)
    print findword + '\t' + str(word_count) + '\t' + ID + '\t' + TRUTH

Overwriting mapper.py


In [18]:
%%writefile reducer.py
#!/usr/bin/python
import sys
from math import log

word_count = []
ID = []
TRUTH = []

for line in sys.stdin:
    key_values = line.split('\t')
    # The key is the single word we're counting
    word = key_values[0]
    word_count.append(int(key_values[1]))
    ID.append(key_values[2])
    TRUTH.append(int(key_values[3]))

total_spam = sum(TRUTH) # total count of spam emails
total_ham = len(TRUTH) - total_spam # total count of ham emails
# Total count of word in spam emails
total_word_spam = sum([x*y for (x,y) in zip(TRUTH,word_count)]) 
# Total count of word in ham emails
total_word_ham = sum(word_count) - total_word_spam

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = float(total_word_ham + 1) / (total_word_ham + 1)
prob_word_spam = float(total_word_spam + 1) / (total_word_spam + 1)

# Assess classification with the training set 
CLASS = [0]*len(TRUTH)
for i in range(len(TRUTH)): # for each email
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + word_count[i]*log(prob_word_ham,10)
    prob_spam_word = log(prob_spam,10) + word_count[i]*log(prob_word_spam,10)
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        CLASS = 1
    # Output for each email
    print ID[i] + '\t' + str(TRUTH[i]) + '\t' + str(CLASS[i])

# Training error
# Count of misclassification errors
errors = sum([x!=y for (x,y) in zip(TRUTH,CLASS)])
training_error = float(errors) / len(TRUTH)
# Additional line
print 'Training Error\t' + str(training_error)

Overwriting reducer.py


In [19]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r singleNBEnron

15/09/15 18:46:14 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:46:16 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted singleNBEnron


In [20]:
# Hadoop streaming command
!hadoop jar hadoop-streaming*.jar -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance -input /user/hadoop/dirhw21/enronemail_1h.txt \
    -output singleNBEnron

15/09/15 18:46:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:46:23 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/15 18:46:23 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/15 18:46:23 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/15 18:46:24 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/15 18:46:24 INFO mapreduce.JobSubmitter: number of splits:1
15/09/15 18:46:24 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local259752051_0001
15/09/15 18:46:25 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/15 18:46:25 INFO mapreduce.Job: Running job: job_local259752051_0001
15/09/15 18:46:25 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/15 18:46:25 

In [21]:
!hdfs dfs -cat singleNBEnron/part-00000

15/09/15 18:46:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0018.2003-12-18.GP	1	0
0018.2001-07-13.SA_and_HP	1	0
0018.1999-12-14.kaminski	0	0
0017.2004-08-02.BG	1	0
0017.2004-08-01.BG	1	0
0017.2003-12-18.GP	1	0
0017.2001-04-03.williams	0	0
0017.2000-01-17.beck	0	0
0017.1999-12-14.kaminski	0	0
0016.2004-08-01.BG	1	0
0016.2003-12-19.GP	1	0
0016.2001-07-06.SA_and_HP	1	0
0016.2001-07-05.SA_and_HP	1	0
0016.2001-02-12.kitchen	0	0
0016.1999-12-15.farmer	0	0
0015.2003-12-19.GP	1	0
0015.2001-07-05.SA_and_HP	1	0
0015.2001-02-12.kitchen	0	0
0015.2000-06-09.lokay	0	0
0015.1999-12-15.farmer	0	0
0015.1999-12-14.kaminski	0	0
0014.2004-08-01.BG	1	0
0014.2003-12-19.GP	1	0
0014.2001-07-04.SA_and_HP	1	0
0014.2001-02-12.kitchen	0	0
0014.1999-12-15.farmer	0	0
0014.1999-12-14.kaminski	0	0
0013.2004-08-01.BG	1	0
0013.2001-06-30.SA_and_HP	1	0
0013.2001-04-03.williams	0	0
0013.1999-12-14.kaminski	0	0
0013.1999-12-14.farmer	

As seen above, I also included an additional line in the output with the training error, as defined in HW1. It's the same value that I got in HW1.3.

#HW2.4

**Using the Enron data from HW1 and in the Hadoop MapReduce framework, write  a mapper/reducer pair that will classify the email messages using multinomial Naive Bayes Classifier using a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results**

**To do so, make sure that**

- **mapper.py **
- **reducer.py**

**performs the multiple-word multinomial Naive Bayes classification via the chosen list.**

In [22]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword_list = env_vars['findword'].split(',')

WORD_RE = re.compile(r"[\w']+")

# input comes from STDIN (standard input)
for line in sys.stdin:
    word_count = 0 # count of word in the email
    ID = line.split("\t")[0]
    TRUTH = line.split("\t")[1]
    content = ' '.join(line.strip().split("\t")[2:])
        # We search the word in both the subject and the content
            # because one or the other may not exist, but the way the data are
            # stored we don't know which one may be missing
    for findword in findword_list:
        word_count = WORD_RE.findall(content).count(findword)
        print findword + '\t' + str(word_count) + '\t' + ID + '\t' + TRUTH

Overwriting mapper.py


In [23]:
%%writefile reducer.py
#!/home/hduser/anaconda/bin/python
import sys
from math import log
import numpy as np

word = []
word_count = []
ID = []
TRUTH = []

for line in sys.stdin:
    key_values = line.split('\t')
    # The key is ONE of the words we're counting
    if len(word) != 0:
        if word[-1] != key_values[0]:
            word.append(key_values[0])
    else:
        word.append(key_values[0])
    word_count.append(int(key_values[1]))
    # ID and TRUTH are replicated for each word in the vocabulary
        # so we just need to keep track of them once
    if len(word) == 1:
        ID.append(key_values[2])
        TRUTH.append(int(key_values[3]))

# The lists above will have a length equal to
    # number_different_words_in_vocab * number_emails_in_dataset
vocab_size = len(word)
num_emails = len(TRUTH)
# Reshape the list word_count into a 2-D numpy array
word_count = np.array(word_count).reshape(len(word), num_emails)

total_spam = sum(TRUTH) # total count of spam emails
total_ham = len(TRUTH) - total_spam # total count of ham emails
total_word_spam = [0]*len(word)
total_word_ham = [0]*len(word)
for i,w in enumerate(word):
    # Total count of word w in spam emails
    total_word_spam[i] = sum([x*y for (x,y) in zip(TRUTH,word_count[i])]) 
    # Total count of word w in ham emails
    total_word_ham[i] = sum(word_count[i]) - total_word_spam[i]

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = [float(x+1) / (sum(total_word_ham)+vocab_size) for x \
                 in total_word_ham]
prob_word_spam = [float(x+1) / (sum(total_word_spam)+vocab_size) for \
                  x in total_word_spam]

# Assess classification with the training set 
CLASS = [0]*num_emails
for i in range(num_emails): # for each email
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count[:,i],prob_word_ham)])
    prob_spam_word = log(prob_spam,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count[:,i],prob_word_spam)])
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        CLASS[i] = 1
    # Output for each email
    print ID[i] + '\t' + str(TRUTH[i]) + '\t' + str(CLASS[i])

# Training error
# Count of misclassification errors
errors = sum([x!=y for (x,y) in zip(TRUTH,CLASS)])
training_error = float(errors) / len(TRUTH)
# Additional line
print 'Training Error\t' + str(training_error)

Overwriting reducer.py


In [24]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r multiNBEnron

15/09/15 18:46:44 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:46:47 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted multiNBEnron


In [25]:
# Hadoop streaming command
!hadoop jar hadoop-streaming*.jar -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance,valium,enlargementWithATypo -input \
    /user/hadoop/dirhw21/enronemail_1h.txt -output multiNBEnron

15/09/15 18:46:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/15 18:46:54 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/15 18:46:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/15 18:46:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/15 18:46:55 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/15 18:46:55 INFO mapreduce.JobSubmitter: number of splits:1
15/09/15 18:46:56 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1498805937_0001
15/09/15 18:46:57 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/15 18:46:57 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/15 18:46:57 INFO mapreduce.Job: Running job: job_local1498805937_0001
15/09/15 18:46:5

In [26]:
!hdfs dfs -cat multiNBEnron/part-00000

15/09/15 18:47:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0007.1999-12-13.kaminski	0	0
0012.2001-02-09.kitchen	0	0
0003.1999-12-14.farmer	0	0
0007.1999-12-14.farmer	0	0
0012.2000-06-08.lokay	0	0
0016.2001-07-06.SA_and_HP	1	0
0007.2000-01-17.beck	0	0
0012.2000-01-17.beck	0	0
0001.2000-01-17.beck	0	0
0007.2001-02-09.kitchen	0	0
0012.1999-12-14.kaminski	0	0
0003.2000-01-17.beck	0	0
0007.2003-12-18.GP	1	0
0012.1999-12-14.farmer	0	0
0016.2001-07-05.SA_and_HP	1	0
0007.2004-08-01.BG	1	0
0011.2004-08-01.BG	1	0
0001.2001-04-02.williams	0	0
0008.2001-02-09.kitchen	0	0
0011.2003-12-18.GP	1	0
0003.2001-02-08.kitchen	0	0
0008.2001-06-12.SA_and_HP	1	0
0011.2001-06-29.SA_and_HP	1	0
0016.2001-02-12.kitchen	0	0
0008.2001-06-25.SA_and_HP	1	0
0011.2001-06-28.SA_and_HP	1	1
0017.2004-08-01.BG	1	0
0008.2003-12-18.GP	1	0
0011.1999-12-14.farmer	0	0
0003.2003-12-18.GP	1	0
0008.2004-08-01.BG	1	0
0010.2004-08-01.BG	1	0
0016

As seen above, I also included an additional line in the output with the training error, as defined in HW1. It's (almost) the same value that I got in HW1.4 (actually, a bit better: 0.40 now, while it was 0.41 wiht the poor man's implementation).