#DATASCI W261: Machine Learning at Scale

#Assignment: Week 2

- Juanjo Carin
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- W261-2
- Week 02
- Submission date: 9/15/2015

#HW2.0

1. **What is a race condition in the context of parallel computation? Give an example.**

2. **What is MapReduce?**

3. **How does it differ from Hadoop?**

4. **Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.**

1. A **race condition** is a situation, in parallel computation, in which the final value can be different depending on the order of parallel processes. For example, it can occur if two threads read and write (after performing some computation) the same variable: it one thread reads and writes the variable, and then the other thread reads and writes that same variable after that, the result will be different than what it would be if the second threads reads the variable before the first one has written it.

2. **MapReduce** is a programming model (and an associated implementation) for processing and generating large data sets with a parallel, distributed algorithm on a cluster. An implementation of this model (i.e., a MapReduce program) is composed of a `map` procedure (that performs filtering and sorting) and a `reduce` method (that performs a summary operation).

3. (Apache) **Hadoop** uses MapReduce, but is more than that: it is a software framework for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware, whose core consists of MapReduce (for processing) and HDFS (Hadoop Distributed File System; for storage). It is also composed of other modules, apart from those two: Hadoop Common, which contains libraries and utilities needed by other Hadoop modules, and Hadoop YARN, which is a resource-management platform responsible for managing computing resources in clusters and using them for scheduling of users' applications. Hadoop can also refer to the whole ecosystem or collection of additional software packages that can be installed on top of or alongside it, such as Apache Pig, Apache Hive, Apache Spark, Apache Storm, etc.

4. As mentioned, Hadoop is based on the MapReduce (or map-and-reduce) programming paradigm, which in turn is based on functional programming and parallel computation. What Hadoop does (in its simplest form) is splitting the input into several chunks, processing those chunks in parallel with a `map` task, and combine the intermediate result of each mapper by means of `reducer` tasks.  **An example is given below, in HW2.1, where the first 10,000 integers, which are shuffled and in string format, are sorted.** Sorting the strings would not work, because they would be sorted according to the leading digit(s) . . .

In [6]:
!echo '798,\n' '98,\n' '2043,' | sort -k1,1

798,\n 98,\n 2043,


. . . so the mappers include leading zeros to each number (in string format) in the portion of the data passed to each one, and the reducer discards those leading zeros (and the sorting is done by the Hadoop framework, no need to code it!).

So the command line equivalent to Hadoop code (given the same mapper and reducer, as briefly described above) would be equivalent to:

```python
!echo '798,\n' '98,\n' '2043,' | python mapper.py | sort -k1,1 | python reducer.py
```

which would sort those three numbers correctly:

`98,
798,
2043`

#HW2.1

**Sort in Hadoop MapReduce**

**Given as input: Records of the form `<integer, “NA”>`, where `integer` is any integer, and `“NA”` is just the empty string.**

**Output: sorted key value pairs of the form `<integer, “NA”>`; what happens if you have multiple reducers? Do you need additional steps? Explain.**

**Write code to generate N  random records of the form `<integer, “NA”>`. Let N = 10,000.**

**Write the python Hadoop streaming map-reduce job to perform this sort.**

In [7]:
N = 10000
from random import sample
with open('input', 'w') as myfile:
    # Sample (without replacement) from 1 to N
    integer = sample(range(1, N+1), N) 
    for i in range(N):
        # Add "," and an empty string to each integer
        myfile.write(str(integer[i])+',\n')
    myfile.close()

# Let's check a few of the generated records, to see if they're random
with open('input', 'r') as myfile:
    text = [word.strip(",.") for line in myfile for word in line.split()]
print text[:10]

['6837', '6188', '8462', '7963', '7003', '1219', '7617', '2948', '173', '708']


In [8]:
%%writefile mapper.py
#!/usr/bin/python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
    # Discard the "," (keep only the integer=key)
    line = line.strip().rstrip(',')
    # Convert to string and add leading zeros
    integer = str(line).zfill(5)
    # Intermediate result (empty value)
    print '%s\t%s' % (integer, '')

Overwriting mapper.py


In [9]:
%%writefile reducer.py
#!/usr/bin/python
import sys
# input comes from STDIN
for line in sys.stdin:
    # Remove value (empty) keeping only the key and convert to int 
        # (which discards leading zeros)
    integer = int(line.strip())
    # Same format as the original input ("," and empty string)
    print '%s%s' % (integer, ',')

Overwriting reducer.py


In [13]:
# Start Hadoop
!/usr/local/hadoop/sbin/start-yarn.sh
!/usr/local/hadoop/sbin/start-dfs.sh

starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-hduser-resourcemanager-juanjo-VB.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-hduser-nodemanager-juanjo-VB.out
15/12/09 14:52:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-hduser-namenode-juanjo-VB.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-hduser-datanode-juanjo-VB.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-hduser-secondarynamenode-juanjo-VB.out
15/12/09 14:53:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [14]:
# Create new folder
!hdfs dfs -mkdir -p /user/hadoop/dirhw21

15/12/09 14:53:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [15]:
## Upload input file to HDFS
!hdfs dfs -put -f input /user/hadoop/dirhw21

15/12/09 14:53:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: Cannot create file/user/hadoop/dirhw21/input._COPYING_. Name node is in safe mode.


In [19]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r sortOutput

15/12/09 14:54:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/09 14:54:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted sortOutput


In [20]:
# Hadoop streaming command
    # Forcing number of reducers to be 1
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar \
    -D mapred.reduce.tasks=1 -mapper mapper.py \
    -reducer reducer.py -input /user/hadoop/dirhw21/input -output sortOutput

15/12/09 14:54:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/09 14:54:48 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/12/09 14:54:48 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/12/09 14:54:48 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/12/09 14:54:49 INFO mapred.FileInputFormat: Total input paths to process : 1
15/12/09 14:54:49 INFO mapreduce.JobSubmitter: number of splits:1
15/12/09 14:54:49 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/12/09 14:54:50 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1008758741_0001
15/12/09 14:54:51 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/12/09 14:54:51 INFO mapreduce.Job: Running job: jo

In [21]:
# Move output to local (rather than direcly reading its content)
#!hdfs dfs -cat sortOutput/part-00000
# Delete it from local if a previous version exists
!rm ~/Downloads/HW2/part-00000
!hdfs dfs -copyToLocal sortOutput/part-00000 ~/Downloads/HW2/

rm: cannot remove ‘/home/hduser/Downloads/HW2/part-00000’: No such file or directory
15/12/09 14:55:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [22]:
# Read first records to check they were sorted
!head -20 ~/Downloads/HW2/part-00000

1,	
2,	
3,	
4,	
5,	
6,	
7,	
8,	
9,	
10,	
11,	
12,	
13,	
14,	
15,	
16,	
17,	
18,	
19,	
20,	


If we use multiple reducers, running the following, for example:

```python
# Hadoop streaming command
!hadoop jar hadoop-streaming*.jar -D mapred.reduce.tasks=2 -mapper mapper.py \
    -reducer reducer.py -input /user/hadoop/dirhw21/input -output sortOutput
```
the first records of output `part-00000` are:

`1,	
3,	
5,	
7,	
9,	
10,	
12,	
14,	
16,	
18,	
21,	
23,	
25,	
27,	
29,	
30,	
32,	
34,	
36,	
38,	`

I.e., there are more than one output (as many as reducers), and each one is sorted, but comes from just one part of the mappers, so it does not necessarily have to include a consecutive subset of records.

Say there are 4 mappers whose inputs are `<4,>`, `<11,>`, `<7,>`, `<2,>`, `<6,>`, `<10,>`, `<8,>`, `<5,>`, `<12,>`, `<3,>`, `<9,>`, `<1,>`. If the outputs of the first 2 mappers are processed by a reducer, and the outputs of the remaining 2 mappers are processed by a second reducer, their outputs would be `<2,>`, `<4,>`, `<6,>`, `<7,>`, `<10,>`, `<11,>`, and `<1,>`, `<3,>`, `<5,>`, `<8,>`, `<9,>`, `<12,>`, respectively.

**We would have to apply a 2nd MapReduce layer, this one with only one Reducer at the end.**

#HW2.2

**Using the Enron data from HW1 and Hadoop MapReduce streaming, write mapper/reducer pair that  will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.**

**To do so, make sure that**

- **mapper.py counts all occurrences of a single word, and**
- **reducer.py collates the counts of the single word.**

In [23]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword = env_vars['findword']

WORD_RE = re.compile(r"[\w']+")
word_count = 0

# input comes from STDIN (standard input)
for line in sys.stdin:
    for w in WORD_RE.findall(line):
        if findword.lower() == w.lower():
            word_count += 1
print findword + '\t' + str(word_count)

Overwriting mapper.py


In [24]:
%%writefile reducer.py
#!/usr/bin/python
import sys

sum_words = 0
for line in sys.stdin:
    key_value = line.split('\t')
    # The key is the single word we're counting
    key = key_value[0]
    # And each value, its count from a mapper
    value = key_value[1]
    sum_words += int(value)
print key + '\t' + str(sum_words)

Overwriting reducer.py


In [25]:
## Upload input file to HDFS
!hdfs dfs -put -f enronemail_1h.txt /user/hadoop/dirhw21

15/12/09 14:55:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [26]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r countWordEnron

15/12/09 14:55:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/09 14:55:42 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted countWordEnron


In [27]:
# Hadoop streaming command
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar \
    -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance -input /user/hadoop/dirhw21/enronemail_1h.txt \
    -output countWordEnron

15/12/09 14:55:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/09 14:55:46 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/12/09 14:55:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/12/09 14:55:46 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/12/09 14:55:47 INFO mapred.FileInputFormat: Total input paths to process : 1
15/12/09 14:55:48 INFO mapreduce.JobSubmitter: number of splits:1
15/12/09 14:55:48 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1448764630_0001
15/12/09 14:55:49 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/12/09 14:55:49 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/12/09 14:55:49 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.Fi

In [28]:
!hdfs dfs -cat countWordEnron/part-00000

15/12/09 14:56:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
assistance	10


Same result than in HW1, because I'm including the subject of the emails, and all occurrences in a single email are summed.

#HW2.3

**Using the Enron data from HW1 and Hadoop MapReduce, write  a mapper/reducer pair that will classify the email messages by a single, user-specified word. Examine the word “assistance” and report your results. To do so, make sure that**

- **mapper.py**
- **reducer.py**

**performs a single word multinomial Naive Bayes classification.**

In [21]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword = env_vars['findword']

WORD_RE = re.compile(r"[\w']+")

# input comes from STDIN (standard input)
for line in sys.stdin:
    word_count = 0 # count of word in the email
    ID = line.split("\t")[0]
    TRUTH = line.split("\t")[1]
    content = ' '.join(line.strip().split("\t")[2:])
        # We search the word in both the subject and the content
            # because one or the other may not exist, but the way the data are
            # stored we don't know which one may be missing
    word_count = WORD_RE.findall(content).count(findword)
    print findword + '\t' + str(word_count) + '\t' + ID + '\t' + TRUTH

Overwriting mapper.py


In [22]:
%%writefile reducer.py
#!/usr/bin/python
import sys
from math import log

word_count = []
ID = []
TRUTH = []

for line in sys.stdin:
    key_values = line.split('\t')
    # The key is the single word we're counting
    word = key_values[0]
    word_count.append(int(key_values[1]))
    ID.append(key_values[2])
    TRUTH.append(int(key_values[3]))

total_spam = sum(TRUTH) # total count of spam emails
total_ham = len(TRUTH) - total_spam # total count of ham emails
# Total count of word in spam emails
total_word_spam = sum([x*y for (x,y) in zip(TRUTH,word_count)]) 
# Total count of word in ham emails
total_word_ham = sum(word_count) - total_word_spam

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = float(total_word_ham + 1) / (total_word_ham + 1)
prob_word_spam = float(total_word_spam + 1) / (total_word_spam + 1)

# Assess classification with the training set 
CLASS = [0]*len(TRUTH)
for i in range(len(TRUTH)): # for each email
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + word_count[i]*log(prob_word_ham,10)
    prob_spam_word = log(prob_spam,10) + word_count[i]*log(prob_word_spam,10)
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        CLASS = 1
    # Output for each email
    print ID[i] + '\t' + str(TRUTH[i]) + '\t' + str(CLASS[i])

# Training error
# Count of misclassification errors
errors = sum([x!=y for (x,y) in zip(TRUTH,CLASS)])
training_error = float(errors) / len(TRUTH)
# Additional line
print 'Training Error\t' + str(training_error)

Overwriting reducer.py


In [23]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r singleNBEnron

15/09/19 17:21:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:21:35 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted singleNBEnron


In [24]:
# Hadoop streaming command
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar \
    -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance -input /user/hadoop/dirhw21/enronemail_1h.txt \
    -output singleNBEnron

15/09/19 17:21:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:21:39 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/19 17:21:39 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/19 17:21:39 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/19 17:21:40 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/19 17:21:40 INFO mapreduce.JobSubmitter: number of splits:1
15/09/19 17:21:40 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local82524836_0001
15/09/19 17:21:41 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/19 17:21:41 INFO mapreduce.Job: Running job: job_local82524836_0001
15/09/19 17:21:41 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/19 17:21:41 IN

In [25]:
!hdfs dfs -cat singleNBEnron/part-00000

15/09/19 17:21:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0018.2003-12-18.GP	1	0
0018.2001-07-13.SA_and_HP	1	0
0018.1999-12-14.kaminski	0	0
0017.2004-08-02.BG	1	0
0017.2004-08-01.BG	1	0
0017.2003-12-18.GP	1	0
0017.2001-04-03.williams	0	0
0017.2000-01-17.beck	0	0
0017.1999-12-14.kaminski	0	0
0016.2004-08-01.BG	1	0
0016.2003-12-19.GP	1	0
0016.2001-07-06.SA_and_HP	1	0
0016.2001-07-05.SA_and_HP	1	0
0016.2001-02-12.kitchen	0	0
0016.1999-12-15.farmer	0	0
0015.2003-12-19.GP	1	0
0015.2001-07-05.SA_and_HP	1	0
0015.2001-02-12.kitchen	0	0
0015.2000-06-09.lokay	0	0
0015.1999-12-15.farmer	0	0
0015.1999-12-14.kaminski	0	0
0014.2004-08-01.BG	1	0
0014.2003-12-19.GP	1	0
0014.2001-07-04.SA_and_HP	1	0
0014.2001-02-12.kitchen	0	0
0014.1999-12-15.farmer	0	0
0014.1999-12-14.kaminski	0	0
0013.2004-08-01.BG	1	0
0013.2001-06-30.SA_and_HP	1	0
0013.2001-04-03.williams	0	0
0013.1999-12-14.kaminski	0	0
0013.1999-12-14.farmer	

As seen above, I also included an additional line in the output with the training error, as defined in HW1. It's the same value that I got in HW1.3.

#HW2.4

**Using the Enron data from HW1 and in the Hadoop MapReduce framework, write  a mapper/reducer pair that will classify the email messages using multinomial Naive Bayes Classifier using a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results**

**To do so, make sure that**

- **mapper.py **
- **reducer.py**

**performs the multiple-word multinomial Naive Bayes classification via the chosen list.**

In [26]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
import os

# Word to use find/use
env_vars = os.environ
findword_list = env_vars['findword'].split(',')

WORD_RE = re.compile(r"[\w']+")

# input comes from STDIN (standard input)
for line in sys.stdin:
    word_count = 0 # count of word in the email
    ID = line.split("\t")[0]
    TRUTH = line.split("\t")[1]
    content = ' '.join(line.strip().split("\t")[2:])
        # We search the word in both the subject and the content
            # because one or the other may not exist, but the way the data are
            # stored we don't know which one may be missing
    for findword in findword_list:
        word_count = WORD_RE.findall(content).count(findword)
        print findword + '\t' + str(word_count) + '\t' + ID + '\t' + TRUTH

Overwriting mapper.py


In [27]:
%%writefile reducer.py
#!/home/hduser/anaconda/bin/python
import sys
from math import log
import numpy as np

word = []
word_count = []
ID = []
TRUTH = []

for line in sys.stdin:
    key_values = line.split('\t')
    # The key is ONE of the words we're counting
    if len(word) != 0:
        if word[-1] != key_values[0]:
            word.append(key_values[0])
    else:
        word.append(key_values[0])
    word_count.append(int(key_values[1]))
    # ID and TRUTH are replicated for each word in the vocabulary
        # so we just need to keep track of them once
    if len(word) == 1:
        ID.append(key_values[2])
        TRUTH.append(int(key_values[3]))

# The lists above will have a length equal to
    # number_different_words_in_vocab * number_emails_in_dataset
vocab_size = len(word)
num_emails = len(TRUTH)
# Reshape the list word_count into a 2-D numpy array
word_count = np.array(word_count).reshape(len(word), num_emails)

total_spam = sum(TRUTH) # total count of spam emails
total_ham = len(TRUTH) - total_spam # total count of ham emails
total_word_spam = [0]*len(word)
total_word_ham = [0]*len(word)
for i,w in enumerate(word):
    # Total count of word w in spam emails
    total_word_spam[i] = sum([x*y for (x,y) in zip(TRUTH,word_count[i])]) 
    # Total count of word w in ham emails
    total_word_ham[i] = sum(word_count[i]) - total_word_spam[i]

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = [float(x+1) / (sum(total_word_ham)+vocab_size) for x \
                 in total_word_ham]
prob_word_spam = [float(x+1) / (sum(total_word_spam)+vocab_size) for \
                  x in total_word_spam]

# Assess classification with the training set 
CLASS = [0]*num_emails
for i in range(num_emails): # for each email
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count[:,i],prob_word_ham)])
    prob_spam_word = log(prob_spam,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count[:,i],prob_word_spam)])
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        CLASS[i] = 1
    # Output for each email
    print ID[i] + '\t' + str(TRUTH[i]) + '\t' + str(CLASS[i])

# Training error
# Count of misclassification errors
errors = sum([x!=y for (x,y) in zip(TRUTH,CLASS)])
training_error = float(errors) / len(TRUTH)
# Additional line
print 'Training Error\t' + str(training_error)

Overwriting reducer.py


In [28]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r multiNBEnron

15/09/19 17:22:02 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
rm: `multiNBEnron': No such file or directory


In [29]:
# Hadoop streaming command
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar \
    -mapper mapper.py -reducer reducer.py -cmdenv \
    findword=assistance,valium,enlargementWithATypo -input \
    /user/hadoop/dirhw21/enronemail_1h.txt -output multiNBEnron

15/09/19 17:22:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:22:08 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/19 17:22:08 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/19 17:22:08 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/19 17:22:09 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/19 17:22:09 INFO mapreduce.JobSubmitter: number of splits:1
15/09/19 17:22:10 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local270049115_0001
15/09/19 17:22:11 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/19 17:22:11 INFO mapreduce.Job: Running job: job_local270049115_0001
15/09/19 17:22:11 INFO mapred.LocalJobRunner: OutputCommitter set in config null
15/09/19 17:22:11 

In [30]:
!hdfs dfs -cat multiNBEnron/part-00000

15/09/19 17:22:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0007.1999-12-13.kaminski	0	0
0012.2001-02-09.kitchen	0	0
0003.1999-12-14.farmer	0	0
0007.1999-12-14.farmer	0	0
0012.2000-06-08.lokay	0	0
0016.2001-07-06.SA_and_HP	1	0
0007.2000-01-17.beck	0	0
0012.2000-01-17.beck	0	0
0001.2000-01-17.beck	0	0
0007.2001-02-09.kitchen	0	0
0012.1999-12-14.kaminski	0	0
0003.2000-01-17.beck	0	0
0007.2003-12-18.GP	1	0
0012.1999-12-14.farmer	0	0
0016.2001-07-05.SA_and_HP	1	0
0007.2004-08-01.BG	1	0
0011.2004-08-01.BG	1	0
0001.2001-04-02.williams	0	0
0008.2001-02-09.kitchen	0	0
0011.2003-12-18.GP	1	0
0003.2001-02-08.kitchen	0	0
0008.2001-06-12.SA_and_HP	1	0
0011.2001-06-29.SA_and_HP	1	0
0016.2001-02-12.kitchen	0	0
0008.2001-06-25.SA_and_HP	1	0
0011.2001-06-28.SA_and_HP	1	1
0017.2004-08-01.BG	1	0
0008.2003-12-18.GP	1	0
0011.1999-12-14.farmer	0	0
0003.2003-12-18.GP	1	0
0008.2004-08-01.BG	1	0
0010.2004-08-01.BG	1	0
0016

As seen above, I also included an additional line in the output with the training error, as defined in HW1. It's (almost) the same value that I got in HW1.4 (actually, a bit better: 0.40 now, while it was 0.41 wiht the poor man's implementation).

#HW2.5

**Using the Enron data from HW1 an in the  Hadoop MapReduce framework, write  a mapper/reducer for a multinomial Naive Bayes Classifier that will classify the email messages using words present. Also drop words with a frequency of less than three (3). How does it affect the misclassifcation error of learnt naive multinomial Bayesian Classifiers on the training dataset:**

The first stage is a MapReduce job that learns the whole vocabulary from the training set. The output (a dictionary with all the words present) will be used in the second stage.

In [31]:
%%writefile mapper1.py
#!/usr/bin/python
import sys
import re
import os

vocabulary = []

# input comes from STDIN (standard input)
for line in sys.stdin:
    content = ' '.join(line.strip().split("\t")[2:])
    # We search the word in both the subject and the content
        # because one or the other may not exist, but the way the data are
        # stored we don't know which one may be missing
    content = re.sub('[^a-z]', ' ', content.lower())
    # Discard non-alphanumeric characters and also numbers
    words = content.split() # extract words
    words = set(words) # extract unique words
    vocabulary[1:1] = words # append to vocabulary
for word in set(vocabulary):
    print '%s\t%s' % (word, 1) # value here is not important

Overwriting mapper1.py


In [32]:
%%writefile reducer1.py
#!/usr/bin/python
import sys

vocabulary = []
for line in sys.stdin:
    # Take key only (the word) and add to vocabulary if not present
    word = line.split("\t")[0]
    #if word not in vocabulary:
    #    print word
    # If we use the 2 lines above instead of the 3 lines below
        # each word in the vocabulary goes in a new line
        # (and there's no need to sort)
    vocabulary.append(word)
vocabulary = sorted(set(vocabulary)) # Get unique words
print ' '.join(vocabulary) # Print words separated by space

Overwriting reducer1.py


In [33]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r dictionary

15/09/19 17:22:25 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:22:28 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted dictionary


In [34]:
# Hadoop streaming command
    # Forcing number of reducers to be 1
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar -D \
    mapred.reduce.tasks=1 -mapper mapper1.py -reducer reducer1.py -input \
    /user/hadoop/dirhw21/enronemail_1h.txt -output dictionary

15/09/19 17:22:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:22:33 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/19 17:22:33 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/19 17:22:33 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/19 17:22:34 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/19 17:22:34 INFO mapreduce.JobSubmitter: number of splits:1
15/09/19 17:22:34 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/19 17:22:34 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local877582841_0001
15/09/19 17:22:35 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
15/09/19 17:22:35 INFO mapreduce.Job: Running job: job

In [35]:
# Move output to local
# Delete it from local if a previous version exists
!rm ~/Downloads/HW2/dictionary.txt
!hdfs dfs -copyToLocal dictionary/part-00000 ~/Downloads/HW2/dictionary.txt

15/09/19 17:22:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


The dictionay we've created have the following words on it:

`a ab abidjan ability able abn about above absent absenteeism absolute absolutely absorb abuse abused acce accelerate accelerated accept acceptable accepted accepting accepts access accomodate accomodates accompanied according accordingly account accountability accounting accounts accrual accurate aches achieve achieved acid acquire acquisition acrobaat acrobat across act action activate active activists activities actor actress actual actually ad adage adams adapted add added adding addition additional additionally address addressed addresses addressing addtional adequately adhesion adm admin adminder administration admitted admixture adobe adobee adolescent adr adrianbold ads adult adv advance advanced advantage ...`

The second stage is a MapReduce job that applies the vocabulary to build the classifier. A parameter (`min_occurr`) is used to drop those words with a frequence less than that number.

In [36]:
!hdfs dfs -put -f ~/Downloads/HW2/dictionary.txt

15/09/19 17:22:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [37]:
%%writefile mapper2.py
#!/usr/bin/python
import sys
import re

f = open('dictionary', 'r')
word_dict = []
for line in f:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    for word in words:
        word_dict.append(word)

# input comes from STDIN (standard input)
for line in sys.stdin:
    word_count = 0 # count of word in the email
    ID = line.split("\t")[0]
    TRUTH = line.split("\t")[1]
    content = ' '.join(line.strip().split("\t")[2:])
        # We search the word in both the subject and the content
            # because one or the other may not exist, but the way the data are
            # stored we don't know which one may be missing
    content = re.sub('[^a-z]', ' ', content.lower())
    words = content.split() # extract words
    for word in set(word_dict):
        print word + '\t' + str(words.count(word)) + '\t' + ID + '\t' + TRUTH

Overwriting mapper2.py


In [38]:
%%writefile reducer2.py
#!/home/hduser/anaconda/bin/python
import sys
import os
from math import log
import numpy as np

# Word to use find/use
env_vars = os.environ
min_occurr = env_vars['min_occurr']

word = []
word_count = []
ID = []
TRUTH = []

for line in sys.stdin:
    key_values = line.split('\t')
    # The key is ONE of the words we're counting
    if len(word) != 0:
        if word[-1] != key_values[0]:
            word.append(key_values[0])
    else:
        word.append(key_values[0])
    word_count.append(int(key_values[1]))
    # ID and TRUTH are replicated for each word in the vocabulary
        # so we just need to keep track of them once
    if len(word) == 1:
        ID.append(key_values[2])
        TRUTH.append(int(key_values[3]))

# The lists above will have a length equal to
    # number_different_words_in_vocab * number_emails_in_dataset
vocab_size = len(word)
num_emails = len(TRUTH)
# Reshape the list word_count into a 2-D numpy array
word_count = np.array(word_count).reshape(len(word), num_emails)
# Drop words with a frequency of less than min_ocur
condition = np.sum(word_count,1) >= int(min_occurr)
final_word_count = word_count[condition, :]
filtered_indices = np.extract(condition, word_count).tolist()
final_word = [word[i] for i in filtered_indices]
final_vocab_size = len(final_word)
                        
total_spam = sum(TRUTH) # total count of spam emails
total_ham = len(TRUTH) - total_spam # total count of ham emails
total_word_spam = [0]*len(final_word)
total_word_ham = [0]*len(final_word)
for i,w in enumerate(final_word):
    # Total count of word w in spam emails
    total_word_spam[i] = sum([x*y for (x,y) in zip(TRUTH,final_word_count[i])]) 
    # Total count of word w in ham emails
    total_word_ham[i] = sum(final_word_count[i]) - total_word_spam[i]

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = [float(x+1) / (sum(total_word_ham)+final_vocab_size) for x \
                 in total_word_ham]
prob_word_spam = [float(x+1) / (sum(total_word_spam)+final_vocab_size) for \
                  x in total_word_spam]

# Assess classification with the training set 
CLASS = [0]*num_emails
for i in range(num_emails): # for each email
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + \
        sum([x*log(y,10) for (x,y) in zip(final_word_count[:,i],prob_word_ham)])
    prob_spam_word = log(prob_spam,10) + \
        sum([x*log(y,10) for (x,y) in zip(final_word_count[:,i],prob_word_spam)])
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        CLASS[i] = 1
    # Output for each email
    print ID[i] + '\t' + str(TRUTH[i]) + '\t' + str(CLASS[i])

# Training error
# Count of misclassification errors
errors = sum([x!=y for (x,y) in zip(TRUTH,CLASS)])
training_error = float(errors) / len(TRUTH)
# Additional line
print 'Training Error\t' + str(training_error)

Overwriting reducer2.py


In [39]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r allNBEnron

15/09/19 17:23:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:23:08 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted allNBEnron


First we try with `min_occurr=1`, i.e., using all words, regardless of their frequency.

In [40]:
# Hadoop streaming command
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar -D \
    mapred.reduce.tasks=1 -files 'dictionary.txt#dictionary' -cmdenv \
    min_occurr=1  -mapper mapper2.py -reducer reducer2.py -input \
    /user/hadoop/dirhw21/enronemail_1h.txt -output allNBEnron

15/09/19 17:23:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:23:11 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/19 17:23:12 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/19 17:23:12 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/19 17:23:12 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/19 17:23:13 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/19 17:23:13 INFO mapreduce.JobSubmitter: number of splits:1
15/09/19 17:23:13 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local1293880643_0001
15/09/19 17:23:15 INFO mapred.LocalDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/local/1442708594420/dictionary.txt <- /

In [41]:
!hdfs dfs -cat allNBEnron/part-00000

15/09/19 17:23:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0016.2001-07-06.SA_and_HP	1	1
0018.2001-07-13.SA_and_HP	1	1
0012.2000-01-17.beck	0	0
0012.2000-06-08.lokay	0	0
0006.2001-02-08.kitchen	0	0
0017.2000-01-17.beck	0	0
0010.2001-06-28.SA_and_HP	1	1
0007.1999-12-14.farmer	0	0
0002.1999-12-13.farmer	0	0
0003.2000-01-17.beck	0	0
0008.2003-12-18.GP	1	1
0007.1999-12-13.kaminski	0	0
0016.1999-12-15.farmer	0	0
0004.2004-08-01.BG	1	1
0011.2001-06-28.SA_and_HP	1	1
0001.2000-06-06.lokay	0	0
0002.2003-12-18.GP	1	1
0004.1999-12-10.kaminski	0	0
0012.2001-02-09.kitchen	0	0
0002.2004-08-01.BG	1	1
0004.2001-04-02.williams	0	0
0003.2003-12-18.GP	1	1
0010.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0013.2001-06-30.SA_and_HP	1	1
0007.2000-01-17.beck	0	0
0006.2001-04-03.williams	0	0
0005.2001-02-08.kitchen	0	0
0001.2001-02-07.kitchen	0	0
0011.2004-08-01.BG	1	1
0009.1999-12-13.kaminski	0	0
0016.2004-08-01.BG	1	

As seen above (and as expected) the training error is null, so the accuracy at classifying the same dataset used to train the model is perfect. Now let's drop all words with a frequency of less than three.

In [42]:
# Delete previous output (if it exists)
!hdfs dfs -rm -r allNBEnron

15/09/19 17:23:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:23:58 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted allNBEnron


In [43]:
# Hadoop streaming command
!hadoop jar /home/hduser/Downloads/hadoop-streaming*.jar -D \
    mapred.reduce.tasks=1 -files 'dictionary.txt#dictionary' -cmdenv \
    min_occurr=3  -mapper mapper2.py -reducer reducer2.py -input \
    /user/hadoop/dirhw21/enronemail_1h.txt -output allNBEnron

15/09/19 17:24:00 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/09/19 17:24:01 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
15/09/19 17:24:02 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
15/09/19 17:24:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
15/09/19 17:24:03 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
15/09/19 17:24:03 INFO mapred.FileInputFormat: Total input paths to process : 1
15/09/19 17:24:03 INFO mapreduce.JobSubmitter: number of splits:1
15/09/19 17:24:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local615146674_0001
15/09/19 17:24:05 INFO mapred.LocalDistributedCacheManager: Creating symlink: /app/hadoop/tmp/mapred/local/1442708644967/dictionary.txt <- /h

In [44]:
!hdfs dfs -cat allNBEnron/part-00000

15/09/19 17:24:26 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
0016.2001-07-06.SA_and_HP	1	1
0018.2001-07-13.SA_and_HP	1	1
0012.2000-01-17.beck	0	0
0012.2000-06-08.lokay	0	0
0006.2001-02-08.kitchen	0	0
0017.2000-01-17.beck	0	0
0010.2001-06-28.SA_and_HP	1	1
0007.1999-12-14.farmer	0	0
0002.1999-12-13.farmer	0	0
0003.2000-01-17.beck	0	0
0008.2003-12-18.GP	1	1
0007.1999-12-13.kaminski	0	0
0016.1999-12-15.farmer	0	0
0004.2004-08-01.BG	1	1
0011.2001-06-28.SA_and_HP	1	1
0001.2000-06-06.lokay	0	0
0002.2003-12-18.GP	1	1
0004.1999-12-10.kaminski	0	0
0012.2001-02-09.kitchen	0	0
0002.2004-08-01.BG	1	1
0004.2001-04-02.williams	0	0
0003.2003-12-18.GP	1	1
0010.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0013.2001-06-30.SA_and_HP	1	1
0007.2000-01-17.beck	0	0
0006.2001-04-03.williams	0	0
0005.2001-02-08.kitchen	0	0
0001.2001-02-07.kitchen	0	0
0011.2004-08-01.BG	1	1
0009.1999-12-13.kaminski	0	0
0016.2004-08-01.BG	1	

Dropping very infrequent words causes the training error to be just slightly higher (1% instead of 0%). If we try higher values of `min_occurr` (dropping words unless they are quite frequent), the training error keeps increasing, though at a very low rate.

At least in a real case (measuring accuracy with a test set rather than the training set), it makes more sense to use very frequent words as **stopwords** (i.e., to drop words that appear more than $n$ times), because they're likely to be present in both types of emails (spam and ham), and hence do not characterize the kind of email we're trying to classify.

In [29]:
!/usr/local/hadoop/sbin/stop-yarn.sh
!/usr/local/hadoop/sbin/stop-dfs.sh

stopping yarn daemons
stopping resourcemanager
localhost: stopping nodemanager
no proxyserver to stop
15/12/09 15:22:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Stopping namenodes on [localhost]
localhost: stopping namenode
localhost: stopping datanode
Stopping secondary namenodes [0.0.0.0]
0.0.0.0: stopping secondarynamenode
15/12/09 15:22:56 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
