Name: Patrick Ng  
Email: patng@ischool.berkeley.edu  
Class: W261-2  
Week: 01  
Date of submission: Jan 18, 2016

## HW1.0.0## 
**Big data** is broad term for data sets so large or complex that traditional data-processing applications are inadequate.  IBM has characterized big data by its 4 V's: Volume (scale of data), Velocity (analysis of streaming data), Variety (different forms of data) and Veractiy (uncertainty of data). 

For example, for a popular website which attracts ten millions visits each day, the amount of web log data generated each day is about 50GB.  The website also uses a recommendation engine to generate recommendations to the visitors.  In order to train the engine each night, we need to cleanse (e.g. remove logs generated by suspected robots) and transform the web log data into a format usable by the training process, and the training itself has to process all the new web log data, together with all data generated from the past.  The whole process has to complete within 6 hours due to business needs.  However, using traditional data-processing techniques the whole proceess could take more than 8 hours.  As a result, we need to make use of big data techniques such as HDFS and parallel computating in order to complete the processing within the required period.

## HW1.0.1##
To estimate the bias, the variance, the irreduciable error for a test dataset T when using polynomial regression models of degree 1, 2, 3, 4 and 5, the strategy is:
- First, please note that each regression model will produce an estimator <i>g(x)</i> of the true function <i>f(x)</i>.
- For each model:
    - Using bootstrapping, generate datasets S<sub>1</sub>, S<sub>2</sub>, ..., S<sub>B</sub> from the dataset T.
    - For each S<sub>b</sub>:
      - Use S<sub>b</sub> to train the model to produce an estimator g<sub>b</sub>(x).
      - Let the dataset T<sub>b</sub> = T \ S<sub>b</sub> be the data points that do not appear in S<sub>b</sub>
      - Calculate the predicted value g<sub>b</sub>(x) for each x in T<sub>b</sub>.
    - Now we have several predictions for each data point x in T. From them we can calculate E<sub>B</sub>[g(x)] for each x.
    - For each data point x, we can now calculate the bias, the variance and the irreduciable error of the model:
        - Bias = E<sub>B</sub>[g(x)] - y
        - Variance = $\sum_{k}$(y<sub>k</sub> - E<sub>B</sub>[g(x)])<sup>2</sup> / (K-1)
            - where K is the number of predictions made for each data point x in T
        - Irreduciable Error / Noise:
            - Since we don't have the true function f(x), we have to estimate the noise at each data point x:
                - In T, if several data points exists at x, then:
                    - noise = variance of the multiple y's from those data points
                - Otherwise, assume noise = 0 for x
                  
To select a model, for each model we calculate the <i>expected prediction error</i>:
- Expected prediction error = E<sub>X</sub>[Variance + Bias<sup>2</sup> + Noise<sup>2</sup>]
    - Note: E<sub>X</sub> means we average it over all data points.

The model with the lowest expected prediction error will be chosen.
      

## Prepare the control script ##

In [40]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.

## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used

## a test set data of 100 messages
data="enronemail_1h.txt" 

## the full set of data (33746 messages)
# data="enronemail.txt" 

## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.

## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ./mapper.py $datachunk "$wordlist" > $datachunk.counts &
    ####
    ####
done
## wait for the mappers to finish their work
wait

## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
./reducer.py $countfiles > $data.output
####
####

## clean up the data chunks and temporary count files
\rm $data.chunk.*


Overwriting pNaiveBayes.sh


## HW1.1 ##

In [41]:
'''
HW1.1. Read through the provided control script (pNaiveBayes.sh)
and all of its comments. When you are comfortable with their
purpose and function, respond to the remaining homework questions below. 
'''
print "done"

done


## HW1.2 ##
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word.

In [42]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: mapper code for HW1.2

import sys
import re

## collect user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())

findword = findwords[0] # for HW1.2 we handle only a single word
regex = re.compile(r'\b' + re.escape(findword) + r'\b')
count = 0

with open(filename, 'r') as f:
    for line in f:
        parts = re.split("\t", line)
        
        subject = "" if parts[2].strip() == "NA" else parts[2]
        body = "" if parts[3].strip() == "NA" else parts[3]
        text = subject + " " + body
        
        count += len(regex.findall(text))

print '%s\t%d' % (findword, count)

Overwriting mapper.py


In [43]:
%%writefile reducer.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: reducer code for HW1.2

import sys
import re

## collect user input
filenames = sys.argv[1:]

totalCount = 0
findword = None

for filename in filenames:
    with open(filename, 'r') as f:
        line = f.readline()
        parts = line.split('\t')
        findword = parts[0]
        count = int(parts[1])
        totalCount += count

print '%s\t%d' % (findword, totalCount)

Overwriting reducer.py


In [44]:
!chmod +x mapper.py; chmod +x reducer.py; chmod +x pNaiveBayes.sh;

**Results:**

In [45]:
!./pNaiveBayes.sh 4 assistance
!cat enronemail_1h.txt.output

assistance	10


## HW1.3. ##  
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a single, user-specified word using the multinomial Naive Bayes Formulation.

In [46]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: mapper code for HW1.3

import sys
import re

## collect user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())

findword = findwords[0] # for HW1.3 we handle only a single word

# Regex for counting the number of findword in msg
regex = re.compile(r'\b' + re.escape(findword) + r'\b')

# Regex for counting the total number of words in a msg
regexAll = re.compile(r'\b\w+\b')

# For each of the count list below, index=0 means non-spam, index=1 means spam.
findwordCounts = [0, 0] # The findword counts among the two classes
totalwordCounts = [0, 0] # The total counts among the two classes
msgCounts = [0, 0]  # Total count of msg among the two classes

# For each msg: MsgId, its True class, and the occurrence count of findword
findwordInMessages = {}

with open(filename, 'r') as f:
    for line in f:
        parts = re.split("\t", line)
        
        msgId = parts[0]
        spamIndicator = int(parts[1])

        subject = "" if parts[2].strip() == "NA" else parts[2]
        body = "" if parts[3].strip() == "NA" else parts[3]
        text = subject + " " + body
        
        # Update the various counters
        
        msgCounts[spamIndicator] += 1
        
        findwordCount = len(regex.findall(text))
        findwordCounts[spamIndicator] += findwordCount
        findwordInMessages[msgId] = [spamIndicator, findwordCount]
        
        totalwordCounts[spamIndicator] += len(regexAll.findall(text))


# Msg count in each class
print "%d\t%d" % (msgCounts[0], msgCounts[1]) 

# Occurrence count of the findword in each class
print "%d\t%d" % (findwordCounts[0], findwordCounts[1]) 

# Total word count in each class
print "%d\t%d" % (totalwordCounts[0], totalwordCounts[1])

# For each msg: MsgId, occurrence count of findword, its True class
for k, v in findwordInMessages.iteritems():
    print "%s\t%d\t%d" % (k, v[0], v[1])

Overwriting mapper.py


In [47]:
%%writefile reducer.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: reducer code for HW1.3

import sys
import re
import math
import collections

## collect user input
filenames = sys.argv[1:]

# For each of the count list below, index=0 means non-spam, index=1 means spam.
findwordCounts = [0.0, 0.0] # The findword counts among the two classes
totalwordCounts = [0.0, 0.0] # The total counts among the two classes
msgCounts = [0.0, 0.0]  # Total count of msg among the two classes

# For each msg: MsgId, occurrence count of findword, its True class
findwordInMessages = {}

def readFindwordInMessages(f, countDict):
    for line in f:
        parts = re.split("\t", line)
        msgId = parts[0]
        trueClass = int(parts[1])
        count = float(parts[2])
        countDict[msgId] = [trueClass, count]

def readCounts(f, counts):
    line = f.readline()
    parts = re.split("\t", line)
    counts[0] += float(parts[0])
    counts[1] += float(parts[1])
    
def trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts):
    priors = {} # P(c)
    condProbs = {} # P(X=t|c)
    msgTotal = sum(msgCounts)
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        priors[c] = msgCounts[c] / msgTotal
        condProbs[c] = findwordCounts[c] / totalwordCounts[c]
        
    return priors, condProbs

for filename in filenames:
    with open(filename, 'r') as f:
        # Msg count in each class
        readCounts(f, msgCounts)
        
        # Occurrence count of the findword in each class
        readCounts(f, findwordCounts)
        
        # Total word count in each class
        readCounts(f, totalwordCounts)

        # The number of occurrence of findword in each message, key'ed by the message id
        readFindwordInMessages(f, findwordInMessages)

(priors, condProbs) = trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts)


# Predict the class of each message
# Please note that we sort the messages by msgId, so that we can display the result
# in a consistent order
for msgId, data in collections.OrderedDict(sorted(findwordInMessages.items())).iteritems():
    trueClass = data[0]
    findwordCount = data[1]
    scores = {}
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        scores[c] = math.log(priors[c]) + findwordCount * math.log(condProbs[c])
    
    predictedClass = 0 if scores[0] > scores[1] else 1
    
    print "%s\t%d\t%d" % (msgId, trueClass, predictedClass)

Overwriting reducer.py


In [48]:
!./pNaiveBayes.sh 4 assistance
!cat enronemail_1h.txt.output

0001.1999-12-10.farmer	0	0
0001.1999-12-10.kaminski	0	0
0001.2000-01-17.beck	0	0
0001.2000-06-06.lokay	0	0
0001.2001-02-07.kitchen	0	0
0001.2001-04-02.williams	0	0
0002.1999-12-13.farmer	0	0
0002.2001-02-07.kitchen	0	0
0002.2001-05-25.SA_and_HP	1	0
0002.2003-12-18.GP	1	0
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	0
0003.1999-12-14.farmer	0	0
0003.2000-01-17.beck	0	0
0003.2001-02-08.kitchen	0	0
0003.2003-12-18.GP	1	0
0003.2004-08-01.BG	1	0
0004.1999-12-10.kaminski	0	1
0004.1999-12-14.farmer	0	0
0004.2001-04-02.williams	0	0
0004.2001-06-12.SA_and_HP	1	0
0004.2004-08-01.BG	1	0
0005.1999-12-12.kaminski	0	1
0005.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0005.2001-02-08.kitchen	0	0
0005.2001-06-23.SA_and_HP	1	0
0005.2003-12-18.GP	1	0
0006.1999-12-13.kaminski	0	0
0006.2001-02-08.kitchen	0	0
0006.2001-04-03.williams	0	0
0006.2001-06-25.SA_and_HP	1	0
0006.2003-12-18.GP	1	0
0006.2004-08-01.BG	1	0
0007.1999-12-13.kaminski	0	0
0007.1999-12-14.farmer	

## HW1.4##
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by a list of one or more user-specified words.

In [49]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: mapper code for HW1.4

import sys
import re
from collections import defaultdict
import cPickle

## collect user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())
regexes = {}

# Create the list of compiled regex, one for each findword
for word in findwords:
    regexes[word] = re.compile(r'\b' + re.escape(word) + r'\b')

# Regex for counting the total number of words in a msg
regexAll = re.compile(r'\b\w+\b')

# For each of the count lists below, index=0 means non-spam, index=1 means spam.

# For each class, the occurrence count of each word.
findwordCounts = [defaultdict(int), defaultdict(int)] 
totalwordCounts = [0, 0] # The total counts among the two classes.
msgCounts = [0, 0]  # Total count of msg among the two classes.

# For each msg: MsgId, its True class, and the occurrence count of each word.
findwordInMessages = {}

with open(filename, 'r') as f:
    for line in f:
        parts = re.split("\t", line)
        
        msgId = parts[0]
        spamIndicator = int(parts[1])

        subject = "" if parts[2].strip() == "NA" else parts[2]
        body = "" if parts[3].strip() == "NA" else parts[3]
        text = subject + " " + body
        
        # Update the various counters
        
        msgCounts[spamIndicator] += 1
        
        wordCountsOfOneMsg = {} # Store the occurrence count of each word in this message
        for word in findwords:
            count = len(regexes[word].findall(text))
            if count > 0:
                findwordCounts[spamIndicator][word] += count
                wordCountsOfOneMsg[word] = count
            
        findwordInMessages[msgId] = [spamIndicator, wordCountsOfOneMsg]
        
        totalwordCounts[spamIndicator] += len(regexAll.findall(text))

# Use cPickle to dump all the data
data = [msgCounts, findwordCounts, totalwordCounts, findwordInMessages]
print cPickle.dumps(data)

Overwriting mapper.py


In [50]:
%%writefile reducer.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: reducer code for HW1.4

import sys
import re
import math
from collections import defaultdict
from collections import Counter
import collections
import cPickle

## collect user input
filenames = sys.argv[1:]

# For each of the count list below, index=0 means non-spam, index=1 means spam.
findwordCounts = [defaultdict(int), defaultdict(int)] # For each class, the occurrence count of each word.
totalwordCounts = [0, 0] # The total counts among the two classes.
msgCounts = [0, 0]  # Total count of msg among the two classes.

# For each msg: MsgId, its True class, and the occurrence count of each word.
findwordInMessages = {}

# The vocab seen
vocab = set()
    
def trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts, vocab):
    priors = {} # P(c)
    condProbs = {0:{}, 1:{}} # P(X=t|c)
    msgTotal = sum(msgCounts)
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        priors[c] = float(msgCounts[c]) / msgTotal
        for term in vocab:
            # By default we'll set word count to be zero if the term isn't found in the class.
            wordCount = findwordCounts[c].get(term, 0)
            
            # Note: we will use Laplace smoothing because the word "enlargementWithATypo"
            #       isn't found in the whole corpus.
            condProbs[c][term] = float(wordCount + 1) / (totalwordCounts[c] + len(vocab))
    
    return priors, condProbs

for filename in filenames:
    with open(filename, 'r') as f:
        # Use cPickle to read the output from mapper
        (msgCountsPart, findwordCountsPart, totalwordCountsPart, findwordInMessagesPart) = \
            cPickle.loads(f.read())
        
        # Update all our counters
        msgCounts = [x + y for x, y in zip(msgCounts, msgCountsPart)]
        totalwordCounts = [x + y for x, y in zip(totalwordCounts, totalwordCountsPart)]
        
        # For each class, update the occurrence count of each word
        for c in [0,1]:
            findwordCounts[c] = dict(Counter(findwordCounts[c]) + Counter(findwordCountsPart[c]))
        
        findwordInMessages.update(findwordInMessagesPart)

# Calculate the vocab seen in all the docs
for c in [0,1]:
    vocab = vocab.union(set(findwordCounts[c].keys()))
        
(priors, condProbs) = trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts, vocab)


# Predict the class of each message
# Please note that we sort the messages by msgId, so that we can display the result
# in a consistent order
for msgId, data in collections.OrderedDict(sorted(findwordInMessages.items())).iteritems():
    trueClass = data[0]
    wordCountsOfOneMsg = data[1]
    wordsOfOneMsg = data[1].keys()
    scores = {}
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        scores[c] = math.log(priors[c])
        for term in wordsOfOneMsg:
            scores[c] += math.log(condProbs[c][term])
    
    predictedClass = 0 if scores[0] > scores[1] else 1
    
    print "%s\t%d\t%d" % (msgId, trueClass, predictedClass)

Overwriting reducer.py


In [51]:
!./pNaiveBayes.sh 4 "assistance valium enlargementWithATypo"
!cat enronemail_1h.txt.output

0001.1999-12-10.farmer	0	0
0001.1999-12-10.kaminski	0	0
0001.2000-01-17.beck	0	0
0001.2000-06-06.lokay	0	0
0001.2001-02-07.kitchen	0	0
0001.2001-04-02.williams	0	0
0002.1999-12-13.farmer	0	0
0002.2001-02-07.kitchen	0	0
0002.2001-05-25.SA_and_HP	1	0
0002.2003-12-18.GP	1	0
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	0
0003.1999-12-14.farmer	0	0
0003.2000-01-17.beck	0	0
0003.2001-02-08.kitchen	0	0
0003.2003-12-18.GP	1	0
0003.2004-08-01.BG	1	0
0004.1999-12-10.kaminski	0	1
0004.1999-12-14.farmer	0	0
0004.2001-04-02.williams	0	0
0004.2001-06-12.SA_and_HP	1	0
0004.2004-08-01.BG	1	0
0005.1999-12-12.kaminski	0	1
0005.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	0
0005.2001-02-08.kitchen	0	0
0005.2001-06-23.SA_and_HP	1	0
0005.2003-12-18.GP	1	0
0006.1999-12-13.kaminski	0	0
0006.2001-02-08.kitchen	0	0
0006.2001-04-03.williams	0	0
0006.2001-06-25.SA_and_HP	1	0
0006.2003-12-18.GP	1	0
0006.2004-08-01.BG	1	0
0007.1999-12-13.kaminski	0	0
0007.1999-12-14.farmer	

## HW1.5##
Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh
   will classify the email messages by all words present.

In [52]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: mapper code for HW1.5

import sys
import re
from collections import defaultdict
import cPickle

## collect user input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())
regexes = {}

useAllWords = False

if findwords[0] == '*':
    useAllWords = True
else:
    # Create the list of compiled regex, one for each findword
    for word in findwords:
        regexes[word] = re.compile(r'\b' + re.escape(word) + r'\b')

# Regex for counting the total number of words in a msg
regexAll = re.compile(r'\b\w+\b')

# For each of the count lists below, index=0 means non-spam, index=1 means spam.

# For each class, the occurrence count of each word.
findwordCounts = [defaultdict(int), defaultdict(int)] 
totalwordCounts = [0, 0] # The total counts among the two classes.
msgCounts = [0, 0]  # Total count of msg among the two classes.

# For each msg: MsgId, its True class, and the occurrence count of each word.
findwordInMessages = {}

with open(filename, 'r') as f:
    for line in f:
        parts = re.split("\t", line)
        
        msgId = parts[0]
        spamIndicator = int(parts[1])

        subject = "" if parts[2].strip() == "NA" else parts[2]
        body = "" if parts[3].strip() == "NA" else parts[3]
        text = subject + " " + body
        
        # Update the various counters
        
        msgCounts[spamIndicator] += 1
        
        wordCountsOfOneMsg = {} # Occurrencd count of each word in this message
        wordsInMsg = regexAll.findall(text) # The vocab in this message
        
        # Update the counters based on the occurrence count of each word this message
        for word in wordsInMsg:
            regex = regexes.get(word, re.compile(r'\b' + re.escape(word) + r'\b'))
            count = len(regex.findall(text))
            if count > 0:
                findwordCounts[spamIndicator][word] += count
                wordCountsOfOneMsg[word] = count
            
        findwordInMessages[msgId] = [spamIndicator, wordCountsOfOneMsg]
        
        totalwordCounts[spamIndicator] += len(regexAll.findall(text))

# Use cPickle to output the data
data = [msgCounts, findwordCounts, totalwordCounts, findwordInMessages]
print cPickle.dumps(data)

Overwriting mapper.py


In [53]:
%%writefile reducer.py
#!/usr/bin/python
## mapper.py
## Author: Patrick Ng
## Description: reducer code for HW1.5

import sys
import re
import math
from collections import defaultdict
from collections import Counter
import collections
import cPickle

## collect user input
filenames = sys.argv[1:]

# For each of the count list below, index=0 means non-spam, index=1 means spam.
findwordCounts = [defaultdict(int), defaultdict(int)] # For each class, the occurrence count of each word.
totalwordCounts = [0, 0] # The total counts among the two classes.
msgCounts = [0, 0]  # Total count of msg among the two classes.

# For each msg: MsgId, its True class, and the occurrence count of each word.
findwordInMessages = {}

# The vocab seen
vocab = set()
    
def trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts, vocab):
    priors = {} # P(c)
    condProbs = {0:{}, 1:{}} # P(X=t|c)
    msgTotal = sum(msgCounts)
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        priors[c] = float(msgCounts[c]) / msgTotal
        for term in vocab:
            # By default we'll set word count to be zero if the term isn't found in the class.
            wordCount = findwordCounts[c].get(term, 0)
            
            # Note: we will use Laplace smoothing because the word "enlargementWithATypo"
            #       isn't found in the whole corpus.
            condProbs[c][term] = float(wordCount + 1) / (totalwordCounts[c] + len(vocab))
    
    return priors, condProbs

for filename in filenames:
    with open(filename, 'r') as f:
        (msgCountsPart, findwordCountsPart, totalwordCountsPart, findwordInMessagesPart) = \
            cPickle.loads(f.read())
        
        # Update all our counters
        msgCounts = [x + y for x, y in zip(msgCounts, msgCountsPart)]
        totalwordCounts = [x + y for x, y in zip(totalwordCounts, totalwordCountsPart)]
        
        # For each class, update the occurrence count of each word
        for c in [0,1]:
            findwordCounts[c] = dict(Counter(findwordCounts[c]) + Counter(findwordCountsPart[c]))
        
        findwordInMessages.update(findwordInMessagesPart)

# Calculate the vocab seen in all the docs
for c in [0,1]:
    vocab = vocab.union(set(findwordCounts[c].keys()))
        
(priors, condProbs) = trainMultinomialNB(msgCounts, findwordCounts, totalwordCounts, vocab)


# Predict the class of each message
# Please note that we sort the messages by msgId, so that we can display the result
# in a consistent order
for msgId, data in collections.OrderedDict(sorted(findwordInMessages.items())).iteritems():
    trueClass = data[0]
    wordCountsOfOneMsg = data[1]
    wordsOfOneMsg = data[1].keys()
    scores = {}
    
    # 0: non-spam, 1: spam
    for c in [0, 1]:
        scores[c] = math.log(priors[c])
        for term in wordsOfOneMsg:
            scores[c] += math.log(condProbs[c][term])
    
    predictedClass = 0 if scores[0] > scores[1] else 1
    
    print "%s\t%d\t%d" % (msgId, trueClass, predictedClass)

Overwriting reducer.py


In [54]:
!./pNaiveBayes.sh 4 *
!cat enronemail_1h.txt.output

0001.1999-12-10.farmer	0	0
0001.1999-12-10.kaminski	0	1
0001.2000-01-17.beck	0	0
0001.2000-06-06.lokay	0	0
0001.2001-02-07.kitchen	0	0
0001.2001-04-02.williams	0	1
0002.1999-12-13.farmer	0	0
0002.2001-02-07.kitchen	0	1
0002.2001-05-25.SA_and_HP	1	1
0002.2003-12-18.GP	1	1
0002.2004-08-01.BG	1	1
0003.1999-12-10.kaminski	0	0
0003.1999-12-14.farmer	0	0
0003.2000-01-17.beck	0	0
0003.2001-02-08.kitchen	0	0
0003.2003-12-18.GP	1	1
0003.2004-08-01.BG	1	1
0004.1999-12-10.kaminski	0	0
0004.1999-12-14.farmer	0	0
0004.2001-04-02.williams	0	1
0004.2001-06-12.SA_and_HP	1	1
0004.2004-08-01.BG	1	1
0005.1999-12-12.kaminski	0	0
0005.1999-12-14.farmer	0	0
0005.2000-06-06.lokay	0	1
0005.2001-02-08.kitchen	0	0
0005.2001-06-23.SA_and_HP	1	1
0005.2003-12-18.GP	1	1
0006.1999-12-13.kaminski	0	1
0006.2001-02-08.kitchen	0	0
0006.2001-04-03.williams	0	1
0006.2001-06-25.SA_and_HP	1	1
0006.2003-12-18.GP	1	1
0006.2004-08-01.BG	1	1
0007.1999-12-13.kaminski	0	0
0007.1999-12-14.farmer	