#DATASCI W261, Machine Learning at Scale
--------
####Assignement:  week \#2
####Lei Yang (leiyang@berkeley.edu)
####Due: 2016-01-26, 8AM PST

###*HW2.0.*
- What is a race condition in the context of parallel computation? Give an example.
- What is MapReduce?
- How does it differ from Hadoop?
- Which programming paradigm is Hadoop based on? Explain and give a simple example in code and show the code running.


####<span style="color:red">HW1.0.0 Answer:</span>


###*HW2.1.* Sort in Hadoop MapReduce
- Given as input: Records of the form <integer, “NA”>, where integer is any integer, and “NA” is just the empty string.
- Output: sorted key value pairs of the form <integer, “NA”> in decreasing order; what happens if you have multiple reducers? Do you need additional steps? Explain.

- Write code to generate N  random records of the form <integer, “NA”>. Let N = 10,000.
- Write the python Hadoop streaming map-reduce job to perform this sort. Display the top 10 biggest numbers. Display the 10 smallest numbers

####<span style="color:red">HW1.0.1 Answer:</span>



###*HW2.2.*  WORDCOUNT
- Using the Enron data from HW1 and Hadoop MapReduce streaming, write the mapper/reducer job that  will determine the word count (number of occurrences) of each white-space delimitted token (assume spaces, fullstops, comma as delimiters). 
- Examine the word “assistance” and report its word count results.
- CROSSCHECK: >grep assistance enronemail_1h.txt|cut -d$'\t' -f4| grep assistance|wc -l    
  - 8    
  - \#NOTE  "assistance" occurs on 8 lines but how many times does the token occur? 10 times! This is the number we are looking for!

###*HW2.2.1*  Using Hadoop MapReduce and your wordcount job (from HW2.2) determine the top-10 occurring tokens (most frequent tokens)

####Let's define *pNaiveBayes.sh* script first, we only need to do this once since it is the same throughout HW1

####Define mapper.py & reducer.py, and make all scripts executable
- *mapper.py* counts the single specified word for the chunk, and output an integer
- *reducer.py* collates counts from all chunks, and output the total count of the single specified word 

In [3]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
count = 0
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1]
countword = sys.argv[2].lower()
with open (filename, "r") as myfile:
    for line in myfile.readlines():
        for word in line.lower().split()[2:]:
            if countword in word:
                count += 1
print countword + ' ' + str(count)

Overwriting mapper.py


In [4]:
%%writefile reducer.py
#!/usr/bin/python
import sys
sum = 0
for filename in sys.argv[1:]:
    with open (filename, "r") as myfile:
        for line in myfile.readlines():
            temp = line.split()
            word = temp[0]
            sum += int(temp[1])
print word + ': ' + str(sum)

Overwriting reducer.py


In [5]:
!chmod a+x pNaiveBayes.sh
!chmod a+x mapper.py
!chmod a+x reducer.py

####<span style="color:red">HW1.2 Results: </span>by checking the ouput file, we know there are 10 counts of word 'assistance'.


In [6]:
!./pNaiveBayes.sh 5 "assistance"
!cat enronemail_1h.txt.output

assistance: 10


###*HW2.3.* Multinomial NAIVE BAYES with NO Smoothing
- Using the Enron data from HW1 and Hadoop MapReduce, write  a mapper/reducer job(s) that will both learn  Naive Bayes classifier and classify the Enron email messages using the learnt Naive Bayes classifier. 
- Use all white-space delimitted tokens as independent input variables (assume spaces, fullstops, commas as delimiters). 
- Note: for multinomial Naive Bayes, the Pr(X=“assistance”|Y=SPAM) is calculated as follows:
 - the number of times “assistance” occurs in SPAM labeled documents / the number of words in documents labeled SPAM 
 - E.g.,   “assistance” occurs 5 times in all of the documents Labeled SPAM, and the length in terms of the number of words in all documents labeled as SPAM (when concatenated) is 1,000. Then Pr(X=“assistance”|Y=SPAM) = 5/1000. Note this is a multinomial estimation of the class conditional for a Naive Bayes Classifier. No smoothing is needed in this HW. Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow. Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities. Please pay attention to probabilites that are zero! They will need special attention. Count up how many times you need to process a zero probabilty for each class and report. 
- Report the performance of your learnt classifier in terms of misclassifcation error rate of your multinomial Naive Bayes Classifier. Plot a histogram of the log posterior probabilities (i.e., log(Pr(Class|Doc))) for each class over the training set. Summarize what you see. 
- Error Rate = misclassification rate with respect to a provided set (say training set in this case). It is more formally defined here:
 - Let DF represent the evalution set in the following:
 - Err(Model, DF) = |{(X, c(X)) ∈ DF : c(X) != Model(x)}|   / |DF|
 - Where || denotes set cardinality; c(X) denotes the class of the tuple X in DF; and Model(X) denotes the class inferred by the Model “Model”

   

####Define mapper.py:
- obtains count for each word from the chunk, for spam and non-spam email separately, 
- records all counts in a dictionary, 
- outputs the dictionaries (non-spam count, and spam count), (non)spam counts, and keyword.

In [7]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
# let's use two dictionaries to hold the word counts for spam and non-spam
n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1]
keyword = sys.argv[2].lower()
with open (filename, "r") as myfile:
    for email in myfile.readlines():
        isSpam = email.split('\t')[1] == '1'
        if isSpam:
            nSpam += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in s_count:
                    s_count[word] = 1
                else:
                    s_count[word] += 1
        else:
            nNormal += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in n_count:
                    n_count[word] = 1
                else:
                    n_count[word] += 1
print n_count
print s_count
print nNormal
print nSpam
print "'" + keyword + "'"

Overwriting mapper.py


####Define reducer.py:
- collapse wrod counts from all chunks
- estimate NB model parameters: prior and conditional probabilities
- classify messages that contains the keyword
- **Note:** for messages that don't contain the keyword, the decision is solely based on prior probability, which will always give non-spam prediction, thus we skip those messages and only focus on those with the specified keyword
- output results

####Parameter estimation background:
Assuming *positional independence*, and with *add-one Laplace smoothing*, the multinomial NB conditional probability $P(t | c)$ can be estimated as:
$$
\hat{P}(t\mid c)=\frac{T_{ct}+1}{(\sum_{t^\prime \in V}{T_{ct^\prime}})+B},
$$

where $B=|V|$ is the number of terms in the vocabulary $V$ (including all text classes), and $T_{ct}$ is the count of word *t* in class *c*. 

To classify a message, the posterior probability of class $c$ can be calculated as:
$$
c_{map}=\arg\max_{c\in\mathbb C}[\log{\hat{P}(c)}+\sum_{1\leqslant k \leqslant n_d}{\log{\hat{P}(t_k\mid c)}}],
$$
where $\hat{P}(t_k\mid c)$ is estimated above with *positional independence* assumption as $\hat{P}(t\mid c)$.

In [9]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import math
from sets import Set

n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
counts = []

# scan through each output file from the chunks
for filename in sys.argv[1:]:
    # we first read out the 2 count dictionaries
    with open (filename, "r") as myfile:         
        for line in myfile.readlines():
            cmd = 'counts.append(' + line + ')'
            exec cmd
            
    # we then combine word counts, for non-spam and spam messages, respectively
    for word in counts[0]:
        if word not in n_count:
            n_count[word] = counts[0][word]
        else:
            n_count[word] += counts[0][word]
    
    for word in counts[1]:
        if word not in s_count:
            s_count[word] = counts[1][word]
        else:
            s_count[word] += counts[1][word]
            
    # combine spam and non-spam count
    nNormal += int(counts[2])
    nSpam += int(counts[3])
    
    # pass along the keyword for classification
    keyword = counts[4]
    
    # clear counts for next chunk
    counts = []

# we now estimate NB parameters for the specified word, according to the formular above
testfile = 'enronemail_1h.txt'
print 'Classify messages with key word: ' + keyword
B = len(Set(s_count.keys() + n_count.keys()))
tot_n = sum(n_count.values())
tot_s = sum(s_count.values())
p_word_s = 1.0*((s_count[keyword] if keyword in s_count else 0) + 0) / (tot_s + B) # no smoothing
p_word_n = 1.0*((n_count[keyword] if keyword in n_count else 0) + 0) / (tot_n + B)

# finally we classify the messages which contains the specified word
#### prior probability: same for every message, since it's determined by training data ####
p_s = 1.0*nSpam/(nSpam+nNormal)
p_n = 1.0*nNormal/(nSpam+nNormal)

# print model parameters
print '\n============= Model Parameters ============='
print 'P(spam) = %f' %(p_s)
print 'P(non-spam) = %f' %(p_n)
print 'P(%s|spam) = %f' %(keyword, p_word_s)
print 'P(%s|non-spam) = %f' %(keyword, p_word_n)

#### likelihood: dependend on the frequency of specified word ####
print '\n============= Classification Results ============='
print 'TRUTH \t CLASS \t ID'
with open (testfile, "r") as myfile:  
    for line in myfile.readlines():
        msg = line.lower().split()
        words = msg[2:] # only include words in subject and content
        n_word = sum([1 if keyword in word else 0 for word in words])
        # if the message doesn't contain our keyword, skip it;
        if n_word == 0:
            continue
        #### posterior probability ####
        p_s_word = math.log(p_s) + n_word * math.log(p_word_s)
        p_n_word = math.log(p_n) + n_word * math.log(p_word_n)
        isSpam = True if p_s_word > p_n_word else False        
        # print results
        print ('spam' if int(msg[1]) else 'ham') + '\t' + ('spam' if isSpam else 'ham') + '\t' + msg[0]
        

Overwriting reducer.py


####<span style="color:red">HW1.3 Results: </span>run the NB classifier with keyword 'assistance', the output file are displayed below:
- **Model parameters**: 
 - prior 
 - likelihood
- **Classification results**: 
 - TRUTH: original label
 - CLASS: filter result
 - ID: message ID

In [10]:
!./pNaiveBayes.sh 2 "assistance"
!cat enronemail_1h.txt.output

Classify messages with key word: assistance

P(spam) = 0.440000
P(non-spam) = 0.560000
P(assistance|spam) = 0.000189
P(assistance|non-spam) = 0.000047

TRUTH 	 CLASS 	 ID
spam	spam	0002.2004-08-01.bg
ham	spam	0004.1999-12-10.kaminski
ham	spam	0005.1999-12-12.kaminski
spam	spam	0010.2001-06-28.sa_and_hp
spam	spam	0011.2001-06-28.sa_and_hp
spam	spam	0013.2004-08-01.bg
spam	spam	0018.2001-07-13.sa_and_hp
spam	spam	0018.2003-12-18.gp


###*HW2.4* Repeat HW2.3 with the following modification: 
- use Laplace plus-one smoothing,
- compare the misclassifcation error rates for 2.3 versus 2.4 and explain the differences.
   

####Definition of mapper.py remains the same as it still just counts words for both classes

In [11]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
# let's use two dictionaries to hold the word counts for spam and non-spam
n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1]
keywords = sys.argv[2].lower()
with open (filename, "r") as myfile:
    for email in myfile.readlines():
        isSpam = email.split('\t')[1] == '1'
        if isSpam:
            nSpam += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in s_count:
                    s_count[word] = 1
                else:
                    s_count[word] += 1
        else:
            nNormal += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in n_count:
                    n_count[word] = 1
                else:
                    n_count[word] += 1
print n_count
print s_count
print nNormal
print nSpam
print "'" + keywords + "'"

Overwriting mapper.py


####Definition of reducer.py is modified to consider multiple keywords, which we use dictionaries to represent

In [14]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import math
from sets import Set

n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
counts = []

# scan through each output file from the chunks
for filename in sys.argv[1:]:
    # we first read out the 2 count dictionaries
    with open (filename, "r") as myfile:         
        for line in myfile.readlines():
            cmd = 'counts.append(' + line + ')'
            exec cmd
            
    # we then combine word counts, for non-spam and spam messages, respectively
    for word in counts[0]:
        if word not in n_count:
            n_count[word] = counts[0][word]
        else:
            n_count[word] += counts[0][word]
    
    for word in counts[1]:
        if word not in s_count:
            s_count[word] = counts[1][word]
        else:
            s_count[word] += counts[1][word]
            
    # combine spam and non-spam count
    nNormal += int(counts[2])
    nSpam += int(counts[3])
    
    # pass along the keyword for classification
    keywords = counts[4].split()
    
    # clear counts for next chunk
    counts = []

testfile = 'enronemail_1h.txt'
print 'Classify messages with keywords: ' + str(keywords)
   
# we now estimate NB parameters for the specified word, according to the formular above
B = len(Set(s_count.keys() + n_count.keys()))
tot_n = sum(n_count.values())
tot_s = sum(s_count.values())

#### prior probability ####
p_s = 1.0*nSpam/(nSpam+nNormal)
p_n = 1.0*nNormal/(nSpam+nNormal)

#### conditional probabilities for words ####
p_word_s, p_word_n = {}, {}
for word in keywords:
    p_word_s[word] = 1.0*((s_count[word] if word in s_count else 0) + 1) / (tot_s + B)
    p_word_n[word] = 1.0*((n_count[word] if word in n_count else 0) + 1) / (tot_n + B)

# finally we classify the messages which contains the specified word
#### print model parameters ####
print '\n============= Model Parameters ============='
print 'P(spam) = %f' %(p_s)
print 'P(non-spam) = %f' %(p_n)
for word in keywords:
    print 'P(%s|spam) = %f' %(word, p_word_s[word])
    print 'P(%s|non-spam) = %f' %(word, p_word_n[word])

#### likelihood: dependend on the frequency of specified word ####
print '\n============= Classification Results ============='
print 'TRUTH \t CLASS \t ID'
with open (testfile, "r") as myfile:  
    for line in myfile.readlines():
        msg = line.lower().split()
        words = msg[2:] # only include words in subject and content
        #### initialize posterior probability ####
        p_s_word = math.log(p_s)
        p_n_word = math.log(p_n)
        
        #### add likelihood for each keyword ####
        n_word = 0
        for key in keywords:
            n_key = sum([1 if key in word else 0 for word in words])
            n_word += n_key
            p_s_word += n_key * math.log(p_word_s[key])
            p_n_word += n_key * math.log(p_word_n[key])
            
        # if the message doesn't contain any keyword, skip it;
        if n_word == 0:
            continue
        isSpam = True if p_s_word > p_n_word else False        
        # print results
        print ('spam' if int(msg[1]) else 'ham') + '\t' + ('spam' if isSpam else 'ham') + '\t' + msg[0]
        

Overwriting reducer.py


####<span style="color:red">HW1.4 Results: </span>run the NB classifier with keywords 'assistance', 'valium' and 'enlargementWithATypo', the output file are displayed below:

In [15]:
!./pNaiveBayes.sh 2 "assistance valium enlargementWithATypo"
!cat enronemail_1h.txt.output

Classify messages with keywords: ['assistance', 'valium', 'enlargementwithatypo']

P(spam) = 0.440000
P(non-spam) = 0.560000
P(assistance|spam) = 0.000227
P(assistance|non-spam) = 0.000093
P(valium|spam) = 0.000038
P(valium|non-spam) = 0.000047
P(enlargementwithatypo|spam) = 0.000038
P(enlargementwithatypo|non-spam) = 0.000047

TRUTH 	 CLASS 	 ID
spam	spam	0002.2004-08-01.bg
ham	spam	0004.1999-12-10.kaminski
ham	spam	0005.1999-12-12.kaminski
spam	ham	0009.2003-12-18.gp
spam	spam	0010.2001-06-28.sa_and_hp
spam	spam	0011.2001-06-28.sa_and_hp
spam	spam	0013.2004-08-01.bg
spam	ham	0016.2003-12-19.gp
spam	ham	0017.2004-08-01.bg
spam	spam	0018.2001-07-13.sa_and_hp
spam	spam	0018.2003-12-18.gp


###*HW2.5.* Repeat HW2.4. 
- This time when modeling and classification ignore tokens with a frequency of less than three (3) in the training set. 
- How does it affect the misclassifcation error of learnt naive multinomial Bayesian Classifier on the training dataset:

####Definition of mapper.py remains the same as it still just counts words for both classes

In [16]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
# let's use two dictionaries to hold the word counts for spam and non-spam
n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1]
#keywords = sys.argv[2].lower()
with open (filename, "r") as myfile:
    for email in myfile.readlines():
        isSpam = email.split('\t')[1] == '1'
        if isSpam:
            nSpam += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in s_count:
                    s_count[word] = 1
                else:
                    s_count[word] += 1
        else:
            nNormal += 1
            for word in email.lower().split()[2:]: # only use subject & content for modeling
                if word not in n_count:
                    n_count[word] = 1
                else:
                    n_count[word] += 1
print n_count
print s_count
print nNormal
print nSpam

Overwriting mapper.py


####Definition of reducer.py is modified to consider all present words:

In [17]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import math
from sets import Set

n_count, s_count = {}, {}
nSpam, nNormal = 0, 0
counts = []

# scan through each output file from the chunks
for filename in sys.argv[1:]:
    # we first read out the 2 count dictionaries
    with open (filename, "r") as myfile:         
        for line in myfile.readlines():
            cmd = 'counts.append(' + line + ')'
            exec cmd
            
    # we then combine word counts, for non-spam and spam messages, respectively
    for word in counts[0]:
        if word not in n_count:
            n_count[word] = counts[0][word]
        else:
            n_count[word] += counts[0][word]
    
    for word in counts[1]:
        if word not in s_count:
            s_count[word] = counts[1][word]
        else:
            s_count[word] += counts[1][word]
            
    # combine spam and non-spam count
    nNormal += int(counts[2])
    nSpam += int(counts[3])
    
    # clear counts for next chunk
    counts = []

testfile = 'enronemail_1h.txt'
print 'Classify messages with all words'
   
# we now estimate NB parameters for all present words
allwords = Set(s_count.keys() + n_count.keys())
B = len(allwords)
tot_n = sum(n_count.values())
tot_s = sum(s_count.values())

#### prior probability ####
p_s = 1.0*nSpam/(nSpam+nNormal)
p_n = 1.0*nNormal/(nSpam+nNormal)

#### conditional probabilities for words ####
p_word_s, p_word_n = {}, {}
for word in allwords:
    p_word_s[word] = 1.0*((s_count[word] if word in s_count else 0) + .1) / (tot_s + B) #Laplace add 1 smoothing
    p_word_n[word] = 1.0*((n_count[word] if word in n_count else 0) + .1) / (tot_n + B)

# finally we classify the messages which contains the specified word
#### we won't print model parameters, to save some space ####
#### likelihood: dependend on the frequency of current word ####
print '\n============= Classification Results ============='
print 'TRUTH \t CLASS \t ID'
n_correct = 0
with open (testfile, "r") as myfile:  
    for line in myfile.readlines():
        msg = line.lower().split()
        words = msg[2:] # only include words in subject and content
        #### initialize posterior probability ####
        p_s_word = math.log(p_s)
        p_n_word = math.log(p_n)
        
        #### add likelihood for each keyword ####        
        for key in Set(words):
            n_key = sum([1 if key in word else 0 for word in words])
            p_s_word += n_key * math.log(p_word_s[key])
            p_n_word += n_key * math.log(p_word_n[key])
            
        isSpam = True if p_s_word > p_n_word else False
        n_correct += isSpam == int(msg[1])
        # print results
        print ('spam' if int(msg[1]) else 'ham') + '\t' + ('spam' if isSpam else 'ham') + '\t' + msg[0]

print '\nOur multinomial NB training error: %f' %(1-1.0*n_correct/(nSpam+nNormal))

Overwriting reducer.py


####<span style="color:red">HW1.5 Results: </span>run the NB classifier all present words, the output file are displayed below:

In [18]:
!./pNaiveBayes.sh 4 "dummy"
!cat enronemail_1h.txt.output

Classify messages with all words

TRUTH 	 CLASS 	 ID
ham	ham	0001.1999-12-10.farmer
ham	ham	0001.1999-12-10.kaminski
ham	ham	0001.2000-01-17.beck
ham	ham	0001.2000-06-06.lokay
ham	ham	0001.2001-02-07.kitchen
ham	ham	0001.2001-04-02.williams
ham	ham	0002.1999-12-13.farmer
ham	ham	0002.2001-02-07.kitchen
spam	spam	0002.2001-05-25.sa_and_hp
spam	spam	0002.2003-12-18.gp
spam	spam	0002.2004-08-01.bg
ham	ham	0003.1999-12-10.kaminski
ham	ham	0003.1999-12-14.farmer
ham	ham	0003.2000-01-17.beck
ham	ham	0003.2001-02-08.kitchen
spam	spam	0003.2003-12-18.gp
spam	spam	0003.2004-08-01.bg
ham	ham	0004.1999-12-10.kaminski
ham	ham	0004.1999-12-14.farmer
ham	ham	0004.2001-04-02.williams
spam	spam	0004.2001-06-12.sa_and_hp
spam	spam	0004.2004-08-01.bg
ham	ham	0005.1999-12-12.kaminski
ham	ham	0005.1999-12-14.farmer
ham	ham	0005.2000-06-06.lokay
ham	ham	0005.2001-02-08.kitchen
spam	spam	0005.2001-06-23.sa_and_hp
spam	spam	0005.2003-12-18.gp
ham	ham	0006.1999-12-13.kaminski
h

###*HW2.6.* Benchmark your code with the Python SciKit-Learn implementation of the multinomial Naive Bayes algorithm
- Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW2.5 and report the misclassification error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SkiKit-Learn to run over this dataset)
- Prepare a table to present your results, where rows correspond to approach used (SkiKit-Learn versus your Hadoop implementation) and the column presents the training misclassification error
- Explain/justify any differences in terms of training error rates over the dataset in HW2.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn 


In [19]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import *

import csv
import numpy as np

# read email message, and organize training data
with open('enronemail_1h.txt', 'r') as f:
    reader = csv.reader(f, delimiter="\t")
    emails = list(reader)
train_label = [msg[1] for msg in emails]
train_data = [msg[2]+msg[3] if len(msg)==4 else msg[2] for msg in emails]
msg_id = [msg[0].lower() for msg in emails]

# feature vectorization
uniVectorizer = CountVectorizer()
dtmTrain = uniVectorizer.fit_transform(train_data) 

# multinomial Naive Bayes Classifier from sklearn
mnb = MultinomialNB()
mnb.fit(dtmTrain, train_label)
pred_mnb = mnb.predict(dtmTrain)
training_error_mnb = 1.0*sum(pred_mnb != train_label) / len(train_label)

# Bernoulli Naive Bayes Classifier from sklearn
bnb = BernoulliNB()
bnb.fit(dtmTrain, train_label)
pred_bnb = bnb.predict(dtmTrain)
training_error_bnb = 1.0*sum(pred_bnb != train_label) / len(train_label)

# multinomial Naive Bayes Classifier from HW1.5
!./pNaiveBayes.sh 4 "dummy"

# load results from HW1.5 and generate comparison matrix
print 'TRUTH \t MNB_HW1.5 \t MNB_SK \t BNB_SK \t ID'
with open ('enronemail_1h.txt.output', "r") as myfile:  
    for line in myfile.readlines():
        if line.startswith('ham') or line.startswith('spam'):
            result = line.split()            
            idx = msg_id.index(result[2])
            result.insert(2, 'spam' if pred_mnb[idx]=='1' else 'ham')
            result.insert(3, 'spam' if pred_bnb[idx]=='1' else 'ham')
            print str.join('\t', result)
            
        if line.startswith('Our multinomial NB'):
            print '\n' + line.strip('\n')     

print 'SK- multinomial NB training error: %f' %training_error_mnb
print 'SK- Bernoulli   NB training error: %f' %training_error_bnb

TRUTH 	 MNB_HW1.5 	 MNB_SK 	 BNB_SK 	 ID
ham	ham	ham	ham	0001.1999-12-10.farmer
ham	ham	ham	ham	0001.1999-12-10.kaminski
ham	ham	ham	ham	0001.2000-01-17.beck
ham	ham	ham	ham	0001.2000-06-06.lokay
ham	ham	ham	ham	0001.2001-02-07.kitchen
ham	ham	ham	ham	0001.2001-04-02.williams
ham	ham	ham	ham	0002.1999-12-13.farmer
ham	ham	ham	ham	0002.2001-02-07.kitchen
spam	spam	spam	ham	0002.2001-05-25.sa_and_hp
spam	spam	spam	spam	0002.2003-12-18.gp
spam	spam	spam	ham	0002.2004-08-01.bg
ham	ham	ham	ham	0003.1999-12-10.kaminski
ham	ham	ham	ham	0003.1999-12-14.farmer
ham	ham	ham	ham	0003.2000-01-17.beck
ham	ham	ham	ham	0003.2001-02-08.kitchen
spam	spam	spam	ham	0003.2003-12-18.gp
spam	spam	spam	ham	0003.2004-08-01.bg
ham	ham	ham	ham	0004.1999-12-10.kaminski
ham	ham	ham	ham	0004.1999-12-14.farmer
ham	ham	ham	ham	0004.2001-04-02.williams
spam	spam	spam	spam	0004.2001-06-12.sa_and_hp
spam	spam	spam	ham	0004.2004-08-01.bg
ham	ham	ham	ham	0005.1999-12-12.kaminski
ham	ham	ham	ham	0005.1999-12-14.farmer
ham	

###*HW 2.6.1. OPTIONAL* (note this exercise is a stretch HW and optional)
- Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW2.6 and report the misclassification error 
- Discuss the performance differences in terms of misclassification error rates over the dataset in HW2.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn. Why such big differences. Explain. 
- Which approach to Naive Bayes would you recommend for SPAM detection? Justify your selection.


###*HW2.7. OPTIONAL* (note this exercise is a stretch HW and optional)

The Enron SPAM data in the following folder enron1-Training-Data-RAW is in raw text form (with subfolders for SPAM and HAM that contain raw email messages in the following form:

- Line 1 contains the subject
- The remaining lines contain the body of the email message.

In Python write a script to produce a TSV file called train-Enron-1.txt that has a similar format as the enronemail_1h.txt that you have been using so far. Please pay attend to funky characters and tabs. Check your resulting formated email data in Excel and in Python (e.g., count up the number of fields in each row; the number of SPAM mails and the number of HAM emails). Does each row correspond to an email record with four values? Note: use "NA" to denote empty field values.

###*HW2.8.*
Using Hadoop Map-Reduce write job(s) to perform the following:
- Train a multinomial Naive Bayes Classifier with Laplace plus one smoothing using the data extracted in HW2.7 (i.e., train-Enron-1.txt). Use all white-space delimitted tokens as independent input variables (assume spaces, fullstops, commas as delimiters). Drop tokens with a frequency of less than three (3).
- Test the learnt classifier using enronemail_1h.txt and report the misclassification error rate. Remember to use all white-space delimitted tokens as independent input variables (assume spaces, fullstops, commas as delimiters). How do we treat tokens in the test set that do not appear in the training set?

###*HW2.8.1.*
- Run  both the Multinomial Naive Bayes and the Bernoulli Naive Bayes algorithms from SciKit-Learn (using default settings) over the same training data used in HW2.8 and report the misclassification error on both the training set and the testing set
- Prepare a table to present your results, where rows correspond to approach used (SciKit-Learn Multinomial NB; SciKit-Learn Bernouili NB; Your Hadoop implementation)  and the columns presents the training misclassification error, and the misclassification error on the test data set
- Discuss the performance differences in terms of misclassification error rates over the test and training datasets by the different implementations. Which approch (Bernouili versus Multinomial) would you recommend for SPAM detection? Justify your selection.

