#DATASCI W261: Machine Learning at Scale

#Assignment: Week 1

- Juanjo Carin
- [juanjose.carin@ischool.berkeley.edu](mailto:juanjose.carin@ischol.berkeley.com)
- W261-2
- Week 01
- Submission date: 9/15/2015

#HW1.0

##HW1.0.0

**Define big data. Provide an example of a big data problem in your domain of expertise.**

*Big Data* refers to data sets so large or complex that traditional data-processing applications are inadequate to handle them (mainly in terms of processing, storing, or transfering them, but also with regards to data curation, search, visualization, security, privacy, and so on).

I come from the B2B sales domain (test & measurement solutions for energy, transportation, optics & electronics, etc.), so the amount of transactions, even in a world basis, were pretty low (and hence easy to handle). So I will mention an example from Marketing & Advertising, one of the areas I feel more attracted: *sentiment analysis*, applied to branding. Say we want to know the feelings and opinions of a given company's customers, based on what those customers tweet: we would want to know not only how many times the brand is mentioned (that would be a mere word count, although within millions of tweets per day) and by whom, but also the content—the meaning—of the tweets containing that brand's name. That involves a very complex task (natural language processing, in multiple languages) over a tremendous amount of data.

##HW1.0.1

**In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreducible error for a test dataset T when using polynomial regression models of degree 1, 2, 3, 4, 5 are considered. How would you select a model?**

(If we use the test set $T$ to select a model, from a collection of them built from a training set, we might also refer to that dataset as *validation set* or *development set*; the term *test set* commonly refers to a third dataset used not to select but to assess or evaluate the model that is finally chosen.)

The expected prediction error in the test dataset $T$ could be decomposed as the sum of some irreducible error, the squared bias of the model $g$ we've built (based on training data), and the variance:

$$E\left[\left(g(x^*)-y^* \right )^2 \right ]=E\left[\left(y^*-f(x^*)\right )^2 \right ]+\left(E\left[g(x^*)-f(x^*)\right] \right )^2+E\left[\left(g(x^*)-E\left(g(x^*) \right ) \right )^2 \right ]$$

If we were facing a **theoretical example** (a **deterministic process** whose function $f$ we know; something like *force (y) = mass (x) × acceleration* or *space (y) =speed × time (x)*), we would be given that $f$ function, as well as the $(x^*, y^*)$ points in $T$ and the $g$ function given by each of the five polynomial regression models, so we just would need to apply the formula above for each of those models, hence obtaining the irreducible error, the bias, and the variance for each model  

> To estimate the variance we need to calculate $E[g(x^*)]$, which implies different models—versions of $g$—depending on the points $x^*$. To do that we would need to split the test set in some datasets, for example using **bootstrapping**. This is commonly done with a training set, when we're builiding the models (with the test set we just assess the overall prediction error).

The model we should select (the best among the five) is the one that yields the lower prediction error (the sum of those three components) . . . But if we knew $f$—the mechanism that generates $y$ from $x$—, there would be no need for a model $g$.

In a **practical case** we do not know $f$, and hence we cannot estimate neither the noise nor the bias (because $f$ appears in both expressions), i.e., we cannot isolate the irreducible error ($\varepsilon^*=y^*-f(x^*)$) or the difference between the true function $f$ and its approximation $g$ (but we can estimate the sum of those two terms as the difference between the expected prediction error and the variance, as explained later).

That wouldn't matter, anway; the information we have (the $g$ functions given by each model, as well as the points $(x^*, y^*) \in T$ ) would be enough to select the best model: as mentioned, we are interested in the model that minimizes the expected prediction error (i.e., the one that gives values of $g(x^*)$ closer to the values of $y^*$, on average), so we just have to estimate that error (the MSE) for each one of the five $g$ functions, and select the one that yields the lower value.

We could also estimate the variance, because it only depends on $g(x^*)$, which is given by each model (that we've supposedly built with some training dataset; as mentioned above, splitting the training set in several datasets would allow us to estimate $E[g(x^*)]$, so we would have to do the same with the test set if we want to estimate the variance for it). This way we might estimate the sum of the irreducible error and the squared bias as the difference between the prediction error and the variance. And the reason why the expected prediction error reaches a minimum is because:

- the variance increases with the complexity of the model, i.e., with the degree of the polynomial function,
- the irreducible error is independent of the model we use (its mean value is constant for a given test set, regardless of the function $g$ that we use), and
- the bias (and hence its squared value) decreases as the complexity of the model increases.

The so-called *bias vs. variance tradeoff* implies that, as the complexity of the model increases (the higher the degree of the polynomial function, in our case), one component of the prediction error—the bias—decreases, another one—the variance—increases, and the third one—the irreducible noise—remains constant, so there may be a minimum (an optimal complexity level, so to speak) where the bias has already decreased enough while the variance has not increased too much yet (of course, that minimum might be outside the 1–5 range we're considering in this example; in that case we would not reach it, and the polynomial function of degree 5 would be the best approximation; or there may not be such a minimum—if the variance increases with the model complexity at a higher rate than the bias decreases, in which case the best degree would be 1).

#HW1.1

**Read through the provided control script (pNaiveBayes.sh) and all of its comments. When you are comfortable with their purpose and function, respond to the remaining homework questions below. A simple cell in the notebook with a print statmement with  a "done" string will suffice here.**

In [1]:
### HW1.1 ###
# Read through the provided control script (pNaiveBayes.sh) and all of its 
    # comments.
# When you are comfortable with their purpose and function, respond to the 
    # remaining homework questions below.
# A simple cell in the notebook with a print statmement with a "done" string 
    # will suffice here.

def hw1_1():
    print 'done'

hw1_1()

done


#HW1.2

**Provide a mapper/reducer pair that, when executed by `pNaiveBayes.sh` will determine the number of occurrences of a single, user-specified word. Examine the word “assistance” and report your results.**

I started with the same approach used in Quiz 1.12.2, with different cells for each part of the code:

In [2]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
count_words = 0 # count of occurrences of the word
count_emails = 0 # count of emails containing the word
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1]
findword = sys.argv[2]
with open(filename, "r") as f:
    for line in f.readlines():
        # Count IF the word appears in a line (email)
        if re.search(findword,line,re.IGNORECASE):
            count_emails += 1
        # Count HOW MANY TIMES the word appears in a line (email)
        for w in WORD_RE.findall(line):
            if findword.lower() == w.lower():
                count_words += 1                
print str(count_words) + '\t' + str(count_emails)

Overwriting mapper.py


As seen above, I counted not only the occurrences of the word but also the emails in which that word appears (for instance, `"assistance"` appears 10 times, in 8 emails: it appears 3 times in one of the emails).

I could also have counted the occurrences of the word **per email**, as well as printed the category of each email, hence being able to re-use this mapper in HW1.3, but I preferred to keep things simple here, and increase the complexity of all mapper/reducer pairs in each section.

In [3]:
# Modify permissions of the mapper.py file
!chmod a+x mapper.py

In [4]:
%%writefile reducer.py
#!/usr/bin/python
import sys
files = sys.argv[1:] #Accept several arguments (same as mapper outputs)
sum_words = 0 # count of occurrences of the word
sum_emails = 0 # count of emails containing the word
for filename in files:
    with open (filename, "r") as f:
        line = f.readline()
        words_emails = line.split('\t')
        sum_words += int(words_emails[0])
        sum_emails += int(words_emails[1])
print str(sum_words) + '\t' + str(sum_emails)

Overwriting reducer.py


In [5]:
# Modify permissions of the reducer.py file
!chmod a+x reducer.py

In [6]:
# Run pNaiveBayes.sh with word 'assistance' and an arbitrary number of partitions
!./pNaiveBayes.sh 10 'assistance'

In [7]:
### HW1.2 ###
# Report results
with open ('enronemail_1h.txt.output', "r") as f:
    line = f.readline()
    words_emails = line.split('\t')
    sum_words = int(words_emails[0])
    sum_emails = int(words_emails[1])
print 'Number of occcurrences of the word \"{0}\":   {1}'.format('assistance', 
    sum_words)
print 'Number of emails containing the word \"{0}\": {1}'.format('assistance', 
    sum_emails)

Number of occcurrences of the word "assistance":   10
Number of emails containing the word "assistance": 8


But then I adapted the code to put all of it in a function (called **hw1_2**) that accepts a number of partitions and a word as inputs:

In [8]:
### HW1.2 ###
def hw1_2(n, word):
    # Create mapper.py
        # which counts both occurrences of the word and mails containing that word
    with open('mapper.py', 'w') as f:
        f.write('#!/usr/bin/python\n')
        f.write('import sys\n')
        f.write('import re\n')
        f.write('count_words = 0\n')
        f.write('count_emails = 0\n')
        f.write('WORD_RE = re.compile(r"[\w'']+")\n')
        f.write('filename = sys.argv[1]\n')
        f.write('findword = sys.argv[2]\n')
        f.write('with open(filename, "r") as f:\n')
        f.write('\tfor line in f.readlines():\n')
        f.write('\t\tif re.search(findword,line,re.IGNORECASE):\n')
        f.write('\t\t\tcount_emails += 1\n')        
        f.write('\t\tfor w in WORD_RE.findall(line):\n')
        f.write('\t\t\tif findword.lower() == w.lower():\n')
        f.write('\t\t\t\tcount_words +=1\n')
        f.write('print str(count_words) + '"'\t'"' + str(count_emails)\n')
    import subprocess
    output = subprocess.check_output(['bash','-c', 'chmod a+x mapper.py'])
    
    # Create reducer.py
    with open('reducer.py', 'w') as f:
        f.write('#!/usr/bin/python\n')
        f.write('import sys\n')
        f.write('files = sys.argv[1:]\n')
        f.write('sum_words = 0\n')
        f.write('sum_emails = 0\n')
        f.write('for filename in files:\n')
        f.write('\twith open (filename, "r") as f:\n')
        f.write('\t\tline = f.readline()\n')
        f.write('\t\twords_emails = line.split(''\t'')\n')
        f.write('\t\tsum_words += int(words_emails[0])\n')
        f.write('\t\tsum_emails += int(words_emails[1])\n')
        f.write('print str(sum_words) + '"'\t'"' + str(sum_emails)\n')
    output = subprocess.check_output(['bash','-c', 'chmod a+x reducer.py'])
    
    # Call pNaiveBayes.sh
    bashCommand = './pNaiveBayes.sh ' + str(n) + ' ' + word
    output = subprocess.check_output(['bash','-c', bashCommand])
    
    # Report results
    with open ('enronemail_1h.txt.output', "r") as myfile:
        line = myfile.readline()
        words_emails = line.split('\t')
        sum_words = int(words_emails[0])
        sum_emails = int(words_emails[1])
    print 'Number of occcurrences of the word \"{0}\":   {1}'.format('assistance', 
        sum_words)
    print 'Number of emails containing the word \"{0}\": {1}'.format('assistance', 
        sum_emails)
    
# Call the function with word 'assistance' and an arbitrary number of partitions
hw1_2(10, 'assistance')

Number of occcurrences of the word "assistance":   10
Number of emails containing the word "assistance": 8


In the remaining of the assignment, I kept the definition of the `mapper` and `reducer` functions outside the definition of the function `HW1.x`.

#HW1.3

**Provide a mapper/reducer pair that, when executed by `pNaiveBayes.sh`, will classify the email messages by a single, user-specified word. Examine the word “assistance” and report your results. To do so, make sure that**

- **`mapper.py` is same as in part (2), and**
- **`reducer.py` performs a single word Naive Bayes classification.**

I modified `mapper.py` to include occurrences of the word per email, as well as the category of each email.

In [9]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
total_ham = 0 # count of ham emails
total_spam = 0 # count of spam emails
total_word_ham = 0 # count of word in ham emails
total_word_spam = 0 # count of word in spam emails
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1] # 1st argument is a file portion
findword = sys.argv[2] # 2nd argument is a single word
with open(filename, "r") as myfile:
    for line in myfile.readlines(): # for each line/email in the file
        word_count = 0 # count of word in the email
        id = line.split("\t")[0]
        category = line.split("\t")[1]
        content = ' '.join(line.strip().split("\t")[2:])
            # We search the word in both the subject and the content
                # because one or the other may not exist, but the way the data are
                # stored we don't know which one may be missing
        for w in WORD_RE.findall(content):
            if findword.lower() == w.lower():
                word_count += 1
        if int(category) == 0: # increase count of emails in ham or spam 
            # as well as the occurrences of the word in each category
            total_ham += 1
            total_word_ham += word_count
        elif int(category) == 1:
            total_spam += 1
            total_word_spam += word_count
        print id + '\t' + category + '\t' + str(word_count)
            # Output one line per mail, with id, category, and occurrences of word
# Print 2 additional lines, with count of emails and total occurrences of the 
    # word, in each category
print 'ham' + '\t' + str(total_ham) + '\t' + str(total_word_ham)
print 'spam' + '\t' + str(total_spam) + '\t' + str(total_word_spam)

Overwriting mapper.py


The last 2 lines that each mapper outputs could have been easily calculated in the reducer using all the previous lines in each mapper's output.

But it's only that information (the total count of occurrences of a fixed vocabulary in each partition) which is needed to train the Naive Bayes model, so that would be enough for this section and the next (HW1.4): the information about every single email is also passed so the reducer can evaluate the training set that has been used to build the model.

Of course, in HW1.5—where we have to find all words present in all emails—these last 2 lines that summarize each mapper's output will not be necessary, and the reducer will have to combine the words found in the emails of each partition.

In [10]:
%%writefile reducer.py
#!/usr/bin/python
import sys
from math import log
files = sys.argv[1:] #Accept several arguments (same as mapper outputs)
total_ham = 0 # total count of ham emails
total_spam = 0 # total count of spam emails
total_word_ham = 0 # total count of word in ham emails
total_word_spam = 0 # total count of word in spam emails
email_list = []
for filename in files: # for each file passed to the reducer
    with open(filename, "r") as f:
        for line in f: # place all lines in each mapper's output into a list
            email_list.append(line)
    # Extract last 2 entries of the list, which correspond to totals
    spam = email_list.pop()
    spam = spam.split('\t')
    ham = email_list.pop()
    ham = ham.split('\t')
    # Increase total count of emails and word occurrences per category
    total_ham += int(ham[1])
    total_word_ham += int(ham[2])
    total_spam += int(spam[1])
    total_word_spam += int(spam[2])
prob_ham = float(total_ham)/(total_ham+total_spam) # PRIORS
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = float(total_word_ham + 1) / (total_word_ham + 1)
prob_word_spam = float(total_word_spam + 1) / (total_word_spam + 1)

# Assess classification with the training set 
errors = 0 # count of misclassification errors
for i in range(len(email_list)): # for each line in the mappers' outputs 
    # Extract category
    category = int(email_list[i].split("\t")[1])
    # Extract count of occurrences of the word
    word_count = int(email_list[i].split("\t")[2])
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + word_count*log(prob_word_ham,10)
    prob_spam_word = log(prob_spam,10) + word_count*log(prob_word_spam,10)
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        predicted_category = 1
    else:
        predicted_category = 0
    # Output category and predicted category for each email
    print email_list[i].split("\t")[0] + '\t' + str(category) + '\t' + \
        str(predicted_category) 
    if predicted_category != category:
        # Increase count of errors the category predicted is wrong
        errors += 1
training_error = float(errors) / (total_ham+total_spam)
print training_error # output the training error

Overwriting reducer.py


In [11]:
### HW1.3 ###
# Report results
def hw1_3(n, word):
    # Call pNaiveBayes.sh
    bashCommand = './pNaiveBayes.sh ' + str(n) + ' ' + word
    import subprocess
    output = subprocess.check_output(['bash','-c', bashCommand])
    
    # Report results
        # Just the last line, containing the training error
    training_error = !(tail -n 1 enronemail_1h.txt.output | cut -d' ' -f1)
    print 'Training error of the classifier only using the word \"{0}\": {1}'.\
        format('assistance', training_error[0])
    
# Call the function with word 'assistance' and an arbitrary number of partitions
hw1_3(10, 'assistance')

Training error of the classifier only using the word "assistance": 0.44


#HW1.4

**Provide a mapper/reducer pair that, when executed by `pNaiveBayes.sh`, will classify the email messages by a list of one or more user-specified words. Examine the words “assistance”, “valium”, and “enlargementWithATypo” and report your results. To do so, make sure that**

- **`mapper.py` counts all occurrences of a list of words, and**
- **`reducer.py` performs the multiple-word Naive Bayes classification via the chosen list.**

In [12]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
from operator import add
total_ham = 0 # count of ham emails
total_spam = 0 # count of spam emails
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1] # 1st argument is a file portion
findwords = sys.argv[2] # 2nd argument is a list of words
findwords = findwords.split(' ')
vocab_size = len(findwords) # (fixed) size of the vocabulary: 
    # the count of words in the list
total_word_ham = [0] *  vocab_size # count of each word in ham emails
total_word_spam = [0] * vocab_size # count of each word in spam emails
with open(filename, "r") as myfile:
    for line in myfile.readlines(): # for each line/email in the file
        id = line.split("\t")[0]
        word_count= [0] * vocab_size
        category = line.split("\t")[1]
        content = ' '.join(line.strip().split("\t")[2:])
            # We search the words in both the subject and the content
                # because one or the other may not exist, but the way the data are  
                # stored we don't know which one may be missing
        # For each word in the list, find its occurrences
        for i in range(vocab_size):
            for w in WORD_RE.findall(content):
                if findwords[i].lower() == w.lower():
                    word_count[i] += 1
        if int(category) == 0: # increase count of emails in ham or spam 
            # as well as the occurrences of each word in each category
            total_ham += 1
            total_word_ham = map(add, total_word_ham, word_count)
        elif int(category) == 1:
            total_spam += 1
            total_word_spam = map(add, total_word_spam, word_count)
        word_count_str = [str(x) for x in word_count]
        print id + '\t' + category + '\t' + '\t'.join(word_count_str)
            # Output one line per mail, with id, category, and occurrences of each
                # word
# Print 2 additional lines, with count of emails and total occurrences of each 
    # word, in each category, plus total count of words in the list
total_word_ham_str = [str(x) for x in total_word_ham] 
total_word_ham_str.append(str(sum(total_word_ham)))
total_word_spam_str = [str(x) for x in total_word_spam]
total_word_spam_str.append(str(sum(total_word_spam)))
print 'ham' + '\t' + str(total_ham) + '\t' + '\t'.join(total_word_ham_str)
print 'spam' + '\t' + str(total_spam) + '\t' + '\t'.join(total_word_spam_str)

Overwriting mapper.py


In [13]:
%%writefile reducer.py
#!/usr/bin/python
import sys
from math import log
from operator import add
files = sys.argv[1:] #Accept several arguments (same as mapper outputs)
total_ham = 0 # total count of ham emails
total_spam = 0 # total count of spam emails

email_list = []
# Get the size of the vocabulary used from the first line from the first mapper's 
    # output
with open(sys.argv[1], "r") as myfile:
    first_line = myfile.readline()
    vocab_size = len(first_line.split('\t')) - 2 # exclude id and category

total_word_ham = [0] * vocab_size # total count of word in ham emails
total_word_spam = [0] * vocab_size # total count of word in spam emails

for filename in files: # for each file passed to the reducer
    with open(filename, "r") as f:
        for line in f: # place all lines in each mapper's output into a list
            email_list.append(line)
    # Extract last 2 entries of the list, which correspond to totals
    spam = email_list.pop()
    spam = [int(x) for x in spam.split('\t')[1:]]
    ham = email_list.pop()
    ham = [int(x) for x in ham.split('\t')[1:]]
    # Increase total count of emails and word occurrences per category
    total_ham += int(ham[0])
    total_word_ham = map(add, total_word_ham, ham[1:(1+vocab_size)])
    total_spam += int(spam[0])
    total_word_spam = map(add, total_word_spam, spam[1:(1+vocab_size)])
# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = [float(x+1) / (sum(total_word_ham)+vocab_size) for x \
                 in total_word_ham]
prob_word_spam = [float(x+1) / (sum(total_word_spam)+vocab_size) for \
                  x in total_word_spam]

# Assess classification with the training set 
errors = 0 # count of misclassification errors
for i in range(len(email_list)): # for each line in the mappers' outputs
    category = int(email_list[i].split("\t")[1]) # extract category
    # Extract count of occurrences of each word
    word_count = [int(x) for x in email_list[i].split("\t")[2:]]
     
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count,prob_word_ham)])
    prob_spam_word = log(prob_spam,10) + \
        sum([x*log(y,10) for (x,y) in zip(word_count,prob_word_spam)])
    # The right side of the equations are not equal to prob_category_word, but 
        # to log(prob_category_word) - log(prob_word) (where prob_word is the 
        # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        predicted_category = 1
    else:
        predicted_category = 0
    # Output category and predicted category for each email
    print email_list[i].split("\t")[0] + '\t' + str(category) + '\t' + \
        str(predicted_category) 
    if predicted_category != category:
        # Increase count of errors the category predicted is wrong
        errors += 1
training_error = float(errors) / (total_ham+total_spam)
print training_error # output the training error

Overwriting reducer.py


In [14]:
### HW1.4 ###
# Report results
def hw1_4(n, list_of_words):
    # Call pNaiveBayes.sh
    bashCommand = './pNaiveBayes.sh ' + str(n) + ' \'' + \
        ' '.join(list_of_words) + '\''
    import subprocess
    output = subprocess.check_output(['bash','-c', bashCommand])
    
    # Report results
        # Just the last line, containing the training error
    training_error = !(tail -n 1 enronemail_1h.txt.output | cut -d' ' -f1)
    print 'Training error of the classifier only using the word \"{0}\": {1}'.\
        format('assistance', training_error[0])

# Call the function with the 3 words, and an arbitrary number of partitions
hw1_4(10, ['assistance', 'valium', 'enlargementWithATypo'])

Training error of the classifier only using the word "assistance": 0.41


#HW1.5

**Provide a mapper/reducer pair that, when executed by `pNaiveBayes.sh`, will classify the email messages by all words present. To do so, make sure that**

- **mapper.py counts all occurrences of all words, and**
- **reducer.py performs a word-distribution-wide Naive Bayes classification.**

In [15]:
%%writefile mapper.py
#!/usr/bin/python
import sys
import re
from operator import add
total_ham = 0 # count of ham emails
total_spam = 0 # count of spam emails
WORD_RE = re.compile(r"[\w']+")
filename = sys.argv[1] # 1st argument is a file portion

# Learn the vocabulary
vocabulary = []
count_email = 0 # count of lines in file passed to mapper
with open(filename, "r") as myfile:
    for line in myfile.readlines(): # for each line/email in the file
        count_email += 1
        content = ' '.join(line.strip().split("\t")[2:])
            # We search the words in both the subject and the content
                # because one or the other may not exist, but the way the data are 
                # stored we don't know which one may be missing
        for w in WORD_RE.findall(content):
            if w.lower() not in vocabulary:
                vocabulary.append(w.lower())
# Print vocabulary in the first line of the mapper's output
print 'id' + '\t' + 'category' + '\t' + '\t'.join(sorted(vocabulary))
vocab_size = len(vocabulary)
word_count = [{key: 0 for key in vocabulary} for x in range(count_email)]

# Find occurrences per email                
with open(filename, "r") as myfile:
    for (i,line) in enumerate(myfile.readlines()): # for each line in the file
        id = line.split("\t")[0]
        category = line.split("\t")[1]
        content = ' '.join(line.strip().split("\t")[2:])
        for k in vocabulary:
            for w in WORD_RE.findall(content):
                if w.lower() == k:
                    word_count[i][k] = word_count[i][k] + 1
        if int(category) == 0: # increase count of emails in ham or spam 
            # as well as the occurrences of each word in each category
            total_ham += 1
        elif int(category) == 1:
            total_spam += 1
        word_count_str = [str(word_count[i][k]) for k \
            in sorted(word_count[i].keys())]
        print id + '\t' + category + '\t' + '\t'.join(word_count_str)
            # Output one line per mail, with id, category, and occurrences of each
                # word

Overwriting mapper.py


In [16]:
%%writefile reducer.py
#!/usr/bin/python
import sys
import re
from math import log
WORD_RE = re.compile(r"[\w']+")
files = sys.argv[1:] #Accept several arguments (same as mapper outputs)
total_ham = 0 # total count of ham emails
total_spam = 0 # total count of spam emails

email_list = []
count_emails = -1*len(files) # got to discount 1st lines, with vocabulary
# Get the full vocabulary
vocabulary = []
count_email = 0 # count of total lines/emails
for filename in files: # for each file passed to the reducer
    with open(filename, "r") as f:
        first_line = f.readline()
        first_line = ' '.join(first_line.strip().split("\t")[2:])
        for w in WORD_RE.findall(first_line):
            if w.lower() not in vocabulary:
                vocabulary.append(w)
    count_emails += sum(1 for line in open(filename))
vocab_size = len(vocabulary)
email_list = []
total_word_ham = {key: 0 for key in vocabulary}
total_word_spam = {key: 0 for key in vocabulary}
# Reconstruct dictionaries
for filename in files: # for each file passed to the reducer
    with open(filename, "r") as f:
        first_line = f.readline()
        partial_vocabulary = first_line.strip().split("\t")[2:]
        for line in f.readlines(): # for each line/email in the file
            word_count = {key: 0 for key in vocabulary}
            id = line.strip().split("\t")[0]
            category = line.strip().split("\t")[1]
            content = line.strip().split("\t")[2:]
            if int(category) == 0: # increase count of emails in ham or spam
                total_ham += 1
                for i in range(len(content)):
                    total_word_ham[partial_vocabulary[i]] = \
                        total_word_ham[partial_vocabulary[i]] + int(content[i])
                    word_count[partial_vocabulary[i]] = int(content[i])                    
            elif int(category) == 1:
                total_spam += 1
                for i in range(len(content)):
                    total_word_spam[partial_vocabulary[i]] = \
                        total_word_spam[partial_vocabulary[i]] + int(content[i])
                    word_count[partial_vocabulary[i]] = int(content[i])
            email_list.append(id+'\t'+category+'\t'+
                '\t'.join([str(x) for x in word_count.values()]))

# PRIORS
prob_ham = float(total_ham)/(total_ham+total_spam)
prob_spam = 1 - prob_ham
# CONDITIONAL LIKELIHOODS
prob_word_ham = [float(x+1) / (sum(total_word_ham.values()) + vocab_size) for x \
    in total_word_ham.values()]
prob_word_spam = [float(x+1) / (sum(total_word_spam.values()) + vocab_size) for x \
    in total_word_spam.values()]

# Assess classification with the training set 
errors = 0 # count of misclassification errors
for i in range(len(email_list)): # for each line in the mappers' outputs
    category = int(email_list[i].split("\t")[1]) # extract category
    # Extract count of occurrences
    word_count = [int(x) for x in email_list[i].split("\t")[2:]] 
    # POSTERIORS
    prob_ham_word = log(prob_ham,10) + sum([x*log(y,10) for (x,y) \
        in zip(word_count,prob_word_ham)])
    prob_spam_word = log(prob_spam,10) + sum([x*log(y,10) for (x,y) \
        in zip(word_count,prob_word_spam)])
        # The right side of the equations are not equal to prob_category_word, but 
            # to log(prob_category_word) - log(prob_word) (where prob_word is the 
            # EVIDENCE). It's OK since we only want to compare the POSTERIORS
    if prob_spam_word > prob_ham_word: # classify as spam if posterior is higher
        predicted_category = 1
    else:
        predicted_category = 0
    # Output category and predicted category for each email
    print email_list[i].split("\t")[0] + '\t' + str(category) + '\t' + \
        str(predicted_category) 
    if predicted_category != category:
        # Increase count of errors the category predicted is wrong
        errors += 1
training_error = float(errors) / (total_ham+total_spam)
print training_error # output the training error

Overwriting reducer.py


In [17]:
### HW1.5 ###
# Report results
def hw1_5(n):
    # Call pNaiveBayes.sh
    bashCommand = './pNaiveBayes.sh ' + str(n)
    import subprocess
    output = subprocess.check_output(['bash','-c', bashCommand])
    
    # Report results
        # Just the last line, containing the training error
    training_error = !(tail -n 1 enronemail_1h.txt.output | cut -d' ' -f1)
    print 'Training error of the classifier using all words in the training set: \
        {1}'.format('assistance', training_error[0])
    
# Call the function with an arbitrary number of partitions
hw1_5(10)

Training error of the classifier using all words in the training set:         0.0


#HW1.6

**Benchmark your code with the Python SciKit-Learn implementation of Naive Bayes**

**Let's define $\text{Training error} = \text{misclassification rate with respect to a training set}$. It is more formally defined here:**

**Let $DF$ represent the training set in the following:**

$$Err(Model, DF) = \frac{|{(X, c(X)) \in DF : c(X) \neq Model(X)}|}{|DF|}$$

**Where $||$ denotes set cardinality; $c(X)$ denotes the class of the tuple X in DF; and $Model(X)$ denotes the class inferred by the Model $Model$.**

- **Run the Multinomial Naive Bayes algorithm (using default settings) from SciKit-Learn over the same training data used in HW1.5 and report the Training error (please note some data preparation might be needed to get the Multinomial Naive Bayes algorithm from SciKit-Learn to run over this dataset)**
- **Run the Bernoulli Naive Bayes algorithm from SciKit-Learn (using default settings) over the same training data used in HW1.5 and report the Training error**
- **Run the Multinomial Naive Bayes algorithm you developed for HW1.5 over the same data used HW1.5 and report the Training error**

**Please prepare a table to present your results**

In [18]:
### HW1.6 ###  
import subprocess
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.feature_extraction.text import *
from sklearn.metrics import accuracy_score
    
def hw1_6(namefile, num_chunks):
    ### SKLEARN ###
    
    content=[]
    category=[]
    f = open(namefile)
    for line in f.readlines():
        category.append(line.strip().split("\t")[1])
        #content.append(l.strip().split("\t")[-1])
        # Uncomment line above and comment line below to use only the content and  
            # not the subject of each email
        # (That yields a 0.02 training error using BinomialNB and 0.19 using 
            # BernoulliNB, instead of 0.00 and 0.16, respectively)
        content.append(' '.join(line.strip().split("\t")[2:]))
        f.close()
    category = map(int, category)
    CV = CountVectorizer()
    feature_vectors = CV.fit_transform(raw_documents=content)
    feature_strings = CV.get_feature_names()
    #print 'SKLEARN: Average number of (non-zero) features per observation/email: \
    #    {0:.2f}'.format(feature_vectors.nnz / float(len(content)))
    #print 'SKLEARN: Some random words that appear in the emails:\n\t{0}'.\
    #    format(np.random.choice(feature_strings, 5, replace=False))
    
    ### SKLEARN MULTINOMIAL NB ###
    MultinomialNB_model = MultinomialNB()
    MultinomialNB_model.fit(feature_vectors, category)
    pred_category_sk_multinomialNB = MultinomialNB_model.predict(feature_vectors)
    training_error_sk_multinomialNB = 1 - \
        accuracy_score(category, pred_category_sk_multinomialNB)
    #errors = np.sum([category[i] != pred_category_sk_multinomialNB[i] \
    #    for i in range(len(content))], dtype='f')
    #print 'Training error: {0:.2f}'.format(errors/len(content))
    
    ### SKLEARN BERNOULLI NB ###
    BernoulliNB_model = BernoulliNB()
    BernoulliNB_model.fit(feature_vectors, category)
    pred_category_sk_BernoulliNB = BernoulliNB_model.predict(feature_vectors)
    training_error_sk_BernoulliNB = 1 - accuracy_score(category, 
        pred_category_sk_BernoulliNB)
    #errors = np.sum([category[i] != pred_category_sk_BernoulliNB[i] \
    #    for i in range(len(content))], dtype='f')
    #print 'Training error: {0:.2f}'.format(errors/len(content))
    
    ### HW1.5: POOR MAN'S MAPREDUCE IMPLEMENTATION OF MULTINOMIAL NB ###
    # Call pNaiveBayes.sh
    bashCommand = './pNaiveBayes.sh ' + str(num_chunks)
    output = subprocess.check_output(['bash','-c', bashCommand])
    # Report results
        # Just the last line, containing the training error
    training_error_mr_multinomialNB = \
        !(tail -n 1 enronemail_1h.txt.output | cut -d' ' -f1)

    # Present results
    classifier =[' MapReduce Multinomial', ' SKLearn Multinomial', 
        ' SKLearn Bernoulli']
    print
    print '|Classifier     |{}|{}|{}|'.format(*classifier)
    print '---------------------------------------------------------------------'\
        '-----------'
    values = ['Training Error ', float(training_error_mr_multinomialNB[0]), 
        training_error_sk_multinomialNB, training_error_sk_BernoulliNB]
    print '|{}|{:22.2f}|{:20.2f}|{:18.2f}|'.format(*values)

hw1_6('enronemail_1h.txt', 10)


|Classifier     | MapReduce Multinomial| SKLearn Multinomial| SKLearn Bernoulli|
--------------------------------------------------------------------------------
|Training Error |                  0.00|                0.00|              0.16|


- **Explain/justify any differences in terms of training error rates over the dataset in HW1.5 between your Multinomial Naive Bayes implementation (in Map Reduce) versus the Multinomial Naive Bayes implementation in SciKit-Learn**

The results are the same for both implementations: none of them missclassify any observation (email) from the training set. That might be because:

1. both extract exactly the same words from the emails, or
2. the dataset is not large enough to detect any differences.

But I'd dare to say the 1st possibility is the correct one: once they have extracted the vocabulary, the maths behind both implementations should be exactly the same, and hence their results.

- **Discuss the performance differences in terms of training error rates over the dataset in HW1.5 between the  Multinomial Naive Bayes implementation in SciKit-Learn with the  Bernoulli Naive Bayes implementation in SciKit-Learn**

As shown above, the Multinomial NB implementation in SciKit-Learn performs better than the Bernoulli NB with this training set: a **100% vs. 84% accuracy**. In general, the Bernoulli model performs better for short documents (as emails are; at least most of the ones contained in our sample), but:

1. it's not only a short length of the documents but also a small size of the vocabulary where the Bernoulli model excels (and our vocabulary was not small, since we used all words, i.e., there were no *stopwords*), and

2. according to some authors (see [McCallum and Nigam, 1998](http://www.kamalnigam.com/papers/multinomial-aaaiws98.pdf)) "*another  point  to  consider  is that  the  multinomial event  model should  be  a  more  accurate  classiffier  for data sets that have a large variance in document length. The multinomial event model naturally handles documents of varying length by incorporating the evidence of each appearing word. The [...] Bernoulli model is a somewhat poor fit for data with varying length, in that it is more likely for a word to occur in a
long document regardless of the class.*" And that was our case, with emails with content ranging from just a few words to more than a thousand.