# DATASCI W261: Machine Learning at Scale
## Assignment Week 1
Miki Seltzer (miki.seltzer@berkeley.edu)<br>
W261-2, Spring 2016<br>
Due date 19-Jan-2016

### HW1.0.0: Define big data. Provide an example of a big data problem in your domain of expertise. 

"Big data" is a broad term that has many interpretations. It doesn't seem fair to assign a concrete size barrier above which is considered "big," and below which is not. A definition that resonates with me is that if a data set is too large to fit on or process with one computer in a reasonable amount of time, then it is considered big data. This definition allows for more flexibility as hard drives become larger and computers become more powerful.

I currently work at a home security company, so we are constantly generating IoT-type data. Analysis on any of the data we generate from wireless sensors is usually a big data problem, unless we are looking at a tiny subset of data. One of the most important problems we are currently trying to solve is determining when a home is unoccupied using data from wireless sensors such as motion detectors and door sensors. It is relatively easy to determine when a home is occupied: we can be fairly certain that someone is home when motion is detected. Conversely, the absence of motion does not always indicate that the home is unoccupied. A simple example is at nighttime when people are usually sleeping.

### HW1.0.1: In 500 words (English or pseudo code or a combination) describe how to estimate the bias, the variance, the irreducible error for a test dataset T when using polynomial regression models of degree 1, 2, 3, 4, 5 are considered. How would you select a model?

In order to estimate the bias, variance and irreducible error (noise) for a single data that is generated from the unknown true function $f(x)$, we first need to generate multiple sets of data from our original data set T. We can do this by sampling the original data set with replacement (bootstrapping). If we repeat the bootstrap resample 50 times, we will have 50 data sets to work with.

With each of the 50 new data sets, we split the data set into training set and a testing set. We fit polynomials of degree 1, 2, 3, 4 and 5 to the training set. This yields 50 models per polynomial degree. For each polynomial degree, we can then estimate the average variance and bias using the testing set.

#### Variance estimation
For each observation, $x$, in the testing set, we now have 50 predictions per polynomial degree, which we denote as $y_1, y_2, ..., y_{50}$. We find the average of these predictions, denoted as $\bar{y}$. We can find the variance of the predictions using the formula: $E((y_i-\bar{y})^2)$. We then average the variance for each data point in the testing set, and then repeat the process for each polynomial degree. Thus, we will have one average variance per polynomial degree.

#### Bias estimation
For each observation, $x$, we also have the actual value of $y$ in the testing set. The bias of the observation $x$ is the difference between the average prediction and the actual value: $\bar{y}-y$. If we happened to have the true function $f(x)$ from which the data were generated, we would calculate the bias as: $\bar{y}-f(x)$.

We then average the bias for each data point in the testing set per polynomial degree, and repeat the process for each polynomial degree. This yields one average bias per polynomial degree. It is also useful to calculate the average squared bias, which is calculated by squaring the bias before averaging all of the observations in the testing set.

#### Noise estimation
If we do not know the true function $f(x)$, we are forced to make the assumption that the noise is zero. If we knew the true function $f(x)$ that the data set T was generated from, we would be able to calculate the irreducible error, which would be the square of the difference between the observed values and the true function: $(y-f(x))^2$.

#### Model selection
We know that there is a trade-off between bias and variance. As the model gets more complex, bias generally decreases, while variance generally increases. We can plot the sum of the squared bias and the variance, and choose the degree where the sum is minimized.


### HW1.1: Read through the provided control script (pNaiveBayes.sh)

In [39]:
print "Done"

Done


In [52]:
!perl -pi -e 's/\r/\n/g' enronemail_1h.txt

Can't remove enronemail_1h.txt: Text file busy, skipping file.


We will need to use the pNaiveBayes.sh file multiple times during this homework assignment, so let's make sure that it is written to our working directory.

In [40]:
%%writefile pNaiveBayes.sh
## pNaiveBayes.sh
## Author: Jake Ryland Williams
## Usage: pNaiveBayes.sh m wordlist
## Input:
##       m = number of processes (maps), e.g., 4
##       wordlist = a space-separated list of words in quotes, e.g., "the and of"
##
## Instructions: Read this script and its comments closely.
##               Do your best to understand the purpose of each command,
##               and focus on how arguments are supplied to mapper.py/reducer.py,
##               as this will determine how the python scripts take input.
##               When you are comfortable with the unix code below,
##               answer the questions on the LMS for HW1 about the starter code.

## collect user input
m=$1 ## the number of parallel processes (maps) to run
wordlist=$2 ## if set to "*", then all words are used

## a test set data of 100 messages
data="enronemail_1h.txt" 

## the full set of data (33746 messages)
# data="enronemail.txt" 

## 'wc' determines the number of lines in the data
## 'perl -pe' regex strips the piped wc output to a number
linesindata=`wc -l $data | perl -pe 's/^.*?(\d+).*?$/$1/'`

## determine the lines per chunk for the desired number of processes
linesinchunk=`echo "$linesindata/$m+1" | bc`

## split the original file into chunks by line
split -l $linesinchunk $data $data.chunk.

## assign python mappers (mapper.py) to the chunks of data
## and emit their output to temporary files
for datachunk in $data.chunk.*; do
    ## feed word list to the python mapper here and redirect STDOUT to a temporary file on disk
    ####
    ####
    ./mapper.py $datachunk "$wordlist" > $datachunk.counts &
    ####
    ####
done
## wait for the mappers to finish their work
wait

## 'ls' makes a list of the temporary count files
## 'perl -pe' regex replaces line breaks with spaces
countfiles=`\ls $data.chunk.*.counts | perl -pe 's/\n/ /'`

## feed the list of countfiles to the python reducer and redirect STDOUT to disk
####
####
./reducer.py $countfiles > $data.output
####
####

## clean up the data chunks and temporary count files
# \rm $data.chunk.*

Overwriting pNaiveBayes.sh


In [41]:
!chmod a+x pNaiveBayes.sh

### HW1.2: Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will determine the number of occurrences of a single, user-specified word. 

Examine the word “assistance” and report your results.

To do so, make sure that

- mapper.py counts all occurrences of a single word, and
- reducer.py collates the counts of the single word.



In [42]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW1.2

import sys
import re
count = 0

# Collect input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())

# Initialize dictionary to empty
wordcount = {}

with open(filename, "rU") as myfile:
    for line in myfile:
        # Format each line, fields separated by \t according to enronemail_README.txt
        fields = re.split("\t", line)
        
        # For each word in list provided by user, count occurrences in subj and body
        for word in findwords:
            if word not in wordcount:
                wordcount[word] = 0 
            wordcount[word] += fields[2].count(word) + fields[3].count(word)

for word in wordcount:
    print [word, wordcount[word]]

Overwriting mapper.py


In [43]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW1.2

import sys
sum = 0

filenames = sys.argv[1:]

wordcount = {}

# Each mapper outputs a [word, count] pair
# For each file and each word, sum the counts
for file in filenames:
    with open(file, "rU") as myfile:
        for line in myfile:
            pair = eval(line)
            word = pair[0]
            count = pair[1]
            if word not in wordcount:
                wordcount[word] = 0
            wordcount[word] += count
            
for word in wordcount:
    print word, "\t", wordcount[word]

Overwriting reducer.py


In [44]:
# Change file permissions
!chmod a+x mapper.py
!chmod a+x reducer.py

In [45]:
!./pNaiveBayes.sh 4 "assistance"

In [46]:
def print_counts():
    with open("enronemail_1h.txt.output", "r") as myfile:
        print "{:<15s}{:3s}".format("word", "count")
        print "----------------------"
        for line in myfile:
            pair = line.split("\t")
            word = pair[0]
            count = int(pair[1])
            print "{:<15s}{:3d}".format(word, count)

#### Output of HW1.2

In [47]:
print_counts()

word           count
----------------------
assistance      10


### HW1.3. Provide a mapper/reducer pair that, when executed by pNaiveBayes.sh will classify the email messages by a single, user-specified word using the multinomial Naive Bayes Formulation. 

Examine the word “assistance” and report your results. To do so, make sure that mapper.py and reducer.py perform a single word Naive Bayes classification. For multinomial Naive Bayes, the Pr(X=“assistance”|Y=SPAM) is calculated as follows: 

$$
\frac{\text{number of times "assistance" occurs in SPAM labeled documents}}{\text{the number of words in documents labeled SPAM}}
$$

In [27]:
%%writefile mapper.py
#!/usr/bin/python
## mapper.py
## Author: Miki Seltzer
## Description: mapper code for HW1.3

import sys
import re
count = 0

# Collect input
filename = sys.argv[1]
findwords = re.split(" ",sys.argv[2].lower())

# Initialize dictionary to empty
word_count = {}

with open(filename, "rU") as myfile:
    for line in myfile:
        # Format each line, fields separated by \t according to enronemail_README.txt
        fields = re.split("\t", line)
        
        # For each word in list provided by user, count occurrences in subj and body
        for word in findwords:
            my_key = (fields[0], fields[1], word)
            if my_key not in word_count:
                word_count[my_key] = 0 
            word_count[my_key] += fields[2].count(word) + fields[3].count(word)

for key in word_count:
    print [key, word_count[key]]

Overwriting mapper.py


In [24]:
%%writefile reducer.py
#!/usr/bin/python
## reducer.py
## Author: Miki Seltzer
## Description: reducer code for HW1.3

import sys
sum = 0

filenames = sys.argv[1:]



Writing reducer.py


In [28]:
# Change file permissions
!chmod a+x mapper.py
!chmod a+x reducer.py

In [30]:
!./pNaiveBayes.sh 4 "assistance"

In [162]:
filenames = ['enronemail_1h_edit.txt.chunk.aa.counts',
             'enronemail_1h_edit.txt.chunk.ab.counts',
             'enronemail_1h_edit.txt.chunk.ac.counts',
             'enronemail_1h_edit.txt.chunk.ad.counts',]

doc_ids = []
class_counts = {}
word_counts = {}

for file in filenames:
    with open(file, "r") as myfile:
        for line in myfile:
            pair = eval(line)
            doc_id = pair[0][0]
            spam = pair[0][1]
            word = pair[0][2]
            count = int(pair[1])
            if doc_id not in doc_ids: doc_ids.append(doc_id)
            if spam not in class_counts: class_counts[spam] = 0.0
            class_counts[spam] += 1
            if word not in word_counts: word_counts[word] = 0.0
            word_counts[word] += count

prior_spam = class_counts['0'] / len(doc_ids)
prior_ham = class_counts['1'] / len(doc_ids)




0.56


In [51]:
import re
findwords = ['assistance']
word_count = {}

with open("enronemail_1h.txt", "rU") as myfile:
    for line in myfile:
        # Format each line, fields separated by \t according to enronemail_README.txt
        fields = re.split("\t", line)
        
        # For each word in list provided by user, count occurrences in subj and body
        for word in findwords:
            my_key = (fields[0], fields[1], word)
            if my_key not in word_count:
                word_count[my_key] = 0 
            word_count[my_key] += fields[2].count(word) + fields[3].count(word)

for key in word_count:
    print [key, word_count[key]]

[('0012.1999-12-14.farmer', '0', 'assistance'), 0]
[('0007.2000-01-17.beck', '0', 'assistance'), 0]
[('0003.1999-12-14.farmer', '0', 'assistance'), 0]
[('0001.2000-01-17.beck', '0', 'assistance'), 0]
[('0016.2003-12-19.GP', '1', 'assistance'), 0]
[('0013.2001-04-03.williams', '0', 'assistance'), 0]
[('0013.2001-06-30.SA_and_HP', '1', 'assistance'), 0]
[('0001.2001-02-07.kitchen', '0', 'assistance'), 0]
[('0009.2001-02-09.kitchen', '0', 'assistance'), 0]
[('0006.1999-12-13.kaminski', '0', 'assistance'), 0]
[('0018.2003-12-18.GP', '1', 'assistance'), 1]
[('0005.2001-02-08.kitchen', '0', 'assistance'), 0]
[('0016.2001-02-12.kitchen', '0', 'assistance'), 0]
[('0015.1999-12-15.farmer', '0', 'assistance'), 0]
[('0017.2004-08-01.BG', '1', 'assistance'), 0]
[('0010.2001-06-28.SA_and_HP', '1', 'assistance'), 1]
[('0010.1999-12-14.kaminski', '0', 'assistance'), 0]
[('0017.1999-12-14.kaminski', '0', 'assistance'), 0]
[('0013.2004-08-01.BG', '1', 'assistance'), 1]
[('0018.1999-12-14.kaminski', '0'