# DATASCI W261: Machine Learning at Scale
## Assignment Week 5
Miki Seltzer (miki.seltzer@berkeley.edu)<br>
W261-2, Spring 2016<br>
Submission: 

## HW5.0:
### What is a data warehouse?
A data warehouse is a repository for one or multiple data sources. Data warehouses can contain relational databases.

### What is a star schema?
A star schema relates multiple fact and dimension tables, and is similar to the snowflake schema. In both schemas, fact tables are referenced by dimension tables (one or multiple). However, star schemas are denormalized, whereas snowflake schemas are normalized.

### When is it used?
A star schema is used to organize the meta data of a relational database, such as which tables can be joined, and the keys on which they can be joined.

## HW5.1:
### In the database world, what is 3NF?
3NF is shorthand for third normal form. A table is in third normal form if the following conditions hold:
- The table is already in second normal form
- Non-prime attributes of the table are non-transitively dependent on every key in the table

### Does machine learning use data in 3NF?
ML can, but does not always use data in 3NF.

### If so, why?
3NF can save a significant amount of disk space because data duplication is avoided. Additionally, if data is denormalized, then fields in the data set might be related to each other and create dependencies. This may be problematic if we are using algorithms that require independent features.

### In what form does ML consume data?
Typically, ML requires all data to be fed into an algorithm to be collected into a single source. Thus, the easiest way for ML to ingest data is for it to be denormalized.

### Why would one use log files that are denormalized?
If one needs to perform real-time analysis on log files, it may be too time consuming to join normalized log files with other tables. If log files are denormalized, they may not need any further processing (joins) to be fed into other steps of a pipeline.

## HW 5.2: Using MRJob, implement a hashside join (memory-backed map-side) for left, right and inner joins. Run your code on the  data used in HW 4.4.

### Justify which table you chose as left table in this hashside join
The two tables used were:
- **anonymous-msweb-preprocess.data:** The log file of visitors and each page that they visited (processed rows prefixed by 'C' or 'V')
- **attributes.csv:** The page ID, page name and URL of each page (prefixed by 'A' in the original data)

The attributes.csv file was very small (only 294 lines), so this is the file that I chose to store in memory. This became my **right** table.

The log file was much larger, so I streamed through this file, and used it as the **left** table.

In [1]:
# We will need these so we can reload modules as we modify them
%load_ext autoreload
%autoreload 2

In [64]:
%%writefile mapSideJoin.py
from mrjob.job import MRJob
from mrjob.step import MRStep
 
class join(MRJob):
    
    # Specify some custom options so we only have to write one MRJob class for each join
    def configure_options(self):
        super(join, self).configure_options()
        self.add_passthrough_option('--joinType', default='inner', )
    
    # Store attributes.csv into memory
    #  - account for multiple occurrences of keys
    #  - self.pages is dict with a list of [pageName, pageURL] pairs
    # Set joinType variable
    def mapper_init(self):
        self.pages = {}
        with open('attributes.csv','r') as myfile:
            for line in myfile:
                fields = line.strip().split(',')
                if fields[0] not in self.pages:
                    self.pages[fields[0]]=[]
                self.pages[fields[0]].append([fields[1], fields[2]])
        self.joinType = self.options.joinType
        self.seenRight = set()
    
    # RIGHT table is stored in memory (self.pages)
    # LEFT table is streamed
    # We need so keep track of which RIGHT keys we have seen
    def mapper(self, _, line):
        fields = line.split(',')
        key = fields[0]
        self.seenRight.add(key)
        if key in self.pages:
            for i in self.pages[key]:
                yield key, (fields[1], fields[2], i[0], i[1])
        elif self.joinType == 'left':
            yield key, (fields[1], fields[2], None, None)

    # We need to emit all of the RIGHT keys that we never saw while streaming through LEFT
    # We will need to deduplicate these in the reducer in case we have multiple mappers
    def mapper_final(self):
        if self.joinType == 'right':
            for key in self.pages:
                if key not in self.seenRight:
                    for value in self.pages[key]:
                        yield key, (None, None, value[0], value[1])
    
    # Need to persist variables
    def reducer_init(self):
        self.joinType = self.options.joinType
    
    # We need to unpack and emit each record
    # We also need to do some work emitting records for the right join
    def reducer(self, key, values):
        emptyRight = True
        for val in values:
            if self.joinType == 'inner' or self.joinType == 'left':
                yield key, val
            elif self.joinType == 'right':
                if val[:2] != [None]*2:
                    emptyRight = False
                    yield key, val
                else: emptyRecord = val
        if emptyRight and self.joinType == 'right':
            yield key, emptyRecord


if __name__ == '__main__':
    join.run()

Overwriting mapSideJoin.py


In [12]:
from mapSideJoin import join

def runJoin(joinType):

    mr_job = join(args=['TopVisitors.txt', '--file', 'attributes.csv', '--joinType', joinType])
    output = []

    with mr_job.make_runner() as runner: 
        # Run MRJob
        runner.run()

        # Write stream_output to file
        for line in runner.stream_output():
            output.append(mr_job.parse_output_line(line))
    
    return output
            
outInner = runJoin('inner')
outLeft = runJoin('left')
outRight = runJoin('right')



In [13]:
print "Rows resulting from join type:\n"
for joinType in ['inner', 'left', 'right']:
    if joinType == 'inner': out = outInner
    elif joinType == 'left': out = outLeft
    elif joinType == 'right': out = outRight
    
    print "{:7s}{:>4,d}".format(joinType, len(out))


Rows resulting from join type:

inner   285
left    285
right   294


## HW5.3: Do some EDA on this data set using MRJob:
- Longest 5-gram (number of characters)
- Top 10 most frequent words (count), i.e., unigrams
- Most/least densely appearing words (count/pages_count) sorted in decreasing order of relative frequency
- Distribution of 5-gram sizes (counts) sorted in decreasing order of relative frequency

In [78]:
import csv

docs = {}
docs['A'] = ['X']*20
docs['A'].extend(['Y']*30)
docs['A'].extend(['Z']*5)
docs['B'] = ['X']*100
docs['B'].extend(['Y']*20)
docs['C'] = ['M']*5
docs['C'].extend(['N']*20)
docs['C'].extend(['Z']*5)


# Create the file for unit testing
with open('unitTest.txt', 'w') as myfile:
    outWriter = csv.writer(myfile)
    for doc in docs:
        row = [doc]
        row.extend(docs[doc])
        outWriter.writerow(row)

In [2]:
%%writefile MRJob5_3.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class job(MRJob):
    
    # Specify some custom options so we only have to write one MRJob class for each part
    def configure_options(self):
        super(job, self).configure_options()
        self.add_passthrough_option('--part', default='1')
    
    """
    Find the longest 5-gram
    - In this case, in each mapper, we only need to store the length of the longest 5-gram we have seen
    - After the mapper has run, we emit the longest 5-gram from this mapper
    - All results will be sent to the same reducer (we specify this)
    - Then we loop through the records in the reducer and emit the remaining longest 5-gram
    """
    
    def mapper_longest5Gram_init(self):
        self.maxLength = 0
        self.longest5Gram = None
    
    def mapper_longest5Gram(self, _, line):
        fields = line.strip().split('\t')
        if len(fields[0]) > self.maxLength: 
            self.maxLength = len(fields[0])
            self.longest5Gram = fields[0]
            
    def mapper_longest5Gram_final(self):
        yield self.longest5Gram, self.maxLength
     
    def reducer_longest5Gram_init(self):
        self.maxLength = 0
        self.longest5Gram = None
    
    def reducer_longest5Gram(self, key, values):
        for val in values:
            if val > self.maxLength:
                self.maxLength = val
                self.longest5Gram = key
        
    def reducer_longest5Gram_final(self):
        yield self.maxLength, self.longest5Gram
    
    """
    Top 10 most frequent words
    - This is our standard word count
    - Loop through each word in the 5-gram and emit (word, 1)
    """
    
    def mapper_topWords(self, _, line):
        fields = line.strip().split('\t')
        words = fields[0].lower().split()
        count, pages_count, books_count = int(fields[1]), int(fields[2]), int(fields[3])
        for word in words:
            self.increment_counter('total', 'words', 1)
            yield word, 1
        
    def combiner_topWords(self, key, values):
        yield key, sum(values)
        
    def reducer_topWords(self, key, values):
        yield key, sum(values)
 
    """
    Densely appearing words
    - For each word, emit count and pages_count
    - Combiner sums count and pages_count
    - Reducer sums count and pages_count, then emits count/pages_count
    """
    
    def mapper_denseWords(self, _, line):
        fields = line.strip().split('\t')
        words = fields[0].lower().split()
        count, pages_count, books_count = int(fields[1]), int(fields[2]), int(fields[3])
        for word in words:
            yield word, (count, pages_count)
        
    def combiner_denseWords(self, key, values):
        count, pages_count = 0.0, 0.0
        for val in values:
            count += val[0]
            pages_count += val[1]
        yield key, (count, pages_count)
        
    def reducer_denseWords(self, key, values):
        count, pages_count = 0.0, 0.0
        for val in values:
            count += val[0]
            pages_count += val[1]
        yield key, count/pages_count

    """
    Frequent5-grams
    - Use count to determine the most frequent 5-gram
    - Sum counts in combiner and reducer
    """
        
    def mapper_frequent5Gram(self, _, line):
        fields = line.strip().split('\t')
        yield fields[0].lower(), float(fields[1])
        
    def combiner_frequent5Gram(self, key, values):
        yield key, sum(values)
        
    def reducer_frequent5Gram(self, key, values):
        yield key, sum(values)
        
    """
    Sorting functions
    - We need these to get the top and bottom values
    - Utilize only one reducer instead of writing a custom partitioner
    """
    def mapper_sort(self, key, value):
        yield float(value), key
        
    def reducer_sort_init(self):
        self.count = 0
    
    def reducer_top10(self, key, values):
        for val in values:
            if self.count < 10:
                yield key, val
                self.count += 1
                
    def reducer_top100(self, key, values):
        for val in values:
            if self.count < 100:
                yield key, val
                self.count += 1
                
    def reducer_top10000(self, key, values):
        for val in values:
            if self.count < 10000:
                yield key, val
                self.count += 1
    
    """
    Multi-step pipeline definitions
    Based on user input when calling runner function
    """
    def steps(self):
        self.part = self.options.part
        if self.part == '1':
            return [
                MRStep(mapper_init=self.mapper_longest5Gram_init,
                       mapper=self.mapper_longest5Gram,
                       mapper_final=self.mapper_longest5Gram_final,
                       reducer_init=self.reducer_longest5Gram_init,
                       reducer=self.reducer_longest5Gram,
                       reducer_final=self.reducer_longest5Gram_final,
                       jobconf={'mapred.reduce.tasks': 1})
            ]
        elif self.part == '2':
            return [
                MRStep(mapper=self.mapper_topWords,
                       combiner=self.combiner_topWords,
                       reducer=self.reducer_topWords),
                MRStep(mapper=self.mapper_sort,
                       reducer_init=self.reducer_sort_init,
                       reducer=self.reducer_top10,
                       jobconf={'mapred.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                                'mapred.text.key.partitioner.options':'-k1,1',
                                'stream.num.map.output.key.fields':1,
                                'mapred.text.key.comparator.options':'-k1,1nr',
                                'mapred.reduce.tasks': 1})
            ]
        elif self.part == '3':
            return [
                MRStep(mapper=self.mapper_denseWords,
                       combiner=self.combiner_denseWords,
                       reducer=self.reducer_denseWords),
                MRStep(mapper=self.mapper_sort,
                       reducer_init=self.reducer_sort_init,
                       reducer=self.reducer_top100,
                       jobconf={'mapred.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                                'mapred.text.key.partitioner.options':'-k1,1',
                                'stream.num.map.output.key.fields':1,
                                'mapred.text.key.comparator.options':'-k1,1nr',
                                'mapred.reduce.tasks': 1})
            ]
        elif self.part == '4':
            return [
                MRStep(mapper=self.mapper_frequent5Gram,
                       combiner=self.combiner_frequent5Gram,
                       reducer=self.reducer_frequent5Gram),
                MRStep(mapper=self.mapper_sort,
                       reducer_init=self.reducer_sort_init,
                       reducer=self.reducer_top100,
                       jobconf={'mapred.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                                'mapred.text.key.partitioner.options':'-k1,1',
                                'stream.num.map.output.key.fields':1,
                                'mapred.text.key.comparator.options':'-k1,1nr',
                                'mapred.reduce.tasks': 1})
            ]
        elif self.part == '5':
            return [
                MRStep(mapper=self.mapper_topWords,
                       combiner=self.combiner_topWords,
                       reducer=self.reducer_topWords),
                MRStep(mapper=self.mapper_sort,
                       reducer_init=self.reducer_sort_init,
                       reducer=self.reducer_top10000,
                       jobconf={'mapred.output.key.comparator.class':'org.apache.hadoop.mapred.lib.KeyFieldBasedComparator',
                                'mapred.text.key.partitioner.options':'-k1,1',
                                'stream.num.map.output.key.fields':1,
                                'mapred.text.key.comparator.options':'-k1,1nr',
                                'mapred.reduce.tasks': 1})
            ]


if __name__ == '__main__':
    job.run()

Overwriting MRJob5_3.py


In [43]:
# Create job flow so that we don't need to keep spinning up clusters
!python -m mrjob.tools.emr.create_job_flow

using configs in /etc/mrjob.conf
using existing scratch bucket mrjob-ac40f1afcc0b86ce
using s3://mrjob-ac40f1afcc0b86ce/tmp/ as our scratch dir on S3
Creating persistent job flow to run several jobs in...
creating tmp directory /tmp/no_script.cloudera.20160214.231258.206908
writing master bootstrap script to /tmp/no_script.cloudera.20160214.231258.206908/b.py
Copying non-input files into s3://mrjob-ac40f1afcc0b86ce/tmp/no_script.cloudera.20160214.231258.206908/files/
Waiting 5.0s for S3 eventual consistency
Creating Elastic MapReduce job flow
Job flow created with ID: j-98XEIC78B7U2
j-98XEIC78B7U2


In [54]:
from MRJob5_3 import job

def runJob5_3(filename, part, s3bucket):

    #mr_job = job(args=[filename, '--part', str(part)])
    #mr_job = job(args=[filename, '--part', str(part), '-r', 'hadoop', '--hadoop-home', '/usr/'])
    mr_job = job(args=[filename, '--part', str(part), '--no-output', '--output-dir', s3bucket,
                       '-r', 'emr', '--emr-job-flow-id', 'j-98XEIC78B7U2'])
    
    output = []

    with mr_job.make_runner() as runner: 
        # Run MRJob
        runner.run()

        # Write stream_output to file
        for line in runner.stream_output():
            output.append(mr_job.parse_output_line(line))
    
    return output

In [47]:
def format_output(output, part):
    for item in output:
        if part == 3:
            print '{:5.5f}  {:<100s}'.format(item[0], item[1])
        else:
            print '{:11,d}  {:<100s}'.format(int(item[0]), item[1])

In [49]:
myfile = 's3://filtered-5grams/'
#myfile = './filtered-5Grams/short-5gram.txt'

output_bucket = 's3://ms-w261-hw05/hw5_3a'

!aws s3 rm --recursive {output_bucket}

part = 1
output = runJob5_3(myfile, part, output_bucket)
format_output(output, part)

        159  AIOPJUMRXUYVASLYHYPSIBEMAPODIKR UFRYDIUUOLBIGASUAURUSREXLISNAYE RNOONDQSRUNSUBUNOUGRABBERYAIRTC UTAHRAPTOREDILEIPMILBDUMMYUVERI SYEVRAHVELOCYALLOSAURUSLINROTSR


In [56]:
output_bucket = 's3://ms-w261-hw05/hw5_3b'

!aws s3 rm --recursive {output_bucket}

part = 2
output = runJob5_3(myfile, part, output_bucket)
format_output(output, part)

 27,502,442  the                                                                                                 
 18,191,779  of                                                                                                  
 12,075,971  to                                                                                                  
  7,881,239  in                                                                                                  
  7,853,465  a                                                                                                   
  7,767,900  and                                                                                                 
  4,316,884  that                                                                                                
  3,847,383  is                                                                                                  
  3,288,731  be                                                                         

In [None]:
output_bucket = 's3://ms-w261-hw05/hw5_3c'

!aws s3 rm --recursive {output_bucket}

part = 3
output = runJob5_3(myfile, part, output_bucket)
format_output(output, part)



## HW5.4
### 1. Build stripes of word co-occurrence for the 1,000 words ranked 9,001 - 10,000
### 2. Using two (symmetric) comparison methods of your choice, pairwise compare all stripes

In [37]:
myfile = './filtered-5Grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt'
#myfile = './filtered-5Grams/short-5gram.txt'

part = 5
output = runJob5_3(myfile, part)

with open('basisWords.txt', 'w') as myfile:
    for i in output[:1000]:
        myfile.write(i[1]+'\n')

The have been translated as follows
 mapred.text.key.comparator.options: mapreduce.partition.keycomparator.options
mapred.text.key.partitioner.options: mapreduce.partition.keypartitioner.options
mapred.reduce.tasks: mapreduce.job.reduces
mapred.output.key.comparator.class: mapreduce.job.output.key.comparator.class


In [50]:
%%writefile MRJob5_4.py
from mrjob.job import MRJob
from mrjob.step import MRStep

class stripes(MRJob):
    
    """
    Build stripes
    - Read in basis words from basisWords.txt
    - Emit stripes where the key and each value's key is in the basis
    """
    
    def mapper_buildStripe_init(self):
        self.vocab = set()
        with open('basisWords.txt','r') as myfile:
            for word in myfile:
                self.vocab.add(word.strip())
        
    def mapper_buildStripe(self, _, line):
        fields = line.strip().split('\t')
        words = fields[0].lower().split()
        wordList = sorted(list(set(words)))
        for index1 in range(len(wordList)-1):
            stripe = {}
            if wordList[index1] in self.vocab:
                for index2 in range(index1+1,len(wordList)):
                    if wordList[index2] in self.vocab:
                        stripe[wordList[index2]] = 1
            if len(stripe) > 0:
                yield wordList[index1], stripe
            
    def combiner_buildStripe(self, key, values):
        stripe = {}
        for val in values:
            for word in val:
                if word in stripe:
                    stripe[word] += val[word]
                else:
                    stripe[word] = val[word]
        yield key, stripe
        
    def reducer_buildStripe(self, key, values):
        stripe = {}
        for val in values:
            for word in val:
                if word in stripe:
                    stripe[word] += val[word]
                else:
                    stripe[word] = val[word]
        yield key, stripe
    
            
        
    """
    Multi-step pipeline definitions
    Based on user input when calling runner function
    """
    def steps(self):
        return [
            MRStep(mapper_init=self.mapper_buildStripe_init,
                   mapper=self.mapper_buildStripe,
                   combiner=self.combiner_buildStripe,
                   reducer=self.reducer_buildStripe,
                   jobconf={'mapred.reduce.tasks': 2})
        ]
    

if __name__ == '__main__':
    stripes.run()

Overwriting MRJob5_4.py


In [51]:
from MRJob5_4 import stripes

def runJob5_4(filename):

    mr_job = stripes(args=[filename, '--file', 'basisWords.txt'])
    #mr_job = stripes(args=[filename, '-r', 'hadoop', '--hadoop-home', '/usr/', '--file', 'basisWords.txt'])
    output = []

    with mr_job.make_runner() as runner: 
        # Run MRJob
        runner.run()

        # Write stream_output to file
        for line in runner.stream_output():
            output.append(mr_job.parse_output_line(line))
    
    return output

In [52]:
myfile = './filtered-5Grams/googlebooks-eng-all-5gram-20090715-0-filtered.txt'
#myfile = './filtered-5Grams/short-5gram.txt'

output = runJob5_4(myfile)
print output



set(['among', 'another', 'often', 'certain', 'mind', 'states', 'known', 'done', 'something', 'human', 'sense', 'seen', 'subject', 'information', 'united', 'want', 'god', 'away', 'question', 'least', 'better', 'enough', 'going', 'development', 'interest', 'themselves', 'ever', 'body', 'took', 'here', 'hand', 'water', 'effect', 'cannot', 'nothing', 'necessary', 'law', 'change', 'brought', 'kind', 'off', 'whether', 'study', 'value', 'person', 'common', 'become', 'went', 'side', 'fact', 'once'])
[('among', {'human': 2, 'brought': 1, 'something': 1, 'want': 1, 'sense': 1, 'subject': 1, 'god': 2, 'least': 3, 'better': 1, 'going': 1, 'interest': 1, 'themselves': 9, 'ever': 3, 'body': 1, 'took': 2, 'known': 5, 'law': 1, 'states': 2, 'kind': 1, 'whether': 2, 'value': 1, 'common': 7, 'become': 1, 'side': 1}), ('another', {'body': 1, 'often': 2, 'certain': 2, 'nothing': 1, 'done': 1, 'want': 2, 'sense': 1, 'subject': 2, 'god': 1, 'away': 1, 'question': 2, 'going': 1, 'themselves': 3, 'ever': 3, '