<a id="MapReduce"></a>

# Lab 11 - MapReduce 

***

In this lab session we will learn
   * Mapper, Reducer and Applications of MapReduce
   * Python methods for MapReduce
   * Some functions
   
   
Highly recommended to look into the white paper from Google - http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf   
   
Datasets:

[1.] https://www.ssa.gov/oact/babynames/limits.html

Resources for further readings

[1.] https://engineering.purdue.edu/~puma/pumabenchmarks.htm - Classical Mapreduce examples/projects

[2.] http://michaelnielsen.org/blog/write-your-first-mapreduce-program-in-20-minutes/ - blog on MapReduce

[3.] https://cs.nyu.edu/~mwalfish/classes/16sp/hw/hw3.html - MapReduce with mrjob (working on VM)

[4.] https://www.youtube.com/watch?v=30RaNpaupj0&list=PLtzRLOcrx9SS1Ir6_viv-yLJd0PX3GR5O - Video explaining Mapreduce applications:- 


[5.] https://mapr.com/blog/5-google-projects-changed-big-data-forever/ -  Google Projects that changed the latitude of Big Data 

### **MapReduce** is a programming model for performing parallel processing on large datasets.

In [None]:
from __future__ import division
import math, random, re, datetime
from collections import defaultdict, Counter
from functools import partial

### WordCount - classical way to count

In [None]:
def tokenize(message):
    message = message.lower()                       # convert to lowercase
    all_words = re.findall("[a-z0-9']+", message)   # extract the words
    return (set(all_words))                           # remove duplicates

Here the tokenizer is used to split the data

In [None]:
def word_count_old(documents):
    """ Word count without using map reduce"""
    return Counter (word for document in documents 
                    for word in tokenize(document))

#### Add text to the document list and observe the output from function

In [None]:
documents=["data science","big data", 'data Mining', "Data Visualization"]
word_counts=word_count_old(documents)
print(word_counts)

### 1. Read through Chapter 24 of the resource (shared in the lab repository), create a basic function for mapper and reducer for counting the number of words from a list. Print output from each functions

### Mapper

* A **Mapper** function returns each item into zero or more key-value pairs.
* Map function in python and Map function described here are two different aspects

In [None]:
# The mapper functions maps the task
def wc_mapper(document):

    
    
    
    
    
    

### Reducer

* A Reducer function aggregates the *values* corresponding to each *key* i.e., produces output values by grouping together values from each corresponding key.
* Aggregation can be anything say summing or fincing maximum or mathematical function
* A Reducer loops through the list of key and values and aggregates it. Say sum or max or min or other function
        

In [None]:
# The reducer function collects the results
def wc_reducer(word, counts):






### Calling Mapper and reducer

In [None]:
# The below function feeds the input data to mapper, consoildates the output from mapper (i.e., collector) 
# to reducer function and finally output the result from reducer. Map_reduce function!
def word_count(documents):

    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

### Create a list of words related to your area expertise and pass it to your MapReduce function

In [None]:
# Make a list of documents here
documents=["data science", "big data", "science fiction"]
word_count(documents)

## 2. MapReduce Paradigm

https://pythonhosted.org/mrjob/guides/concepts.html#mapreduce-and-apache-hadoop

### 2.1 The above link illustrates with an example about the MapReduce paradigm and each functional outputs. Please answer the following

### The input file contains a list of technical skills sets extracted from the job applications "Python, R, Hadoop, SQL", "Python, Rshiny, Matlab", "R, SQL, Hadoop", "SQL, C#, Python"

###  a. What would be the mapping function output for job applicant #1

### b. What is the output from Shuffling function for python and R

### c. If reducer employs the Sum aggregation, what is the most famous programming language among the 4 applicants?

### We can verify them by applying mapreduce 

In [None]:
skills =[ "Python, R, Hadoop, SQL", "Python, Rshiny, Matlab", "R, SQL, Hadoop", "SQL, C#, Python"]

word_count(skills)

### More generic approach to add both mapper and redcuer in a single function is shown below.

#### Binding them all together under a single function - map_reduce

In [None]:
def map_reduce(inputs, mapper, reducer):
    """runs MapReduce on input using functions mapper and reducer"""
    collector = defaultdict(list)
    
    # write a for loop over the inputs that calls mapper
    for i in inputs:
        for key,value in mapper(i):
            collector[key].append(value)
    # write a return statement that calls the reducer
    return[output
          for key,value in collector.items()
          for output in reducer(key,value)]

In [None]:
word_counts = map_reduce(documents, wc_mapper, wc_reducer)
print(word_counts)

### 2.2 MapReduce Python function

Some of the MapReduce function available are:
* mapreduce : https://pypi.org/project/mapreduce/
* kotti_mapreduce: https://pypi.org/project/kotti_mapreduce/
* mrs_mapreduce: https://pypi.org/project/mrs-mapreduce/
* mrjob: /pypi.org/project/mrjob/
* pydoop 1.2.0 : https://pypi.org/project/pydoop/



## 3. Lets implement the MapReduce on a larger problem 

#### The dataset reference [1] provides a list of all the baby names that classified according to popular names by state. 
#### Our task now is to create a function that could load all the files in mapper (). The final goal of the MapReduce is to output the most common words and visualize them using Treemap

Reference: Courtesy of UN5550-Fall 2017 assignment.

In [None]:
'''Under mapper we would need to extract data from all the files from a folder and 
tokenize as we did for previous case.

we donot want to enter the name of each file that needs to be read, we would need to implement method 
to read the file contents 
in the folder and extract the info'''

# Loading all necessary libraries
import glob, os, fileinput, re, datetime, sys, collections, string 
from collections import defaultdict, Counter, OrderedDict
from functools import partial
import pandas as pd     


In [None]:
#Function to read all the words from the file
def tokenize(message):
    message = message.lower()                    
    all_words = re.findall("[a-z']+", message)  
    return (all_words)  

### 3.1 Create a mapper function which works locally to produce a key, value pair for words that start with given letter 

* __Mapper__

In [None]:
# The following code would produce mapper output with top 'n' names that start with the letter

# mapper function which reads each line in the file on certain condition and returns the key value pairs
def mapper1(alpha, filename):

    
    
    
    
    
    
    
    
        
 

* __Reducer__

In [None]:
# reducer function which yields most 'n' common words being used.
def reducer1(n, key, wordNcounts):

    
    
    
    
    
    
    
    
    
    
    
    

* MapReduce

In [None]:
def map_reduce(inputs, mapper, reducer):
    """runs MapReduce on input using functions mapper and reducer"""
    collector = defaultdict(list)
    
    # write a for loop over the inputs that calls mapper
    for i in inputs:
        for key,value in mapper(i):
            collector[key].append(value)
    # write a return statement that calls the reducer
    return[output
          for key,value in collector.items()
          for output in reducer(key,value)]


* __Reading all files from a folder, look into options of os libraries__

https://docs.python.org/3/library/os.html

https://docs.python.org/2/library/functools.html


In [None]:
# read all the filenames and path in the given directory









### 3.2 Plotting Treemap

An example for plotting treemap is described below: 

https://python-graph-gallery.com/200-basic-treemap-with-python/

There is a necessity to pip install squarify library for building the treemap.

In [None]:
# !pip install --user squarify

### 3.3 Modify the above mapper function to produce a key, value pair for words that contains the given subword say 'an' in all the words

In [None]:
# mapper function which reads each line in the file on certain condition and returns the key value pairs
def mapper2(string, filename):

    
    
    
    
    
    
    

In [None]:
# reducer function which yields most 'n' common words being used.
def reducer2(n, key, wordNcounts):

    
    
    
    
    
    
    

In [None]:
# main code
