# Map Reduce

This notebook is performing map reduce in a simplified manner in Python. Distribution of compute to different nodes is not done here; the purpose rather is to explore how to implement a map or reduce function, assuming that the functionality is provided akin to the libraries mentioned in [Dean and Ghemawat](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf).


This notebook comprises a section defining identity mappers and reducers, along with a `run` method which you may change if necessary. An intermediate sort function is also provided. 

Implement the `mapper` and `reducer` in the Term Vectors section, and use the run cell as provided.


In [1]:
from itertools import groupby
from operator import itemgetter
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
%config Completer.use_jedi = False


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jeandre/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# Empty MAPPER
def mapper(key, value):
    """
    Our user defined mapper function .
    : param key : 
    : param value : 
    """
    for w in value:
        yield (w, 1)

In [6]:
# Empty REDUCER

def reducer(key , list_value):
    """
    User defined reducer.
    : param key : 
    : param list_value :
    """
    yield (key,sum(list_value))

In [7]:


def cleaner(line):
    # lowercase all words and get alphabetical char only and keeping
    # apostrophe for time being
    words = re.findall(r'[a-z\']+' , line.lower())
    for word in words :
        # we will omit apostrophe ' s assuming users won't type them in a search
        word = word.replace("'" , '')
        if not (word is '' or word in stopwords.words('english')):
            yield word


    
def intermediate_sort(data):
    """
    collect by key 
    """
    data = sorted ( data )
    return [(k, list(tuple(zip(*g))[1])) for k, g  in groupby(data , itemgetter(0))]

    

def run(sources_dict):
    """
    Since we are focusing on the mapper and reducer functions here we have to
    provide the boiler plate code that a MapReduce library typically would . This
    function does that in a simple way (we ignore distributing it for now).
    : param sources_dict : dictionary where (key,fqfilename), for example ('doc_id','/home/fileX')
    """
    map_result =[]
    reduce_result =[]
    # open the files and apply map to each of them ( could be done in parallel ,
    # but we prefer to keep it simple ) .
    for k , v in sources_dict.items():
        # do map per source
        # this could happen in its own process / worker typically
        f = open(v, 'r')
        map_result += list(mapper(k, f.read()))
        f.close()
#         ::alt
#          with open(v, 'r') as f:
#             for line in f.readlines():
#                 map_result += list(mapper(k, line))
    # this would be written to disk in the original paradigm ,
    # but we keep it in memory for ease of use
    intermediate_result = intermediate_sort(map_result)
    # now that the data has been ' collected ' and grouped by key it can be handed
    # to the reducers . They would run over partitions or chunks usually , but we
    # will just iterate through the keys we have and call them
    for elem in intermediate_result:
        reduce_result.append(list(reducer(elem [0], elem [1])))
    return map_result, intermediate_result, reduce_result

  if not (word is '' or word in stopwords.words('english')):


In [8]:
# EXAMPLE
!mkdir -p input/
!echo -e 'D1 : the cat sat on the mat' > input/d1.txt
!echo -e 'D2 : the dog sat on the log' > input/d2.txt

_, _, res = run({'D1': 'input/d1.txt' , 'D2': 'input/d2.txt'})

res

[[('\n', 2)],
 [(' ', 14)],
 [('1', 1)],
 [('2', 1)],
 [(':', 2)],
 [('D', 2)],
 [('a', 4)],
 [('c', 1)],
 [('d', 1)],
 [('e', 4)],
 [('g', 2)],
 [('h', 4)],
 [('l', 1)],
 [('m', 1)],
 [('n', 2)],
 [('o', 4)],
 [('s', 2)],
 [('t', 8)]]

# Term Vector

The paper states:

> Term-Vector per Host: A term vector summarizes the
most important words that occur in a document or a set
of documents as a list of 〈word, frequency〉 pairs. The
map function emits a 〈hostname, term vector〉
pair for each input document (where the hostname is
extracted from the URL of the document). The re-
duce function is passed all per-document term vectors
for a given host. It adds these term vectors together,
throwing away infrequent terms, and then emits a final
〈hostname, term vector〉 pair.

As for 

> throwing away infrequent terms

Write your code in such a way that only terms occurring at least twice are retained.

Hint: 
  * Consider how they use the word 'frequency' elsewhere in the paper.
 

In [None]:
# your mapper
def mapper(key, value):
    yield (key, value)

def reducer(key , list_value):
    yield (key, list_value)

In [None]:
x, y, res = run({'www.somesite.com/page/1': 'page1.txt', 'www.somesite.com/page/2': 'page2.txt'})