# Lab 2: Word Count with MapReduce 

<hr>

## Introduction

Here is what our problem looks like:

* We have a huge text document
* We need to count the number of times each distinct word appears in the document


* Sample application:

    * Analyze web server logs to find popular URLs
    * Analyze texts for content or style 

In this lab, we will use the MapReduce Programming Model and perform a word count.



## Objectives

You will be able to:

* Read and Write text files using Pythong
* Use MapReduce Model and implement map and reduce operations
* Perform basic analysis from the experiment findings towards identifying writing styles



## MapReduce Framework

Here are the steps that we will perform for our problem, under the map reduce framework. 

* Read data (text files in this case)


* Map:
    * Extract something you care about


* Group by key: Sort and Shuffle


* Reduce:
    * Aggregate, summarize, filter or transform


* Write the result 

Here is what it looks like visually: 
![](images/wc1.png)

## Map - Read file and return sorted key-value pairs


Write a function `mapper` with a single file name as input that returns a sorted sequence of tuples (word, 1) values.

```pybt
mapper('sample.txt')
[('adipisci', 1), ('adipisci', 1), ('adipisci', 1), ('adipisci', 1), ('adipisci', 1), ('adipisci', 1), ('adipisci', 1), ('aliquam', 1), ('aliquam', 1), ('aliquam', 1), ('aliquam', 1), ('aliquam', 1), ('aliquam', 1), ('aliquam', 1), ('amet', 1), ('amet', 1), ('amet', 1)...
```

### Hint

The following example shows you how to read a txt file and print each word. \
Note: 'hamlet.txt' is in this lab directory.

In [None]:
myfile = open('hamlet.txt');
for line in myfile.readlines():
    # split each line into words
    words = line.split()  
    # we are looping over the words array and printing the word
    for word in words:
        print(word)

### Solution

Define the `mapper` function in the following cell. For each work of the file, you must
* remove leading and trailing whitespace
* remove any dot (.), semi-colon(;), question mark(?), and comma (,)
* change it to lower case

The returned list **must be sorted**.

In [3]:
# Write your solution here
def mapper(file):
    words_list = []
    myfile = open(file);
    invalid = ".;?,'&[]()-"
    for line in myfile.readlines():
        
        for char in invalid:
            line = line.replace(char, '')
        line = line.lower().strip()

        words = line.split()
        for word in words:
            words_list.append((word,1))

    return sorted(words_list)
   

## Partition


Create a function named `partitioner` that stores the key/value pairs from `mapper`  that group (word, 1) pairs into a list as:
```python
partitioner(mapper('sample.txt'))
[('adipisci', [1, 1, 1, 1, 1, 1, 1]), ('aliquam', [1, 1, 1, 1, 1, 1, 1]), ('amet', [1, 1, 1, 1],...]
```

### Hint

You can create a dictionary to store each word and its index in the list.

### Solution

In [None]:
# Write your solution here
def partitioner(list_from_mapper):
    result = []
    word_index = {}

    for word,num in list_from_mapper:
        if word in word_index:
            result[word_index[word]][1].append(num)
        else:
            result.append((word, [num]))
            word_index[word] = len(result) - 1
    return result
        
partitioner(mapper('hamlet.txt'))

## Reduce - Sums the counts and returns a single key-value (word, sum).


Write the function `reducer` that read a tuple `(word,[1,1,1,..,1])` and sum the occurrences of word to a final count, and then output the tuple (word,occurences).

```python
reducer(('hello',[1,1,1,1,1])
('hello',5)
```

### Solution

In [27]:
# import reduce so you can use it
from functools import reduce

# Write your solution here
def reducer(tuple):
    return (tuple[0],reduce(lambda x, y: x + 1, tuple[1]))

reducer(('hello',[1,1,1,1,1]))
    

('hello', 5)

## Word Count

Create a function `word_count` with a single file name that uses functions implemented above to count (word, occurences). Then write the result into a txt file named 'output.txt'. The result **must be sorted by most occurences**.

### Hint

The following example shows you how to write something into a txt file.

In [32]:
# open a file named 'test.txt' with mode 'w'. 
#If 'test.txt' does not exist, then 'test.txt' will be created.
f = open('test.txt', 'w')

# write() can write strings
f.write("Hello Wolrd!")
f.write('\n')

# writelines can write a list of strings
list  = ["apple", "is", "so", "good"]
f.writelines(list)
f.write('\n')

# convert other types to strings in order to write
list2 = ["apple", 5, "is", "good"]
f.writelines(list2[0] + str(list2[1]) + list2[2] + list2[3])

# close the file
f.close()

### Solution

In [31]:
# Write your solution here
def word_count(filename):
    word_count_list = partitioner(mapper(filename))
    result = []
    for tup in word_count_list:
        result.append(reducer(tup))

    result.sort(reverse=True, key=lambda x: x[1])
    
    with open('output.txt', 'w') as outfile:
        for tup in result:
            outfile.write(tup[0] + ' - ' + str(tup[1]) + '\n')
            
# Once the function is define, have a try    
word_count('hamlet.txt')