# Practicing MapReduce

The goal of this lab is to give you experience thinking in terms of MapReduce. We will be using small datasets that you can inspect manually to determine the correctness of your results to help you internalize how MapReduce works. In the next lab, you will have the opportunity to use Spark, a MapReduce-based system, to process the very large datasets for which it was actually designed.


## Learning the Basics

We will first look at the `map()` and `reduce()` functions individually and then we will use them together to build more complex exercises.


### The `map()` Function

First, let's think in terms of the `map()` function.  The `map()` function iterates through all the items in the given iterable and executes the function we passed as an argument on each of them. If you're asking yourself "how is this different from a regular loop?" the answer is it's obvious how to parallelize a map function without any further input from the programmer. 

Consider for example a list of fruits like the one below:

In [None]:
fruits = ["Apple", "Strawberry", "Banana", "Pear", "Apricot", "Watermelon", "Orange", "Avocado", "Pineapple"]

How can you take the list of fruits and get another list of which ones begin with the letter "A"?

In [None]:
# Option 1: Defining your own begins_with_A function
def begins_with_A(word):
    return word[0] == "A"

bool_fruits_A = list(map(begins_with_A, fruits))
print(bool_fruits_A)

In [None]:
# Option 2: A nicer and more compact way
bool_fruits_A = list(map(lambda s: s[0] == "A", fruits))
print(bool_fruits_A)

### The `reduce()` Function

Now that you know how to use `map()`, let's give `reduce()` a try.  First, remember that `reduce()` returns a single value based on the function and iterable we've passed (instead of an iterator). Second, `reduce()` takes a function that is commutative and associative, that is, the order of the elements does not affect the result and the grouping of the elements also does not affect the result. This means you need to map the data into the same type of result you're expecting after applying `reduce()`.

Also note that in Python 3 `reduce()` isn't a built-in function anymore, but it can be found in the functools module.


In [None]:
from functools import reduce

some_list = [2, 4, 7, 3, 1, 10, 21, 42]
print(reduce(lambda x, y: x + y, some_list))

Sum is commutative and associative, so any splitting and reordering of <code>some_list</code> will yield the same result:

In [None]:
half_some_list_1 = [2, 4, 7, 3]
half_some_list_2 = reversed([1, 10, 21, 42])

reduced_1 = reduce(lambda x, y: x + y, half_some_list_1)
reduced_2 = reduce(lambda x, y: x + y, half_some_list_2)

print(reduce(lambda x, y: x + y, [reduced_1,reduced_2]))

This is important again because, when working with very large datasets, your data will be split and you want to get correct results when doing things in parallel.

**Mini exercise**: use `map()` and `reduce()` to count how many fruits begin with the letter "A". Use the cell below to try it out!

From now on, you'll find a hidden answers for the exercises. Resist the urge of looking at the answers right away and **try to solve them on your own first**!

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible solution.</b>
</font>
</summary>
  <p>
    <code>int_fruits_A = map(lambda s: int(s),bool_fruits_A)</code><br>
    <code>print(reduce(lambda x, y: x + y, int_fruits_A))</code>
   </p>
</details>

In [None]:
# Use this cell to type your answer or copy and paste the answer hidden above



## Warm Up Exercise: A Social Network

Now that you know how to use map and reduce, let's practice doing more complex things.

Consider a simple social network dataset consisting of a set of key-value pairs (person, friend) representing a friend relationship between two people. 

Each input record is a pair person_A, person_B where person_A is a string representing the name of a person and person_B is a string representing the name of one of person_A's friends. Note that it may or may not be the case that the relationship is symmetric, that is, person_B might not consider person_A a friend. 

**Task**: Describe a MapReduce algorithm to count the number of friends for each person. The output should be a pair (person, friend_count) where person is a string and friend_count is an integer indicating the number of friends associated with person.


### Map the Input 

Let's begin by reading the input file "friends.dat".

In [None]:
# Loading the data
network_data_file = open("friends.dat")
network_data = network_data_file.read().split('\n')
network_data_file.close()

You can open the file or print 'network_data' to see how the data looks like.

In [None]:
print(network_data)

Now that the data has been loaded, think about the format you'll need to map your data into. In order to reduce your data into a list of pairs of the form <code>(person, num_friends)</code>, you'll have to map the original input data into something similar.

In [None]:
person_1_pairs = list(map(lambda p: [((p.split(','))[0],1)], network_data))

print(person_1_pairs)

Now that we mapped our data, it's almost time to use reduce(). The reduceByKey() function is only available in pyspark, but we can still write our own reduce_by_key function that is commutative and associative and use it together with reduce().

In [None]:
## We consider 2 lists because the function should be commutative and associative. 
## Keeping the lists sorted reduces the time complexity of the function, 
## which is not important for tiny problems like this one, but it's very important for large datasets
## Precondition/assumption: list1 and list2 are sorted by name/key. 
## Example input list1 = [(Anna,1),(Maria,3)] list2 = [(Anna,2),(Kate,4),(Zara,1)]
## Expected output = [(Anna,3),(Kate,4),(Maria,3),(Zara,1)]

def reduce_by_key(list1,list2):
    final_list = []
    i, j = 0, 0
  
    while i < len(list1) and j < len(list2): 
        if list1[i][0] == list2[j][0]: 
            final_list.append((list1[i][0],list1[i][1] + list2[j][1]))
            i += 1
            j += 1
            
        elif list1[i][0] < list2[j][0]: 
            final_list.append(list1[i]) 
            i += 1
  
        else: 
            final_list.append(list2[j]) 
            j += 1
    
    return list(final_list + list1[i:] + list2[j:])

Now we apply reduce() to get our final count:

In [None]:
person_friends_pairs = list(reduce(reduce_by_key, person_1_pairs))

print(person_friends_pairs)

**Task**: use MapReduce again to partially validate your result by making sure that the total number of friends matches the number of lines in the input file, that is, add up all the friends.

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible solution.</b>
</font>
</summary>
<code>print(reduce(lambda p , q: ('total',p[1]+q[1]), person_friends_pairs))</code>
</details>

<details>
<summary>
<font size="3" color="green">
<b>Click here to see another possible solution.</b>
</font>
</summary>
    <p>
        <code>friend_count_map = map(lambda p: p[1], person_friends_pairs)</code><br>
        <code>print(reduce(lambda p,q: p+q, friend_count_map))</code>
    </p>
</details>

In [None]:
# Use this cell to test your MapReduce answer and compare it with the input

# Total number of friends in the file
print(len(network_data))

## Exercise: Counting Fruits

Using the same fruit list from the begining of the lab, count how many fruits begin with each letter.

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible way to use map().</b>
</font>
</summary>
<code>fruits_letters = list(map(lambda s: [(s[0],1)], fruits))</code>
</details>

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible way to use reduce().</b>
</font>
</summary>
<code>fruits_letters_total = list(reduce(reduce_by_key, fruits_letters))</code>
</details>

In [None]:
#Use this cell for your code



Now consider that you have even more fruit names available int spring_fruits.dat. Count the fruits starting by each letter including both the previous list and spring fruits. Ideally, you should not need to modify what you already did on the previous cell!

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible solution.</b>
</font>
</summary>
<code>spring_fruits_letters = list(map(lambda s: [(s[0],1)], spring_fruits_data))
spring_fruits_letters_total = list(reduce(reduce_by_key, spring_fruits_letters,fruits_letters_total))</code>
</details>

In [None]:
# Loading the data
spring_fruits_file = open("spring_fruits.dat")
spring_fruits_data = spring_fruits_file.read().split('\n')
spring_fruits_file.close()

In [None]:
# Add your code here


## Final Exercise: Inverted Index

In this exercise, you'll be creating an [inverted index](https://en.wikipedia.org/wiki/Inverted_index) using MapReduce. An _inverted index_ is a data structure that is common to most information retrieval systems, and it is used for storing a mapping from the content (i.e words, numbers, etc) to its locations (i.e. tables, documents, etc). 

There exist two main variants of inverted indexes, namely a record-level inverted index and a word-level inverted index. A _record-level inverted index_ stores a list of references to documents for each word. A _word-level inverted index_ additionally stores the positions of each word within a document. In this exercise, we'll focus on the first and simpler version of the two. 

Let's begin by loading the data and seeing what it looks like.

In [None]:
# Loading data
import json

books_file = open("books.json")
books = books_file.read().split("\n")
books_file.close()

book_records = list(map(lambda p: json.loads(p), books))

print("The first file name is:", book_records[0][0], "\n")
print("The content of the first file is:", book_records[0][1])

Now it's time to map the input data into a data format that can be used with reduce later on.

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible solution.</b>
</font>
</summary>
<code>mapped_books = list(map(lambda b: list(map(lambda w: (w,[b[0]]),sorted(dict.fromkeys(b[1].split())))),book_records))</code>
</details>

In [None]:
# Your mapping (This is the difficult part of this exercise!)



Now that you have mapped your data, it's time to use reduce()

<details>
<summary>
<font size="3" color="green">
<b>Click here to see one possible solution.</b>
</font>
</summary>
<code>inverted_index = list(reduce(reduce_by_key, mapped_books))</code>
</details>

In [None]:
# Your reduce() 

