## The Problem

As we handle "big data", we need different tools and architectures to support data which cannot fit on a single computer. [MapReduce](https://en.wikipedia.org/wiki/MapReduce) helps by allowing us to aggregate using many computers and then reduce to summarize from our work. 

## Our Servers (with logs)

![Servers](servers.png)

### Grouping Servers to parse logs

![Map Reduce Groups](grouped_servers.png)

### MapReduce (Cloudera)

![Cloudera MapReduce Graphic](http://blog.cloudera.com/wp-content/uploads/2014/03/ssd1.png)

## Import Data

In [None]:
import pandas as pd
from urllib.parse import urlparse

In [None]:
df = pd.read_csv('../data/parsed_logs.csv')

In [None]:
df.head()

In [None]:
df.dtypes

In [None]:
df.dropna().shape

In [None]:
df.shape

### Map Step

In [None]:
df.iloc[0]['url']

In [None]:
test_url = df.iloc[0]['url'] + '?referrer_id=google&user_id=123'

In [None]:
urlparse(test_url)

In [None]:
urlparse(test_url).path

In [None]:
def get_path(url):
    '''Return parsed path when given url.'''
    return urlparse(url).path

In [None]:
get_path(test_url)

In [None]:
def map_path(input_df):
    return input_df['url'].map(get_path)

In [None]:
map_path(df).head()

### Reduce Step

In [None]:
def reduce_paths(all_path_series):
    '''Receives a list of output from the map step and aggregates them, 
       returning a count of each path'''
    final_df = pd.concat(all_path_series)
    return final_df.value_counts()

In [None]:
reduce_paths([map_path(df)])

### Putting it all together

In [None]:
# initial input

file_names = ['../data/parsed_logs.csv', 
              '../data/parsed_logs2.csv', 
              '../data/parsed_logs3.csv']

In [None]:
mapped_files = []
for filen in file_names:
    df = pd.read_csv(filen)
    mapped_files.append(map_path(df))

reduce_paths(mapped_files)

### Your Turn

- add an argument (keyword preferred) to the reduce step so you can define how many you want to show and only return that number of rows
- change the map step to take just a file name and do the initial change to dataframe there
- share any problems in Slack

In [None]:
# keyword arguments
def foo(x, y, z=False):
    print(z)

In [None]:
foo(1, 2, z='test')