# Development of Map Reduce APIs
Let us develop our own Map Reduce APIs to understand how they work internally. We need to be comfortable with passing the functions as arguments.

We will provide the code and walk you through to make you understand how the Map Reduce APIs are internally implemented.

* Develop myFilter
* Validate myFilter Function
* Develop myMap
* Validate myMap Function
* Develop myReduce
* Develop myReduceByKey
* Exercises

## Develop myFilter

Develop a function by name myFilter which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Apply the condition using the argument passed. We might pass named function or lambda function.
* Return the collection with all the elements satisfying the condition.

In [None]:
#c for collection, f for function, c_f collection having those records which are filtered, e for elements

In [8]:
def myFilter(c, f):
    c_f = []#empty_list
    for e in c:
        if f(e):
            c_f.append(e)
    return c_f

## Validate myFilter function

Use the same examples which were used before as part of Processing Collections using loops.

* Read orders data

In [9]:
orders_path = '/Users/monikamendiratta/data/retail_db/orders/part-00000.csv'
orders = open(orders_path). \
    read(). \
    splitlines()

In [10]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

* Get orders placed by customer id 12431


In [11]:
#Simple logic to be implemented as lambda function below
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
int(order.split(',')[2])

11599

In [12]:
customer_orders = myFilter(orders,
                           lambda order: int(order.split(',')[2]) == 12431
                          )

In [13]:
customer_orders

['3774,2013-08-16 00:00:00.0,12431,CANCELED',
 '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
 '4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
 '22812,2013-12-12 00:00:00.0,12431,PENDING',
 '22927,2013-12-13 00:00:00.0,12431,CLOSED',
 '25614,2013-12-30 00:00:00.0,12431,CLOSED',
 '27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '45894,2014-05-06 00:00:00.0,12431,CLOSED',
 '46217,2014-05-07 00:00:00.0,12431,CLOSED',
 '49678,2014-05-31 00:00:00.0,12431,PENDING',
 '51865,2014-06-15 00:00:00.0,12431,PROCESSING',
 '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
 '67110,2014-07-14 00:00:00.0,12431,PENDING']

* Get orders placed by customer id 12431 in the month of 2014 January

In [14]:
customer_orders_for_month = myFilter(orders,
                           lambda order: int(order.split(',')[2]) == 12431
                                     and order.split(',')[1].startswith('2014-01')
                          )

In [15]:
customer_orders_for_month

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']

* Get orders placed by customer id 12431 in processing or pending_payment for the month of 2014 January

In [18]:
customer_orders_month_status = myFilter(orders,
                           lambda order: int(order.split(',')[2]) == 12431
                                     and order.split(',')[1].startswith('2014-01')
                                     and order.split(',')[3] in ('PROCESSING','PENDING_PAYMENT')
                          )

In [19]:
customer_orders_month_status

['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT']

## Develop myMap

Develop a function by name myMap which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Apply the transformation logic using the argument passed.
* Return the collection with all the elements which are transformed based on the logic passed.

* Map is nothing but a function used for row level transformations which means if there is a collection and you want to apply certain
transformational rules on each and every element in the collection and return the new collection. The size of the new collection will typically be 
same as the original collection, however, the new collection will have the transformed data. That's where the map comes into picture. It always 
returns a collection of same length or same number of elements but it will have the transformed data.

In [53]:
def myMap(c, f):
    c_t = []
    for e in c:
        c_t.append(f(e))
    return c_t

In [50]:
customer_orders = myMap(orders,
                           lambda order: int(order.split(',')[2]) == 12431
                          )

In [45]:
customer_orders[:5]

[False, False, False, False, False]

### Validate myMap function
* Create list for range between 1 to 9 and return square of each number.

In [51]:
l = list(range(1,10))
l

[1, 2, 3, 4, 5, 6, 7, 8, 9]

In [52]:
myMap(l, lambda e: e * e)

[1, 4, 9, 16, 25, 36, 49, 64, 81]

* Use orders and extract order_dates. Also apply set and get only unique dates.

In [26]:
orders_path = '/Users/monikamendiratta/data/retail_db/orders/part-00000.csv'
orders = open(orders_path). \
    read(). \
    splitlines()

In [27]:
order_dates = myMap(orders,
                   lambda order: order.split(',')[1]
                   )

In [None]:
set(order_dates)

In [30]:
len(set(order_dates))

364

* Use orders and extract order_id as well as order_date from each element in the form of a tuple. Make sure that order_id is of type int.

In [34]:
orders[:10]

['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']

In [36]:
#sample output needed
ordr = '1,2013-07-25 00:00:00.0,11599,CLOSED'
(int(ordr.split(',')[0]), ordr.split(',')[1])

(1, '2013-07-25 00:00:00.0')

In [37]:
#my_logic
orders_transformed = myMap(orders,
                          lambda order: (int(order.split(',')[0]), order.split(',')[1])
                          )

In [41]:
orders_transformed[:10]

[(1, '2013-07-25 00:00:00.0'),
 (2, '2013-07-25 00:00:00.0'),
 (3, '2013-07-25 00:00:00.0'),
 (4, '2013-07-25 00:00:00.0'),
 (5, '2013-07-25 00:00:00.0'),
 (6, '2013-07-25 00:00:00.0'),
 (7, '2013-07-25 00:00:00.0'),
 (8, '2013-07-25 00:00:00.0'),
 (9, '2013-07-25 00:00:00.0'),
 (10, '2013-07-25 00:00:00.0')]

In [39]:
order_tuples = myMap(orders,
                     lambda order: (int(order.split(',')[0]), order.split(',')[1])
                    )

In [40]:
order_tuples[:10]

[(1, '2013-07-25 00:00:00.0'),
 (2, '2013-07-25 00:00:00.0'),
 (3, '2013-07-25 00:00:00.0'),
 (4, '2013-07-25 00:00:00.0'),
 (5, '2013-07-25 00:00:00.0'),
 (6, '2013-07-25 00:00:00.0'),
 (7, '2013-07-25 00:00:00.0'),
 (8, '2013-07-25 00:00:00.0'),
 (9, '2013-07-25 00:00:00.0'),
 (10, '2013-07-25 00:00:00.0')]

## Develop myReduce
Develop a function by name myReduce which takes a collection and a function as arguments. Function should do the following:
* Iterate through elements
* Perform aggregation operation using the argument passed. Argument should have necessary arithmetic logic.
* Return the aggregated result.

## Develop myReduceByKey
Develop a function by name myReduceByKey which takes a collection of tuples and a function as arguments. Each element in the collection should have exactly 2 attributes. Function should do the following:
* Iterate through the collection of tuples.
* Group the data by first element in the collection of tuples and apply the function using the argument passed. Argument should have necessary arithmetic logic. 
* Return a collection of tuples, where first element is unique and second element is aggregated result.

* Use the function to get the count by date from orders.

* Use the function to get the revenue for each order id.

## Exercises
Here are the same exercises which you have solved before. Try to solve these using mapReduce APIs.
* We will provide you a python script which will have all the above map reduce APIs. Use it as package and solve the below mentioned problems.
  * Create a file with name `mymapreduce.py`
  * Import and use it `from mymapreduce import *`.

In [None]:
def myFilter(c, f):
   c_f = []
   for e in c:
       if(f(e)):
           c_f.append(e)
   return c_f

def myMap(c, f):
   c_f = []
   for e in c:
       c_f.append(f(e))
   return c_f

def myReduce(c, f):
   t = c[0]
   for e in c[1:]:
       t = f(t, e)
   return t

def myReduceByKey(p, f):
   p_f = {}
   for e in p:
       if(e[0] in p_f):
           p_f[e[0]] = f(p_f[e[0]], e[1])
       else:
           p_f[e[0]] = e[1]
   return list(p_f.items())

* Get number of COMPLETE orders placed by each customer
* Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.
* Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.