## Performing Aggregations

Let us understand how to perform aggregations using Pandas. There are 2 types of aggregations.
* Global Aggregations
* By key Aggregations

In [2]:
import pandas as pd

In [3]:
orders_path = "/data/retail_db/orders/part-00000"

In [4]:
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]

In [5]:
orders = pd.read_csv(orders_path,
                     delimiter=',',
                     header=None,
                     names=orders_schema
                    )

In [6]:
order_items_path = "/data/retail_db/order_items/part-00000"

In [7]:
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]

In [8]:
orders = pd.read_csv(orders_path,
                     delimiter=',',
                     header=None,
                     names=orders_schema
                    )

In [9]:
order_items = pd.read_csv(order_items_path,
                     delimiter=',',
                     header=None,
                     names=order_items_schema
                    )

### Global Aggregations

There are several global aggregations that can be performed.
  * Getting number of records in the Data Frame.

In [10]:
orders.shape

(68883, 4)

In [12]:
orders.shape[0]

68883

* Getting number of non np.NaN values in each attribute in a Data Frame

In [15]:
orders.count()

order_id             68883
order_date           68883
order_customer_id    68883
order_status         68883
dtype: int64

In [16]:
type(orders.count())

pandas.core.series.Series

In [17]:
orders.count()['order_id']

68883

* Getting basic statistics of numeric fields of a Data Frame

In [18]:
orders.describe()

Unnamed: 0,order_id,order_customer_id
count,68883.0,68883.0
mean,34442.0,6216.571099
std,19884.953633,3586.205241
min,1.0,1.0
25%,17221.5,3122.0
50%,34442.0,6199.0
75%,51662.5,9326.0
max,68883.0,12435.0


* Get revenue for a order id 2 from order_items

In [19]:
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()

579.98

### By Key Aggregations

By Key Aggregations are those which are computed per key. Here are some of the examples.
* Getting number of orders per day

In [20]:
orders.groupby(orders['order_date'])['order_id'].count()

order_date
2013-07-25 00:00:00.0    143
2013-07-26 00:00:00.0    269
2013-07-27 00:00:00.0    202
2013-07-28 00:00:00.0    187
2013-07-29 00:00:00.0    253
                        ... 
2014-07-20 00:00:00.0    285
2014-07-21 00:00:00.0    235
2014-07-22 00:00:00.0    138
2014-07-23 00:00:00.0    166
2014-07-24 00:00:00.0    185
Name: order_id, Length: 364, dtype: int64

* Getting number of orders per status

In [21]:
orders.groupby('order_status')['order_status'].count()

order_status
CANCELED            1428
CLOSED              7556
COMPLETE           22899
ON_HOLD             3798
PAYMENT_REVIEW       729
PENDING             7610
PENDING_PAYMENT    15030
PROCESSING          8275
SUSPECTED_FRAUD     1558
Name: order_status, dtype: int64

* Computing revenue per order

In [22]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    sum()

order_item_order_id
1         299.98
2         579.98
4         699.85
5        1129.86
7         579.92
          ...   
68879    1259.97
68880     999.77
68881     129.99
68882     109.99
68883    2149.99
Name: order_item_subtotal, Length: 57431, dtype: float64

In [23]:
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'})

Unnamed: 0_level_0,revenue,min,max,item_count
order_item_order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,299.98,299.98,299.98,1
2,579.98,129.99,250.00,3
4,699.85,49.98,299.95,4
5,1129.86,99.96,299.98,5
7,579.92,79.95,299.98,3
...,...,...,...,...
68879,1259.97,129.99,999.99,3
68880,999.77,149.94,250.00,5
68881,129.99,129.99,129.99,1
68882,109.99,50.00,59.99,2


In [25]:
order_items.rename(columns={'order_item_order_id': 'order_id'})

Unnamed: 0,order_item_id,order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
0,1,1,957,1,299.98,299.98
1,2,2,1073,1,199.99,199.99
2,3,2,502,5,250.00,50.00
3,4,2,403,1,129.99,129.99
4,5,4,897,2,49.98,24.99
...,...,...,...,...,...,...
172193,172194,68881,403,1,129.99,129.99
172194,172195,68882,365,1,59.99,59.99
172195,172196,68882,502,1,50.00,50.00
172196,172197,68883,208,1,1999.99,1999.99
