## Exercises - Pandas Data Frames

Here are some of the Exercises related to Pandas.
* Create Pandas Data Frames using Schema
* Get all the orders which belong to the month of 2013 August
* Get all the orders which belong to the months of August, September and October in 2013.
* Get count of orders by status for the month of 2014 January
* Get all the records from orders where there are no corresponding records in order_items
* Get all the customers who have not placed any orders
* Get the revenue by status

### Exercise 1 - Create Pandas Data Frames using Schema

Create Pandas Data Frame for orders, order_items and customers. Make sure to use **schema/retail_db/retail.json** to get the column names.

In [107]:
import os
import json
import csv
import pandas as pd

def get_df(base_folder, data_set_name, schema_file):
    file_names = os.listdir(f'{base_folder}/{data_set_name}')
    retail_schemas = json.load(open(schema_file))
    columns = list(map(lambda col: col['column_name'], retail_schemas[data_set_name]))
    data = []
    for file_name in file_names:
        file_path = f'{base_folder}/{data_set_name}/{file_name}'
        raw_data = open(file_path)
        data += list(raw_data)
    return pd.DataFrame(map(lambda rec: rec.split(','), data), columns=columns)

In [108]:
orders = get_df('/data/retail_db', 'orders', 'schemas/retail_db/retail.json')

In [109]:
order_items = get_df('/data/retail_db', 'order_items', 'schemas/retail_db/retail.json')

In [110]:
customers = get_df('/data/retail_db', 'customers', 'schemas/retail_db/retail.json')

In [111]:
orders.head(2)

Unnamed: 0,order_id,order_date,order_customer_id,order_status
0,1,2013-07-25 00:00:00.0,11599,CLOSED\n
1,2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT\n


In [112]:
order_items.head(2)

Unnamed: 0,order_item_id,order_item_order_id,order_item_product_id,order_item_quantity,order_item_subtotal,order_item_product_price
0,1,1,957,1,299.98,299.98\n
1,2,2,1073,1,199.99,199.99\n


In [113]:
customers.head(2)

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
0,1,Richard,Hernandez,XXXXXXXXX,XXXXXXXXX,6303 Heather Plaza,Brownsville,TX,78521\n
1,2,Mary,Barrett,XXXXXXXXX,XXXXXXXXX,9526 Noble Embers Ridge,Littleton,CO,80126\n


### Get all the orders which belong to the month of 2013 August

In [114]:

orders.query("order_date.str.startswith('2013-08')")

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68705,68706,2013-08-20 00:00:00.0,130,COMPLETE\n
68706,68707,2013-08-23 00:00:00.0,11730,COMPLETE\n
68707,68708,2013-08-26 00:00:00.0,8852,ON_HOLD\n
68708,68709,2013-08-30 00:00:00.0,4756,COMPLETE\n


### Get all the orders which belong to the months of August, September and October in 2013.

In [115]:
orders.query(
    '''order_date.str.startswith("2013-08") or order_date.str.startswith("2013-09") 
    or order_date.str.startswith("2013-10")'''
, engine='python')

Unnamed: 0,order_id,order_date,order_customer_id,order_status
1296,1297,2013-08-01 00:00:00.0,11607,COMPLETE\n
1297,1298,2013-08-01 00:00:00.0,5105,CLOSED\n
1298,1299,2013-08-01 00:00:00.0,7802,COMPLETE\n
1299,1300,2013-08-01 00:00:00.0,553,PENDING_PAYMENT\n
1300,1301,2013-08-01 00:00:00.0,1604,PENDING_PAYMENT\n
...,...,...,...,...
68737,68738,2013-10-27 00:00:00.0,1100,COMPLETE\n
68738,68739,2013-10-28 00:00:00.0,2528,PENDING\n
68739,68740,2013-10-29 00:00:00.0,10691,ON_HOLD\n
68740,68741,2013-10-30 00:00:00.0,5974,PENDING_PAYMENT\n


### Get count of orders by status for the month of 2014 January

In [116]:
data = orders.query('order_date.str.startswith("2014-01")')
data.groupby(orders.order_status)['order_id'].count()

order_status
CANCELED\n            110
CLOSED\n              633
COMPLETE\n           1911
ON_HOLD\n             365
PAYMENT_REVIEW\n       77
PENDING\n             635
PENDING_PAYMENT\n    1334
PROCESSING\n          712
SUSPECTED_FRAUD\n     131
Name: order_id, dtype: int64

### Get all the records from orders where there are no corresponding records in order_items

In [117]:
orders_with_index = orders.set_index('order_id') ## set the order_id column as index

In [118]:
order_items_with_index = order_items.set_index('order_item_order_id') ## set the order_item_order_id as index column

In [119]:
result = orders_with_index.join(order_items_with_index) ## Join based on index_value. default inner join

In [120]:
## Return those records where order_item_id is nan. NaN in order_item_id means not placed order.
records = result.query('order_item_id.isna()') 

In [121]:
## Drop all the columns of order_items and returns only columns of orders dataframe

records.drop(order_items_with_index.columns, axis=1) 

Unnamed: 0,order_date,order_customer_id,order_status
10015,2013-09-25 00:00:00.0,3112,COMPLETE\n
10016,2013-09-25 00:00:00.0,1214,PROCESSING\n
10022,2013-09-25 00:00:00.0,2697,PENDING_PAYMENT\n
10031,2013-09-25 00:00:00.0,9968,COMPLETE\n
10035,2013-09-25 00:00:00.0,2570,PENDING_PAYMENT\n
...,...,...,...
9978,2013-09-25 00:00:00.0,3100,CANCELED\n
9980,2013-09-25 00:00:00.0,8261,PENDING_PAYMENT\n
9982,2013-09-25 00:00:00.0,4860,COMPLETE\n
9994,2013-09-25 00:00:00.0,9585,PENDING_PAYMENT\n


In [122]:
## This will also gives same output
##orders[~orders.order_id.isin(order_items.order_item_order_id)]

### Get all the customers who have not placed any orders

In [124]:

cust = customers[~customers.customer_id.isin(orders.order_customer_id)]
cust

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_email,customer_password,customer_street,customer_city,customer_state,customer_zipcode
218,219,Mary,Harrell,XXXXXXXXX,XXXXXXXXX,9016 Foggy Robin Expressway,Denver,CO,80219\n
338,339,Mary,Greene,XXXXXXXXX,XXXXXXXXX,4271 Hazy Close,Long Beach,CA,90805\n
468,469,Randy,Smith,XXXXXXXXX,XXXXXXXXX,252 Golden Goose Loop,South San Francisco,CA,94080\n
1186,1187,Dorothy,Vazquez,XXXXXXXXX,XXXXXXXXX,363 Green Goose Run,Danbury,CT,06810\n
1480,1481,Grace,Smith,XXXXXXXXX,XXXXXXXXX,2171 Clear Lake Isle,Caguas,PR,00725\n
1807,1808,Albert,Ellison,XXXXXXXXX,XXXXXXXXX,9795 Heather Wynd,Billings,MT,59102\n
2072,2073,Donna,Stephens,XXXXXXXXX,XXXXXXXXX,9792 Cozy Corners,Sunnyvale,CA,94087\n
2095,2096,Jose,Tanner,XXXXXXXXX,XXXXXXXXX,8976 Old Hickory Landing,Bronx,NY,10467\n
2449,2450,James,Smith,XXXXXXXXX,XXXXXXXXX,4063 Little Creek Court,Newark,DE,19702\n
4554,4555,Mary,Smith,XXXXXXXXX,XXXXXXXXX,5455 Red Lagoon Maze,Caguas,PR,00725\n


### Get the revenue by status

In [125]:
result = pd.merge(left=orders, right=order_items, how="inner", left_on="order_id", 
                  right_on="order_item_order_id", indicator=True)

In [126]:
result['order_item_subtotal']=result['order_item_subtotal'].astype('float')

In [127]:
result.groupby('order_status')['order_item_subtotal'].sum().round(2)

order_status
CANCELED\n             696030.99
CLOSED\n              3736048.79
COMPLETE\n           11276933.69
ON_HOLD\n             1864731.24
PAYMENT_REVIEW\n       357841.45
PENDING\n             3851881.28
PENDING_PAYMENT\n     7581671.05
PROCESSING\n          4190636.76
SUSPECTED_FRAUD\n      766844.68
Name: order_item_subtotal, dtype: float64