# Data Processing using Pandas Dataframe APIs

* Create Dataframes using dynamic column list on CSV Data
* Performing Inner Join between Pandas Dataframes
* Perform Aggregations on Join results
* Sort Data in Pandas Dataframes
* Overview of Writing Pandas Dataframes to Files
* Write Pandas Dataframes to JSON Files

* Create Dataframes using dynamic column list on CSV Data

In [None]:
import json

In [None]:
import pandas as pd

In [None]:
def get_column_names(schemas, ds_name, sorting_key='column_position'):
    column_details = schemas[ds_name]
    columns = sorted(column_details, key=lambda col: col[sorting_key])
    return [col['column_name'] for col in columns]

In [None]:
schemas = json.load(open('data/retail_db/schemas.json'))

In [None]:
orders_columns = get_column_names(schemas, 'orders')

In [None]:
orders = pd.read_csv(
    'data/retail_db/orders/part-00000',
    names=orders_columns
)

In [None]:
customers_columns = get_column_names(schemas, 'customers')

In [None]:
customers_columns

In [None]:
customers = pd.read_csv(
    'data/retail_db/customers/part-00000',
    names=customers_columns
)

In [None]:
orders

In [None]:
customers

* Performing Inner Join between Pandas Dataframes

In [None]:
customers.join?

In [None]:
customers.set_index('customer_id')

In [None]:
customers = customers.set_index('customer_id')

In [None]:
orders = orders.set_index('order_customer_id')

In [None]:
customer_orders = customers. \
    join(orders, how='inner')

In [None]:
customer_orders

In [None]:
customer_orders.shape

* Perform Aggregations on Join results

In [None]:
customer_orders. \
    reset_index(names='customer_id'). \
    groupby('customer_id')['customer_id']. \
    agg(order_count='count'). \
    reset_index(). \
    query('order_count >= 10')

* Sort Data in Pandas Dataframes

In [None]:
orders

In [None]:
help(orders.sort_values)

In [None]:
orders.sort_values('order_customer_id')

In [None]:
orders.sort_values('order_customer_id', ascending=False)

In [None]:
orders.sort_values(['order_customer_id', 'order_date'])

In [None]:
orders.sort_values(
    ['order_customer_id', 'order_date'],
    ascending=False
)

In [None]:
orders.sort_values(
    ['order_customer_id', 'order_date'],
    ascending=[True, False]
)

* Overview of Writing Pandas Dataframes to Files

In [None]:
help(orders.to_csv)

* Write Pandas Dataframes to JSON Files

In [None]:
import os
os.makedirs('data/retail_db/orders_json', exist_ok=True)

In [None]:
# Stores using columnar format
# Review data in the files to understand how data is formatted
orders.to_json('data/retail_db/orders_json/part-00000')

In [None]:
orders.to_json?

In [None]:
# Save data using row format
orders.to_json(
    'data/retail_db/orders_json/part-00000',
    orient='records',
    lines=True
)

In [None]:
pd.read_json(
    'data/retail_db/orders_json/part-00000',
    lines=True
)