# Data Processing using Pandas Dataframe APIs

* Overview of Pandas for Data Processing
* Overview of Reading CSV Data using Pandas
* Read Data from CSV Files to Pandas Dataframes
* Filter Data in Pandas Dataframe using query
* Get Count by Status using Pandas Dataframe APIs
* Get count by Month and Status using Pandas Dataframe APIs

* Overview of Pandas for Data Processing

Let us get an overview of Python Pandas Library.
* Pandas is one of the most popular libraries of Python.
* Usage: Data Processing, Data Analysis (including Visualization)
* Robust APIs to read data from different sources, process data and write data to different targets.
* Integrated with File Formats, Databases, REST APIs, etc
* Easy to integrate with additional File Systems or Databases (install additional libraries)

In [None]:
!pip install pandas

In [None]:
import pandas as pd

* Overview of Reading CSV Data using Pandas

In [None]:
help(pd.read_csv)

* Read Data from CSV Files to Pandas Dataframes

In [None]:
orders_columns = [
    'order_id', 'order_date',
    'order_customer_id', 'order_status'
]

In [None]:
pd.read_csv(
    'data/retail_db/orders/part-00000', 
    names=orders_columns
)

* Filter Data in Pandas Dataframe using query

In [None]:
orders = pd.read_csv(
    'data/retail_db/orders/part-00000', 
    names=orders_columns
)

In [None]:
orders

In [None]:
orders.columns

In [None]:
orders.query?

In [None]:
orders['order_status'].unique()

In [None]:
orders.query('order_status == "COMPLETE"')

In [None]:
orders.query('order_status == "COMPLETE" and order_date == "2014-01-01 00:00:00.0"')

In [None]:
orders.query('order_status == "COMPLETE" or order_status == "CLOSED"')

In [None]:
orders.query('order_status == ("COMPLETE", "CLOSED")')

* Get Count by Status using Pandas Dataframe APIs

Here are the tasks related to aggregations.
* Get count by order status.
* Get count by order month and then by order status. We need to generate order month using order date using apply on dataframe.

```python
orders.apply(lambda order: order.order_date[:7], axis=1)
```

In [None]:
orders.columns

In [None]:
help(orders.groupby)

In [None]:
orders. \
    groupby('order_status')['order_id']. \
    agg(order_count='count')

In [None]:
orders

* Get count by Month and Status using Pandas Dataframe APIs

In [None]:
orders['order_month'] = orders.apply(lambda order: order.order_date[:7], axis=1)

In [None]:
orders

In [None]:
orders. \
    groupby(['order_month', 'order_status'])['order_id']. \
    agg(order_count='count')

In [None]:
orders. \
    groupby(['order_month', 'order_status'])['order_id']. \
    agg(order_count='count'). \
    reset_index()