**Coursebook: Exploratory Data Analysis**
- Part 2 of Practical Data Analysis with Python and SQL
- Last Updated: January 2019

___

- Author: [Samuel Chan](https://github.com/onlyphantom)
- Developed by [Algoritma](https://algorit.ma)'s product division and instructors team

# Background

## Top-Down Approach 

The coursebook is part of the **Practical Data Analysis with Python and SQL** Specialization offered by [Algoritma](https://algorit.ma). It takes a more accessible approach compared to Algoritma's core educational products, by getting participants to overcome the "how" barrier first, rather than a detailed breakdown of the "why". 

This translates to an overall easier learning curve, one where the reader is prompted to write short snippets of code in frequent intervals, before being offered an explanation on the underlying theoretical frameworks. Instead of mastering the syntactic design of the Python programming languages, then moving into data structures, and then the `pandas` library, and then the mathematical details in an imputation algorithm, and its code implementation; we would do the opposite: Implement the imputation, then a succinct explanation of why it works and applicational considerations (what to look out for, what are assumptions it made, when _not_ to use it etc).

## Learn-by-Building

This coursebook is intended for participants who have completed the preceding courses offered in the **Practical Data Analysis with Python and SQL** Specialization. This is the second course, **Exploratory Data Analysis**

The coursebook focuses on:
- Why and What: Exploratory Data Analysis
- Categorical Data Types  
- Date Time objects
- Group By Operations

The final part of this course is a Graded Asssignment, where you are expected to apply all that you've learned on a new dataset, and attempt the given questions.


# Data Preparation and Exploration

About 60 years ago, John Tukey defined data analysis as the "procedures for analyzing data, techniques for interpreting the results of such procedures ... and all the machinery of mathematical statistics which apply to analyzing dsta". His championing of EDA encouraged the development of statsitical computing packages, especially S at Bell Labs (which later inspired R).

He wrote a book titled _Exploratory Data Analysis_ arguing that too much emphasis in statistics was placed on hypothesis testing (confirmatory data analysis) while not enough was placed on the discovery of the unexpected. 

> Exploratory data analysis isolates patterns and features of the data and reveals these forcefully to the analyst.

This course aims to present a selection of EDA techniques -- some developed by John Tukey himself -- but with a special emphasis on its application to modern business analytics.

In the previous course, we've got our hands on a few common techniques:

- `.head()` and `.tail()`
- `.describe()`
- `.shape` and `.size`
- `.axes`
- `.dtypes`

In the following chapters, we'll expand our EDA toolset with the following additions:  

- Tables
- Cross-Tables and Aggregates

In [7]:
import pandas as pd

## Tables

One of the simplest EDA toolkit is the frequency table (contingency tables) and cross-tabulation tables. It is highly familiar, convenient, and practical for a wide array of statistical tasks. The simplest form of a table is to display counts of a `categorical` column. Let's start by reading our dataset in; Create a new cell and peek at the first few rows of the data.

In [13]:
household = pd.read_csv("data_input/household.csv")
household.shape

(72000, 10)

In [29]:
## Your code below


## -- Solution code

In `pandas`, each column of a `DataFrame` is a `Series`. To get the counts of each unique levels in a categorical column, we can use `.value_counts()`. The resulting object is a `Series` and in descending order so that the most frequent element is on top. 

Try and perform `.value_counts()` on the `format` column, adding either:

- `sort=False` as a parameter to prevent any sorting of elements, or
- `ascending=True` as a parameter to sort in ascending order instead

In [40]:
household.sub_category.value_counts()

Detergent    36000
Sugar        24000
Rice         12000
Name: sub_category, dtype: int64

In [41]:
## Your code below


## -- Solution code

`crosstab` is a very versatile solution to producing frequency tables on a `DataFrame` object. Its utility really goes further than that but we'll start with a simple use-case.

Consider the following code: we use `pd.crosstab()` passing in the values to group by in the rows (`index`) and columns (`columns`) respectively. 

In [69]:
pd.crosstab(index=household['sub_category'], columns="count")

col_0,count
sub_category,Unnamed: 1_level_1
Detergent,36000
Rice,12000
Sugar,24000


Realize that in the code above, we're setting the row (index) to be `sub_category` and the function will by default compute a frequency table. 

In [42]:
pd.crosstab(index=household['sub_category'], columns="count", normalize='columns')

col_0,count
sub_category,Unnamed: 1_level_1
Detergent,0.5
Rice,0.166667
Sugar,0.333333


In the cell above, we set the values to be normalized over each columns, and this will divide each values in place over the sum of all values. This is equivalent to a manual calculation:

In [70]:
catego = pd.crosstab(index=household['sub_category'], columns="count")
catego / catego.sum()

col_0,count
sub_category,Unnamed: 1_level_1
Detergent,0.5
Rice,0.166667
Sugar,0.333333


We can also use the same `crosstab` method to compute a cross-tabulation of two factors. In the following cell, the `index` references the sub-category column while the `columns` references the format column:

In [28]:
pd.crosstab(index=household['sub_category'], columns=household['format'])

format,hypermarket,minimarket,supermarket
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Detergent,2611,24345,9044
Rice,999,7088,3913
Sugar,1761,15370,6869


When we add `margins=True` to our method call, then an extra row and column of margins (subtotals) will be included in the output:

In [81]:
pd.crosstab(index=household['sub_category'], 
            columns=household['format'], 
            margins=True)

format,hypermarket,minimarket,supermarket,All
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Detergent,2611,24345,9044,36000
Rice,999,7088,3913,12000
Sugar,1761,15370,6869,24000
All,5371,46803,19826,72000


In [73]:
household.head()

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth
0,9622257,32369294,7/22/2018 21:19,Rice,Rice,supermarket,128000.0,0,1,2018-07
1,9446359,31885876,7/15/2018 16:17,Rice,Rice,minimarket,102750.0,0,1,2018-07
2,9470290,31930241,7/15/2018 12:12,Rice,Rice,supermarket,64000.0,0,3,2018-07
3,9643416,32418582,7/24/2018 8:27,Rice,Rice,minimarket,65000.0,0,1,2018-07
4,9692093,32561236,7/26/2018 11:28,Rice,Rice,supermarket,124500.0,0,1,2018-07


### Aggregation Table

In the following cell, we introduced another parameter to perform aggregation on our table. The `aggfunc` parameter when present, required the `values` parameter to be specified as well. `values` is the values to aggregate according to the factors in our index and columns:

In [88]:
pd.crosstab(index=household['sub_category'], 
            columns='mean', 
            values=household['unit_price'],
            aggfunc='mean')

col_0,mean
sub_category,Unnamed: 1_level_1
Detergent,17893.793214
Rice,70013.146313
Sugar,12645.066024


#### Knowledge Check

Create a cross-tab using `sub_category` as the index (row) and `format` as the column. Fill the values with the median of `unit_price` across each row and column. Add a subtotal to both the row and column.

1. On average, Sugar is cheapest at...?
2. On average, Detergent is most expensive at...?

Create a new cell for your code and answer the questions above.

In [101]:
## Your code below


## -- Solution code

Reference answer:

```
pd.crosstab(index=household['sub_category'], 
            columns=household['format'], 
            values=household['unit_price'],
            aggfunc='median', margins=True)
```

In [92]:
household.yearmonth.unique()

array(['2018-07', '2018-08', '2018-09', '2018-01', '2018-02', '2018-03',
       '2018-04', '2018-05', '2018-06', '2017-10', '2017-11', '2017-12'],
      dtype=object)

### Higher-dimensional Tables

If we need to inspect our data in higher resolution, we can create cross-tabulation using more than two factors. This allows us to yield insights on a more granular level yet have our output remain relatively compact and structured:

In [97]:
pd.crosstab(index=household['yearmonth'], 
            columns=[household['format'], household['sub_category']], 
            values=household['unit_price'],
            aggfunc='median')

format,hypermarket,hypermarket,hypermarket,minimarket,minimarket,minimarket,supermarket,supermarket,supermarket
sub_category,Detergent,Rice,Sugar,Detergent,Rice,Sugar,Detergent,Rice,Sugar
yearmonth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2017-10,17400.0,64000.0,12500.0,16800.0,62500.0,12500.0,16925.0,64000.0,12500.0
2017-11,16770.0,64000.0,12400.0,16800.0,62500.0,12500.0,16500.0,64000.0,12400.0
2017-12,17500.0,64000.0,12000.0,16600.0,62500.0,12500.0,16600.0,64000.0,12400.0
2018-01,16800.0,64000.0,12275.0,16200.0,62500.0,12500.0,16700.0,64000.0,12400.0
2018-02,17500.0,64000.0,11990.0,17000.0,63500.0,12500.0,16200.0,64000.0,12290.0
2018-03,16900.0,64000.0,12000.0,16300.0,63500.0,12500.0,15680.0,64000.0,12400.0
2018-04,16815.0,64000.0,11990.0,16800.0,63500.0,12500.0,15700.0,64000.0,12400.0
2018-05,16950.0,64000.0,12000.0,16800.0,63000.0,12500.0,16700.0,64000.0,12400.0
2018-06,16550.0,64000.0,12300.0,17300.0,63500.0,12500.0,16700.0,64000.0,12400.0
2018-07,16550.0,64000.0,12325.0,16800.0,63500.0,12500.0,16600.0,64000.0,12300.0


## Pivot Tables

If our data is already in a `DataFrame` format, using `pd.pivot_table` can sometimes be more convenient compared to a `pd.crosstab`. We create a `pivot_table` by passing in the following:
- `data`: our `DataFrame`
- `index`: the column to be used as rows
- `columns`: the column to be used as columns
- `values`: the values used to fill in the table
- `aggfunc`: the aggregation function

In [120]:
pd.pivot_table(
    data=household,
    index='yearmonth',
    columns=['format','sub_category'],
    values='unit_price',
    aggfunc='median'
)

format,hypermarket,hypermarket,hypermarket,minimarket,minimarket,minimarket,supermarket,supermarket,supermarket
sub_category,Detergent,Rice,Sugar,Detergent,Rice,Sugar,Detergent,Rice,Sugar
yearmonth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2017-10,17400.0,64000.0,12500.0,16800.0,62500.0,12500.0,16925.0,64000.0,12500.0
2017-11,16770.0,64000.0,12400.0,16800.0,62500.0,12500.0,16500.0,64000.0,12400.0
2017-12,17500.0,64000.0,12000.0,16600.0,62500.0,12500.0,16600.0,64000.0,12400.0
2018-01,16800.0,64000.0,12275.0,16200.0,62500.0,12500.0,16700.0,64000.0,12400.0
2018-02,17500.0,64000.0,11990.0,17000.0,63500.0,12500.0,16200.0,64000.0,12290.0
2018-03,16900.0,64000.0,12000.0,16300.0,63500.0,12500.0,15680.0,64000.0,12400.0
2018-04,16815.0,64000.0,11990.0,16800.0,63500.0,12500.0,15700.0,64000.0,12400.0
2018-05,16950.0,64000.0,12000.0,16800.0,63000.0,12500.0,16700.0,64000.0,12400.0
2018-06,16550.0,64000.0,12300.0,17300.0,63500.0,12500.0,16700.0,64000.0,12400.0
2018-07,16550.0,64000.0,12325.0,16800.0,63500.0,12500.0,16600.0,64000.0,12300.0


In [122]:
pd.pivot_table(
    data=household, 
    index='sub_category',
    columns='yearmonth',
    values='quantity'
)

yearmonth,2017-10,2017-11,2017-12,2018-01,2018-02,2018-03,2018-04,2018-05,2018-06,2018-07,2018-08,2018-09
sub_category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Detergent,1.355667,1.292333,1.323333,1.413667,1.426667,1.539667,1.471667,1.384333,1.345667,1.274667,1.366667,1.359
Rice,1.422,1.298,1.277,1.323,1.353,1.413,1.363,1.359,1.358,1.256,1.269,1.304
Sugar,1.5855,1.5905,1.65,1.606,1.6955,1.849,1.791,1.92,1.9475,1.581,1.638,1.7015


A key difference between `crosstab` and `pivot_table` is that `crosstab` uses `len` as the default aggregation function while `pivot_table` using the mean. Copy the cdoe from the cell above and make a change: use `sum` as the aggregation function instead: 

In [102]:
## Your code below


## -- Solution code

## Working with Datetime

Given the program's special emphasis on business-driven analytics, one data type of particular interest to us is the `datetime`. In the first part of this coursebook, we've seen an example of `datetime` in the section introducing data types (`employees.joined`).

A large portion of data science work performed by business executives involve time series and/or dates (think about the kind of data science work done by computer vision researchers, or by credit rating analysts etc and this special relationship between business and datetime data becomes apparent), so adding a level of familiarity with this format will serve you well in the long run. 

As a start, let's take a look at the data types of our `DataFrame` again:

In [124]:
household.dtypes

receipt_id            int64
receipts_item_id      int64
purchase_time        object
category             object
sub_category         object
format               object
unit_price          float64
discount              int64
quantity              int64
yearmonth            object
dtype: object

Notice that all columns are in the right data types, except for `purchase_time`. The correct data type for this column would have to be a `datetime`.

To convert a column `x` to a datetime, we would use:

    `x = pd.to_datetime(x)`
    

In [130]:
household['purchase_time'] = pd.to_datetime(household['purchase_time'])
household.head()

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth
0,9622257,32369294,2018-07-22 21:19:00,Rice,Rice,supermarket,128000.0,0,1,2018-07
1,9446359,31885876,2018-07-15 16:17:00,Rice,Rice,minimarket,102750.0,0,1,2018-07
2,9470290,31930241,2018-07-15 12:12:00,Rice,Rice,supermarket,64000.0,0,3,2018-07
3,9643416,32418582,2018-07-24 08:27:00,Rice,Rice,minimarket,65000.0,0,1,2018-07
4,9692093,32561236,2018-07-26 11:28:00,Rice,Rice,supermarket,124500.0,0,1,2018-07


In fact, `pandas` has a number of machineries to work with `datetime` objects. These are convenient for when we need to extract the `month`, or `year`, or `weekday_name` from `datetime`. Some common applications in business analysis include:

- `household['purchase_time'].dt.month`
- `household['purchase_time'].dt.year`
- `household['purchase_time'].dt.day`
- `household['purchase_time'].dt.dayofweek`
- `household['purchase_time'].dt.hour`
- `household['purchase_time'].dt.weekday_name.head()`

### Knowledge Check
_Est. Time required: 20 minutes_

1. In the following cell, start again by reading in the `household.csv` dataset 
2. Convert `purchase_time` to `datetime`. Use `.to_datetime()` for this.
3. Use `x.dt.weekday_name`, assuming `x` is a datetime object to get the day of week
4. Print the first 5 rows of your data to verify that your preprocessing steps are correct

In [155]:
## Your code below


## -- Solution code

Tips: In the cell above, start from:

`household = pd.read_csv("data_input/household.csv")`

Inspect the first 5 rows of your data and pay close attention to the `weekday` column. 

Bonus challenge: How many transactions happen on each day of the week? Use `pd.crosstab` or `x.value_counts()`.

In [162]:
## Your code below


## -- Solution code

There are also other functions that can be helpful in certain situations. Supposed we want to transform the existing `datetime` column into values of periods we can use the `.to_period` method:

- `household['purchase_time'].dt.to_period('D')`
- `household['purchase_time'].dt.to_period('W')`
- `household['purchase_time'].dt.to_period('M')`
- `household['purchase_time'].dt.to_period('Q')`

If you've managed the above exercises, well done! Run the following cell anyway to make sure we're at the same starting point as we go into the next chapter of working with categorical data (factors). 

In [242]:
# Reference answer for Knowledge Check
household = pd.read_csv("data_input/household.csv", index_col=1, parse_dates=['purchase_time'])
household.drop(['receipt_id', 'yearmonth', 'sub_category'], axis=1, inplace=True)
household['weekday'] = household['purchase_time'].dt.weekday_name
pd.crosstab(index=household['weekday'], columns='count')

col_0,count
weekday,Unnamed: 1_level_1
Friday,10778
Monday,9050
Saturday,11828
Sunday,12573
Thursday,9138
Tuesday,9427
Wednesday,9206


## Working with Categories

The official documentation from `pandas` describe the `category` data type as a tool to "represent a categorical variable in classic R fashion".

When working with categories, it is recommended both from a business point of a view and a technical one to use `pandas` categorical data type. From a business perspective, this adds clarity to the analyst's mind about the type of data he/she is working with. This informs and guides the analysis, on questions such as which statistical methods or plot types to use.

From a technical viewpoint, the memory savings -- and in turn, computation speed as well as computational resources -- can be quite significant. Specifically, the docs remarked:

> The memory usage of a `Categorical` is proportional to the number of categories plus the length of the data. In contrast, an `object` dtype is a constant times the length of the data

In [243]:
household.dtypes

purchase_time    datetime64[ns]
category                 object
format                   object
unit_price              float64
discount                  int64
quantity                  int64
weekday                  object
dtype: object

From the output of `dtypes`, we see that there are three variables currently stored as `object` type where a `category` is more appropriate. This is a common diagnostic step, and one that you will employ in almost every data analysis project. 

We'll convert the `weekday` column to a categorical type using `.astype()`. `astype('int64')` converts a Series to an integer type, and `.astype(category)` logically, converts a Series to a categorical.

By default, `.astype()` will raise an error if the conversion is not successful (we call them "exceptions"). In an analysis-driven environment, this is what we usually prefer. However, in certain production settings, you don't want the exception to be raised and rather return the original object.

In [234]:
household['weekday'] = household['weekday'].astype('category', errors='raise')
household.dtypes

purchase_time    datetime64[ns]
category                 object
format                   object
unit_price              float64
discount                  int64
quantity                  int64
weekday                category
dtype: object

Go ahead and perform the other conversions in the following cell. When you're done, use `dtypes` to check that you have the categorical columns stored as `category`.

In [233]:
## Your code below


## -- Solution code

### Alternative Solutions (optional)

In [235]:
household.select_dtypes(exclude='object').head()

Unnamed: 0_level_0,purchase_time,unit_price,discount,quantity,weekday
receipts_item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
32369294,2018-07-22 21:19:00,128000.0,0,1,Sunday
31885876,2018-07-15 16:17:00,102750.0,0,1,Sunday
31930241,2018-07-15 12:12:00,64000.0,0,3,Sunday
32418582,2018-07-24 08:27:00,65000.0,0,1,Tuesday
32561236,2018-07-26 11:28:00,124500.0,0,1,Thursday


In [222]:
pd.concat([
    household.select_dtypes(exclude='object'),
    household.select_dtypes(include='object').apply(
        pd.Series.astype, dtype='category'
    )
], axis=1).dtypes

purchase_time    datetime64[ns]
unit_price              float64
discount                  int64
quantity                  int64
category               category
format                 category
weekday                category
dtype: object

In [244]:
objectcols = household.select_dtypes(include='object')
household[objectcols.columns] = objectcols.apply(lambda x: x.astype('category'))
household.head()

Unnamed: 0_level_0,purchase_time,category,format,unit_price,discount,quantity,weekday
receipts_item_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
32369294,2018-07-22 21:19:00,Rice,supermarket,128000.0,0,1,Sunday
31885876,2018-07-15 16:17:00,Rice,minimarket,102750.0,0,1,Sunday
31930241,2018-07-15 12:12:00,Rice,supermarket,64000.0,0,3,Sunday
32418582,2018-07-24 08:27:00,Rice,minimarket,65000.0,0,1,Tuesday
32561236,2018-07-26 11:28:00,Rice,supermarket,124500.0,0,1,Thursday


In [245]:
household.dtypes

purchase_time    datetime64[ns]
category               category
format                 category
unit_price              float64
discount                  int64
quantity                  int64
weekday                category
dtype: object

## Missing Values and Duplicates

In [240]:
household.shape

(72000, 7)

In [241]:
household.drop_duplicates().shape

(71475, 7)

## Renaming Index or Columns

# Learn 