**Inclass material for Week 2: Exploratory Data Analysis**

This notebook was made based on main materials `2_Exploratory_Data_Analysis.ipynb`

Version: Newton - February 2021

___

# Exploratory Data Analysis (EDA)

## Training Objectives

The coursebook focuses on:
- Why and What: Exploratory Data Analysis
- Date Time objects
- Categorical data types
- Cross Tabulation and Pivot Table
- Treating Duplicates and Missing Values 

## Introduction to EDA

About 60 years ago, John Tukey defined data analysis as the "procedures for analyzing data, techniques for interpreting the results of such procedures ... and all the machinery of mathematical statistics which apply to analyzing data". His championing of EDA encouraged the development of statistical computing packages, especially S at Bell Labs (which later inspired R).

He wrote a book titled _Exploratory Data Analysis_ arguing that too much emphasis in statistics was placed on hypothesis testing (confirmatory data analysis) while not enough was placed on the discovery of the unexpected. 

> Exploratory data analysis isolates patterns and features of the data and reveals these forcefully to the analyst.

This course aims to present a selection of EDA techniques -- some developed by John Tukey himself -- but with a special emphasis on its application to modern business analytics.

## Data Loading

Don't forget to **import** required library before analysis:

In [None]:
import pandas as pd

Load `household` data from the directory `data_input/household.csv`:

In [None]:
household = pd.read_csv('data_input/household.csv')

In [None]:
household.head()

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth
0,9622257,32369294,7/22/2018 21:19,Rice,Rice,supermarket,128000.0,0,1,2018-07
1,9446359,31885876,7/15/2018 16:17,Rice,Rice,minimarket,102750.0,0,1,2018-07
2,9470290,31930241,7/15/2018 12:12,Rice,Rice,supermarket,64000.0,0,3,2018-07
3,9643416,32418582,7/24/2018 8:27,Rice,Rice,minimarket,65000.0,0,1,2018-07
4,9692093,32561236,7/26/2018 11:28,Rice,Rice,supermarket,124500.0,0,1,2018-07


Data terkait transaksi Rice, Detergent (Fabric Care), dan Sugar 

In [None]:
household['category'].unique()

array(['Rice', 'Fabric Care', 'Sugar/Flavored Syrup'], dtype=object)

💡 **Tips**: Using `.info()` method, we can check complete **information** of our DataFrame:

- Number of rows and columns (`.shape`)
- Column name and number of non-null value (`.columns`)
- Data types of each column (`.dtypes`)
- Memory usage

In [None]:
household.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72000 entries, 0 to 71999
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   receipt_id        72000 non-null  int64  
 1   receipts_item_id  72000 non-null  int64  
 2   purchase_time     72000 non-null  object 
 3   category          72000 non-null  object 
 4   sub_category      72000 non-null  object 
 5   format            72000 non-null  object 
 6   unit_price        72000 non-null  float64
 7   discount          72000 non-null  int64  
 8   quantity          72000 non-null  int64  
 9   yearmonth         72000 non-null  object 
dtypes: float64(1), int64(4), object(5)
memory usage: 5.5+ MB


Note: 72000 non-null untuk semua kolom menandakan data `household` tidak ada missing value

___

# Working with Datetime

Let's focus on column data types of `household`:

In [None]:
household.dtypes

receipt_id            int64
receipts_item_id      int64
purchase_time        object
category             object
sub_category         object
format               object
unit_price          float64
discount              int64
quantity              int64
yearmonth            object
dtype: object

Which column should be stored in `datetime64` data type?

> Answer: `purchase_time`

In [None]:
# apakah yearmonth dijadikan datetime64?
household['yearmonth']
# jawaban: category, karena tidak ada komponen day (hanya tahun dan bulan), dan juga nilainya berulang
# datetime64 -> yyyy-mm-dd

0        2018-07
1        2018-07
2        2018-07
3        2018-07
4        2018-07
          ...   
71995    2017-12
71996    2017-12
71997    2017-12
71998    2017-12
71999    2017-12
Name: yearmonth, Length: 72000, dtype: object

## Convert to Datetime

There are three common ways for us to convert data type into `datetime64`:

### First: Method `.astype()`

Let's create a copy of `household` so that the original data remains unchanged.

In [None]:
df_1 = household.copy()
df_1.dtypes

receipt_id            int64
receipts_item_id      int64
purchase_time        object
category             object
sub_category         object
format               object
unit_price          float64
discount              int64
quantity              int64
yearmonth            object
dtype: object

Convert data type using `.astype()` method:

In [None]:
df_1['purchase_time'] = df_1['purchase_time'].astype('datetime64')

In [None]:
df_1.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

⚠️ **Warning**: Don't forget to assign the result to its original column.

### Second: Parameter `parse_dates`

Used during data loading, assuming we already know which columns are supposed to be `datetime64`.

In [None]:
df_2 = pd.read_csv('data_input/household.csv', parse_dates=['purchase_time'])
df_2.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

In [None]:
df_2['purchase_time'].head()
# format datetime64 adalah yyyy-mm-dd hh:mm:ss

0   2018-07-22 21:19:00
1   2018-07-15 16:17:00
2   2018-07-15 12:12:00
3   2018-07-24 08:27:00
4   2018-07-26 11:28:00
Name: purchase_time, dtype: datetime64[ns]

### Third: Method `pd.to_datetime()`

In [None]:
df_3 = household.copy()
df_3.dtypes

receipt_id            int64
receipts_item_id      int64
purchase_time        object
category             object
sub_category         object
format               object
unit_price          float64
discount              int64
quantity              int64
yearmonth            object
dtype: object

Convert data type using `pd.to_datetime()` method:

In [None]:
df_3['purchase_time'] = pd.to_datetime(df_3['purchase_time'])

In [None]:
df_3.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

So, what is the difference between using `.astype()` and `pd.to_datetime()`?

Unlike using `astype()`, with `pd.to_datetime()` we are allowed to specify **parameters** for the datetime conversion. Thus, provide more **flexibility**.

Suppose we have a column which stores a daily sales data from end of January to the beginning of February:

In [None]:
sales_date = pd.Series(['30-01-2021', '31-01-2021', '01-02-2021', '02-02-2021'])
sales_date

0    30-01-2021
1    31-01-2021
2    01-02-2021
3    02-02-2021
dtype: object

The example above shows how Indonesians usually write dates, using **dd-mm-yyyy** format. Let's see what will happen when we convert `sales_date` data type to `datetime64`:

In [None]:
sales_date.astype('datetime64') # yyyy-mm-dd

0   2021-01-30
1   2021-01-31
2   2021-01-02
3   2021-02-02
dtype: datetime64[ns]

Take a look on the third observation (second index):

- Expectation: 1 February
- Reality: 2 January

⚠️ **Warning**: By default, `pandas` will infer date as **month first** format.

#### Parameter `dayfirst`

Solution 1: Use parameter `dayfirst=True` to specify that our `sales_date` starts with day, not month.

In [None]:
pd.to_datetime(sales_date, dayfirst=True)

0   2021-01-30
1   2021-01-31
2   2021-02-01
3   2021-02-02
dtype: datetime64[ns]

#### Parameter `format`

Solution 2: Use parameter `format` to specifically specify the date formatting of our `sales_date`.

In [None]:
pd.to_datetime(sales_date, format='%d-%m-%Y')

0   2021-01-30
1   2021-01-31
2   2021-02-01
3   2021-02-02
dtype: datetime64[ns]

- `%d`: tanggal, dua digit
- `%m`: bulan, dua digit
- `%Y`: tahun, empat digit

The specified `format` is called as Python's `strftime` directives.

`strftime` means string from time, we can format the time in different desirable ways.

[Documentation of `strftime`](https://strftime.org/)

Misalkan format tanggal yang kita miliki sekarang `dd/mm-yyyy` maka formatnya juga mengikuti: `format='%d/%m-%Y'`

In [None]:
sales_date2 = pd.Series(['30/01-2021', '31/01-2021', '01/02-2021', '02/02-2021'])
pd.to_datetime(sales_date2, format='%d/%m-%Y')

0   2021-01-30
1   2021-01-31
2   2021-02-01
3   2021-02-02
dtype: datetime64[ns]

Note:

- Untuk `datetime64` formatnya selalu yyyy-mm-dd hh:mm:ss
- `ns`: nanosecond

Misalkan format tanggal dan bulan tanpa '0' melainkan langsung angkanya, maka formatnya sama saja:

In [None]:
sales_date3 = pd.Series(['1-1-2021', '2-1-2021', '3-1-2021', '4-1-2021'])
pd.to_datetime(sales_date3, format='%d-%m-%Y')

0   2021-01-01
1   2021-01-02
2   2021-01-03
3   2021-01-04
dtype: datetime64[ns]

## Extract Datetime Component

The main reason we convert a column to `datetime64` date type because `pandas` has a number of machineries to work with `datetime` objects. These are convenient for when we need to extract datetime component.

**Date component (numeric)**
- `.dt.year`
- `.dt.month`
- `.dt.day`
- `.dt.dayofweek`: index of day, Monday = 0 to Sunday = 6

**Date component (string)**
- `.dt.month_name()`
- `.dt.day_name()`

**Time component**
- `.dt.hour`
- `.dt.minute`
- `.dt.second`

[Documentation of datetime properties](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetimelike-properties)

Let's extract datetime component from `household`, but first let's convert `purchase_time` column to `datetime64`:

In [None]:
# your code here
household['purchase_time'] = household['purchase_time'].astype('datetime64')
household['purchase_time']

0       2018-07-22 21:19:00
1       2018-07-15 16:17:00
2       2018-07-15 12:12:00
3       2018-07-24 08:27:00
4       2018-07-26 11:28:00
                ...        
71995   2017-12-27 09:20:00
71996   2017-12-13 19:52:00
71997   2017-12-27 08:03:00
71998   2017-12-07 12:29:00
71999   2017-12-19 18:59:00
Name: purchase_time, Length: 72000, dtype: datetime64[ns]

In [None]:
household.dtypes

receipt_id                   int64
receipts_item_id             int64
purchase_time       datetime64[ns]
category                    object
sub_category                object
format                      object
unit_price                 float64
discount                     int64
quantity                     int64
yearmonth                   object
dtype: object

Check the range of `purchase_time`:

In [None]:
household['purchase_time'].describe()

  """Entry point for launching an IPython kernel.


count                   72000
unique                  62072
top       2017-10-22 12:00:00
freq                       12
first     2017-10-01 00:00:00
last      2018-09-30 23:57:00
Name: purchase_time, dtype: object

Note: `dtype: object` bukanlah tipe data dari kolom `purchase_time`, melainkan dari hasil `.describe()`

### Datetime attribute

To extract datetime component in numeric value, we can use its **attribute** (without bracket)

In [None]:
household['purchase_time']

0       2018-07-22 21:19:00
1       2018-07-15 16:17:00
2       2018-07-15 12:12:00
3       2018-07-24 08:27:00
4       2018-07-26 11:28:00
                ...        
71995   2017-12-27 09:20:00
71996   2017-12-13 19:52:00
71997   2017-12-27 08:03:00
71998   2017-12-07 12:29:00
71999   2017-12-19 18:59:00
Name: purchase_time, Length: 72000, dtype: datetime64[ns]

In [None]:
household['purchase_time'].dt.month

0         7
1         7
2         7
3         7
4         7
         ..
71995    12
71996    12
71997    12
71998    12
71999    12
Name: purchase_time, Length: 72000, dtype: int64

In [None]:
household['purchase_time'].dt.hour

0        21
1        16
2        12
3         8
4        11
         ..
71995     9
71996    19
71997     8
71998    12
71999    18
Name: purchase_time, Length: 72000, dtype: int64

### Datetime method

To extract name (text) from datetime component, we can use its **method** (with bracket)

In [None]:
household['purchase_time'].dt.day_name()

0           Sunday
1           Sunday
2           Sunday
3          Tuesday
4         Thursday
           ...    
71995    Wednesday
71996    Wednesday
71997    Wednesday
71998     Thursday
71999      Tuesday
Name: purchase_time, Length: 72000, dtype: object

In [None]:
household['purchase_time'].dt.month_name()

0            July
1            July
2            July
3            July
4            July
           ...   
71995    December
71996    December
71997    December
71998    December
71999    December
Name: purchase_time, Length: 72000, dtype: object

## Datetime Transformation

Supposed we want to transform the existing `datetime64` column into values of periods, we can use the `.to_period()` method:

- `.dt.to_period('D')`: Daily (yyyy-mm-dd)
- `.dt.to_period('W')`: Weekly (Monday to Sunday period)
- `.dt.to_period('M')`: Monthly (year-month)
- `.dt.to_period('Q')`: Quarterly (year-quarter)

[Documentation of offset aliases](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries-offset-aliases)

In [None]:
household['purchase_time'].dt.to_period('D') # Daily

0        2018-07-22
1        2018-07-15
2        2018-07-15
3        2018-07-24
4        2018-07-26
            ...    
71995    2017-12-27
71996    2017-12-13
71997    2017-12-27
71998    2017-12-07
71999    2017-12-19
Name: purchase_time, Length: 72000, dtype: period[D]

In [None]:
household['purchase_time'].dt.to_period('W') # Weekly: Monday/Sunday

0        2018-07-16/2018-07-22
1        2018-07-09/2018-07-15
2        2018-07-09/2018-07-15
3        2018-07-23/2018-07-29
4        2018-07-23/2018-07-29
                 ...          
71995    2017-12-25/2017-12-31
71996    2017-12-11/2017-12-17
71997    2017-12-25/2017-12-31
71998    2017-12-04/2017-12-10
71999    2017-12-18/2017-12-24
Name: purchase_time, Length: 72000, dtype: period[W-SUN]

In [None]:
household['purchase_time'].dt.to_period('W-SAT') # Weekly Saturday: Sunday/Saturday

0        2018-07-22/2018-07-28
1        2018-07-15/2018-07-21
2        2018-07-15/2018-07-21
3        2018-07-22/2018-07-28
4        2018-07-22/2018-07-28
                 ...          
71995    2017-12-24/2017-12-30
71996    2017-12-10/2017-12-16
71997    2017-12-24/2017-12-30
71998    2017-12-03/2017-12-09
71999    2017-12-17/2017-12-23
Name: purchase_time, Length: 72000, dtype: period[W-SAT]

In [None]:
household['purchase_time'].dt.to_period('M') # Monthly: year-month

0        2018-07
1        2018-07
2        2018-07
3        2018-07
4        2018-07
          ...   
71995    2017-12
71996    2017-12
71997    2017-12
71998    2017-12
71999    2017-12
Name: purchase_time, Length: 72000, dtype: period[M]

In [None]:
household['purchase_time'].dt.to_period('Q') # Quarterly: yearQ[1/2/3/4]

0        2018Q3
1        2018Q3
2        2018Q3
3        2018Q3
4        2018Q3
          ...  
71995    2017Q4
71996    2017Q4
71997    2017Q4
71998    2017Q4
71999    2017Q4
Name: purchase_time, Length: 72000, dtype: period[Q-DEC]

**In class Question 1**: Bagaimana untuk menggabungkan beberapa kolom komponen tanggal menjadi satu?

Misal pakai data `companies.csv` kita gabungkan kolom `Month`, `Day`, dan `Year`:

In [None]:
companies = pd.read_csv("data_input/companies.csv")
companies

Unnamed: 0,ID,Customer Name,Consulting Sales,Software Sales,Forecasted Growth,Returns,Month,Day,Year,Location,Account
0,30940,New Media Group,IDR7125000,IDR5500000,30.00%,"IDR1,500,000",1,10,2017,Jakarta,Enterprise
1,82391,Li and Partners,IDR420000,IDR820000,10.00%,"IDR400,000",6,15,2016,Jakarta,Startup
2,18374,PT. Kreasi Metrik Solusi,0,IDR550403,25.00%,0,3,29,2012,Surabaya,Enterprise
3,57531,PT. Algoritma Data Indonesia,IDR850000,IDR395500,4.00%,0,7,17,2017,Jakarta,Startup
4,19002,Palembang Konsultansi,IDR2115000,0,-15.00%,0,2,24,2018,Bandung,Startup
5,31142,PT. Surya Citra Manajemen,IDR960000,IDR503000,19.00%,0,1,19,2019,Jakarta,Enterprise


Tahapan:

1. Gabungkan dengan operasi `+`, namun pastikan tipenya **str** (string) agar dapat digabungkan
2. Ubah menjadi tipe data `datetime64`
3. Kemudian dapat diekstrak komponen datetime nya

In [None]:
companies['date'] = companies['Month'].astype(str) + '-' + companies['Day'].astype(str) + '-' + companies['Year'].astype(str)
companies['date'] = companies['date'].astype('datetime64')
companies['date'].dt.day_name()

0      Tuesday
1    Wednesday
2     Thursday
3       Monday
4     Saturday
5     Saturday
Name: date, dtype: object

**In class Question 2**: Apakah str = object?

Tipe data object umumnya adalah string (kalimat) pada file text/csv. Object tidak selalu string, bahkan kita dapat meletakkan DataFrame di dalam DataFrame, dan DataFrame yang ada di dalam dikenal sebagai object. Namun hal ini jarang dilakukan dalam melakukan analisis data.

In [None]:
tabel1 = pd.DataFrame({'kolom1': [1, 2, 3],
                       'kolom2': ['a', 'b', 'c']})
tabel2 = tabel1.copy()

tabel = pd.DataFrame({'kolom1': [tabel1, tabel2]})
tabel

Unnamed: 0,kolom1
0,kolom1 kolom2 0 1 a 1 2 ...
1,kolom1 kolom2 0 1 a 1 2 ...


In [None]:
tabel.dtypes

kolom1    object
dtype: object

## Knowledge Check: Datetime

_Estimated time required: 15 minutes_

1. In the following cell, start again by reading in the `household.csv` dataset.
2. Make sure the `purchase_time` column has converted as a datetime object.
3. Use `x.dt.day_name()`, assuming `x` is a datetime object to get the day of week. Assign this to a new column in your `household` Data Frame, name it `weekday`
4. The `yearmonth` column stores the information of year and month of the `purchase_time`. Using `dt.to_period()`, how will you recreate the column if you needed the same information?
5. Print the first 5 rows of your data to verify that your preprocessing steps are correct

Alternatif untuk nomor 2:

- Menggunakan `household['purchase_time'].astype('datetime64')`
- Menambahkan parameter `parse_dates = ['purchase_time']` di dalam `pd.read_csv()`

In [None]:
# your code here
household = pd.read_csv('data_input/household.csv')
household['purchase_time'] = pd.to_datetime(household['purchase_time'])
household['weekday'] = household['purchase_time'].dt.day_name()
household['yearmonth_recreate'] = household['purchase_time'].dt.to_period('M') # Monthly: year-month
household.head()

Unnamed: 0,receipt_id,receipts_item_id,purchase_time,category,sub_category,format,unit_price,discount,quantity,yearmonth,weekday,yearmonth_recreate
0,9622257,32369294,2018-07-22 21:19:00,Rice,Rice,supermarket,128000.0,0,1,2018-07,Sunday,2018-07
1,9446359,31885876,2018-07-15 16:17:00,Rice,Rice,minimarket,102750.0,0,1,2018-07,Sunday,2018-07
2,9470290,31930241,2018-07-15 12:12:00,Rice,Rice,supermarket,64000.0,0,3,2018-07,Sunday,2018-07
3,9643416,32418582,2018-07-24 08:27:00,Rice,Rice,minimarket,65000.0,0,1,2018-07,Tuesday,2018-07
4,9692093,32561236,2018-07-26 11:28:00,Rice,Rice,supermarket,124500.0,0,1,2018-07,Thursday,2018-07


💭 **Bonus challenge:**

Suppose that the estimated delivery time will take exactly 2 days after the product is purchased. How do we obtain the result by using `purchase_time` column?

In [None]:
household['purchase_time']

0       2018-07-22 21:19:00
1       2018-07-15 16:17:00
2       2018-07-15 12:12:00
3       2018-07-24 08:27:00
4       2018-07-26 11:28:00
                ...        
71995   2017-12-27 09:20:00
71996   2017-12-13 19:52:00
71997   2017-12-27 08:03:00
71998   2017-12-07 12:29:00
71999   2017-12-19 18:59:00
Name: purchase_time, Length: 72000, dtype: datetime64[ns]

In [None]:
# Pak Prasetyo, Pak Zainul, Bu Rahmah
from datetime import timedelta
household['purchase_time'] + timedelta(days=2)

0       2018-07-24 21:19:00
1       2018-07-17 16:17:00
2       2018-07-17 12:12:00
3       2018-07-26 08:27:00
4       2018-07-28 11:28:00
                ...        
71995   2017-12-29 09:20:00
71996   2017-12-15 19:52:00
71997   2017-12-29 08:03:00
71998   2017-12-09 12:29:00
71999   2017-12-21 18:59:00
Name: purchase_time, Length: 72000, dtype: datetime64[ns]

In [None]:
# Pak Privera
household['purchase_time'] + pd.Timedelta(2, 'D') # 2 Days

0       2018-07-24 21:19:00
1       2018-07-17 16:17:00
2       2018-07-17 12:12:00
3       2018-07-26 08:27:00
4       2018-07-28 11:28:00
                ...        
71995   2017-12-29 09:20:00
71996   2017-12-15 19:52:00
71997   2017-12-29 08:03:00
71998   2017-12-09 12:29:00
71999   2017-12-21 18:59:00
Name: purchase_time, Length: 72000, dtype: datetime64[ns]

In [None]:
# Pak Endi: bisa, tetapi untuk data timenya hilang
household['purchase_time'].dt.to_period('D') + 2

0        2018-07-24
1        2018-07-17
2        2018-07-17
3        2018-07-26
4        2018-07-28
            ...    
71995    2017-12-29
71996    2017-12-15
71997    2017-12-29
71998    2017-12-09
71999    2017-12-21
Name: purchase_time, Length: 72000, dtype: period[D]

## Timedelta Object

`pandas` has `.Timedelta()` method which represents a duration, the difference between two dates or times.

[Documentation of Timedelta](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Timedelta.html#pandas-timedelta)

In [None]:
# contoh selisih antar dua tanggal
contoh = pd.DataFrame({
    'id': [1, 2, 3],
    'start': ['2021-01-01 01:00:00', '2021-01-02 01:00:00', '2021-01-03 01:00:00'],
    'end': ['2021/01/04 15:00:00', '2021/01/07 16:00:00', '2021/01/08 17:00:00']
})

contoh[['start', 'end']] = contoh[['start', 'end']].astype('datetime64')
contoh['end'] - contoh['start']

0   3 days 14:00:00
1   5 days 15:00:00
2   5 days 16:00:00
dtype: timedelta64[ns]

In [None]:
(contoh['end'] - contoh['start']).dt.days

0    3
1    5
2    5
dtype: int64

**END OF DAY 1**

___

**START OF DAY 2**

# Working with Category

## Characteristic: Unique Values

Characteristic of `category` data type: repeated values, which can be categorized into several groups. We can use:

- `.unique()` to see unique values of a Series
- `.nunique()` to see number of unique values of a Series or DataFrame

to identify which columns are better converted to `category`.

In [None]:
household.nunique() # number of unique value

receipt_id            69776
receipts_item_id      72000
purchase_time         62072
category                  3
sub_category              3
format                    3
unit_price             3884
discount               1329
quantity                 19
yearmonth                12
weekday                   7
yearmonth_recreate       12
dtype: int64

In [None]:
household['format'].unique()

array(['supermarket', 'minimarket', 'hypermarket'], dtype=object)

## Convert to Category

Let's create a copy of `household` so that the original data remains unchanged.

In [None]:
household_cat = household.copy()
household_cat.dtypes

receipt_id                     int64
receipts_item_id               int64
purchase_time         datetime64[ns]
category                      object
sub_category                  object
format                        object
unit_price                   float64
discount                       int64
quantity                       int64
yearmonth                     object
weekday                       object
yearmonth_recreate         period[M]
dtype: object

Let's convert the columns to `category` data type:

- `category`
- `sub_category`
- `format`
- `yearmonth`
- `weekday`

In [None]:
cat_col = ['category', 'sub_category', 'format', 'yearmonth', 'weekday']
household_cat[cat_col] = household_cat[cat_col].astype('category')

In [None]:
household_cat.dtypes

receipt_id                     int64
receipts_item_id               int64
purchase_time         datetime64[ns]
category                    category
sub_category                category
format                      category
unit_price                   float64
discount                       int64
quantity                       int64
yearmonth                   category
weekday                     category
yearmonth_recreate         period[M]
dtype: object

## Advantages

There are two main advantages of converting to `category`:

### First: Memory Efficient

We can compare two Data Frame **before and after** the columns are converted to `category` data type:

- `household` (before): 6.0+ MB
- `household_cat` (after): 3.6 MB

In [None]:
household.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72000 entries, 0 to 71999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   receipt_id          72000 non-null  int64         
 1   receipts_item_id    72000 non-null  int64         
 2   purchase_time       72000 non-null  datetime64[ns]
 3   category            72000 non-null  object        
 4   sub_category        72000 non-null  object        
 5   format              72000 non-null  object        
 6   unit_price          72000 non-null  float64       
 7   discount            72000 non-null  int64         
 8   quantity            72000 non-null  int64         
 9   yearmonth           72000 non-null  object        
 10  weekday             72000 non-null  object        
 11  yearmonth_recreate  72000 non-null  period[M]     
dtypes: datetime64[ns](1), float64(1), int64(4), object(5), period[M](1)
memory usage: 6.6+ MB


In [None]:
household_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72000 entries, 0 to 71999
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   receipt_id          72000 non-null  int64         
 1   receipts_item_id    72000 non-null  int64         
 2   purchase_time       72000 non-null  datetime64[ns]
 3   category            72000 non-null  category      
 4   sub_category        72000 non-null  category      
 5   format              72000 non-null  category      
 6   unit_price          72000 non-null  float64       
 7   discount            72000 non-null  int64         
 8   quantity            72000 non-null  int64         
 9   yearmonth           72000 non-null  category      
 10  weekday             72000 non-null  category      
 11  yearmonth_recreate  72000 non-null  period[M]     
dtypes: category(5), datetime64[ns](1), float64(1), int64(4), period[M](1)
memory usage: 4.2 MB


### Second: Categorical Accessor `.cat`

Just like the `datetime64` data type which has `.dt` accessor, the `category` data type has `.cat` accessor.

In [None]:
household_cat['format'].cat.categories

Index(['hypermarket', 'minimarket', 'supermarket'], dtype='object')

You can explore more functionalities by refering to [documentation of categorical accessor](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#categorical-accessor) for the complete list.

___

# Contingency Table

One of the simplest EDA toolkit to display **counts** of a categorical column:

Contingency = Cross-tabulation = Frequency tables

In [None]:
household = household_cat.copy()
household.dtypes

receipt_id                     int64
receipts_item_id               int64
purchase_time         datetime64[ns]
category                    category
sub_category                category
format                      category
unit_price                   float64
discount                       int64
quantity                       int64
yearmonth                   category
weekday                     category
yearmonth_recreate         period[M]
dtype: object

## Method `.value_counts()`

Usage: Get the counts of each unique levels in one categorical column, sorted by descending order.

Parameter:

- `sort=False`: prevent any sorting values, **sort by index** instead
- `ascending=True`: **sort values** in ascending order

<b id='q1'>Business Question 1</b>

There are three categories of `format` (market type), how many total transactions occurred in each `format`?

In [None]:
household['format'].value_counts()
# by default: sort secara descending

minimarket     46803
supermarket    19826
hypermarket     5371
Name: format, dtype: int64

In [None]:
46803+19826+5371

72000

- Input: Series
- Output: Series

Let's say we don't need sort by values:

In [None]:
household['format'].value_counts(sort=False)
# sort by index, berdasarkan alfabetikal

hypermarket     5371
minimarket     46803
supermarket    19826
Name: format, dtype: int64

Sort by ascending (smallest to largest):

In [None]:
household['format'].value_counts(ascending=True)

hypermarket     5371
supermarket    19826
minimarket     46803
Name: format, dtype: int64

<b id='q2'>Business Question 2</b>

Which `weekday` has the **largest** and **smallest** transaction volume?

In [None]:
household['weekday'].value_counts()

Sunday       12573
Saturday     11828
Friday       10778
Tuesday       9427
Wednesday     9206
Thursday      9138
Monday        9050
Name: weekday, dtype: int64

In [None]:
# largest transaction volume
household['weekday'].value_counts().head(1)

Sunday    12573
Name: weekday, dtype: int64

In [None]:
# smallest transaction volume
household['weekday'].value_counts().tail(1)

Monday    9050
Name: weekday, dtype: int64

## Cross-Tabulation

Versatile solution in producing frequency table is by using `crosstab`. Syntax:

    pd.crosstab(
        index=...,
        columns=...
    )
                
**Required Parameter**:

- `index`: Values to group by in the index (rows)
- `columns`: Values to group by in the columns

Let's re-create [Business Question 1](#q1) using `pd.crosstab()`:

How many total transactions occurred in each `format`?

In [None]:
pd.crosstab(
    index=household['format'],
    columns='Total')

col_0,Total
format,Unnamed: 1_level_1
hypermarket,5371
minimarket,46803
supermarket,19826


Note: Hasil dari `pd.crosstab` adalah sebuah DataFrame

Let's re-create [Business Question 2](#q2) using `pd.crosstab()`:

Which `weekday` has the **largest** and **smallest** transaction volume?

In [None]:
pd.crosstab(
    index=household['weekday'],
    columns='Total')

# by default: crosstab sort by index, berdasarkan alfabetikal

col_0,Total
weekday,Unnamed: 1_level_1
Friday,10778
Monday,9050
Saturday,11828
Sunday,12573
Thursday,9138
Tuesday,9427
Wednesday,9206


To get the insight easier, we have to sort the DataFrame by using `.sort_values()` method. Let's sort the table above in **descending** order:

In [None]:
pd.crosstab(
    index=household['weekday'],
    columns='Total').sort_values(by='Total', ascending=False)

col_0,Total
weekday,Unnamed: 1_level_1
Sunday,12573
Saturday,11828
Friday,10778
Tuesday,9427
Wednesday,9206
Thursday,9138
Monday,9050


Optional: If you want to sort by `weekday` name, using `.cat` accessor:

In [None]:
household['weekday'] = household['weekday'].cat.reorder_categories(
    ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'],
    ordered=True
)

In [None]:
pd.crosstab(
    index=household['weekday'],
    columns='Total')

col_0,Total
weekday,Unnamed: 1_level_1
Monday,9050
Tuesday,9427
Wednesday,9206
Thursday,9138
Friday,10778
Saturday,11828
Sunday,12573


Dua tipe data kategori:

- Nominal: tipe data kategori tanpa urutan -> tidak perlu dilakukan reorder
- Ordinal: tipe data kategori yang ada urutan -> perlu dilakukan reorder, co: nama hari, nama bulan, tingkat pendidikan

<b id='q3'>Business Question 3</b>

From [Business Question 1](#q1) we know that minimarket has the most frequent transaction. Let's say we are curious to know what `category` of items has the highest total transactions in minimarket?  

In [None]:
pd.crosstab(index=household['category'],
            columns=household['format'])

format,hypermarket,minimarket,supermarket
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fabric Care,2611,24345,9044
Rice,999,7088,3913
Sugar/Flavored Syrup,1761,15370,6869


In [None]:
pd.crosstab(index=household['category'],
            columns=household['format']).sort_values(by='minimarket', ascending=False)

format,hypermarket,minimarket,supermarket
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fabric Care,2611,24345,9044
Sugar/Flavored Syrup,1761,15370,6869
Rice,999,7088,3913


In [None]:
# ingin mengambil kolom minimarket saja
category_format = pd.crosstab(index=household['category'],
                              columns=household['format'])
category_format[['minimarket']]

format,minimarket
category,Unnamed: 1_level_1
Fabric Care,24345
Rice,7088
Sugar/Flavored Syrup,15370


In [None]:
# alternatif: melakukan conditional subsetting, kemudian crosstab
mini = household[household['format'] == 'minimarket']

pd.crosstab(
    index=mini['category'],
    columns='Total Transaction in Minimarket'
)

col_0,Total Transaction in Minimarket
category,Unnamed: 1_level_1
Fabric Care,24345
Rice,7088
Sugar/Flavored Syrup,15370


### Additional Parameters of `pd.crosstab()`

#### Margins

Usage: Calculate subtotals

Parameter

- `margins`: Add row and column margins for subtotals
- `margins_name`: Name of the row and column that will contain the totals when `margins=True`

In [None]:
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True
)

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,2611,24345,9044,36000
Rice,999,7088,3913,12000
Sugar/Flavored Syrup,1761,15370,6869,24000
All,5371,46803,19826,72000


In [None]:
# change margins_name, dari All menjadi Total Transaction
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True,
    margins_name='Total Transaction'
)

format,hypermarket,minimarket,supermarket,Total Transaction
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,2611,24345,9044,36000
Rice,999,7088,3913,12000
Sugar/Flavored Syrup,1761,15370,6869,24000
Total Transaction,5371,46803,19826,72000


#### Normalize

Usage: Calculate percentages/proportions instead of the frequency

- `normalize='all'` or `normalize=True`: normalize over all values
- `normalize='index'`: normalize over each row
- `normalize='columns'`: normalize over each column

**Normalize by all**

In [None]:
# sebelum normalize -> frekuensi/count
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True
)

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,2611,24345,9044,36000
Rice,999,7088,3913,12000
Sugar/Flavored Syrup,1761,15370,6869,24000
All,5371,46803,19826,72000


In [None]:
# setelah normalize all -> proporsi/persentase
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True,
    normalize='all'
)

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,0.036264,0.338125,0.125611,0.5
Rice,0.013875,0.098444,0.054347,0.166667
Sugar/Flavored Syrup,0.024458,0.213472,0.095403,0.333333
All,0.074597,0.650042,0.275361,1.0


Dari **keseluruhan** transaksi, transaksi untuk Fabric care dan yang terjadi di hypermarket adalah sebesar 0.036264 (3.6264%)

**Normalize by index**

In [None]:
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True,
    normalize='index'
)

format,hypermarket,minimarket,supermarket
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fabric Care,0.072528,0.67625,0.251222
Rice,0.08325,0.590667,0.326083
Sugar/Flavored Syrup,0.073375,0.640417,0.286208
All,0.074597,0.650042,0.275361


- Jumlah per baris (`index`) adalah 100%
- Cara interpretasi: 0.072528 (7.2528%) adalah persentase Fabric Care yang terjual di hypermarket dibandingkan dengan `format` lainnya

**Normalize by column**

In [None]:
pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True,
    normalize='columns'
)

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,0.486129,0.520159,0.456169,0.5
Rice,0.185999,0.151443,0.197367,0.166667
Sugar/Flavored Syrup,0.327872,0.328398,0.346464,0.333333


- Jumlah per kolom (`columns`) adalah 100%
- Cara interpretasi: Pada transaksi yang terjadi di hypermarket, 0.486129 (48.6129%)nya adalah Fabric Care

In [None]:
# mengubah menjadi persentase, dikalikan 100
category_format_percent = pd.crosstab(
    index=household['category'],
    columns=household['format'],
    margins=True,
    normalize='columns'
)*100

category_format_percent

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,48.612921,52.015896,45.616867,50.0
Rice,18.599888,15.144328,19.736709,16.666667
Sugar/Flavored Syrup,32.78719,32.839775,34.646424,33.333333


In [None]:
# membulatkan dua angka di belakang koma
category_format_percent.round(2)

format,hypermarket,minimarket,supermarket,All
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Fabric Care,48.61,52.02,45.62,50.0
Rice,18.6,15.14,19.74,16.67
Sugar/Flavored Syrup,32.79,32.84,34.65,33.33


## Knowledge Check: Contingency Table

_Estimated time required: 15 minutes_

1. In which period (`yearmonth`) does the hypermarket achieve the highest total transactions?

In [None]:
# your code here

# Pak Lutfi
pd.crosstab(
    index=household['yearmonth'],
    columns=household['format']
).sort_values('hypermarket', ascending=False).head(1)

# Jawaban: 2018-03 dengan total transaksinya sebesar 521

format,hypermarket,minimarket,supermarket
yearmonth,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-03,521,3540,1939


2. In the second quarter of 2018, how many percentage of total transactions came from the supermarket, when we compared to other `format`? (Hint: extract the quarterly period from `purchase_time`)

In [None]:
# your code here

# Bu Rahmah
household['quarter']=household['purchase_time'].dt.to_period('Q')

pd.crosstab(
    index=household['quarter'],
    columns=household['format'],
    normalize='index'
)*100

# Answer: 28.427778%

format,hypermarket,minimarket,supermarket
quarter,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017Q4,7.305556,65.272222,27.422222
2018Q1,8.238889,62.472222,29.288889
2018Q2,7.338889,64.233333,28.427778
2018Q3,6.955556,68.038889,25.005556


💭 **Bonus challenge:**

Produce a frequency table where we can breakdown the total transactions for each `yearmonth` period by each product `category` in each market (`format`).

In [None]:
# Pak Musthofa

pd.crosstab(
    index=household["yearmonth"],
    columns=[household["category"], household["format"]]
)

category,Fabric Care,Fabric Care,Fabric Care,Rice,Rice,Rice,Sugar/Flavored Syrup,Sugar/Flavored Syrup,Sugar/Flavored Syrup
format,hypermarket,minimarket,supermarket,hypermarket,minimarket,supermarket,hypermarket,minimarket,supermarket
yearmonth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2017-10,213,2009,778,81,600,319,152,1289,559
2017-11,218,2069,713,76,581,343,148,1262,590
2017-12,206,2111,683,68,583,349,153,1245,602
2018-01,229,2094,677,68,597,335,152,1269,579
2018-02,212,1989,799,117,549,334,184,1207,609
2018-03,249,1853,898,97,559,344,175,1128,697
2018-04,176,2031,793,79,578,343,121,1300,579
2018-05,236,1884,880,82,585,333,169,1235,596
2018-06,230,1996,774,81,622,297,147,1331,522
2018-07,169,2289,542,65,626,309,122,1378,500


In `pandas`, we call a higher-dimensional table as Multi-Index DataFrame. We will cover more on Week 3: Data Wrangling and Visualization course. So, stay tune~

**END OF DAY 2**

___

**START OF DAY 3**

# Aggregation Table

## `pd.crosstab()`

Not only for creating a frequency table, `crosstab` can also be used to perform aggregation by adding two more parameters:

- `values`: Values to be aggregated according to the `index` and `columns`
- `aggfunc`: Aggregate function to be used. Most common function: 'count', 'mean', 'median', 'sum', 'max', 'min'


    pd.crosstab(
        index=...,
        columns=...,
        values=...,
        aggfunc=...
    )

<b id='q4'>Business Question 4</b>

Find out the most expensive product `category`, by its mean of `unit_price`.

In other words, let's calculate the mean of `unit_price` of each product `category`.

In [None]:
pd.crosstab(
    index=household['category'], 
    columns='mean_unit_price', 
    values=household['unit_price'],
    aggfunc='mean'
)

<b id='q5'>Business Question 5</b>

Just like [Business Question 4](#q4), but please also breakdown the mean for each market (`format`)

Let's say we want to compare the mean and median side by side:

📝 Notes on mean and median:

- Both are measures of central tendency
- Mean = average values of data, Median = middle value (50%) of data
- Median is prefered when the data is skewed (contains extremely large values)

<b id='q6'>Business Question 6</b>

Construct an aggregation table to show the sum of `quantity` across each product `category` and `format`. Find out how many percentages of **Rice** is sold in **hypermarket**?

Note: Don't forget we can use parameter `margins` and `normalize` too.

## Knowledge Check: Aggregation Table

_Estimated time required: 20 minutes_

1. Create a cross-tab showing the median of `unit_price` across each `sub_category` and `format`. Add a subtotal to both the row and column. Answer the following questions:

    - Overall by median, Sugar is cheapest at ...
    - Overall by median, Detergent is most expensive at ...

In [None]:
# your code here


2. In which `quarterly` do the hypermarket achieve its highest total sales (sum of `subtotal`)?. **Hint**: Create a new column `subtotal` which calculate the total price of each item in a transaction by multiplying the `unit_price` with the `quantity`.


<!--
# float formatting, round to 3 decimal places
pd.options.display.float_format = '{:.3f}'.format
-->

In [None]:
# your code here


💭 **Bonus challenge:**

Produce a cross-tabulation table where we calculate the sum of `quantity` using:

- `quarterly` and `weekday` as the `index`
- `format` and `category` as the `columns`

Note: this will return a Multi-Index DataFrame.

## Pivot Table

Actually if our data is already a `DataFrame` object, using `pd.pivot_table()` can be more convenient compared to a `pd.crosstab()`.

Parameters:

- `data`: the `DataFrame` object (**not available in `pd.crosstab()`**)
- `index`: the column to be used as rows
- `columns`: the column to be used as columns
- `values`: the values used to fill in the table
- `aggfunc`: the aggregation function (**default: 'mean'**)

Syntax:

```
pd.pivot_table(
    data=...,
    index=...,
    columns=...,
    values=...,
    aggfunc=...
)
```

OR

```
data.pivot_table(
    index=...,
    columns=...,
    values=...,
    aggfunc=...
)
```

Let's compare between `pivot_table` and `crosstab`. Consider the following aggregation table created using `crosstab`:

In [None]:
pd.crosstab(
    index=household['weekday'], 
    columns=[household['format'], household['sub_category']], 
    values=household['unit_price'],
    aggfunc='mean'
)

Let's re-create the table above with `pd.pivot_table()`.

Note: Default `aggfunc='mean'`

In [None]:
pd.pivot_table(
    data=household,
    index='weekday',
    columns=['format', 'sub_category'],
    values='unit_price'
)

If we want to summarize the table by `max` of `unit_price`, just specify the `aggfunc` parameter:

In [None]:
pd.pivot_table(
    data=household,
    index='weekday', 
    columns=['format', 'sub_category'], 
    values='unit_price',
    aggfunc='max'
)

Let's recreate a frequency table below which shows the number of transactions, using `pd.pivot_table()`:

| category                 |   hypermarket |   minimarket |   supermarket |   Total |
|:-------------------------|--------------:|-------------:|--------------:|--------:|
| **Fabric Care**          |          2611 |        24345 |          9044 |   36000 |
| **Rice**                 |           999 |         7088 |          3913 |   12000 |
| **Sugar/Flavored Syrup** |          1761 |        15370 |          6869 |   24000 |
| **Total**                |          5371 |        46803 |         19826 |   72000 |

- index: ...
- columns: ...
- aggfunc: ...
- values: ...
- "Total" in row and column using ...

**Note:** parameter `margins` and `margins_name` are available on both `crosstab` and `pivot_table`, but parameter `normalize` only available on `crosstab`

___

# Summary: Tables in `pandas` 

## Frequency Tables

Usage: calculate number of rows for categorical column

Methods:

- For one column, use `.value_counts()` (Series)
- For one or more columns:
    - `pd.crosstab(index, columns)`
    - `pd.pivot_table(index, columns, aggfunc='count')` but rarely used

## Aggregation Tables

Usage: aggregate numerical column which broken down by categorical column

Methods:

- `pd.crosstab(index, columns, values, aggfunc)`
- `pd.pivot_table(data, index, columns, values, aggfunc)`

## `crosstab` vs `pivot_table`

The main differences between `crosstab` and `pivot_table` can be summarized as follows:

|                                                                                    | `pd.crosstab()` | `pd.pivot_table()` |
|------------------------------------------------------------------------------------|-----------------|--------------------|
|                                                                          **Input** | Array of values |          DataFrame |
|                                                              **Default `aggfunc`** |       `'count'` |           `'mean'` |
|                                                          **Parameter `normalize`** |       Available |      Not Available |
| [**Computation Time**](https://ramiro.org/notebook/pandas-crosstab-groupby-pivot/) | Relatively Slower |  Relatively Faster |

In [None]:
# crosstab expect a list/array of values, so the input doesn't have to be a DataFrame
import numpy as np

pd.crosstab(
    index=np.array(['Sugar', 'Sugar', 'Rice', 'Rice']),
    columns=np.array(['hypermarket', 'hypermarket', 'minimarket', 'hypermarket'])
)

# Missing Values and Duplicates

During the data exploration and preparation phase, it is likely we come across some problematic details in our data, such as: 

- Value of _-1_ for the _age_ column
- Value of _blank_ for the _customer segment_ column
- Value of _None_ for the _loan duration_ column
- etc

All of these are examples of "untidy" data, which is rather common depending on the data collection and recording process in a company.

## Missing Values

Let's import `household_untidy.csv`, which is a manipulated version of `household` DataFrame. 

**Note**: If you're curious on how to inject the missing values, check out our main materials on Section 3.

In [None]:
household_untidy = pd.read_csv("data_input/household_untidy.csv", index_col='receipts_item_id')
household_untidy['purchase_time'] = household_untidy['purchase_time'].astype('datetime64')
household_untidy.head()

Missing Values:

- `NaN`: Not a Number, for object and category
- `nan`: not a number, for numeric
- `NaT`: Not a Time, for datetime64

### Check Missing Values

A common method to detect missing values:

- `.isna()`: returns `True` if the values are missing.
- `.notna()`: returns `True` if the values are **not** missing.

In [None]:
household_untidy.isna()

Count the **number** of missing values across each column:

- `True` will be treated as 1
- `False` will be treated as 0

In [None]:
household_untidy.isna().sum()

Usually, we count the **percentages** of missing values across each column:

### Treatment for Missing Value

Three most common ways to deal with missing value:

1. Delete column
2. Delete row
3. Imputation

#### Delete Column

Use `.drop(columns = ...)` method

In [None]:
household_untidy = household_untidy.drop(columns='Unnamed: 8')
household_untidy.head()

⚠️ **Warning**: When removing a column, it is necessary to pay attention to the business case, will the discarded column eliminate or reduce the information of our data?

#### Delete Row

When we are certain that the rows with missing value can be safely dropped, we can use `.dropna()` method.

- `.dropna(how='any')`: drops row if **minimum 1 column** contain missing value

- `.dropna(how='all')`: drops row if **all values** are missing

- `.dropna(thresh=...)`: drops row if non-missing values < `thresh` 

In [None]:
household_untidy.dropna(how='any')#.shape[0]

In [None]:
household_untidy.dropna(how='all')#.shape[0]

In [None]:
# example: thresh=1, minimal 1 non-NA value present so that it doesn't drop the row
household_untidy.dropna(thresh=1)

#### Imputation

Fill missing value with some value, using `.fillna()` method

First:

- Impute `category`, `format`, `discount`  with value 'Missing'
- Impute `unit_price` with value 0
- Impute `quantity` with value 0

In [None]:
household_untidy[['category', 'format', 'discount']] = household_untidy[['category', 'format', 'discount']].fillna('Missing')
household_untidy['unit_price'] = household_untidy['unit_price'].fillna(0)
household_untidy['quantity'] = household_untidy['quantity'].replace(np.nan, 0) #.fillna(0)
household_untidy.head()

Note, these two codes are functionally identical:

- `.fillna(0)`
- `.replace(np.nan, 0)`

Second: impute `purchase_time` with the next value. Use parameter `method='bfill'` to **backward fill** the missing values in the dataset

In [None]:
household_untidy['purchase_time'] = household_untidy['purchase_time'].fillna(method='bfill')
household_untidy.head(7)

💡 **Tips on Imputation**:

For categorical column, do either:
- Make NA as one category
- Fill using a central value (mode as the most frequent category)
- Fill using predictive/machine learning model (Classification problem)

For numerical column, do either:
- Fill using a central value (mean or median)
- Fill using predictive/machine learning model (Regression problem)

Case: In a dataframe where `salary` is missing but the bank has data about the customer's occupation / profession, years of experience, years of education, seniority level, age, and industry, then a machine learning model can offer a viable alternative to the mean imputation approach.

## Duplicated Data

### Check Duplicates

Use method `.duplicated()` and combine with subsetting operation.

Parameter `keep`: 

- `keep='first'` (default): Mark duplicates as True except for the **first occurrence**.
- `keep='last'`: Mark duplicates as True except for the **last occurrence**.
- `keep=False`: Mark all duplicates as True.

In [None]:
cond = household_untidy.duplicated(keep=False)
household_untidy[cond]

### Treatment for Duplicated Data

Drop the duplicated row by using `.drop_duplicates()` method.

Parameter `keep`: 

- `keep='first'` (default): Drop duplicates except for the **first occurrence**.
- `keep='last'`: Drop duplicates except for the **last occurrence**.
- `keep=False`: Drop all duplicates.

In [None]:
household_untidy.drop_duplicates(keep='last')

Suppose we want to show unique observation of `receipt_id`:

In [None]:
household[household['receipt_id'].duplicated()]

⚠️ **Warning**: Duplicates may mean a different thing from a data point-of-view and a business analyst's point-of-view. You want to be extra careful about whether the duplicates is an intended characteristic of your data, or whether it poses a violation to the business logic.

## Knowledge Check: Missing Values and Duplicates

_Estimated time required: 10 minutes_

Should we drop the duplicates for each of the following case, and why:

1. A medical center collects anonymized heart rate monitoring data from patients. It has duplicate observations collected across a span of 3 months
2. An insurance company uses machine learning to deliver dynamic pricing to its customers. Each row contains the customer's name, occupation / profession and historical health data. It has duplicate observations collected across a span of 3 months
3. On our original `household` data, check for duplicate observations. Would you have drop the duplicated rows?

In [None]:
# illustration data for question 1
heart = pd.DataFrame({
    'rate': [70, 80, 90, 75, 95, 70, 85, 90]
})

heart

In [None]:
# illustration data for question 2
insurance = pd.DataFrame({
    'cust_id': ['C1', 'C2', 'C3', 'C4', 'C1'],
    'occupation': ['Employee', 'Employee', 'Student', 'Student', 'Employee'],
    'health': ['Good', 'Ok', 'Good', 'Ok', 'Ok']
})

insurance['cust_id'].duplicated()

In [None]:
# reference data for question 3
household = pd.read_csv('data_input/household.csv', index_col='receipts_item_id')
household[household.duplicated(keep=False)]

**Answer**:

1. Should (drop/not drop) the duplicates, because ...

2. Should (drop/not drop) the duplicates, because ...

3. Should (drop/not drop) the duplicates, because ...