# [教學目標]
- 以下程式碼將示範在 python 如何利用 pandas.cut 與 .qcut 計算出數據的離散化標籤

# [範例重點]
- pandas.cut 的等寬劃分效果 (In[3], Out[4])
- pandas.qcut 的等頻劃分效果 (In[5], Out[6])

In [9]:
# 載入套件
import os
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [10]:
# 初始設定 Ages 的資料
ages = pd.DataFrame({"age": [18,22,25,27,7,21,23,37,30,61,45,41,9,18,80,100]})
ages

Unnamed: 0,age
0,18
1,22
2,25
3,27
4,7
5,21
6,23
7,37
8,30
9,61


#### 等寬劃分

In [11]:
# 新增欄位 "equal_width_age", 對年齡做等寬劃分
ages["equal_width_age"] = pd.cut(ages["age"], 4)
ages["equal_width_age"]

0     (6.907, 30.25]
1     (6.907, 30.25]
2     (6.907, 30.25]
3     (6.907, 30.25]
4     (6.907, 30.25]
5     (6.907, 30.25]
6     (6.907, 30.25]
7      (30.25, 53.5]
8     (6.907, 30.25]
9      (53.5, 76.75]
10     (30.25, 53.5]
11     (30.25, 53.5]
12    (6.907, 30.25]
13    (6.907, 30.25]
14    (76.75, 100.0]
15    (76.75, 100.0]
Name: equal_width_age, dtype: category
Categories (4, interval[float64]): [(6.907, 30.25] < (30.25, 53.5] < (53.5, 76.75] < (76.75, 100.0]]

In [12]:
# 觀察等寬劃分下, 每個種組距各出現幾次
ages["equal_width_age"].value_counts() # 每個 bin 的值的範圍大小都是一樣的

(6.907, 30.25]    10
(30.25, 53.5]      3
(76.75, 100.0]     2
(53.5, 76.75]      1
Name: equal_width_age, dtype: int64

#### 等頻劃分

In [13]:
# 新增欄位 "equal_freq_age", 對年齡做等頻劃分
ages["equal_freq_age"] = pd.qcut(ages["age"], 4)
ages["equal_freq_age"] 

0     (6.999, 20.25]
1      (20.25, 26.0]
2      (20.25, 26.0]
3       (26.0, 42.0]
4     (6.999, 20.25]
5      (20.25, 26.0]
6      (20.25, 26.0]
7       (26.0, 42.0]
8       (26.0, 42.0]
9      (42.0, 100.0]
10     (42.0, 100.0]
11      (26.0, 42.0]
12    (6.999, 20.25]
13    (6.999, 20.25]
14     (42.0, 100.0]
15     (42.0, 100.0]
Name: equal_freq_age, dtype: category
Categories (4, interval[float64]): [(6.999, 20.25] < (20.25, 26.0] < (26.0, 42.0] < (42.0, 100.0]]

In [14]:
# 觀察等頻劃分下, 每個種組距各出現幾次
ages["equal_freq_age"].value_counts() # 每個 bin 的資料筆數是一樣的

(42.0, 100.0]     4
(26.0, 42.0]      4
(20.25, 26.0]     4
(6.999, 20.25]    4
Name: equal_freq_age, dtype: int64

### 作業
新增一個欄位 `customized_age_grp`，把 `age` 分為 (0, 10], (10, 20], (20, 30], (30, 50], (50, 100] 這五組，'(' 表示不包含, ']' 表示包含

Hints: 執行 ??pd.cut()，了解提供其中 bins 這個參數的使用方式