# [教學目標]
- 知道 DataFrame 如何檢視欄位的型態數量以及各欄型態, 以及 Label Encoding / One Hot Encoding 如何寫?

# [範例重點]
- 檢視 DataFrame 的資料型態 (In[3], In[4])
- 了解 Label Encoding 如何寫 (In[6])
- 了解 One Hot Encoding 如何寫 (In[7])

In [1]:
import os
import numpy as np
import pandas as pd

In [3]:
!dir

 磁碟區 C 中的磁碟是 OS
 磁碟區序號:  06B4-EB6B

 C:\Users\kelly\Documents\GitHub\3rd-ML100Days\homework 的目錄

2019/09/01  上午 09:10    <DIR>          .
2019/09/01  上午 09:10    <DIR>          ..
2019/08/31  下午 04:05    <DIR>          .ipynb_checkpoints
2019/08/30  上午 12:10           136,566 Day_001_HW-checkpoint.ipynb
2019/08/30  上午 12:10           136,566 Day_001_HW.ipynb
2019/08/29  下午 11:04            36,855 Day_001_HW1-checkpoint.ipynb
2019/08/29  下午 11:04            36,855 Day_001_HW1.ipynb
2019/08/30  上午 10:57             1,516 Day_002_HW.ipynb
2019/08/30  下午 12:11             3,381 Day_003_HW.ipynb
2019/08/30  下午 09:39            80,946 Day_004_HW.ipynb
2019/08/31  下午 04:01             8,440 Day_005-1_build_dataframe_from_scratch.ipynb
2019/08/30  下午 11:03             5,225 Day_005-1_HW.ipynb
2019/08/31  下午 04:06         1,106,217 Day_005-2_HW.ipynb
2019/08/30  下午 10:06            12,958 Day_005-2_read_and_write_files.ipynb
2019/08/31  下午 04:05         1,094,442 Day_005-3_read_and_write_files.

檢視資料中各個欄位類型的數量

In [3]:
app_train.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

檢視資料中類別型欄位各自類別的數量

In [4]:
app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

#### Label encoding
有仔細閱讀[參考資料](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)的人可以發現，Label encoding 的表示方式會讓同一個欄位底下的類別之間有大小關係 (0<1<2<...)，所以在這裡我們只對有類別數量小於等於 2 的類別型欄位示範使用 Label encoding，但不表示這樣處理是最好的，一切取決於欄位本身的意義適合哪一種表示方法

In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

3 columns were label encoded.


#### One Hot encoding
pandas 中的 one hot encoding 非常方便，一行程式碼就搞定

In [7]:
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

print(app_train['CODE_GENDER_F'].head())
print(app_train['CODE_GENDER_M'].head())
print(app_train['NAME_EDUCATION_TYPE_Academic degree'].head())

0    0
1    1
2    0
3    1
4    0
Name: CODE_GENDER_F, dtype: uint8
0    1
1    0
2    1
3    0
4    1
Name: CODE_GENDER_M, dtype: uint8
0    0
1    0
2    0
3    0
4    0
Name: NAME_EDUCATION_TYPE_Academic degree, dtype: uint8


可以觀察到原來的類別型欄位都轉為 0/1 了

## 作業
將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

In [8]:
app_train = pd.read_csv(f_app_train)
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()

(307511, 1)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START
0,WEDNESDAY
1,MONDAY
2,MONDAY
3,WEDNESDAY
4,THURSDAY


In [None]:
"""
Your Code Here
"""