# [教學目標]
- 知道 DataFrame 如何檢視欄位的型態數量以及各欄型態, 以及 Label Encoding / One Hot Encoding 如何寫?

# [範例重點]
- 檢視 DataFrame 的資料型態 (In[3], In[4])
- 了解 Label Encoding 如何寫 (In[6])
- 了解 One Hot Encoding 如何寫 (In[7])

In [34]:
import os
import numpy as np
import pandas as pd

In [41]:
# 設定 data_path,在哪邊開啟ipython,工作目路徑就是當下的資料夾
dir_data = './data/'
print(os.getcwd())
f_app_train = os.path.join(dir_data, 'application_train.csv')
f_app_test = os.path.join(dir_data, 'application_test.csv')
print(os.getcwd())
app_train = pd.read_csv(f_app_train)
app_test = pd.read_csv(f_app_test)


D:\AI馬拉松\Day06\D6
D:\AI馬拉松\Day06\D6


In [57]:
app_train[0:3]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


檢視資料中各個欄位類型的數量

In [9]:
app_train.dtypes.value_counts()
#app_train.dtypes

float64    65
int64      41
object     16
dtype: int64

檢視資料中類別型欄位各自類別的數量(每個欄位裡有幾種類別)

In [11]:
app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

#### Label encoding
有仔細閱讀[參考資料](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)的人可以發現，Label encoding 的表示方式會讓同一個欄位底下的類別之間有大小關係 (0<1<2<...)，所以在這裡我們只對有類別數量小於等於 2 的類別型欄位示範使用 Label encoding，但不表示這樣處理是最好的，一切取決於欄位本身的意義適合哪一種表示方法

In [36]:
from sklearn.preprocessing import LabelEncoder
#rom sklearn import preprocessing

In [37]:
print(app_train["NAME_CONTRACT_TYPE"].unique()) #pd.read_csv(f_app_train)["NAME_CONTRACT_TYPE"].unique() 代表顯示唯一值的欄位
print(pd.read_csv(f_app_train)["NAME_CONTRACT_TYPE"].unique())
len(list(app_train["NAME_CONTRACT_TYPE"].unique()))


['Cash loans' 'Revolving loans']
['Cash loans' 'Revolving loans']


2

In [63]:
#for col in app_train:
#    print( app_train[])

In [86]:
# Create a label encoder object
le = LabelEncoder() #宣告一個labelEncoder的 物件
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2: 
            # Train on the training data
            le.fit(app_train[col]) #將要把label的資料模型放進去
            print(le.classes_) #印出欄位的類別
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col]) #進行資料標記
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)
app_train.head(2)
#app_test.head(2)
#app_train.shape[0]

0 columns were label encoded.


Unnamed: 0,SK_ID_CURR,TARGET,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,AMT_GOODS_PRICE,REGION_POPULATION_RELATIVE,DAYS_BIRTH,DAYS_EMPLOYED,...,HOUSETYPE_MODE_terraced house,WALLSMATERIAL_MODE_Block,WALLSMATERIAL_MODE_Mixed,WALLSMATERIAL_MODE_Monolithic,WALLSMATERIAL_MODE_Others,WALLSMATERIAL_MODE_Panel,"WALLSMATERIAL_MODE_Stone, brick",WALLSMATERIAL_MODE_Wooden,EMERGENCYSTATE_MODE_No,EMERGENCYSTATE_MODE_Yes
0,100002,1,0,202500.0,406597.5,24700.5,351000.0,0.018801,-9461,-637,...,0,0,0,0,0,0,1,0,1,0
1,100003,0,0,270000.0,1293502.5,35698.5,1129500.0,0.003541,-16765,-1188,...,0,1,0,0,0,0,0,0,1,0


#### One Hot encoding
pandas 中的 one hot encoding 非常方便，一行程式碼就搞定

In [98]:
app_train = pd.get_dummies(app_train) #get_dummies : 僅能將字串轉換為One hot encoding表示形式， 沒指定columns會全部轉換。
app_test = pd.get_dummies(app_test)

print(app_train['CODE_GENDER_F'].head())
print(app_train['CODE_GENDER_M'].head())
print(app_train['NAME_EDUCATION_TYPE_Academic degree'].head())

0    0
1    1
2    0
3    1
4    0
Name: CODE_GENDER_F, dtype: uint8
0    1
1    0
2    1
3    0
4    1
Name: CODE_GENDER_M, dtype: uint8
0    0
1    0
2    0
3    0
4    0
Name: NAME_EDUCATION_TYPE_Academic degree, dtype: uint8


In [99]:
app_train.shape
#app_train['CODE_GENDER_F','CODE_GENDER_M'].head()
#app_train.cloumns
#app_train[0:5]
#app_train['CODE_GENDER_F','CODE_GENDER_M']

(307511, 246)

可以觀察到原來的類別型欄位都轉為 0/1 了

## 作業
將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

In [88]:
app_train = pd.read_csv(f_app_train)
sub_train = pd.DataFrame(app_train['WEEKDAY_APPR_PROCESS_START'])
print(sub_train.shape)
sub_train.head()

(307511, 1)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START
0,WEDNESDAY
1,MONDAY
2,MONDAY
3,WEDNESDAY
4,THURSDAY


In [101]:
sub_train = pd.get_dummies(sub_train)
print(sub_train.shape)
sub_train.head()

(307511, 7)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START_FRIDAY,WEEKDAY_APPR_PROCESS_START_MONDAY,WEEKDAY_APPR_PROCESS_START_SATURDAY,WEEKDAY_APPR_PROCESS_START_SUNDAY,WEEKDAY_APPR_PROCESS_START_THURSDAY,WEEKDAY_APPR_PROCESS_START_TUESDAY,WEEKDAY_APPR_PROCESS_START_WEDNESDAY
0,0,0,0,0,0,0,1
1,0,1,0,0,0,0,0
2,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1
4,0,0,0,0,1,0,0
