1. 資料欄位變數一般可分
    * 離散變數-只能用整數單位計算
        * 性別、國家  
    * 連續變數-能在一定時間內任意取值
        * 飛機起降所需時間、車速
2. 常見的欄位資料類型
    * float64
    * int 64
    * object
    * 其他- 日期、boolean    

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
dir_data="../data"
f_app_train = os.path.join(dir_data,"application_train.csv")
f_app_test = os.path.join(dir_data,"application_test.csv")

app_train = pd.read_csv(f_app_train)
app_test = pd.read_csv(f_app_test)

#查看各自欄位類別的型態
app_train.dtypes

SK_ID_CURR                      int64
TARGET                          int64
NAME_CONTRACT_TYPE             object
CODE_GENDER                    object
FLAG_OWN_CAR                   object
FLAG_OWN_REALTY                object
CNT_CHILDREN                    int64
AMT_INCOME_TOTAL              float64
AMT_CREDIT                    float64
AMT_ANNUITY                   float64
AMT_GOODS_PRICE               float64
NAME_TYPE_SUITE                object
NAME_INCOME_TYPE               object
NAME_EDUCATION_TYPE            object
NAME_FAMILY_STATUS             object
NAME_HOUSING_TYPE              object
REGION_POPULATION_RELATIVE    float64
DAYS_BIRTH                      int64
DAYS_EMPLOYED                   int64
DAYS_REGISTRATION             float64
DAYS_ID_PUBLISH                 int64
OWN_CAR_AGE                   float64
FLAG_MOBIL                      int64
FLAG_EMP_PHONE                  int64
FLAG_WORK_PHONE                 int64
FLAG_CONT_MOBILE                int64
FLAG_PHONE  

1. 在資料中觀看各資料類型的數量

In [3]:
#app_train.get_dtype_counts()
app_train.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

2. 針對類別類型資料欄位看他的分類數量

In [4]:
#選出類別欄位
app_train.select_dtypes(include=["object"])
#顯示出df某欄位的不重複值
pd.unique(app_train.NAME_CONTRACT_TYPE)
#select_dtypes 通過類別選取欄位
#pd.Series.nunique 只能在Series使用返回不唯一值 axis=0 針對直行的數據
app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis=0)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64

3. 模型怎麼處理類別型的資料？有什麼表示方法？
#### Label Encoder v.s. One Hot Encoder
因為模型沒有辦法讀取文字和分類資料,所以需要依靠這兩種方法把他們轉為數字型資料

* Label Encoding

假設有個欄位有著主食甜食水果三個資料,他會隨機給予他們數字(0:主食,1:水果,2:甜食),但這個分類並沒有依據,在之後模型也有可能會誤會有大小順序關係;下面是程式的實作

In [5]:
from sklearn.preprocessing import LabelEncoder

In [6]:
#create a label encoder object
le = LabelEncoder()
le_count = 0

for col in app_train:
    if app_train[col].dtype == "object":
         #只針對小於等於兩個類別的欄位
        if len(list(app_train[col].unique())) <= 2:
            print(app_train[col].name,list(app_train[col].unique()))
            le.fit(app_train[col])
            #transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            # kepp track of how many columns were label encoded
            le_count += 1
print("%d columns were label encoded." % le_count)

NAME_CONTRACT_TYPE ['Cash loans', 'Revolving loans']
FLAG_OWN_CAR ['N', 'Y']
FLAG_OWN_REALTY ['Y', 'N']
3 columns were label encoded.


In [12]:
#可以看到欄位已被轉變
app_train["NAME_CONTRACT_TYPE"].head()

0    0
1    0
2    1
3    0
4    0
Name: NAME_CONTRACT_TYPE, dtype: int64

* [One Hot Encoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
假設有個欄位(TYPE)有著主食甜食水果三個資料,此欄位會分成三個欄位(TYPE_主食,TYPE_甜食,TYPE_水果),而資料的值只會有0或1(符合),當資料行是TYPE_主食時,0的欄位會顯示1,TYPE_甜食,TYPE_水果欄位則顯示為0;下面是程式的實作

In [8]:
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

In [9]:
#欄位值都變成0或1
print(app_train['CODE_GENDER_F'].head())
print(app_train['CODE_GENDER_M'].head())
print(app_train['NAME_EDUCATION_TYPE_Academic degree'].head())
#欄位擴展到243
print(app_train.columns)

0    0
1    1
2    0
3    1
4    0
Name: CODE_GENDER_F, dtype: uint8
0    1
1    0
2    1
3    0
4    1
Name: CODE_GENDER_M, dtype: uint8
0    0
1    0
2    0
3    0
4    0
Name: NAME_EDUCATION_TYPE_Academic degree, dtype: uint8
Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'FLAG_OWN_CAR',
       'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT',
       'AMT_ANNUITY', 'AMT_GOODS_PRICE',
       ...
       'HOUSETYPE_MODE_terraced house', 'WALLSMATERIAL_MODE_Block',
       'WALLSMATERIAL_MODE_Mixed', 'WALLSMATERIAL_MODE_Monolithic',
       'WALLSMATERIAL_MODE_Others', 'WALLSMATERIAL_MODE_Panel',
       'WALLSMATERIAL_MODE_Stone, brick', 'WALLSMATERIAL_MODE_Wooden',
       'EMERGENCYSTATE_MODE_No', 'EMERGENCYSTATE_MODE_Yes'],
      dtype='object', length=243)


## 作業
將下列部分資料片段 sub_train 使用 One Hot encoding, 並觀察轉換前後的欄位數量 (使用 shape) 與欄位名稱 (使用 head) 變化

In [10]:
dir_data = '../data/'
f_app_train = os.path.join(dir_data, 'application_train.csv')
app_train_2 = pd.read_csv(f_app_train)

sub_train = pd.DataFrame(app_train_2["WEEKDAY_APPR_PROCESS_START"])
print(sub_train.shape)
sub_train.head()

(307511, 1)


Unnamed: 0,WEEKDAY_APPR_PROCESS_START
0,WEDNESDAY
1,MONDAY
2,MONDAY
3,WEDNESDAY
4,THURSDAY


In [11]:
sub_train = pd.get_dummies(sub_train)
print(sub_train.shape)
print(sub_train.head())

(307511, 7)
   WEEKDAY_APPR_PROCESS_START_FRIDAY  WEEKDAY_APPR_PROCESS_START_MONDAY  \
0                                  0                                  0   
1                                  0                                  1   
2                                  0                                  1   
3                                  0                                  0   
4                                  0                                  0   

   WEEKDAY_APPR_PROCESS_START_SATURDAY  WEEKDAY_APPR_PROCESS_START_SUNDAY  \
0                                    0                                  0   
1                                    0                                  0   
2                                    0                                  0   
3                                    0                                  0   
4                                    0                                  0   

   WEEKDAY_APPR_PROCESS_START_THURSDAY  WEEKDAY_APPR_PROCESS_START_TUESDAY