# 類別變數特徵工程

## 作業程式碼
本作業將請學員完成以下要求：
1. 請至 Kaggle 平台找尋欲探索的資料集，進行本次作業。
2. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 Label Encoding
3. 請挑選適合的兩個類別變數，分別使用兩種不同撰寫方法執行 One-Hot Encoding 
4. 請挑選適合的兩個類別變數，分別從三種方法中使用兩種不同撰寫方法執行 Ordinal Encoding
5. 請挑選適合的兩個類別變數，撰寫並執行 Frequency Encoding
6. 請挑選適合的兩個類別變數，撰寫並執行Feature Combination

In [49]:
import numpy as np
import pandas as pd

## 輸入資料

In [50]:
# 輸入資料
raw_data = pd.read_csv("C:/Users/Orianna/Desktop/marathon/house-prices-advanced-regression-techniques/test.csv") # 此行需要填入資料路徑
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

In [51]:
print(raw_data.shape)

(1459, 80)


In [52]:
# 為學習方便，在此先移除遺失值
raw_data = raw_data.dropna(axis=1)
print(raw_data.shape)

(1459, 47)


## 將原始資料切割成訓練與測試資料

In [53]:
from sklearn.model_selection import train_test_split

trainData, testData = train_test_split(raw_data, test_size = 0.25, random_state = 214)

In [54]:
print(trainData.columns.tolist())

['Id', 'MSSubClass', 'LotArea', 'Street', 'LotShape', 'LandContour', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'ExterQual', 'ExterCond', 'Foundation', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SaleCondition']


## Label Encoding

本次介紹以下兩種方法可進行 Label Encoding
1. 使用 sklearn 套件中 LabelEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 進行 Label Encoding

In [55]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_labelencoding=le.fit_transform(trainData['LotConfig']) #fit_transform：先“学习”所有可能的类别，再把它们一一映射成整数
test_labelencoding=le.transform(testData['LotConfig']) #transform：直接用刚才学到的“映射规则”把新数据也转成整数

In [56]:
for before_labelencoding,after_labelencoding in zip(trainData["LotConfig"], train_labelencoding):
    print("Before Label Encoding: {}, After Label Encoding: {}".format(before_labelencoding, after_labelencoding))

Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Corner, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Corner, After Label Encoding: 0
Before Label Encoding: Corner, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Label Encoding: Inside, After Label Encoding: 4
Before Lab

In [57]:
# 方法二：使用 Dictionary 資料型態的功能
label_dict={
    content:index for index ,content in enumerate(trainData['LotConfig'].unique())
}

train_labelencoding= trainData["LotConfig"].apply(lambda x: label_dict[x])
test_labelencoding = testData["LotConfig"].apply(lambda x: label_dict[x])

In [58]:
for before, after in zip(trainData["LotConfig"], train_labelencoding):
    print("Before Label Encoding: {}, After Label Encoding: {}".format(before, after))

Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Corner, After Label Encoding: 1
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Corner, After Label Encoding: 1
Before Label Encoding: Corner, After Label Encoding: 1
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Label Encoding: Inside, After Label Encoding: 0
Before Lab

## One-Hot Encodig

本次介紹以下兩種方法進行 One-Hot Encoding
1. 使用 sklearn 套件中的 OneHotEncoder 函數
2. 使用 Dictionary 資料型態的功能

舉例：將 Destination 變數進行 One-Hot Encoding

In [59]:
# 方法一：使用 sklearn 套件中的 OneHotEncoder 函數

from sklearn.preprocessing import OneHotEncoder

one=OneHotEncoder()
trainData1=one.fit_transform(trainData['LotConfig'].values.reshape((-1, 1))).toarray()
testData1=one.transform(testData['LotConfig'].values.reshape((-1, 1))).toarray()


In [60]:
for before ,after in zip(trainData['LotConfig'], trainData1.tolist()):
     print("Before One-Hot Encoding: {}, After One-Hot Encoding: {}".format(before, after))

Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [1.0, 0.0, 0.0, 0.0, 0.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [0.0, 0.0, 0.0, 0.0, 1.0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [1.0, 0.0, 0.0, 0.0, 0.0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [1.0, 0.0, 0.0, 0.0, 0.0]
Befo

> 注意：使用 OneHotEncoder 轉換時，資料要先把維度轉換成二維才能轉換喔

In [61]:
# 方法二：使用 Dictionary 資料型態的功能
onehot_dict={
        content : [1 if index == onehot_index else 0 for onehot_index in range (trainData['LotConfig'].unique().tolist().__len__())]\
            for index,content in enumerate(trainData['LotConfig'].unique()) 
}

train_onehot=trainData['LotConfig'].apply(lambda x:onehot_dict[x])
test_onehot=testData['LotConfig'].apply(lambda x:onehot_dict[x])

In [62]:
for before_onehot, after_onehot in zip(trainData["LotConfig"], train_onehot):
    print("Before One-Hot Encoding: {}, After One-Hot Encoding: {}".format(before_onehot, after_onehot))

Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [0, 1, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [0, 1, 0, 0, 0]
Before One-Hot Encoding: Corner, After One-Hot Encoding: [0, 1, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Encoding: [1, 0, 0, 0, 0]
Before One-Hot Encoding: Inside, After One-Hot Enco

## Ordinal Encoding

本次介紹兩種方法進行 Ordinal Encoding
1. 使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序（比較不推薦）
2. 使用 sklearn 套件中的 OrdinalEncoder 函數且要自定義類別順序（比較推薦）
3. 使用 Dictionary 資料型態的功能


In [63]:
# 方法一：使用 sklearn 套件中的 LabelEncoder 函數且要自定義類別順序
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
le.fit(np.sort(raw_data['LandSlope'].unique()))
categories =np.sort(raw_data['LandSlope'].unique())
train_ordinal=le.transform(trainData['LandSlope'])
test_ordinal=le.transform(testData['LandSlope'])

In [64]:
print("Categories and their codes will be:", categories)

Categories and their codes will be: ['Gtl' 'Mod' 'Sev']


In [65]:
for before_ordinal, after_ordinal in zip(trainData['LandSlope'], train_ordinal):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinal, after_ordinal))

Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Enco

In [66]:
# 方法二：使用 sklearn 中的 OrdinalEncoder 函數
from sklearn.preprocessing import OrdinalEncoder
ordinal=OrdinalEncoder()
ordinal.fit(raw_data['LandSlope'].values.reshape((-1,1)))

train_ordinalencoding=ordinal.transform(trainData['LandSlope'].values.reshape((-1,1)))
test_ordinalencoding=ordinal.transform(testData['LandSlope'].values.reshape((-1,1)))

In [67]:
for before_ordinalencoding, after_ordinalencoding in zip(trainData["LandSlope"], train_ordinalencoding):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinalencoding, after_ordinalencoding))

Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0.]
Before Ordinal Encoding: Gtl, After Ordinal Encoding: [0

In [68]:
# 方法三：使用 Dictionary 資料型態的功能
ordinal_dict={
    content: index for index, content in enumerate(np.sort(raw_data['LandSlope'].unique())) 
}

train_ordinal=trainData['LandSlope'].apply (lambda x: ordinal_dict[x])
test_ordinal=trainData['LandSlope'].apply (lambda x: ordinal_dict[x])

In [69]:
for before_ordinalencoding, after_ordinalencoding in zip(trainData["LandSlope"], train_ordinal):
    print("Before Ordinal Encoding: {}, After Ordinal Encoding: {}".format(before_ordinalencoding, after_ordinalencoding))

Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Encoding: 0
Before Ordinal Encoding: Gtl, After Ordinal Enco