# 資料前處理(Label encoding、 One hot encoding)
這兩個編碼方式的目的是為了將類別 (categorical)或是文字(text)的資料轉換成數字，而讓程式能夠更好的去理解及運算。
> Label encoding : 把每個類別 mapping 到某個整數，不會增加新欄位

> One hot encoding : 為每個類別新增一個欄位，用 0/1 表示是否

![](images/Encoder.PNG)


## Encoding Categorical features (or label)
![](images/Encoding.PNG)


In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#from keras.utils import np_utils
 
from sklearn.preprocessing import LabelEncoder
from sklearn import datasets
from sklearn.model_selection import train_test_split

In [6]:
df = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});
df

Unnamed: 0,blood,Y,Z
0,A,high,
1,B,low,
2,AB,high,-1196.0
3,O,mid,72.0
4,B,mid,83.0


# 方法一：sklearn - label encoder + onehot encoder
>onehot encoder要用2D array，若維度所以要用reshape(-1,1)<br>
>onehot encoder要數字，若資料文文字要先用label encoder轉數字

In [11]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoded_Y = encoder.fit_transform(df.blood)
#print(encoded_Y, '\n', df.blood)
df['blood'] = encoded_Y
print(df)

   blood     Y       Z
0      0  high     NaN
1      2   low     NaN
2      1  high -1196.0
3      3   mid    72.0
4      2   mid    83.0


## One hot encoding
One Hot encoding的編碼邏輯為將類別拆成多個行(column)，每個列中的數值由1、0替代，當某一列的資料存在的該行的類別則顯示1，反則顯示0。

然在指定column進行編碼的情形下，One hot encoding<b>無法直接對字串進行編碼，必須先透過Label encoding將字串以數字取代後再進行One hot encoding處理。</b>

> categorical_features = [0]: 表示欲在data上執行One hot encoding的index為0

> data_le: 為經過Label encoding編碼的資料(註:OneHotEncoder的輸入要為2-D array，而Label encoding為1-D array)


OneHotEncoder會轉出scipy.csr_matrix資料結構用.toarray()轉array
從結果可以知道，數字0的column 代表的是A、數字1的column 代表的是B，而數字2的column 代表的是AB。
除了轉換字串外，One hot encoding也可以轉換數字。在此處的data就不需要先經過Label encoding編碼

```python
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
   
# creating one hot encoder object with categorical feature 0 
# indicating the first column 
columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough') 
  
data = np.array(columnTransformer.fit_transform(data), dtype = str) 
```

In [30]:
#方法一:
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
onehot = OneHotEncoder()
d = np.array(df['blood'])

#轉換後會變成scipy 格式，需要再用to_array 變成array 格式
onehot_df = onehot.fit_transform(d.reshape(-1,1)).toarray()
print(d,'\n', onehot_df)
print(type(onehot_df))
# creating one hot encoder object with categorical feature 0 
# indicating the first column 


[0 2 1 3 2] 
 [[1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [0. 0. 1. 0.]]
<class 'numpy.ndarray'>


"\ncolumnTransformer = ColumnTransformer([('encoder', \n                                        OneHotEncoder(), \n                                        [0])], \n                                      remainder='passthrough') \n"

In [49]:
#方法二:
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
onehot = OneHotEncoder()
df2 = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});



# creating one hot encoder object with categorical feature 0 
# indicating the first column 

columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [0])], 
                                      remainder='passthrough') 

array2_onehot = np.array(columnTransformer.fit_transform(df2), dtype = str)
print('data df2:','\n', df2)
print('df2_oneshot:', '\n',array2_onehot)
type(array2_onehot)

#全部轉成DataFrame
df2_DF = pd.DataFrame(array2_onehot, columns = ['Bld A', 'Bld AB', 'Bld B', 'Bld O', 'Y', 'Z'])
df2_DF

data df2: 
   blood     Y       Z
0     A  high     NaN
1     B   low     NaN
2    AB  high -1196.0
3     O   mid    72.0
4     B   mid    83.0
df2_oneshot: 
 [['1.0' '0.0' '0.0' '0.0' 'high' 'nan']
 ['0.0' '0.0' '1.0' '0.0' 'low' 'nan']
 ['0.0' '1.0' '0.0' '0.0' 'high' '-1196.0']
 ['0.0' '0.0' '0.0' '1.0' 'mid' '72.0']
 ['0.0' '0.0' '1.0' '0.0' 'mid' '83.0']]


Unnamed: 0,Bld A,Bld AB,Bld B,Bld O,Y,Z
0,1.0,0.0,0.0,0.0,high,
1,0.0,0.0,1.0,0.0,low,
2,0.0,1.0,0.0,0.0,high,-1196.0
3,0.0,0.0,0.0,1.0,mid,72.0
4,0.0,0.0,1.0,0.0,mid,83.0


In [6]:
#方法二，自己測試在第二個feature [0] -->[1]
# importing one hot encoder from sklearn 
# There are changes in OneHotEncoder class 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
onehot = OneHotEncoder()
df3 = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});



# creating one hot encoder object with categorical feature 0 
# indicating the first column 

columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [1])], 
                                      remainder='passthrough') 

df3_onehot = np.array(columnTransformer.fit_transform(df3), dtype = str)
print('data df3:','\n', df3)
print('df3_oneshot:', '\n',df3_onehot)
type(df3_onehot)

 

data df3: 
   blood     Y       Z
0     A  high     NaN
1     B   low     NaN
2    AB  high -1196.0
3     O   mid    72.0
4     B   mid    83.0
df3_oneshot: 
 [['1.0' '0.0' '0.0' 'A' 'nan']
 ['0.0' '1.0' '0.0' 'B' 'nan']
 ['1.0' '0.0' '0.0' 'AB' '-1196.0']
 ['0.0' '0.0' '1.0' 'O' '72.0']
 ['0.0' '0.0' '1.0' 'B' '83.0']]


numpy.ndarray

# 方法二：Keras - label encoder + to_categorical
>to_categorical要數字，若資料文文字要先用label encoder轉數字

In [5]:
from sklearn.preprocessing import LabelEncoder
#!!pip install np_utils
from keras.utils import np_utils 

# label encoder 
encoder = LabelEncoder()

#產生資料
df4 = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]});



encoded_Y = encoder.fit_transform(df4.blood)
#print(encoded_Y, '\n', df.blood)
df4['blood'] = encoded_Y

# convert integers to one hot encoding
keras_oneshot = np_utils.to_categorical(df4.blood)
keras_oneshot



array([[1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 1., 0.]], dtype=float32)

## 方法三：pd.get_dummies方法
![](images/Encoding_pd.PNG)
pd.get_dummies(df)
>get_dummies可以直接轉字串，反而無法轉換數字<br>
>get_dummies沒指定columns，會全部轉換

In [8]:
df5 = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]})

df5_dummy = pd.get_dummies(df5)
print("All dummies:", '\n', df5_dummy)

df6 = pd.DataFrame({'blood':['A','B','AB','O','B'], 
                   'Y':['high','low','high','mid','mid'],
                   'Z':[np.nan,np.nan,-1196,72,83]})

df6_dummy = pd.get_dummies(df6.blood)
print("Blood dummy:", '\n', df6_dummy)

All dummies: 
         Z  blood_A  blood_AB  blood_B  blood_O  Y_high  Y_low  Y_mid
0     NaN        1         0        0        0       1      0      0
1     NaN        0         0        1        0       0      1      0
2 -1196.0        0         1        0        0       1      0      0
3    72.0        0         0        0        1       0      0      1
4    83.0        0         0        1        0       0      0      1
Blood dummy: 
    A  AB  B  O
0  1   0  0  0
1  0   0  1  0
2  0   1  0  0
3  0   0  0  1
4  0   0  1  0


## 練習一：sklearn - label encoder + onehot encoder
下面的資料可以看到country那欄皆為字串， 大部分的模型都是基於數學運算，字串無法套入數學模型進行運算，<br>
在此先對其進行Label encoding編碼，我們從 sklearn library中導入 LabelEncoder class，對第一行資料進行fit及transform並取代之。

In [14]:
import numpy as np
import pandas as pd
# importing one hot encoder from sklearn 
from sklearn.preprocessing import OneHotEncoder 
from sklearn.compose import ColumnTransformer 
 

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
print(data)


columnTransformer = ColumnTransformer([('encoder', 
                                        OneHotEncoder(), 
                                        [1])], 
                                      remainder='passthrough') 

array2_onehot = np.array(columnTransformer.fit_transform(data), dtype = str)
print('data df2:','\n', data)
print('df2_oneshot:', '\n',array2_onehot)
type(array2_onehot)

#全部轉成DataFrame
df2_DF = pd.DataFrame(array2_onehot)
print(df2_DF)
 

     Country  Age  Salary
0     Taiwan   25   20000
1  Australia   30   32000
2    Ireland   45   59000
3  Australia   35   60000
4    Ireland   22   43000
5     Taiwan   36   52000
data df2: 
      Country  Age  Salary
0     Taiwan   25   20000
1  Australia   30   32000
2    Ireland   45   59000
3  Australia   35   60000
4    Ireland   22   43000
5     Taiwan   36   52000
df2_oneshot: 
 [['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' 'Taiwan' '20000']
 ['0.0' '0.0' '1.0' '0.0' '0.0' '0.0' 'Australia' '32000']
 ['0.0' '0.0' '0.0' '0.0' '0.0' '1.0' 'Ireland' '59000']
 ['0.0' '0.0' '0.0' '1.0' '0.0' '0.0' 'Australia' '60000']
 ['1.0' '0.0' '0.0' '0.0' '0.0' '0.0' 'Ireland' '43000']
 ['0.0' '0.0' '0.0' '0.0' '1.0' '0.0' 'Taiwan' '52000']]
     0    1    2    3    4    5          6      7
0  0.0  1.0  0.0  0.0  0.0  0.0     Taiwan  20000
1  0.0  0.0  1.0  0.0  0.0  0.0  Australia  32000
2  0.0  0.0  0.0  0.0  0.0  1.0    Ireland  59000
3  0.0  0.0  0.0  1.0  0.0  0.0  Australia  60000
4  1.0  0.0  0

## 練習二：Keras - label encoder + to_categorical

In [16]:
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data

from sklearn.preprocessing import LabelEncoder
#!!pip install np_utils
from keras.utils import np_utils 

# label encoder 
encoder = LabelEncoder()


encoded_Y = encoder.fit_transform(data.Country)
data['Country'] = encoded_Y

# convert integers to one hot encoding
keras_oneshot = np_utils.to_categorical(data.Country)
keras_oneshot


array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

## 練習三：Pandas.get_dummies
>　get_dummies : 僅能將字串轉換為One hot encoding表示形式， 沒指定columns會全部轉換。

In [17]:
country=['Taiwan','Australia','Ireland','Australia','Ireland','Taiwan']
age=[25,30,45,35,22,36]
salary=[20000,32000,59000,60000,43000,52000]
dic={'Country':country,'Age':age,'Salary':salary}
data=pd.DataFrame(dic)
data


df6_dummy = pd.get_dummies(data.Country)
print(df6_dummy)

   Australia  Ireland  Taiwan
0          0        0       1
1          1        0       0
2          0        1       0
3          1        0       0
4          0        1       0
5          0        0       1
