## Data Pre-processing with Scikit-Learn


Scikit learn does not allow string value to analyse data.
So, all raw data with strings must be transformed to the number value.

Related Class are located in sklearn.preprocessing

- Label Incoding: String into number
- One-Hot Encoding: change the data into Sparse Matrix
- Feature Scaling and Normalization
  - StandarScaler: mean 0 and variable 1. (-1 ~ 1)
  - MinMaxScaler: range in 0 ~ 1 

#### Label Incoding

LabelEncoder in sklearn.preprocessing will create int label in alphabetical order.

In [1]:
from sklearn.preprocessing import LabelEncoder

items = ["TV", "Refridgerator", "Microwave", "Computer", "Electric Fan", "Electric Fan", "Mixer", "Mixer"]

# define class instance
encoder = LabelEncoder()

# fit the label
encoder.fit(items)

# transform the data into number and store in var 
labels = encoder.transform(items)
print(f"Encoded Result: {labels}")

# show encoded order
print(f"Encoded Class: {encoder.classes_}")

# Decoded Order
print(f"Decoded: {encoder.inverse_transform(labels)}")

Encoded Result: [5 4 2 0 1 1 3 3]
Encoded Class: ['Computer' 'Electric Fan' 'Microwave' 'Mixer' 'Refridgerator' 'TV']
Decoded: ['TV' 'Refridgerator' 'Microwave' 'Computer' 'Electric Fan' 'Electric Fan'
 'Mixer' 'Mixer']


Label Encoding has a problem because all labels are representated by number.
This will cause the ML model could analysis the data with priority.

=> Use One-Hot-Encoding

#### One-Hot-Encoding

1. change raw data into ndarray form
2. apply one-hot-encoder and get sparse matrix
3. transform sparse matrix to dense matrix

In [2]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

items = ["TV", "Refridgerator", "Microwave", "Computer", "Electric Fan", "Electric Fan", "Mixer", "Mixer"]
data = np.array(items)
print(data)
print("==")

# reshape 2 dimension array
data = data.reshape(-1, 1)
print(data)
print("==")

# Apply One-Hot-Encoding and get Sparse Matrix
encoder = OneHotEncoder()
encoder.fit(data)
labels = encoder.transform(data)
print("SPARSE MATRIX")
print(labels)
print("==")

# Change Sparse Matrix to Dense Matrix
print("DENSE MATRIX")
print(labels.toarray())

['TV' 'Refridgerator' 'Microwave' 'Computer' 'Electric Fan' 'Electric Fan'
 'Mixer' 'Mixer']
==
[['TV']
 ['Refridgerator']
 ['Microwave']
 ['Computer']
 ['Electric Fan']
 ['Electric Fan']
 ['Mixer']
 ['Mixer']]
==
SPARSE MATRIX
  (0, 5)	1.0
  (1, 4)	1.0
  (2, 2)	1.0
  (3, 0)	1.0
  (4, 1)	1.0
  (5, 1)	1.0
  (6, 3)	1.0
  (7, 3)	1.0
==
DENSE MATRIX
[[0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0. 0.]]


##### But Pandas provide more easy way to get one-hot-encoding labels: pd.get_dummies(DATAFRAME_DATA)

In [3]:
import pandas as pd

df = pd.DataFrame({"items":["TV", "Refridgerator", "Microwave", "Computer", "Electric Fan", "Electric Fan", "Mixer", "Mixer"]})
pd.get_dummies(df)

Unnamed: 0,items_Computer,items_Electric Fan,items_Microwave,items_Mixer,items_Refridgerator,items_TV
0,False,False,False,False,False,True
1,False,False,False,False,True,False
2,False,False,True,False,False,False
3,True,False,False,False,False,False
4,False,True,False,False,False,False
5,False,True,False,False,False,False
6,False,False,False,True,False,False
7,False,False,False,True,False,False


## Feature Scaling: StandardScaler

In [4]:
from sklearn.datasets import load_iris

raw_data = load_iris()
pd_data = pd.DataFrame(data=raw_data.data, columns=raw_data.feature_names)
print(pd_data)

# Mean and var for each feature
print(pd_data.mean())
print(pd_data.var())

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                  5.1               3.5                1.4               0.2
1                  4.9               3.0                1.4               0.2
2                  4.7               3.2                1.3               0.2
3                  4.6               3.1                1.5               0.2
4                  5.0               3.6                1.4               0.2
..                 ...               ...                ...               ...
145                6.7               3.0                5.2               2.3
146                6.3               2.5                5.0               1.9
147                6.5               3.0                5.2               2.0
148                6.2               3.4                5.4               2.3
149                5.9               3.0                5.1               1.8

[150 rows x 4 columns]
sepal length (cm)    5.843333
sepal widt

All features has a different range, so it is required to set all values in the same range

In [5]:
from sklearn.preprocessing import StandardScaler

# define scaler
scaler = StandardScaler()

# apply Dataframe into scaler
scaler.fit(pd_data)

# transform data
iris_scaled = scaler.transform(pd_data)
result = pd.DataFrame(data=iris_scaled, columns=raw_data.feature_names)

# mean and var
print(result.mean())
print(result.var())

sepal length (cm)   -1.690315e-15
sepal width (cm)    -1.842970e-15
petal length (cm)   -1.698641e-15
petal width (cm)    -1.409243e-15
dtype: float64
sepal length (cm)    1.006711
sepal width (cm)     1.006711
petal length (cm)    1.006711
petal width (cm)     1.006711
dtype: float64


## Feature Scaling: MinMaxScaler

In [6]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(pd_data)
iris_mmscaled = scaler.transform(pd_data)
result = pd.DataFrame(data=iris_mmscaled, columns=raw_data.feature_names)
print(result)

# mean and var
print(result.mean())
print(result.var())

# min and max
print(result.min())
print(result.max())

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0             0.222222          0.625000           0.067797          0.041667
1             0.166667          0.416667           0.067797          0.041667
2             0.111111          0.500000           0.050847          0.041667
3             0.083333          0.458333           0.084746          0.041667
4             0.194444          0.666667           0.067797          0.041667
..                 ...               ...                ...               ...
145           0.666667          0.416667           0.711864          0.916667
146           0.555556          0.208333           0.677966          0.750000
147           0.611111          0.416667           0.711864          0.791667
148           0.527778          0.583333           0.745763          0.916667
149           0.444444          0.416667           0.694915          0.708333

[150 rows x 4 columns]
sepal length (cm)    0.428704
sepal widt

### user fit_transform()


In [10]:
data_scaled = pd.DataFrame(data=scaler.fit_transform(pd_data), columns=raw_data.feature_names)
data_scaled

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,0.222222,0.625000,0.067797,0.041667
1,0.166667,0.416667,0.067797,0.041667
2,0.111111,0.500000,0.050847,0.041667
3,0.083333,0.458333,0.084746,0.041667
4,0.194444,0.666667,0.067797,0.041667
...,...,...,...,...
145,0.666667,0.416667,0.711864,0.916667
146,0.555556,0.208333,0.677966,0.750000
147,0.611111,0.416667,0.711864,0.791667
148,0.527778,0.583333,0.745763,0.916667
