# Data Preprocessing

# 1. Importing the required Libraries

These two are essentail libraries which we will import every time.  
* Numpy is Library which contain Mathemation functions.  
* Pandas is the library used to import and manage the data sets.


有兩個 Library 是每次都需要載入的

* Numpy 包含數學計算函數   

* Pandas 用於導入和管理數據集  

In [1]:
import numpy as np
import pandas as pd

# 2. Importing the Data Set

Data sets are generally avaliable in .csv format. A CSV file stores tabular data in plain text. Each line of the file is a data record. We use the read_csv method of the pandas library to read a local CSV file as a dataframe. Then we make separate Matrix and Vector of independent and dependent variables from the dataframe.

資料集通常是 .csv 的格式。CSV 檔案以文件形式儲存表格數據。檔案內的每一列是一個資料記錄。使用 Pandas 的 read_csv 方法讀取本地 CSV 檔案為一個資料框。從資料框中製作出自變數和應變數的矩陣和向量。

In [2]:
dataset  = pd.read_csv('./datasets/Data.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [3]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
Country      10 non-null object
Age          9 non-null float64
Salary       9 non-null float64
Purchased    10 non-null object
dtypes: float64(2), object(2)
memory usage: 400.0+ bytes


In [4]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
X = dataset.iloc[:, : -1].values
X[:10]
#因為取欄位資料後 X 為 ndarray 的格式，所以不能用 .head()，只能取位置

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [6]:
type(X)

numpy.ndarray

In [7]:
Y = dataset.iloc[:, 3].values
Y[:10]

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [8]:
type(Ｙ)

numpy.ndarray

# 3. Handling the missing data

The data we get is rarely homogeneous. Data can be missing due to various reasons and needs to be handled so that it does not reduce the performance of our machine learning model. We can replace the missing data by the Mean or Median of the entire column. We use Imputer class of sklean.preprocessing for this task.

資料可能會因為不同的原因而遺失，導致得到的資料是不完整的。為了不降低機器學習模型的效能，需要處理數據。可以用整行的平均值或中位數來取代遺失的資料。利用 sklearn.preprocessing 中的 Imputer 來完成。

In [9]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [10]:
from sklearn.preprocessing import Imputer

In [11]:
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)



In [12]:
imputer = imputer.fit(X[:, 1:3])

In [13]:
X[:, 1:3] = imputer.transform(X[:, 1:3])

In [14]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

# 4. Enconding categorical data

Categorical data are variables that contain label values rather than numeric values. The number of possible values is often limited to a fixed set. Example values such as 'Yes' and 'No' cannot be used in mathematical equations of the model so we need to encode these variables into numbers. To achieve this we import LabelEncoder class from sklearn.preprocessing library.

類別資料是包含標籤值而不是數字值。取值範圍通常是固定的。如 'Yes' 和 'No' 不能被使用在數學公式的模型中，所以需要轉換成數字。要這麼做可以用sklearn.preprocessing 中的 LabelEncoding。

In [15]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [16]:
labelencoder_X = LabelEncoder()

In [17]:
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

In [18]:
X[:, 0]

array([0, 2, 1, 2, 1, 0, 2, 0, 1, 0], dtype=object)

### Creating a dummy variable

In [19]:
onehotencoder = OneHotEncoder(categorical_features = [0])

In [20]:
X = onehotencoder.fit_transform(X).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [21]:
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

In [22]:
labelencoder_Y = LabelEncoder()

In [23]:
Y = labelencoder_Y.fit_transform(Y)

In [24]:
Y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

# 5. Splitting the datasets into training sets and test sets

We make tow partitions of dataset one for training the model called training set and other for testing the performance of the trained model called test set. The split is generally 80/20. We import train_test_split() method of ~~sklearn.crossvalidation~~ sklearn.model_selection library.

將資料集拆分成兩部分，一個是用來訓練模型的訓練集，另一個是用來驗證模型的測試集。兩者比例一般是 80/20。可利用 sklearn.cross_validation 中的train_test_split() 來達成。

In [25]:
from sklearn.model_selection import train_test_split

#原先程式中是使用 from sklearn.cross_validation import train_test_split，但在 sklearn 新的版本改成在 model_selection 內

In [26]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

In [27]:
X_train.shape

(8, 5)

In [28]:
X_train

array([[0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04]])

In [29]:
X_test.shape

(2, 5)

In [30]:
X_test

array([[0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04]])

# 6. Feature Sacling

Most of the machine learning algorithms use the Euclidean distance between two data points in their computations, features highly varying in magnitudes, units and range pose problems. High magnitudes features will weigh more in the distance calculations than features with low magnitudes. Done by Feature standadizaion or Z-score normalization. StandardScalar of sklearn.preprocessing is imported.

大部分的機器學習演算法使用兩點間的歐式距離表示，但特徵在幅度，單位和範圍問題的變化很大。在計算距離時，高幅度的特徵比低幅度的特徵權重更大。可用特徵標準化或 Z 值歸一化解決。使用sklearn.preprocessing 中的 StandardScalar。

In [31]:
from sklearn.preprocessing import StandardScaler

In [32]:
sc_X = StandardScaler()

In [33]:
X_train = sc_X.fit_transform(X_train)

In [34]:
X_train

array([[-1.        ,  2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [ 1.        , -0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-1.        , -0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-1.        , -0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [ 1.        , -0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-1.        , -0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [ 1.        , -0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [ 1.        , -0.37796447, -0.77459667, -0.59788085, -0.48214934]])

In [35]:
X_test = sc_X.fit_transform(X_test)

In [36]:
X_test

array([[ 0.,  0.,  0., -1., -1.],
       [ 0.,  0.,  0.,  1.,  1.]])