# Importing the Dataset

In [1]:
path = '/Users/sun/Desktop/Data_Preprocessing/'
dataset = read.csv(file.path(path,'Data.csv'))

In [2]:
head(dataset)

Country,Age,Salary,Purchased
France,44,72000.0,No
Spain,27,48000.0,Yes
Germany,30,54000.0,No
Spain,38,61000.0,No
Germany,40,,Yes
France,35,58000.0,Yes


# Missing Data

Original variable

In [3]:
dataset$Age

In [4]:
dataset$Salary

Deal with missing value

In [5]:
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

Not contain missing value

In [6]:
dataset$Age

In [7]:
dataset$Salary

# Categorical Data

Let Categorical variable be a factor

In [8]:
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))

In [9]:
dataset$Country

In [10]:
dataset$Purchased

> [Note] <br>
> 1. 在 R 中，轉成 factor 就表示可以把文字類別轉成數字類別，且 factor 之間不存在順序關係

# Splitting the dataset into the Training set and Test set

In [11]:
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

> [Note] <br>
> 1. 使用 `caTools` 套件 <br>
> 2. `sample.split(` <font color='red'> Y </font>`, SplitRatio = 0.8)` 依照 Y 去做切分， SplitRatio 放的是 train data 的比例

# Feature Scaling

<img src='course_2.png', width=50%>

> [Note] <br>
> 1. 大多數的 machine learning 方法都是基於歐幾里德距離，若變數間 Scale 不同的話就會造成很大的影響；<br>
> e.g. Salary_A:$79000$，Salary_B:$48000$；Salary_distance:$(79000-48000)^2=961000000$ <br>
> e.g. Age_A:$48$，Age_B:$27$；Age_distance:$(48-27)^2=441$<br>

|<h4>Standardisation</h4>|<h4>Normalisation</h4>|
|---|---|
|$X_{Std} = \frac{X - mean(X)}{std(X)}$|$X_{norm} = \frac{X - min(X)}{max(X) - min(X)}$|

In [12]:
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])

In [13]:
head(training_set)

Unnamed: 0,Country,Age,Salary,Purchased
1,1,0.90101716,0.9392746,0
2,2,-1.58847494,-1.337116,1
3,3,-1.14915281,-0.7680183,0
4,2,0.02237289,-0.1040711,0
5,3,0.31525431,0.1594,1
7,2,0.13627122,-0.9577176,0


> [Note] <br>
> 1. 已經轉化成 One-Hot Encoding 的變數需要做 Scaling 嗎？ <br>
>   (1) 可以做 Scaling ，這樣可以讓 Model 配適的更好，因為大家的 Scale 相同 <br>
>   (2) 不做 Scaling ，因為這樣可以保留原本變數中想表達的意義，且才可以解釋 <br>
> 2. Dependent Variable 需要做 Scaling 嗎？這裡因為 Dependent Variable 是類別所以不需做， Model會已經知道他是類別的 <br>
> 3. 不是基於歐幾里德距離的 Model 有需要做 Scaling 嗎？
> e.g. Decision Trees 若做了 Scaling 則可以讓他收斂的更快
> 4. Factor 不是 numeric 就無法使用 `scale`

# How to Set Up Working Directory

To set a working directory in RStudio: 
1. use "Files" tab for lower right pane (window) in RStudion interface to navigate to your chosen directory, then go to Session -> Set Working Directory -> To Files Pane Location. 
2. You can use Choose Directory... option in Session -> Set Working Directory as well.
3. a little gear icon for right lower pane. 
4. Online commands: `getwd()` to see what is your working directory and `setwd()` to change it:

In [14]:
getwd()