# Machine Learning A-Z™: Hands-On Python & R In Data Science‎

## Part 1 - Data Preprocessing

#### Importing libraries

Many libraries are selected/imported by default in RStudio. If not, you have to import them using the `library` function

In [41]:
library(ggplot2) # e.g. Importing ggplot2
library(caTools)

#### Importing the dataset

Reading _cvs-Files_ can be done with the function `read.csv`.

In [46]:
dataset <- read.csv("Data.csv")
dataset

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,,Yes
France,35.0,58000.0,Yes
Spain,,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


#### Missing Data

How to handle missing data in the dataset.

 1. **Removing rows** which includes missing data in one or more columns. However, this is not very useful, if we only have a small number of rows/data or if we there are many columns with missing data
 2.  Filling missing data with the **mean, minimum, maximum ...** of other rows.
 
Here we are using the function `ifelse` to decide wether there are missing data (*na*) or not. If so, we are using the function `ave` calculate the **mean**.

In [47]:
dataset$Age <- ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, 
    na.rm = TRUE)), dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, 
    na.rm = TRUE)), dataset$Salary)

dataset

Country,Age,Salary,Purchased
France,44.0,72000.0,No
Spain,27.0,48000.0,Yes
Germany,30.0,54000.0,No
Spain,38.0,61000.0,No
Germany,40.0,63777.78,Yes
France,35.0,58000.0,Yes
Spain,38.77778,52000.0,No
France,48.0,79000.0,Yes
Germany,50.0,83000.0,No
France,37.0,67000.0,Yes


#### Categorial Data

Categorial data contains a fixed number of categories. Machine learning algorithms are based on mathematical equations and therefore can only handle *numbers*.  This means, if the categorial data are *text*, we have to encode the categories by replacing them with *numbers*.

With the function `factor` we can encode our _text_ categories into _numbers_.

In this dataset **Country** and **Purchased** are categorial and will be encoded.

In [48]:
# Encoding categorical data
dataset$Country <- factor(dataset$Country, levels = c("France", "Spain", "Germany"), 
    labels = c(1, 2, 3))
dataset$Purchased <- factor(dataset$Purchased, levels = c("No", "Yes"), labels = c(0, 
    1))
dataset

Country,Age,Salary,Purchased
1,44.0,72000.0,0
2,27.0,48000.0,1
3,30.0,54000.0,0
2,38.0,61000.0,0
3,40.0,63777.78,1
1,35.0,58000.0,1
2,38.77778,52000.0,0
1,48.0,79000.0,1
3,50.0,83000.0,0
1,37.0,67000.0,1
