# Machine Learning A-Z™: Hands-On Python & R In Data Science‎

## Part 1 - Data Preprocessing

#### Importing libraries

Importing libraries in Python is done with the `import` statement. This loads the library into its own namespace or one can define a new namespace. It is possible to import the whole library or just parts of it.


In [61]:
import numpy # Importing the whole numpy library
import numpy as np # Importing numpy in the namespace np
from numpy import array, arange # Importing only array and arange from numpy

import numpy as np
import matplotlib.pyplot as plt


#### Importing Datasets

There is a very popular library called `pandas` for tabular data in Python. Several data formats can be read into a `pandas.Series` or `pandas.DataFrame`.

In [62]:
import pandas as pd
dataset = pd.read_csv('Data.csv') # Reading Data.csv
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


Depending on the used machine learning algorithm (Supervised, Unsupervised) the dataset have to be separated into the independent variables (_features_) and if supervised learning is used, the dependent variables (_labels_).

In this example the _features_ are the columns __Country, Age__ and __Salary__ and __Purchased__ is the _label_.

In [63]:
X = dataset.iloc[:, :-1] # Country, Age and Salary
y = dataset.iloc[:, 3] # Purchased
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [64]:
y

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

#### Missing Data

How to handle missing data in the dataset.

 1. **Removing rows** which includes missing data in one or more columns. However, this is not very useful, if we only have a small number of rows/data or if we there are many columns with missing data
 2.  Filling missing data with the **mean, minimum, maximum ...** of other rows.
 
This can be done by the function `Imputer` from `sklearn.preprocessing`.

In [65]:
from sklearn.preprocessing import Imputer

# Taking care of missing data

imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(X.iloc[:, 1:3]) # There are only missing data in columns 2 and 3[1:3] 
X.iloc[:, 1:3] = imputer.transform(X.iloc[:, 1:3])
X

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,63777.777778
5,France,35.0,58000.0
6,Spain,38.777778,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


#### Categorial Data

Categorial data contains a fixed number of categories. Machine learning algorithms are based on mathematical equations and therefore can only handle *numbers*.  This means, if the categorial data are *text*, we have to encode the categories by replacing them with *numbers*.

In `sklearn.preprocessing` there are several functions for encoding, like `LabelEncoder` and `OneHotEncoder`. 

 - `LabelEncoder`: This encoder replaces every category (_text_) by a _number_.
 - `OneHotEncoder`: This encoder adds dummy variables (columns) for every category. This encoder adds dummy variables (columns) for each category. In this columns the value of the row is *1* for one category and *0* for each other. 
 
As `LabelEncoder` replaces _text_ with _numbers_, this encoding is not useful, if there is **no logical order** within the categories Instead use `OneHotEncoder` in addition.

_Since scikit-learn>0.20 there is a new Function called `CategoricalEncoder`. This encodes categorial data into a one-hot-encode or ordinal form_.

In this dataset **Country** in the _features_ and **Purchased** in the _labels_ are categorial data.

In [66]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Encoding the Independent Variable

labelencoder_X = LabelEncoder()
X.iloc[:, 0] = labelencoder_X.fit_transform(X.iloc[:, 0])

## No logical order within the categories? Use OneHotEncoder in addition
onehotencoder = OneHotEncoder(categorical_features=[0], sparse=False)
X = onehotencoder.fit_transform(X)

# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

pd.DataFrame(X)


Unnamed: 0,0,1,2,3,4
0,1.0,0.0,0.0,44.0,72000.0
1,0.0,0.0,1.0,27.0,48000.0
2,0.0,1.0,0.0,30.0,54000.0
3,0.0,0.0,1.0,38.0,61000.0
4,0.0,1.0,0.0,40.0,63777.777778
5,1.0,0.0,0.0,35.0,58000.0
6,0.0,0.0,1.0,38.777778,52000.0
7,1.0,0.0,0.0,48.0,79000.0
8,0.0,1.0,0.0,50.0,83000.0
9,1.0,0.0,0.0,37.0,67000.0


#### Splitting the Dataset

It is useful to split the dataset into a _training set_ and a _test set_, to proof if the machine learning model is stable. 

If the model performance on the _training set_ is much better than the performance on the _test set_, the model did not generalize well. It rather learned the correlation between the features and the labels including the noise. This is typically called __Overfitting__.

There is the function `train_test_split` from `sklearn.model_selection` fot splitting datasets.

In [67]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

#### Feature Scaling

Many machine learning models are based on the _Euclidean distance d_. The _Euclidean distance_ of point 1 with coordinates $(x1, y1)$ and point 2 with coordinates $(x2, y2)$ is:

$$d = \sqrt{(x2-x1)^2+(y2-y1)^2}$$

This means, if the range of the data is not in the same scale, _d_ would be dominated by the data with the largest scale. Therefore the _features_ should be scaled before training a machine learning model.

There are several ways for scaling the data. The two most common are:
 - Standardization: For every observation of a feature the mean of the feature is withdrawn, divided by the standard deviation
 
$$x_{stand} = \frac{x-\bar{x}}{\sigma(x)}$$


 - Normalization: The minimum of the feature is substraced from the observation, divided by the difference of the maximum and the minimum of the feature.
 
$$x_{norm} = \frac{x-x_{min}}{x_{max}-x_{min}}$$
 
In `sklearn.preprocessing` there are many functions for feature scaling, e.g. `StandardScaler` and `Normalizer`

In [71]:
from sklearn.preprocessing import StandardScaler, Normalizer

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train) # The scaler is fitted to the training set followed by transforming the training set
X_test = sc_X.transform(X_test) # The testset is transformed with the scaler fitted  to the training set
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train.reshape(-1, 1)) # The labels are fitted and transformed using an other scaler

In [72]:
'''Trains a simple convnet on the MNIST dataset.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.
'''

from __future__ import print_function
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K

batch_size = 128
num_classes = 10
epochs = 12

# input image dimensions
img_rows, img_cols = 28, 28

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)
else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
Epoch 2/12
Epoch 3/12
Epoch 4/12
Epoch 5/12
Epoch 6/12
Epoch 7/12
Epoch 8/12
Epoch 9/12
Epoch 10/12
Epoch 11/12
Epoch 12/12
Test loss: 0.025606131626059143
Test accuracy: 0.9918
