# Notebook for the exercices from the Machine Learning A-Z™ course
### Hands-On Python & R In Data Science
https://www.udemy.com/machinelearning/

**Part 1** - Data Preprocessing  
**Part 2** - Regression: Simple Linear Regression, Multiple Linear Regression, Polynomial Regression, SVR, Decision Tree Regression, Random Forest Regression  
**Part 3** - Classification: Logistic Regression, K-NN, SVM, Kernel SVM, Naive Bayes, Decision Tree Classification, Random Forest Classification  
**Part 4** - Clustering: K-Means, Hierarchical Clustering  
**Part 5** - Association Rule Learning: Apriori, Eclat  
**Part 6** - Reinforcement Learning: Upper Confidence Bound, Thompson Sampling  
**Part 7** - Natural Language Processing: Bag-of-words model and algorithms for NLP  
**Part 8** - Deep Learning: Artificial Neural Networks, Convolutional Neural Networks  
**Part 9** - Dimensionality Reduction: PCA, LDA, Kernel PCA  
**Part 10** - Model Selection & Boosting: k-fold Cross Validation, Parameter Tuning, Grid Search, XGBoost  

## Section 1 - Welcome to the course!
### Class 1 - Applications for Machine Learning

1. <strong>Facebook Facial recognition </strong> : algorithms tags someone automatically
2. <strong>Kinect</strong>: You can play games. (Uses Random Forest)
3. <strong>Virtual reality headsets</strong>: ML monitor your actions, you turn your head and the picture moves 
4. <strong>Voice recognition or Speech-to-text</strong>
5. <strong>Robodog</strong>: The dogs learn how to walk, reinforcement learning so the dogs learn how to walk on their own
6. <strong>Ads</strong>: Facebook
7. <strong>Recommender systems</strong>: Amazon, Netflix
8. <strong>Medicine</strong>: to save lives
9. <strong>Space</strong>: to recognize certain areas of the world
10. <strong>Mars</strong>: explore new territories 

# Part 1 - Data Pre-processing

Is the **preparation of the dataset** for any machine learning model.  

**Crutial step in the journey of making a ML model** - without data processing, the model won't work properly 

In [81]:
# Data Preprocessing Template
# Importing the essential libraries

# Numpy contains mathematic tools
import numpy as np 
# Library to plot nice charts
import matplotlib.pyplot as plt 
# Library to import and manege datasets
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Data.csv')
# X is matrix of the independent variables (country, age, and salary) -- all lines, all columns except the last one
X = dataset.iloc[:, :-1].values
# X is the dependent variable vector (purchased) -- take only the last column
y = dataset.iloc[:, 3].values


In [82]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [83]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [84]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

## Missing data

First option: Removing the observations where there is some missing data? Dangerous  
**Better option: Take the mean of the columns!!!** we can use other strategies as the median or most frequent

In [85]:
# Taking care of missing data
# SKLearn is a ML library to preprocess data
from sklearn.preprocessing.imputation import Imputer
imputer = Imputer('NaN', strategy = 'mean', axis=0) # the missing values

# Fitting the inputer to our matrix of features X
imputer = imputer.fit(X[:,1:3])
# Applying the transform method to replace the missing data
X[:,1:3] = imputer.transform(X[:,1:3])

X



array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Categorical data 

We need to encode the **text values into numbers**

In [87]:
# The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column

# Encoding the two variables Country and Purchase
labelEncoder_X = LabelEncoder() 
# Applying the label encoder object on our column
# and replacing the first column of our matrix X
X[:,0] = labelEncoder_X.fit_transform(X[:,0])

X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

Is Spain greater than France and Germany greater than Spain? No!!   
There is no relational order between the 3 categories!
But in this way, the model think YES.

Solution? To don't express orders in this variable, use the **dummy variables**!

We'll have three columns -- one for each country

<img src="images/dummy_encoding.png">

In [88]:
onehotencoder = OneHotEncoder(categorical_features=[0])
# Fitting the object to our matrix X
X = onehotencoder.fit_transform(X).toarray()

X

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

In [89]:
# Encoding the dependent variable, no need to dummy variable as the model knows that is a dependent variable - no order 

labelEncoder_y = LabelEncoder() 
y = labelEncoder_y.fit_transform(y)

y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Splitting the dataset into training and test set

You need to evaluate the model in a different set than the set in which you built your model

(1) **Training set** - on which we build the machine learning model understand the corralations  
(2) **Test set** - on which we test the performance model

The **performance of the training set and test set should not be different** - the model would find correlations in new situations

The better the model learns the correlations in the training set, the better it will predict the results in the test set.

If the model **learn by heart** (memorize) the correlation, it will not be able to predict the results in the test set

In [90]:
# Importing the sklearn library
from sklearn.model_selection import train_test_split 

X_train, X_test, y_test, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


## Feature scaling

We need to put the variables in the same scale - in the same range

Ways of scaling features: 

<img src="images/feature_scaling.png">

Do we need to scale the dummy variables? Depends on the context!!  
**If we scale**, everything will be on the same scale -- good for predictions, but we lose interpretations (which observations belongs to each country?)  
**If we do not scale**, we keep interpretations




In [91]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Part 2 - Regression

Regression models are used for predicting a real value - like salary.

Most important ML Regression models: 

1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Regression
4. Support Vector for Regression (SVR)
5. Decision Tree Classification
6. Random Forest Classification