
# Exercise 5 Machine Learning models

The goal of this exercise is to have an overview of the existing Machine Learning models and to learn to call them from scikit learn.
We will focus on:

- SVM/ SVC
- Decision Tree
- Random Forest (Ensemble learning)
- Gradient Boosting (Ensemble learning, Boosting techniques)

All these algorithms exist in two versions: regression and classification. Even if the logic is similar in both classification and regression, the loss function is specific to each case.

It is really easy to get lost among all the existing algorithms. This article is very useful to have a clear overview of the models and to understand which algorithm use and when. https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9

Preliminary:

- Import California Housing data set and split it in a train set and a test set (10%). Fit a linear regression on the data set. *The goal is to focus on the metrics, that is why the code to fit the Linear Regression is given.*

```python
#imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
#data
housing = fetch_california_housing()
X, y = housing['data'], housing['target']
#split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=43)
#pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)
#fit
pipe.fit(X_train, y_train)

```

1. Create 5 pipelines with 5 different models as final estimator (keep the imputer and scaler unchanged):
    1. Linear Regression
    2. SVM
    3. Decision Tree (set `random_state=43`)
    4. Random Forest (set `random_state=43`)
    5. Gradient Boosting  (set `random_state=43`)

Take time to have basic understanding of the role of the basic hyperparameter and their default value.

- For each algorithm, print the R2, MSE and MAE on both train set and test set.


In [28]:
#imports
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import tree, svm
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

housing = fetch_california_housing()
X, y = housing['data'], housing['target']
#split data train test 
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.1,
                                                    shuffle=True,
                                                    random_state=43)
#pipeline 
pipeline = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('lr', LinearRegression())]
pipe = Pipeline(pipeline)
#fit
pipe.fit(X_train, y_train)

# predict
y_pred_train = pipe.predict(X_train) 
y_pred_test = pipe.predict(X_test)

r2_train = r2_score(y_pred_train, y_train)
mse_train = mean_squared_error(y_train, y_pred_train)
mae_train = mean_absolute_error(y_train, y_pred_train)

print('r2 on the train set:', r2_train)
print('MAE on the train set:', mae_train)
print('MSE on the train set:', mse_train)

r2_test = r2_score(y_pred_test, y_test)
mse_test = mean_squared_error(y_test, y_pred_test)
mae_test = mean_absolute_error(y_test, y_pred_test)

print('----------------')
print('r2 on the test set:', r2_test)
print('MAE on the test set:', mae_test)
print('MSE on the test set:', mse_test)


r2 on the train set: 0.34823544284172525
MAE on the train set: 0.5330920012614552
MSE on the train set: 0.5273648371379568
----------------
r2 on the test set: 0.3551785428138904
MAE on the test set: 0.5196420310323714
MSE on the test set: 0.49761195027083815


In [26]:

print('------------svm----------')
# from sklearn import preprocessing

# lab_enc = preprocessing.LabelEncoder()
# y_encoded = lab_enc.fit_transform(y_train)

# pipeline 
pipeline_svc = [('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('svr', SVR())]

pipe_svc = Pipeline(pipeline_svc)
# fit
pipe_svc.fit(X_train, y_train)

y_pred_train_svm = pipe_svc.predict(X_train) 
y_pred_test_svm = pipe_svc.predict(X_test)

r2_train_svm = r2_score(y_pred_train_svm, y_train)
mse_train_svm = mean_squared_error(y_train, y_pred_train_svm)
mae_train_svm = mean_absolute_error(y_train, y_pred_train_svm)

print('r2 on the train set:', r2_train_svm)
print('MAE on the train set:', mae_train_svm)
print('MSE on the train set:', mse_train_svm)

r2_test_svm = r2_score(y_pred_test_svm, y_test)
mse_test_svm = mean_squared_error(y_test, y_pred_test_svm)
mae_test_svm = mean_absolute_error(y_test, y_pred_test_svm)

print('----------------')
print('r2 on the test set:', r2_test_svm)
print('MAE on the test set:', mae_test_svm)
print('MSE on the test set:', mse_test_svm)


------------svm----------
r2 on the train set: 0.6462366150965996
MAE on the train set: 0.3835645163325987
MSE on the train set: 0.3346447867133917
----------------
r2 on the test set: 0.6162644671183827
MAE on the test set: 0.3897680598426786
MSE on the test set: 0.3477101776543003


In [29]:

print('------------decision tree----------')

# pipeline
pipe_tree = DecisionTreeRegressor(random_state=43)
# fit
pipe_tree.fit(X_train, y_train)

y_pred_train_tree = pipe_tree.predict(X_train) 
y_pred_test_tree = pipe_tree.predict(X_test)

r2_train_tree = r2_score(y_pred_train_tree, y_train)
mse_train_tree = mean_squared_error(y_train, y_pred_train)
mae_train_tree = mean_absolute_error(y_train, y_pred_train)

print('r2 on the train set:', r2_train_tree)
print('MAE on the train set:', mae_train_tree)
print('MSE on the train set:', mse_train_tree)

r2_test_tree = r2_score(y_pred_test_tree, y_test)
mse_test_tree = mean_squared_error(y_test, y_pred_test_tree)
mae_test_tree = mean_absolute_error(y_test, y_pred_test_tree)

print('----------------')
print('r2 on the test set:', r2_test_tree)
print('MAE on the test set:', mae_test_tree)
print('MSE on the test set:', mse_test_tree)


------------decision tree----------
r2 on the train set: 1.0
MAE on the train set: 0.5330920012614552
MSE on the train set: 0.5273648371379568
----------------
r2 on the test set: 0.6107615716040615
MAE on the test set: 0.43781382267441854
MSE on the test set: 0.48602743476128873


In [34]:
print('-------------random forest------------')
pipe_rand = RandomForestRegressor(random_state=43)
# fit
pipe_rand.fit(X_train, y_train)

y_pred_train_rand = pipe_rand.predict(X_train) 
y_pred_test_rand = pipe_rand.predict(X_test)

r2_train_rand = r2_score(y_pred_train_rand, y_train)
mse_train_rand = mean_squared_error(y_train, y_pred_train_rand)
mae_train_rand = mean_absolute_error(y_train, y_pred_train_rand)

print('r2 on the train set:', r2_train_rand)
print('MAE on the train set:', mae_train_rand)
print('MSE on the train set:', mse_train_rand)

r2_test_rand = r2_score(y_pred_test_rand, y_test)
mse_test_rand = mean_squared_error(y_test, y_pred_test_rand)
mae_test_rand = mean_absolute_error(y_test, y_pred_test_rand)

print('----------------')
print('r2 on the test set:', r2_test_rand)
print('MAE on the test set:', mae_test_rand)
print('MSE on the test set:', mse_test_rand)


-------------random forest------------
r2 on the train set: 0.9705828237104931
MAE on the train set: 0.11990197445628809
MSE on the train set: 0.034488233047398974
----------------
r2 on the test set: 0.7486758098300759
MAE on the test set: 0.32004346957364355
MSE on the test set: 0.24226863716491928


In [36]:
print('-------------gradient boosting------------')
pipe_grad = GradientBoostingRegressor(random_state=43)
# fit
pipe_grad.fit(X_train, y_train)

y_pred_train_grad = pipe_grad.predict(X_train) 
y_pred_test_grad = pipe_grad.predict(X_test)

r2_train_grad = r2_score(y_pred_train_grad, y_train)
mse_train_grad = mean_squared_error(y_train, y_pred_train_grad)
mae_train_grad = mean_absolute_error(y_train, y_pred_train_grad)

print('r2 on the train set:', r2_train_grad)
print('MAE on the train set:', mae_train_grad)
print('MSE on the train set:', mse_train_grad)

r2_test_grad = r2_score(y_pred_test_grad, y_test)
mse_test_grad = mean_squared_error(y_test, y_pred_test_grad)
mae_test_grad = mean_absolute_error(y_test, y_pred_test_grad)

print('----------------')
print('r2 on the test set:', r2_test_grad)
print('MAE on the test set:', mae_test_grad)
print('MSE on the test set:', mse_test_grad)

-------------gradient boosting------------
r2 on the train set: 0.7395782392433273
MAE on the train set: 0.35656543036682264
MSE on the train set: 0.26167490389525294
----------------
r2 on the test set: 0.7157292841624554
MAE on the test set: 0.36457603328113886
MSE on the test set: 0.27058520173725403


In [37]:
# It is important to notice that the Decision Tree over fits very easily. It learns easily the training data but is not able to extrapolate on the test set. This algorithm is not used a lot.

# However, Random Forest and Gradient Boosting propose a solid approach to correct the over fitting (in that case the parameters `max_depth` is set to None that is why the Random Forest over fits the data). These two algorithms are used intensively in Machine Learning Projects.


SyntaxError: invalid syntax (<ipython-input-37-0ef7bbf4246d>, line 2)