# Masters' in Applied Artificial Intelligence
## Machine Learning Algorithms Course

Notebooks for the MLA course

by [*lufer*](mailto:lufer@ipca.pt)

---



# ML Modelling - Part V-I - Regression Models

**Contents**:

1.  **Create a Regression ML Model**
2.  **Ridge Regression Algorithm**
3.  **RandomForest Algorithm**


This notebook explores the creation of Machine Learning models for Regression Supervised Learning.

# Environment preparation


**Importing necessary Libraries**

In [23]:
import pandas as pd
import numpy as np
import seaborn as sns

#import libraries for trainning
from sklearn.model_selection import train_test_split


**Mounting Drive**

In [24]:

from google.colab import drive

# it will ask for your google drive credentiaals
drive.mount('/content/gDrive/', force_remount=True)

Mounted at /content/gDrive/



# Review of concepts


### Types od ML Algorithms

There is essentially four types of ML Algorithms:

*   Supervised ML Algorithms
*   Unsupervised ML Algorithms
*   Semi-Supervised ML Algorithms
*   Reinforcement ML Algorithms

###ML algorithms selection

The selection of ML can follow the next phases:

1.   **Understand Your Problem**

> Understanding clearly the problem to solve. What is the goal? What is the problem all about: classification, regression, clustering, or something else? What kind of data you to have to work with?

2.  **Process the Data**

> Ensure that your data is in the right format for your chosen algorithm. Process and prepare your data by Cleaning, Clustering, Regression.

3.  **Exploration of Data**

>  Conduct data analysis to gain insights into your data. Visualizations and statistics helps you to understand the relationships within your data.

4.  **Metrics Evaluation**

>  Decide on the metrics that will measure the success of model. You must choose the metric that should align with your problem.

5.  **Start wirh a simple model**

> One should begin with the simple easy-to-learn algorithms. For classification, try regression, decision tree. Simple model provides a baseline for comparison.

6.  **Use Multiple Algorithms**

> Multiple algorithms allow to check that one performs better than others, in the dataset

7.  **Hyperparameter Tuning**

> Grid Search and Random Search can helps with adjusting parameters choose algorithm that find best combination.

8.  **Cross-Validation**

> Using cross-validation allow to explores the performance of your models. It is relevant to preven overfiting or underfiting.

9.  **Comparing Results**

> Evaluate the models’s performance by using the metrics evaluation. Compare their performance and choose that best one that align with problem’s goal.

10.  **Consider Model Complexity**

> Balance complexity of model and their performance. Compare their performance and choose that one best algorithm to generalize better.



---
see more in books:
*   *Machine Learning*, Tom M. Mitchel
*   *Mastering Machine Learning with Python in Six Steps*, M
Manohar Swamynathan
---




# 1 - Choosing a ML Algorithm

Choose a Machine Learning algortihms depend of many factors, such as the size of the datatset, the type of the data in it, the goal of the model, and others.

Sklearn offers a graphical algorith that facilicates this selection.

![picture](https://scikit-learn.org/stable/_static/ml_map.png)

Let's consider a Supervised Learning process, usign Regression Algorithms, where a number is intended to predict.

#2 - Regression Problems

A Regression problem is the one that intends to predict a continuous number. It explores Regression Algorithms,

## 2.1 - Performance Metrics

Regression Problems have as performance metrics:

*    [Root-Mean-Square Deviation (RMSD) or Root-Mean-Square Error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation?ref=mrdbourke.com)
*    [Mean Absolute Error (MAE)](https://en.wikipedia.org/wiki/Mean_absolute_error)
*     [Precision @ k](https://en.wikipedia.org/wiki/Evaluation_measures_%28information_retrieval%29?ref=mrdbourke.com#Precision_at_K)

\
see more in [Mean-Squared Error (MSE) or R-squared (R^2)?](https://vitalflux.com/mean-square-error-r-squared-which-one-to-use/)




---

***Download Dataset***

This notebook will explore the Californi Housing dataset. It is prepared for Regression Analysis. see [California Housing Dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html#)

\
**Attribute Information:**

*   `MedInc` median income in block group
*   `HouseAge` median house age in block group
*   `AveRooms` average number of rooms per househol
*   `AveBedrms` average number of bedrooms per household
*   `Population` block group population
*   `AveOccup` average number of household members
*   `Latitude` block group latitude
*   `Longitude` block group longitude

In [25]:
#Importing a real world dataset preparaed for Regression

from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

#answer: a dictionary

In [26]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

We want to use Features to predict the Target

##2.2 - Prepare the dataset

In [27]:
#because it is a dictionary, convert it to a dataframe
df=pd.DataFrame(housing['data'],columns=housing["feature_names"])
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


Include the target column

In [28]:
#df['MedHouseVal']= housing['target']
#df.drop('MedHouseVal',axis=1, inplace=True)

df['target']= housing['target']
df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


Now let's try to use an algorithm to predict the "target"


### **Split the dataset**

In [29]:
# mantains random
np.random.seed(42)

#Create X and y data
# Note different names for:
# X = features, features valaribles, data, independent variables
# y = target,  labels, target variables, dependent variables

X = df.drop('target', axis=1)
y = df['target']  #medium house price

#split dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#X_train.head()
#y.head()

## **2.3 - Ridge Regression Algorithm**

Since we have more than 50 but less than 100k data records, we want to predict a quantity, it is relevant now to analyse the number of features. Considering that it is not so relevant, a Ridge Regression algorithm can be explored see. [sklearn selection ML schema](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html#)).

*Ridge regression* is a statistical regularization technique. It corrects for overfitting on training data (see [What is ridge regression](https://www.ibm.com/topics/ridge-regression)).

### **Instantiate and fit the model (algorithm)**

In [30]:
#import RidgeRegression algorithm

#from sklearn import linear_model
#reg = linear_model.Ridge(alpha=.5)

#or
from sklearn.linear_model import Ridge

# Instantiate
model = Ridge()

#fit the model
model.fit(X_train,y_train)

### **What does *fit the model* means?**

Fitting the model is the process of learning. The model go through each row (X) (independent variables) and try to figure out the corresponding expected label (y). In practice it tries to find patterns.

### **What does the evaluation phase means?**

During the evaluation phase (when checking the score), the model uses the learned patterns.

### **Chech de score ("quality") of the model**


In [31]:
# Check the score of the model (on the test set)
model.score(X_test,y_test)

#answer: Coefficient of Determination...between 0 and 1...the bigger, the better!

0.5758549611440126

Can we improve this coefficent?

*   With more data?
*   With a different model?



Can we try combining multiple models?

Yes! That's what "ensemble" means! So, let's try an ensemble method!

see [SKiLearn Ensemble methods](https://scikit-learn.org/stable/modules/ensemble.html)




**Note:**

`Ensemble methods combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability/robustness over a single estimator.`

`Examples of ensemble methods are gradient-boosted trees and random forests.`

## **2.4 - Random Forest Regressor Algorithm**

\
<p align=center><b>Randn Forest Algorith are based on **Decisions Trees**.</b></p>
<br/>
<table width="100%">
<tr align="center"><td style="text-align: center; vertical-align: middle;"><img src="https://www.researchgate.net/profile/Jessica-Pickles-2/publication/339279807/figure/fig1/AS:891889284829201@1589654384981/Random-forests-are-collections-of-randomised-decision-trees-A-A-single-decision-tree.ppm" width="400" height="250"></td>
</tr>
<tr align="center"><td>
[image credits to:](https://www.researchgate.net/profile/Jessica-Pickles-2/publication/339279807/figure/fig1/AS:891889284829201@1589654384981/Random-forests-are-collections-of-randomised-decision-trees-A-A-single-decision-tree.ppm)</td></tr>
</table>


`Random forests are collections of randomised decision trees. (A) A single decision tree is built from all variables in a dataset and as a result classification can be vulnerable to the order in which variables appear in the tree. (B) Random forests avoid this by randomly selecting variables from the dataset to build many trees (e.g. a forest) with combinations of variables
(DOI:10.1002/path.5397)`

see [Random Forest Simple Explanation](https://williamkoehrsen.medium.com/random-forest-simple-explanation-377895a60d2d)

Import the *RandomForestRegressor* model class from SKlearn ensemble module.

There is also a *RandonForestClassifier*.



###**Create Instance, Fit and Evaluate**

In [32]:
#import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor

#set random see
np.random.seed(42)

#Create X and y data
X = df.drop('target', axis=1)
y = df['target']

#Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Create Randon Forest Model instance
model = RandomForestRegressor()       #by default it uses "n_estimator=100" decisions tress

#fit the model to the train data
model.fit(X_train,y_train)

#check the score (on the test data)
model.score(X_test,y_test)

#answer: better Coefficient of Determination!!!

0.8066196804802649

### **Make Predictions**

Predictions must be done using the test datatset.

In [33]:
df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [34]:
#predicting using predict() function...on test dataset
y_predicted = model.predict(X_test)

In [35]:
y_predicted[:10]


array([0.49384  , 0.75494  , 4.9285964, 2.54029  , 2.33176  , 1.6549701,
       2.34323  , 1.66182  , 2.47489  , 4.8344779])

In [36]:
y_test[:10]

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
13311    1.58700
7113     1.98200
7668     1.57500
18246    3.40000
5723     4.46600
Name: target, dtype: float64

Convert *y_test* into an array, to allow comparing with *y_predicted*.

In [37]:
y_test_array= np.array(y_test)

In [38]:
y_test_array[:10]

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

Check both sizes

In [39]:
len(y_test_array)

4128

In [40]:
len(y_predicted)

4128

Since the size are equals, there is a predicted value for each test value!

In [41]:
finalResult = pd.DataFrame({"Truth":y_test,"Predicted":y_predicted})
finalResult

Unnamed: 0,Truth,Predicted
20046,0.47700,0.493840
3024,0.45800,0.754940
15663,5.00001,4.928596
20484,2.18600,2.540290
9814,2.78000,2.331760
...,...,...
15362,2.63300,2.220380
16623,2.66800,1.947760
18086,5.00001,4.836378
2144,0.72300,0.717820


Let's now evaluating the model results. For that we'll use **Regression Metrics**.

see [Metrics and scoring](https://scikit-learn.org/stable/modules/model_evaluation.html)

## **2.5 - Evaluating the Regression model**

The goal is to compare the predicted with the actual values (Truth)

For evaluating the quality of the predictions, it can be used:

*   Estimator score method
*   Scoring Parameter
*   Metric Functions

see https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics


We will explore:

*   R^2 (r-squared) or coefficient of determiantion
*   Mean Absolute Error (MAE)
*   Mean Squared  Error (MSE)

### **Usin the `score()` method**

The default `score()` evaluation metric is `r-squared` for regression algorithms.

Highest=1.0 and Lowest=0.0



In [42]:
#check the score (on the test data)
#this is the Regression metrics by default (r-squared)
model.score(X_test,y_test)

0.8066196804802649

### **Using R^2**

*   *R^2 is similar to Acccuracy.*\
*   **The closer R^2 value is to 1, the better the model!**
*   But doesn't tell how wrong the model is!
*   Behaves like `score()`function

In [43]:
from sklearn.metrics import r2_score

r2_score(y_test, y_predicted)
#answer: equal coefficient as score(). check it!?

0.8066196804802649

### **Using Mean Absolute Error (MAE)**

MAE is the avarege of the absolute differences between  predictions and ther actual values.

It gives an idei of How wrong the model predicts !

In [44]:
#compare the predictions (y_predicted) with the truth values (y_test)
from sklearn.metrics import mean_absolute_error

#calculate the mean absolute differences amount each of the arrays values
mean_absolute_error(y_predicted, y_test_array)

0.3265721842781009

Visualizing better the results

In [45]:
dfaux = pd.DataFrame(data={"real values":y_test,
                           "predicted values":y_predicted})
dfaux

Unnamed: 0,real values,predicted values
20046,0.47700,0.493840
3024,0.45800,0.754940
15663,5.00001,4.928596
20484,2.18600,2.540290
9814,2.78000,2.331760
...,...,...
15362,2.63300,2.220380
16623,2.66800,1.947760
18086,5.00001,4.836378
2144,0.72300,0.717820


In [46]:
dfaux["Diff"] = dfaux['predicted values']-dfaux['real values']
dfaux.head()

Unnamed: 0,real values,predicted values,Diff
20046,0.477,0.49384,0.01684
3024,0.458,0.75494,0.29694
15663,5.00001,4.928596,-0.071414
20484,2.186,2.54029,0.35429
9814,2.78,2.33176,-0.44824


In [47]:
dfaux['Diff'].mean()
#note: different from MAE. Why?

0.0121069218749996

In [48]:
np.abs(dfaux['Diff']).mean()
#answer: equal do MAE

0.3265721842781009

### **Using Mean Square Error (MSE)**

MSE is the mean of the square of the errors between actual and predicted values.

In [49]:
#import
from sklearn.metrics import mean_squared_error

#remeber: model = RandomForestRegressor()
y_preds = model.predict(X_test)
mse=mean_squared_error(y_test,y_preds)
mse

0.2534073069137548

In [50]:
#fill an array with y_test mean()
y_test_mean = np.full(len(y_test),y_test.mean())
#note: it is less than MAE

Let's see the squared differences

In [51]:
dfaux['SquaredDiff'] = np.square(dfaux['Diff'])
dfaux

Unnamed: 0,real values,predicted values,Diff,SquaredDiff
20046,0.47700,0.493840,0.016840,0.000284
3024,0.45800,0.754940,0.296940,0.088173
15663,5.00001,4.928596,-0.071414,0.005100
20484,2.18600,2.540290,0.354290,0.125521
9814,2.78000,2.331760,-0.448240,0.200919
...,...,...,...,...
15362,2.63300,2.220380,-0.412620,0.170255
16623,2.66800,1.947760,-0.720240,0.518746
18086,5.00001,4.836378,-0.163632,0.026775
2144,0.72300,0.717820,-0.005180,0.000027


In [52]:
#analyse the first record
dfaux.iloc[0]

real values         0.477000
predicted values    0.493840
Diff                0.016840
SquaredDiff         0.000284
Name: 20046, dtype: float64