# Introduction to Machine Learning - Exercise 4
Goal of the excercise is to learn how to use Scikit-learn library for a regression tasks employing various linear regression models and moreover evaluate the performance of the proposed models.

![meme01](https://github.com/lubsar/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_10_meme_01.jpeg?raw=true)

## ðŸ“Œ Useful URLs

### Models
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html#sklearn.linear_model.Ridge
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html#sklearn.linear_model.Lasso
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet

### Preprocessing
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler
- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PowerTransformer.html#sklearn.preprocessing.PowerTransformer

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math

from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PowerTransformer

In [2]:
"""
Computes MAPE
"""
def mean_absolute_percentage_error(y_true: np.array, y_pred: np.array) -> float:
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def compute_metrics(df: pd.DataFrame) -> pd.DataFrame:
    y_true, y_pred = df['y_true'].values, df['y_pred'].values
    return compute_metrics_raw(y_true, y_pred)

def compute_metrics_raw(y_true: pd.Series, y_pred: pd.Series) -> pd.DataFrame:
    mae, mse, rmse, mape = mean_absolute_error(y_true=y_true, y_pred=y_pred), mean_squared_error(y_true=y_true, y_pred=y_pred), np.sqrt(mean_squared_error(y_true=y_true, y_pred=y_pred)), mean_absolute_percentage_error(y_true=y_true, y_pred=y_pred)
    return pd.DataFrame.from_records([{'MAE': mae, 'MSE': mse, 'RMSE': rmse, 'MAPE': mape}], index=[0])

## Petrol Consumption Dataset
https://www.kaggle.com/datasets/harinir/petrol-consumption

### ðŸŽ¯ Our goal is to build a regression model for prediction of petrol consumption in the 48 USA states.

In [3]:
df = pd.read_csv('https://raw.githubusercontent.com/lubsar/EFREI-Introduction-to-Machine-Learning/main/datasets/petrol_consumption.csv')
df.head()

Unnamed: 0,Petrol_tax,Average_income,Paved_Highways,Population_Driver_licence(%),Petrol_Consumption
0,9.0,3571,1976,0.525,541
1,9.0,4092,1250,0.572,524
2,9.0,3865,1586,0.58,561
3,7.5,4870,2351,0.529,414
4,8.0,4399,431,0.544,410


## Is each column numerical?

## Do we have any missing data?

# ðŸ“Š Let's start with a simple EDA

* ðŸ”Ž Can you see any relationships among the features from the pairplot?
    * What should we look for?
* ðŸ”Ž Do you think that the features are normally distributed?

## Always look for a simple trend-like patters first ðŸ™‚
> ## **Trend is your friend** ðŸ˜€

![meme02](https://github.com/rasvob/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_10_meme_02.png?raw=true)

## What about the a correlation coefficients?
* ðŸ”Ž What row/column is the most important from the heatmap?
    * Why?
* ðŸ”Ž Are correlations among **independent variables** good or bad?

## Can you see any outliers in the data?
* What about skewness or variance differences?

# ðŸš€ Let's build our first simple regression models with just 2 variables and compare them
* We will split the data into train/test set
* Then we can build the models and evaluate them

### There are many metrics used for the perormance evaluation
* MAE, RMSE, MAPE, R2, etc.
    * Do you know what these abbr. mean?
* ðŸ”Ž **Do we want these metrics to go lower or higher?**
    * Is it the same direction as in classification tasks, e.g. F1-Score, or opposite way around? 
* ðŸ’¡ You can take a look at these blog posts:
    * [this](https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914)
    * or [this](https://www.analyticsvidhya.com/blog/2021/05/know-the-best-evaluation-metrics-for-your-regression-model/) for more details

## Create `X` and `y` dataframes

## Split the data in ration 80:20

# âš¡ The 1st model will be the simplest one
* We will choose only one feature for the model - *Population_Driver_licence(%)*
    * ðŸ”Ž Why did we chose this specific feature?

## ðŸ”Ž How would the regression line formula look like?
* ðŸ’¡ What is a general equation of straight line in 2D? And for nD?

# ðŸ’¡ Very simple visual check of prediction quality is `y_test vs. y_pred` scatter plot
* What is an ideal result?

# However it is always better to quantify the errors ðŸ˜Š
* ðŸ’¡MAPE or SMAPE uses percentage values, thus these might be easier to understand to non-expert audience
* ðŸ’¡MAE, RMSE are in the same units as the predicted variable
    * Always take a look at basic statistical properties (typical value range, variance or use box-plot ) to rationalize the amount of error according to the range or the variable
    * ðŸ“Œ e.g., MAE = 10 can be low for variable in <1000, 5000> range but very high for variable in <0, 50> range

## So what do you think about the error?
* ðŸ’¡ Is model completely off or is it roughly right?

# âš¡ 2nd model will use just one variable again, however now it will be an uncorrelated one
* ðŸŽ¯ We want to compare the model with the 1st one

## Let's take a look at the scatterplot of y_test vs. y_pred now

## ðŸ”Ž Which one of the two models is better and why?

# The obvious next step is using more than one feature in the model, so let's get to it! ðŸ‘Š
* The API is the same, we will just include every feature in the model instead of just one

## How would the regression line formula look like now?

## ðŸ“Š For MLR we usually also want to take a look at coefficients values so we can "explain" the decisions by the model
* ðŸ”Ž Are all the features used?
* ðŸ”Ž Is there any feature much more important than other features?

## ðŸ”Ž Is the model better than the 1st one with just one feature?
* How different are the results?

## ðŸ”Ž Is it wise to have a model with some coefficient of few magnitudes higher values than other coefficients?
* What can go wrong? 
*  What is a **colinearity?** 
    * Why it may become an issue for regression models?

![meme03](https://github.com/lubsar/EFREI-Introduction-to-Machine-Learning/blob/main/images/fml_10_meme_03.jpg?raw=true)

# There are method for dealing with of these issues
* It is called regularization
    * We have two types of it - **L1 (Lasso)** and **L2 (Ridge)**
    * What is the difference between them?
* How is the regularization used?
    * What do we change in the model?
 
* Very nice comparison of both methods is at https://www.datacamp.com/tutorial/tutorial-lasso-ridge-regression

# Let's try L1 - Lasso first
* ðŸ’¡ The most important parameter is the `alpha` value
* Higher alpha means that the regularization will be more strict

## ðŸ’¡ Notice the values of coefficients

# We can use L2 - Ridge in the same way

### ðŸ”Ž Regardless the used model, what is the difference among the coefficients values with enabled and disabled regularization?

# ðŸ’¡ There are usually differences among the variables ranges 
* This may bring some difficulties in the coefficient optimization process
* ðŸ’¡ If the ranges are similar, the optimization process should be a lot easier
    * Why?
* Due to that, we usually use `MinMaxScaler` or `StandardScaler` before we try to fit a linear regression model
    * We are not limited to these two preprocessing methods

# ðŸš€ We will try to fit the Lasso model again, but this time with scaled features

### ðŸ”Ž Why do we fit the scaler only on the training part of the data?

### Try 0.1, 1 and 10 for alpha parameter
* What is different for each run?

## ðŸ”Ž How are the models and results different?

# âœ… Task
* We are obviously not limited to only a linear regression models for the regression tasks
    * Usually there is a regression alternative for most of the classification models in Sk-Learn
* Use [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) for the data and compare it to the Linear regression model
* Use [ElasticNet](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html#sklearn.linear_model.ElasticNet) model
    * The model combines the L1 and L2 regularization
    * Study how the model works
    * The model has 2 important parameters `alpha` and `l1_ratio`
    * Try to tune them - i.e. try various combinations and plot the results (you can plot MAE or MSE for different `alpha` and `l1_ratio` values)
* Compare the `KNeighborsRegressor` and `ElasticNet` models - which of them was more precise?