# Linear Regression with Regularization

## Overview

Machine Learning problem can be divided into two sub-categories -- **`Classification`** and **`Regression`** problem.
* **`Classification`** refers to the task of assigning labels *`(yes/no, True/False, Ham/Spam)`* to data samples belonging to different classes. 

It can either be a two-class classification problem (mostly using **sigmoid function**) or a multi-class classifcation problem (mostly using **softmax function**). 

* **`Regression`**, on the other hand, refers to the task of predicting continuous values (scalar) by depicting the relationship between dependent variables and various independent features.

`Linear Regression`, therefore, performs the task to predict a dependent variable value (y) based on a given independent variable (x) in a linear line. To find the line, **Ordinary Least Squared Method (OLS)** is widely used for regression problem. 

For complex problems, **`multivariate regression / Polynomial regression / Regularization term`** is utilized to enhance performance while at the same time restraining loss. 

In this notebook, we will first look at some basic assumptions behind a robust linear regression model.

We will discuss what the assumptions are, and, if those assumptions are not satisfied, possible ways to transfrom our data.

---

## Notebook Setting

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

Here we will use the famous Boston Housing Dataset.

In [2]:
from sklearn.datasets import load_boston

boston_dataset = load_boston()

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston['Target'] = boston_dataset.target
boston = boston[boston.columns[-1:].append(boston.columns[:-1])]
boston.head()

Unnamed: 0,Target,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,24.0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,21.6,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,34.7,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,33.4,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,36.2,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [3]:
boston.isnull().values.any()

False

**No null value is detected. It's good to go!**

---

## Preparation

In [4]:
from sklearn.linear_model import LinearRegression, LassoCV, RidgeCV, ElasticNetCV
from sklearn.model_selection import train_test_split

In [5]:
X = boston.drop(columns=['Target'])
y = boston['Target']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

## Benchmark Regression Model

In [7]:
lr = LinearRegression(normalize=True)

In [8]:
lr.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True)

In [9]:
lr_pred = lr.predict(X_test)

---

## Loss Functions for Evaluating Model Performances

Machines learn by means of **`loss function`**. It’s a method of evaluating how well specific algorithm models the given data. 

To evaluate the performance of the regression model, various loss functions are introduced and are preferred in different scenario.

There’s no one-size-fits-all loss function to all the machine learning problems. Each loss function performs well in respective use cases in detecting different model performances, and should be considered before implementing one.

<img src='loss_all.png' width="400" height="150">

In the following section, we are going to dive deep into some of the most common loss functions as below.

* **`MSE`** : Mean Squared Error
* **`RMSE`**: Root Mean Squared Error
* **`MAE`** : Mean Absolute Error
* **`MBE`** : Mean Bias Error
* **`MAPE`**: Mean Absolute Percentage Error
* **`R^2`** : R-Squared + Adj. R-Squared
* **`RMSLE`** : Root Mean Squared Logarithmic Error 

### Mean Squared Error (MSE / Quadratic Loss / L2 Loss)

<img src="mse.png" width="300" height="400">

**`Mean Squared Error (MSE)`** is one of the most simple and common loss function in regression analysis. For each data point, it calculates the squared difference between the prediction `ŷ` and original data point `y` and then take the mean value of those values.

Due to the squared term, predictions that are far away from the actual values, which leads to overestimation of how bad the model is, are penalized compared to less deviated predictions. 

`MSE` is preferred over other other metrics such as `MAE`, because it is **differentiable** and hence can be optimized better.

**Recommended Use Cases** :
* **`When Large Errors are undesirable`**:
Since the errors are squared before they are averaged, the MSE penalizes even a small error which leads to over-estimation of how bad the model is. Therefore, if outliers are undesirable and should be cared about, use `MSE` to detect those large errors!


* **`Further Calculation`**:
MSE is also widely used due to its differentiable nature.

**Disadvantage** : 
* **`Overestimate the problem of a bad model`** : `MSE` is easily affected by outliers. A huge `MSE` often means that there are outliers.


* **`Low Score may imply Overfitting`** : A low `MSE` does not imply good model, as it may be an overfitting model that fits all the data point.


* **`Noisiness`** : If we have noisy data (that is, data includes some randomness or for whatever reason is not entirely reliable) — even a “perfect” model may have a high `MSE` in that situation, so it becomes hard to judge how well the model is performing.

In [10]:
def mse(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        err = pred[i] - y[i]
        sum_err += (err**2)
    return(sum_err / float(len(pred)))

In [11]:
mse(lr_pred, y_test)

28.940057594323065

### Root Mean Squared Error (RMSE)

<img src="rmse.png" width="300" height="400">

Root Mean Squared Error is simply the root of the MSE. Then why bother to create another loss function?



Both `MSE` and `RMSE` decrease monotonically. Thus, a model that has a higher MSE will also have a higher RMSE compared to another model.

In [12]:
# Calculate root mean squared error
def rmse(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        err = pred[i] - y[i]
        sum_err += (err**2)
    return(np.sqrt(sum_err / float(len(pred))))

In [13]:
rmse(lr_pred, y_test)

5.379596415561586

### Mean Absolute Error (MAE / L1 Loss)

<img src="mae.png" width="300" height="400">

**`Mean Absolute Error (MAE)`** : `MAE` measures the average magnitude of absolute differences in a set of prediction `ŷ` and its partnered data point `y`, without considering the direction. Unlike `RMSE` and `MSE`, `MAE` is more robust to extreme values since it doesn't have the squared term that penalizes errors as extremely as `MSE` does. That is, all the individual differences are weighted equally.

**`Mean Absolute Error`** is widely used in cases like finance, where `$10`  error is usually exactly two times worse than `$5` error. 

On the other hand, `MSE` metric thinks that `$10` error is four times worse than `$5` error. 
Therefore, `MAE` is easier to justify than `MSE`.

**Recommended Use Cases** :
* **`Pay equal attention to both all data points`**:
Unlike `MSE` that pays higher emphasis on outliers, since `MAE` doesn't include the squared term, it is **suitable** for applications where you want to **pay equal attention to all the data points**. Therefore, if you **have outliers and want to treat it equally**, use `MAE`!


* **`Easy comparison between differences`**:
The `MAE` is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between `10` and `0` will be twice the difference between `5` and `0`.

**Disadvantage** : 
* **`Hard to compute further`** : `MAE` is hard to compute further due to its absolute function. 

Meanwhile, its gradient either `-1` or `1`. In worst scenarios when `ŷ == y`, it is even indifferentiable!!

In [14]:
def mae(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        sum_err += abs(pred[i] - y[i])
    return(sum_err / float(len(pred)))

In [15]:
mae(lr_pred, y_test)

3.538685478441911

註：還沒完全讀懂 https://medium.com/human-in-a-machine-world/mae-and-rmse-which-metric-is-better-e60ac3bde13d

### Mean Biased Error (MBE)

<img src="mbe.png" width="300" height="400">

**`Mean Biased Error`** is exceptionally useful in detecting the average model bias. 

Note that since positive and negative errors will cancel out (since the absolute sign is not taken), the error accounts for no variation and only bias. 

Although less accurate in practice, **`MBE`** can be used to determine whether the model can predict the intended result, or has positive or negative bias. 

In [16]:
def mbe(pred, y_test):
    sum_err = 0.0
    y = y_test.values
    for i in range(len(y_test)):
        sum_err += pred[i] - y[i]
    return(sum_err / float(len(pred)))

In [18]:
mbe(lr_pred, y_test)

0.35401600164261304

### Mean Absolute Percentage Error (MAPE)

### Root Mean Squared Logarithmic Error  ( RMSLE )

<img src="rmsle2.png" width="500" height="400">

With simple logarithmic calculation, we can easily detect that the main difference between `RMSLE` and `RMSE` lies in the fact that 

* `RMSLE` only considers the **relative error between and the Predicted and the actual value**. **The absolute difference (the scale of the error) is not significant**. On the other hand, `RMSE` increases in magnitude if the scale of error increases.


* `RMSLE` penalizes predictions that are less than the actual values more than it penalized predictions more than actual values. (這邊可以想個例子來證明一下)

<img src="rmsleviz.png" width="500" height="400">

**Recommended Use Cases**:

* **`When biased penalty is acceptable`** : `RMSLE` incurs a larger penalty for the `under-estimation` between predictions and actual values more than `over-estimation`.


* **`When underestimated is not acceptable but overestimation can be allowed`** : `RMSLE` is especially useful for business cases where the underestimation of the target variable is not acceptable but overestimation can be tolerated.

### R² ( R-Squared / Coefficient of Determination )

Imagine that you get a MSE score of `5.21`. What should you do? Is this a decent value?

To tackle this ambiguity, **`R²`** is a wonderful metric to evaluate how well the model fits the data point.

<img src="r2.png" width="250" height="400">

<img src="r2-1.png" width="300" height="400">

**`MSE(model)`** : Sum squared Regression Error. This MSE is derived from whatever regression model we implemented.

**`MSE(baseline)`** : Sum squared  Total Error. This constant baseline model can be interpreted as the **`simplest model`** we can derive -- which is to always predict the average of all samples.

**`y̅`** : the mean of the observed yᵢ.

**Simple Explanations**:

1. **`R²`** is a scale-free metric and is widely used in terms of evaluating rooms for improvement for the model.
2. **`R²`** is the ratio between **how good our model is** vs **how good is the naive mean model**.

This metric compares the fit of the chosen model with that of a horizontal straight line (the null hypothesis).

A value close to `1` indicates that the modle perfectly has zero-to-none bias and variation (with close to zero error), and a value close to `0` indicates a model very close to the baseline.

Note that  **`R²` can actually be negative**. **`R²`** is negative when the chosen model does not follow the trend of the data, so fits worse than a horizontal line. Therefore, **if the chosen model fits worse than a horizontal line, then 𝑅2 is negative**. 

Note that `R²` is not in fact the square of anything, so it can have a negative value without violating any rules of math. 

<img src="r2neg.png" width="350" height="400">

### Adjusted R² ( Adj. R-Squared )

What is the problem of common `R²`?

`R²` suffers from the problem that the model could overfit the data. Imagine that you have five data points, and you derive the `R²` score of `0.70`. If you want to improve your `R²`, one simple way is to add one more variable into the model. 

Howver, the scores improve on increasing terms even though the model is exactly not improving. This potentially overfitting problem may misguide the researcher. In extreme cases, you can even get an `R²` of `1` when you have the same amounts of variables and data points.

<img src="adjr2.png" width="400" height="400">

**`p`** : Number of independent variables


**`N`** : Number of observations (Sample Size)

**`Adjusted R²`** is thus introduced to solve this problem.

`Adjusted R²` is always lower than `R²` as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.

### End Notes

Keep in mind that there’s no one-size-fits-all loss function to all the machine learning problems. Each loss function performs well in respective use cases in detecting different model performances, and should be considered before implementing one. 

---

## Reference: 

https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914

https://medium.com/@george.drakos62/how-to-select-the-right-evaluation-metric-for-machine-learning-models-part-1-regrression-metrics-3606e25beae0

https://towardsdatascience.com/regression-an-explanation-of-regression-metrics-and-what-can-go-wrong-a39a9793d914

https://stats.stackexchange.com/questions/12900/when-is-r-squared-negative