# Evaluation Metrics for Regression Tutorial #
This notebook is based on the tutorial at: https://youtu.be/TrzUlo4BImM?si=ubypDBivjnEHJ-jb <br>
It teaches us about the various metrics for regression and when & why to use which metrics.  <br>
<br>
The tutorial talks about:
  - Explaining Regression
  - R-Squared 
  - Adjusted R-Squared
  - MSE & RMSE (Mean Squared Error & Root Mean Squared Error)
  - MAE (Mean Absolute Error)
  - MAPE (Mean Absolute Percentage Error)
  - Honorable Mentions
<br>

In [1]:
from sklearn.datasets import fetch_california_housing

In [None]:
# Retrieve the data, it returns a dictionary.
data = fetch_california_housing()
data

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [None]:
# Extract the data from the dictionary returned
X = data['data']
y = data['target']

In [None]:
#Split the data into Training and Test segments
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2)

In [None]:
# Set up two models, one Linear Regression, one RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

reg = LinearRegression()
forest = RandomForestRegressor(n_jobs=-1)

In [None]:
# Scale the data, common preprocessing
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [8]:
reg.fit(X_train, y_train)

In [9]:
forest.fit(X_train, y_train)

### R² Score ###
R² Score, also known as **coefficient of determination** describes how much of the variance can be explained by the model.<br>
<br>
R² Score ranges from 0 to 1 (and sometimes less than 0 if the model is really bad).
  - 1 means the model perfectly predicts the data — it explains 100% of the variability.
  - 0 means the model's predictions are no better than just guessing the average of all the data points.
  - Negative values mean the model is worse than just guessing the average.


Imagine you're trying to predict the height of a plant based on how much sunlight it gets. The R² score tells you how much of the plant's height is explained by the sunlight.<br>
<br>
If R² = 0.8, it means 80% of the variation in the plant's height is explained by sunlight, and the remaining 20% might be due to other factors (like water or soil quality) or random noise.<br>
<br>
<br>
The Math behind the R² score is:<br>
$$ R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y})^2} $$

Characteristics of the R² score:
  - Adding variables to your model will either do nothing or explain more, which means the R² score will either stay the same or improve.  It will never get worse.  
  - Easy to interpret as a percentage of explained variance
  - Does not tell us if the model is good or bad, model can have igh R² score but still large residuals.
  - Explains most of the variance, but predictions still far off in absolute terms
  - More suited for linear models, not so much for random forests, neural networks, etc.

In [14]:
# By Default, when you call the score() method, you get the R² Score (COD)
R2_reg = reg.score(X_test, y_test)
R2_reg

0.5999625972136192

In [17]:
# By default the score() method for the Random Forest model is also the R² Score (COD)
R2_forest = forest.score(X_test, y_test)
R2_forest

0.8137081550298793

### Adjusted R² score ### 

The adjusted R² penalizes the score for adding unnecessary predictors that don't improve the model significantly.<br>
Adjusted R² Score:
  - Penalizes complexity (extra predictors)
  - Better for comparing models with different number of variables
  - Should be used in combination with other metrics
  - Also more suited for linear models
<br>


The equation for adjusted R² score is:
$$
R_{\text{adj}}^2 = 1 - \left(1 - R^2\right) \frac{n - 1}{n - p - 1}
$$
where:<br>
*n*: the number of data points<br>
*p*: the number of independent variables

In [13]:
import numpy as np
X_test.shape


(4128, 8)

In [15]:
n = X_test.shape[0]  # Datapoints
p = X_test.shape[1]  # Columns, or predictors

In [16]:

R2_adj_reg = 1 -((1-R2_reg)*(n-1))/(n-p-1)
R2_adj_reg

0.5991856369751412

In [18]:
R2_adj_forest = 1 -((1-R2_forest)*(n-1))/(n-p-1)
R2_adj_forest

0.8133463354717921

### MSE (Mean Squared Error) ###
The MSE measures the average squared difference between actual and predicted values giving greater weight to larger errors<br>
MSE is:
  - Not very intuitive because of the squaring
  - Emphasizes larger errors due to the squaring, useful when large errors are unacceptable
  - Differentiable, useful in optimization algorithms
  - Good when trying to minimize overall error in the model
  - Highly sensitive to outliers
  - Scale-dependent

The equation for MSE is:

$$
\text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2
$$

In [19]:
from sklearn.metrics import mean_squared_error

y_pred = reg.predict(X_test)
mse_reg = mean_squared_error(y_test, y_pred)
mse_reg


0.5131746781041621

In [20]:

y_pred = forest.predict(X_test)
mse_forest = mean_squared_error(y_test, y_pred)
mse_forest

0.23897829780437432

### RMSE (Root Mean Squared Error) ###
Takes the root of MSE, converting the error back to the original unit of measurement, making it easier to interpret.  

The Root Mean Squared Error (RMSE) is calculated as:

$$
\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}
$$

In [21]:
from sklearn.metrics import root_mean_squared_error

y_pred = reg.predict(X_test)
rmse_reg = root_mean_squared_error(y_test, y_pred)
rmse_reg

0.7163621138112778

In [22]:
y_pred = forest.predict(X_test)
rmse_forest = root_mean_squared_error(y_test, y_pred)
rmse_forest

0.4888540659587218

### MAE (Mean Absolute Error) ###
MAE measures the average absolute difference between the actual and predicted values.  Unlike MSE or RMSE, it doesn't square the errors, so it treats all errors equally without emphasizing larger ones.  It's easy to interpret since it's in the same units as the target variable.<br>
<br>
MAE Key Points:
  - Measures how wrong predictions were on average in orignal units
  - ON average, our model missed the correct value by X
  - Easy to understand
  - Less sensitive to outliers than MSE or RMSE
  - Does not penalize large errors as heavily
  - Not differentiable at zero, which can affect some optimization methods.

The Mean Absolute Error (MAE) is calculated as:

$$
\text{MAE} = \frac{1}{n} \sum_{i=1}^n \lvert y_i - \hat{y}_i \rvert
$$

In [24]:
from sklearn.metrics import mean_absolute_error

y_pred = reg.predict(X_test)
mae_reg = mean_absolute_error(y_test, y_pred)
mae_reg

0.5249127231490586

In [25]:
y_pred = forest.predict(X_test)
mae_forest = mean_absolute_error(y_test, y_pred)
mae_forest

0.3199667818313955

### MAPE (Mean Absolute Percentage Error) ###
MAPE expresses the prediction error as a percentage of the actual value, making it unitless and easier to compare across datasets. It’s sensitive to very small actual values which can lead to high percentage errors even for small differences.<br>
<br>
MAPE Key Points:<br>
  - Expresses errors as percentages, making interpretability better
  - Useful when comparing forecast accuracy between datasets
  - Undefined when actual values are zero or near zero

The Mean Absolute Percentage Error (MAPE) is calculated as:

$$
\text{MAPE} = \frac{1}{n} \sum_{i=1}^n \left\lvert \frac{y_i - \hat{y}_i}{y_i} \right\rvert \times 100
$$

In [26]:
from sklearn.metrics import mean_absolute_percentage_error

y_pred = reg.predict(X_test)
mape_reg = mean_absolute_percentage_error(y_test, y_pred)
mape_reg

0.31925823782399426

In [27]:
y_pred = forest.predict(X_test)
mape_forest = mean_absolute_percentage_error(y_test, y_pred)
mape_forest

0.18112797390957516

### MedAE (Median Absolute Error) ###
MedAE is less sensitive to large outliers compared to MAE because it uses the median instead of the mean.  It’s ideal when your data contains extreme errors that could distort the average in MAE.<br>
<br>
MedAE Key Points:<br>
  - Robust to outliers
  - Reflects typical magnitude of errors
  - May not capture effect of large errors
  - Less sensitive to overall error distribution

The Median Absolute Error (MedAE) is calculated as:

$$
\text{MedAE} = \text{median} \left( \lvert y_i - \hat{y}_i \rvert \right)
$$

In [28]:
from sklearn.metrics import median_absolute_error

y_pred = reg.predict(X_test)
medae_reg = median_absolute_error(y_test, y_pred)
medae_reg

0.4128123531686718

In [29]:
y_pred = forest.predict(X_test)
medae_forest = median_absolute_error(y_test, y_pred)
medae_forest

0.20048010000000072

### Honorable Mentions ###
MSLE (Mean Squared Logarithmic Error)
RMSLE (Root Mean Squared Logarithmic Error)
Explained Variance Score
Symmetric MAPE
Huber Loss
AIC and BIC (Akaike and Bayes Information Criterion)