# **Evaluating Regression Models**

There are multiple evalution metrics which we can use. Some of them are:
- **Mean/Median**
- **Standard Deviation**
- **R Square/Adjusted R Square**
- **Mean Squared Error *(MSE)* / Root Mean Squared Error *(RMSE)***
- **Mean Absolute Error *(MAE)***

---

## **Preparing Data**

### Importing Libraries

In [1]:
import numpy as np
from sklearn.datasets import fetch_california_housing

### Importing Data

In [2]:
X, y = fetch_california_housing(as_frame=True,
                               return_X_y=True)

Merging the *data* and *target* values so that we can split them into *Train* and *Test* sets together.

### Splitting the data
***Please Note:** I will not be focusing on splitting the data in this notebook*

In [3]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---

## **Preprocessing the data**

Let's quickly look at the our data and see if what kind of data are we dealing with

In [4]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   MedInc      20640 non-null  float64
 1   HouseAge    20640 non-null  float64
 2   AveRooms    20640 non-null  float64
 3   AveBedrms   20640 non-null  float64
 4   Population  20640 non-null  float64
 5   AveOccup    20640 non-null  float64
 6   Latitude    20640 non-null  float64
 7   Longitude   20640 non-null  float64
dtypes: float64(8)
memory usage: 1.3 MB


As we can see there are no categorical values, so we just need to prepare our data with some numerical transformation. We could easily combine our model and all the preprocessing steps into one step using a `pipeline`. But for the sake of simplicity let's just create pre-processing pipeline

### Importing Libraries

In [5]:
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline

### Creating a pre-processing pipeline

In [6]:
preprocessing = Pipeline(steps=[
    ('impute', SimpleImputer()),
    ('scaler', StandardScaler())
])

### Pre-processing the data

In [7]:
X_train_prepared = preprocessing.fit_transform(X_train)

---

## **Preparing a simple regression model**
*(For the purpose of this notebook, we will use a simple Linear Regression Model)*

### Importing Library

In [8]:
from sklearn.linear_model import LinearRegression

### Initializing the model

In [9]:
lin_reg = LinearRegression()

### Fitting the model to our pre-processed data

In [10]:
lin_reg.fit(X_train_prepared, y_train)

LinearRegression()

---

## **Let's get our preditctions**

In [11]:
y_pred = lin_reg.predict(X_train_prepared)

---

## **Evaluating the model**

### **Mean/Median**
- Gives a rough idea about how much our prediction differ from the actual value
- Mean is greatly affected by outliers, so in case your data has outliers then look at *Median*

In [12]:
predicted_mean = np.mean(y_pred)
predicted_median = np.median(y_pred)

print(f'Mean of predicted values: {predicted_mean}')
print(f'Median of predicted values: {predicted_median}')

Mean of predicted values: 2.0719469373788764
Median of predicted values: 2.0221271798382108


Here we can see that our predictions are roughly 2 points off the prediction. This could be a pretty big difference if the unit of *MedianHouseVal* is in thousands or even higher.

### **Standard Deviation**
- It is measure of variation of the value from the mean
- It helps in understanding the dispersion of the values

In [13]:
predicted_std = np.std(y_pred)
print(f'Std. Variance of predicted values: {predicted_std}')

Std. Variance of predicted values: 0.9049005946869291


### **R Square/Adjusted R Square**
- R Square measures how much of variability in dependent variable can be explained by the model
    - It's the square of the correlation coefficient
    - It's a good measure to determine how well the model fits the dependent variables
    - It doesn't take into consideration of overfitting problem
    - Best possible score is 1
- Adjusted R Square penalises additional independent variables added to the model and adjust the metric to prevent the overfitting issue

In [14]:
from sklearn.metrics import r2_score

predicted_r2_score = r2_score(y_train, y_pred)
print(f'R2 score of predicted values: {predicted_r2_score}')

R2 score of predicted values: 0.6125511913966952


As we can see 61% of the dependent variability can be explained by the model

### **Mean Squared Error *(MSE)* / Root Mean Squared Error *(RMSE)***
- Mean Squared Error
    - It is the sum of square of prediction error *(which is real output minus the predicted output)* divided by the number of data points
    - It gives an absolute number on how much the predicted result deviate from the actual value
    - It doesn't provide much insights but is a good metric to compare different models
    - It gives larger penalisation to big prediction error
- Root Mean Squared Error
    - It's the root of MSE
    - More commonly used than MSE
    - Since it's a square root, it's a much smaller number and is on the same level as prediction error

In [15]:
from sklearn.metrics import mean_squared_error

predicted_mse = mean_squared_error(y_train, y_pred)
print(f'Predicted MSE: {predicted_mse}')

predicted_rmse = np.sqrt(predicted_mse)
print(f'Predicted RMSE: {predicted_rmse}')

Predicted MSE: 0.5179331255246699
Predicted RMSE: 0.7196757085831575


### **Mean Absolute Error *(MAE)***
- It is similar to MSE. The only difference is that instead of taking the sum of square of error *(like in MSE)*, it takes the sum of absolute value of error
- Compared to MSE or RMSE, it is more direct representation of sum of error terms
- It treats all the errors the same

In [16]:
from sklearn.metrics import mean_absolute_error

predicted_mae = mean_absolute_error(y_train, y_pred)
print(f'Predicted MAE: {predicted_mae}')

Predicted MAE: 0.5286283596581934


---
---