### 3 Best metrics to evaluate Regression Model?

There are 3 main metrics for model evaluation in regression:

1. R Square/Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)
3. Mean Absolute Error(MAE)

#### r-squared

r2 measures how much variability in output can be explained by the model. It is the square of the Correlation Coefficient(R) and that is why it is called R-Squared.

<img src='r2.JPG' width=300 height=50>

- R Square value is between 0 to 1 
- The ideal value for r-square is 1
- A bigger value indicates a better fit between prediction and actual value.
- However, it does not take into consideration of overfitting problem, because it fails to generalize with so many independent variables
- Thats why we need adjusted R-sqaure in that case

R-square is a comparison of the residual sum of squares (SSres) with the total sum of squares(SStot).

The total sum of squares is calculated by summation of squares of perpendicular distance between data points and the average line.

<img src='r2 graph.JPG' width=300 height=300 >

- similarly residual sum of square can we calculated by somming of perpendicular distance between the data point and best fitted line.

Note: The value of R-square can also be negative when the model fitted is worse than the average fitted model. 

Limitation of using the R-square method –

- r2 always increases or remains the same when new variables are added to the model, irrespective of significance of feature added 
- meaning value of r2 never decreases on the addition of new attributes to the model. As a result, non-significant attributes can also be added to the model with an increase in the r-square value.  it can lead to overfitting of the model if there are large no. of variables.


This is because SStot is always constant and the regression model tries to decrease the value of SSres by finding some correlation with this new attribute hence the overall value of r-square increases, which can lead to a poor regression model.

#### Adjusted R-Sqaured

In [6]:
from sklearn.metrics import r2_score
# There is no directed formula for adjusted r2 in sklearn library, can use stats model

<img src='adjr2.jpg' width=300 height=300 >

                                            img source: geeksforgeeks.com

Adjusted r-square is a modified form of r-square whose value increases if new predictors tend to improve model’s performance and decreases if new predictors do not improve performance as expected.
- Adjusted r-square takes care of overfitting in this way

Lets remind:
        
        SStotal = sum(Yi-Yavg)**2
        SSres = sum(Yi-Ypred)**2
        
        r2=1 - (SSres / SStotal)

In [5]:
# how the value of r2 always increases?

SStot is always fixed for some data points if new predictors are added to the model, since what is the difference between Actual data and average of the data. but value of SSres decreases as model tries to find some correlations from the added predictors. Hence, r-square’s value always increases.

In [3]:
# Why Adjusted-R Square Test: 

- to take care of complexity of the model
- to count overfitting of the model

In [1]:
# Mean Square Error(MSE)/Root Mean Square Error(RMSE)

<img src='mse.jpg' width=200 height=150>

- Root Mean Square Error(RMSE) is the square root of MSE. 
- It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation.

In [2]:
from sklearn.metrics import mean_squared_error
import math
# print(mean_squared_error(y_test, y_pred))
# print(math.sqrt(mean_squared_error(y_test, y_pred)))
# MSE: 2017904593.23
# RMSE: 44921.092965684235

### Mean Absolute Error(MAE)

- Mean Absolute Error(MAE) is similar to Mean Square Error(MSE). 
- However, instead of the sum of square of error in MSE, MAE is taking the sum of the absolute value of errors between Actual Value and Predicted Value

<img src='mae.jpg' width=200 height=150>

Compare to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalization to big prediction error by square it while MAE treats all errors the same.

In [3]:
from sklearn.metrics import mean_absolute_error
# print(mean_absolute_error(y_test, y_pred))
#MAE: 26745.1109986

Overall Recommendation/Conclusion

R Square/Adjusted R Square is better used to explain the model to other people because you can explain the number as a percentage of the output variability. 

MSE, RMSE, or MAE are better be used to compare performance between different regression models. 

Personally, I would prefer using RMSE and I think Kaggle also uses it to assess the submission.

However, it makes total sense to use MSE if the value is not too big and MAE if you do not want to penalize large prediction errors.

Adjusted R square is the only metric here that considers the overfitting problem. 

R Square has a direct library in Python to calculate but I did not find a direct library to calculate Adjusted R square except using the statsmodel results. 

If you really want to calculate Adjusted R Square, you can use statsmodel or use its mathematic formula directly.

In [4]:
# how to calculate r2 and adjusted r2

In [11]:
import numpy as np

In [8]:
import seaborn as sns
df=sns.load_dataset('tips')

In [9]:
df.shape

(244, 7)

In [13]:
df.head(2)

Unnamed: 0,extra,total_bill,tip,sex,smoker,day,time,size
0,1.0,16.99,1.01,Female,No,Sun,Dinner,2
1,1.0,10.34,1.66,Male,No,Sun,Dinner,3


In [12]:
# Used to standardise statsmodel in python
f = np.ones((244, 1))
df.insert(0, 'extra', f)

In [16]:
# Gives summary of data model->gives value of r-square and adjusted r-square
import statsmodels.formula.api as sm
x = df.iloc[:, [0,1,3,4,5,6,7]]
y = df.iloc[:, 2]
 
 
regressor_OLS = sm.ols(endog = y, exog = x).fit()
regressor_OLS.summary()

TypeError: from_formula() missing 2 required positional arguments: 'formula' and 'data'