# Choosing the right metric for evaluating Machine Learning Models

**Recall or Sensitivity or TPR(True Positive Rate):**
Number of items correctly identified as positive out of total true positive -TP/(TP+FN)

**Specificity or TNR(True Negative Rate):** Number of item correctly identified as negative out of total negatives-TN/(TN+FP)

**Precision: ** Number of item correctly identified as positive out of total items identified as positive -TP/(TP+FP)

**False Positive Rate or Type I Error:** Number of items wrong identified as positive out of total true negatives- FP(FP/TN)

**False Negative Rate or Type II Error:** Number of items wrongly identified as negative out of total true postive-FN/(FN+TP)

**F1 Score:** It is a harmonic mean of precision and recall given by $F1 = 2*Precision*Recall/(Precision + Recall)$

**Accuracy:** Percentage of total items classified correctly-
(TP+TN)/(N+P)

## ROC-AUC Score 

Receiver operating characteristic- Area under curve

Mathematically, it is calcualted by area under of sensitivity(TPR) vs FPR(1-specificity). Ideally, we would like to have high sensitivity & high specificity, but in real-world scenarios, there is always a tradeoff between sensitivity & specificity

Some important characteristics of ROC-AUC are:

    * The value can range from 0 to 1. However auc score of a random classifier for balanced data is 0.5
    
    * ROC-AUC score is independent of the threshold set for classification because it only considers the rank of each prediction and not its absolute value. The same is not true for F1 score which needs a threshold value in case of probabilities output

## Log-Loss

Log-loss is a measurement of accuracy that incorporate the idea of probabilistic confidence given by following expression for binary class:

$-(y log(p)) + (1-y)log(1-p))$

It takes into account the uncertainty of your prediction based on how much it varies from the actual label. In teh worst case, let's say you predicted 0.5 for all the observations. So log-loss will become -log(0.5) = 0.69. Hence, we can say that anything above 0.6 is a very poor model considering the actual probabilities

## Case 1

**Consider Balanced Data**

    * If you care for absolute probabilistic difference, go with log-loss
    
    * If you care only for the final class prediction and you don't want to tune threshold, go with AUC score
    
    * F1 score is sensitive to threshold and you would want to tune it first before comparing the models

## Case 2

**Consider imbalance data**

    * Log-loss is failing in this case because according to log-loss both the models are performing equally. This is beacuse log-loss function is symmetric and does not differentiate between classes.
    
    * If you care for a class which is smaller in number independent of the fact whether it is positve or negative, go for ROC-AUC score.
    
    * When you have a small positive class, then F1 score makes more sense.

## Which metric should use for muti-classification

Most commonly used metrics for multi-classes are F1 score, Average Accuracy, Log-loss. There is yet no well developed ROC-AUC score for multi--class.

**log-loss for multi-class is defined as:**

$logloss = - \frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}y_{ij}log(p_{ij})$

where,<br>
    N: No of Rows in Test set <br>
    M: No of Falut Delivery Classes <br>
    $Y_{ij}: 1$ 
    if observation belongs to Class j; else 0 <br>
    $p_{ij}:$ Predicted Probability that observation belong to Class j
    

* In Micro-average method, you sum up the individual true positvies, false positives, and false negatives of the system for different sets and then apply them to get the statistic.
* In Macro-average, you take the averge of the precision and recall of the system on different sets

**Micro-average is preferable if there is a class imbalance problem.**

# Metrics for Regerssion Problems

* Mean Squared Error(MSE)
* Root Mean Squared Error(RMSE)
* Mean Absolute Error(MAE)
* R squared ($R^2$)
* Adjusted R Squared($R^2$)
* Mean Square Percentage Error(MSPE)
* Mean Absolute Percentage Error(MAPE)
* Root Mean Squared Logarithmic Error(RMSLE)

## Mean Squared Error(MSE)

$MSE = \frac{1}{N}\sum_{i=1}^{N}(y_i - \hat{y_i})^2$

MSE basically measures average squared error of our predictions. For each point, it calcualted square difference between the precitions and the target and the average those values.

**Advantage:** Useful if we have unexpected values that we should care about. Very high or low value that we should pay attention.

**Disadvantage:** if we make a single very bad prediction, the squaring will make the error even worse and it may skew the metric towards overestimating the model's badness.

## Root Mean Squared Error(RMSE)

RMSE is just the square root of MSE. The sqaure root is intruduced to make scale of the error to be the same as the scale of targets

$RMSE = \sqrt{MSE}$

First, they are similar in terms of their minimizers, every minimizer of MSE is also minimizer for RMSE and vice versa since the square roots is an non-decreasing function.

$MSE(A)>MSE(B)\Longleftrightarrow RMSE(A)>RMSE(B)$

It means that, if the target metric is RMSE, we still can compare our models using MSE, since MSE will order the models in the same way as RMSE. Thus we can optimize MSE instead of RMSE

Inface, MSE is a little bit easier to work with, so everybody uses MSE instead of RMSE. Also a little bit of difference between the two for gradient-based models

$\frac{\delta RMSE}{\delta \hat{y_i}} = \frac{1}{2\sqrt{MSE}}\frac{\delta MSE}{\delta \hat{y_i}}$

It means that traveling along MSE gradient is equivalent to traveling along RMSE gradient but with a different flowing rate and the flowing rate depends on MSE score itself.

So even though RMSE and MSE are really similar in terms of models scoring, they can be not immediately interchangeable for gradient based methods.

## Mean Absolute Error(MAE) 

In MAE the error is calculated as an average of absolute differences between the target values and the predictions.<br>
The MAE is a linear score which means that **all the individual differences are weighted equally** in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0. However, same is not sure for RMSE. 

$MAE = \frac{1}{N}\sum_{i=1}^{N}|y_i - \hat{y_i}|$

this metric is that it **penalizes huge errors that not as that badly as MSE does.** Thus, it's not that sensitive to outlier as mean square error.

MAE is widely used in finance, where $10 error is usually exactly two times worse than $5 error. On the other hand, MSE metric thinks that $10 error is four times worse than $5 error. MAE is easier to justify than MSE

Another important thing about MAE is its gradients with respect to the predictions. The gradiend is a step function and it takes -1 when $\hat{y_i}$ is smaller than the target and +1 when it is larger

**Note that:** if we want to have constant prediction the best one will be the **median value of the target values**. It can be found by setting the derivative of our total error with respect to that constant to zero, and find it from this equation

## R Squared($R^2$)

The coefficient of determination, or R2, is another metirc use to evalute a model and it is closely realted to MSE， but has the advantage of being **scale-free**--it does't matter if the output values are very large or very small, **the R2 is always going to be between -$\infty$ and 1**

When R2 is negative it means that the model is worse than predicting the mean

$R^2 = 1 - \frac{MSE(model)}{MSE(baseline)}$

The MSE of the baseline is defined as:<br><br>
$MSE(baseline) = \frac{1}{N}\sum_{i=1}^{N}(y_i - \bar{y})^2$

where $\bar{y}$ is the mean of the observed $y_i$

To make it more clear, this baseline MSE can be thought of as teh MSE that the simplest possible model would get. The simplest possible model would be to always predict teh average of all samples. A value close to 1 indicates a model with close to zero error, and a value close to zero indicates a model very close to baseline.

**In conclusion, R2 is the ratio between how good our model is vs how good is the naive mean model.**

## MAE vs MSE

* Do you have outliers in the data?
    - Use MAE
* Are you sure they are outliers?
    - Use MAE
* Or they are just unexpected values we should still care about?
    - Use MSE

## what is difference between an RMSE and RMSLE(logarithmic error)

MSE incorporates both the variance and the bias of the predictor.

RMSE is the square root of MSE. In case of unbiased estimator, RMSE is just the square root of variance, which is actually Standard Deviation

Note: Square root of variance is standard deviation

In case of RMSLE, you take the log of the prdictions and actual values. OS basically, what cahnges is the variance that you are measuing. RMSEL is usually used when you don't want to penalize huge differences in the predicated and teh actual values when both predicted and true values are huge numbers.
1. if both predicted and actual values are small: RMSE and RMSLE is same.
2. If either predicted or the actual value is big: RMSE>RMSLE
3. If both predicted and actual values are big: