# Evauation Metrics for Machine Learning

One of the core tasks in building any machine learning model is to evaluate its performance. It’s fundamental, and it’s also really hard. So how would one measure the success of a machine learning model? How would we know when to stop the training and evaluation and call it good?

While data preparation and training a machine learning model is a key step in the machine learning pipeline, it's equally important to measure the performance of this trained model. How well the model generalizes on the unseen data is what defines adaptive vs non-adaptive machine learning models.

By using different metrics for performance evaluation, we should be in a position to improve the overall predictive power of our model before we roll it out for production on unseen data. Without doing a proper evaluation of the ML model using different metrics, and depending only on accuracy, can lead to a problem when the respective model is deployed on unseen data and can result in poor predictions.

### Model accuracy 


**Confusion Matrix**


![](./fig/confu.jpg)



**True Positive (TP)** — A true positive is an outcome where the model correctly predicts the positive class.  
**True Negative (TN)** —A true negative is an outcome where the model correctly predicts the negative class.  
**False Positive (FP)** —A false positive is an outcome where the model incorrectly predicts the positive class.  
**False Negative (FN)** —A false negative is an outcome where the model incorrectly predicts the negative class.  

$$Accuracy  = \frac{Number\ of\ correct\ predictions}{Total\ Number\ of\ predictions} = \frac{TP + TN}{TP + TN + FP + FN}$$


Though highly-accurate models are what we aim to achieve, accuracy alone may not be sufficient to ensure the model’s performance on unseen data.

Note: Accuracy is not a good metric wheren we are dealing with class specific imbalanced dataset.


**Precision and Recall**

$$ Precision =  \frac{Actual\ Positive}{Total\ Predicted\ Positive} = \frac{TP}{TP + FP}$$

$$Recall = \frac{Actual\ Positive}{Total\ Positive} = \frac{TP}{TP + FN}$$

For example in a Recommender systems precision is more critical whrer we recommend content of interest to the user with little disturbance, Where in case of Medical uses liek identifying cancer or any disease, we dont want to miss any of them so recall is important.

P-R curves are used to calculate Brek Even Points. But MOre common is to calculate F1 - Score.


**F1 Score**

An F1 score punishes extreme values more. Ideally, an F1 Score could be an effective evaluation metric in the following classification scenarios:
+ When FP and FN are equally costly - meaning they miss on true positives or find false positives - both impact the model almost the same way, as in our cancer detection classification example
+ Adding more data doesn’t effectively change the outcome effectively
+ TN is high (like with flood predictions, cancer predictions, etc.)

$$ F1 Score = \frac{2*Precision*Recall}{Precision + Recall}$$

**ROC and AUC**

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

+ True Positive Rate
+ False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

$$TPR = \frac{TP}{TP + FN}$$

False Positive Rate (FPR) is defined as follows:

$$FPR = \frac{FP}{TN + FP}$$

An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold classifies more items as positive, thus increasing both False Positives and True Positives. The following figure shows a typical ROC curve.

![](./fig/ROC.png)

To compute the points in an ROC curve, we could evaluate a logistic regression model many times with different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-based algorithm that can provide this information for us, called AUC.

**AUC: Area Under the ROC Curve**

An ROC curve is a two-dimensional depiction of classifier performance. To compare classifiers, we may want to reduce ROC performance to a single scalar value representing expected performance. A common method is to calculate the area under the ROC curve, abbreviated AUC. Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1.0. However, because random guessing produces the diagonal line between (0, 0) and (1, 1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5

AUC is desirable for the following two reasons:

+ AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
+ AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.

However, both these reasons come with caveats, which may limit the usefulness of AUC in certain use cases:

+ Scale invariance is not always desirable. For example, sometimes we really do need well calibrated probability outputs, and AUC won’t tell us about that.

+ Classification-threshold invariance is not always desirable. In cases where there are wide disparities in the cost of false negatives vs. false positives, it may be critical to minimize one type of classification error. For example, when doing email spam detection, you likely want to prioritize minimizing false positives (even if that results in a significant increase of false negatives). AUC isn't a useful metric for this type of optimization.