### What Are Evaluation Metrics or Error Metrics?

The idea of building machine learning models or artificial intelligence or deep learning models works on a constructive feedback principle. You build a model, get feedback from metrics, make improvements, and continue until you achieve a desirable classification accuracy. Evaluation metrics explain the performance of the model. An important aspect of evaluation metrics is their capability to discriminate among model results.

It is a way checking how robust the model is. Once we have finished building a model, do not hurriedly map predicted values on unseen data. This is an incorrect approach. The ground truth is building a predictive model is not your motive. It’s about creating and selecting a model which gives a high accuracy_score on out-of-sample data. Hence, it is crucial to check the accuracy of your model prior to computing predicted values.

We consider different kinds of metrics to evaluate our ml models. The choice of evaluation metric completely depends on the type of model and the implementation plan of the model. These metrics will help you in evaluating your model’s accuracy. 




### Types of Predictive Models

When we talk about predictive models, we are talking either about a regression model (continuous output) or a classification model (nominal or binary output). The evaluation metrics used in each of these models are different.

### Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of predicted classes. For the problem in hand, we have N=2, and hence we get a 2 X 2 matrix. It is a performance measurement for machine learning classification problems where the output can be two or more classes. It is a table with 4 different combinations of predicted and actual values. It is extremely useful for measuring precision-recall, Specificity, Accuracy, and most importantly, AUC-ROC curves.

Here are a few definitions you need to remember for a confusion matrix:

* **True Positive:** You predicted positive, and it’s true.
* **True Negative:** You predicted negative, and it’s true.
* **False Positive:** (Type 1 Error): You predicted positive, and it’s false.
* **False Negative:** (Type 2 Error): You predicted negative, and it’s false.
* **Accuracy:** the proportion of the total number of correct predictions that were correct.
* **Positive Predictive Value or Precision:** the proportion of positive cases that were correctly identified.
* **Negative Predictive Value:** the proportion of negative cases that were correctly identified.
* **Sensitivity or Recall:** the proportion of actual positive cases which are correctly identified.
* **Specificity:** the proportion of actual negative cases which are correctly identified.

![image.png](attachment:image.png)

![image-2.png](attachment:image-2.png)

The accuracy for the problem in hand comes out to be 88%. As you can see from the above two tables, the Positive Predictive Value is high, but the negative predictive value is quite low. The same holds for Sensitivity and Specificity. This is primarily driven by the threshold value we have chosen. If we decrease our threshold value, the two pairs of starkly different numbers will come closer.

In general, we are concerned with one of the above-defined metrics. For instance, in a pharmaceutical company, they will be more concerned with a minimal wrong positive diagnosis. Hence, they will be more concerned about high Specificity. On the other hand, an attrition model will be more concerned with Sensitivity. Confusion matrices are generally used only with class output models.

### F1 Score

In the last section, we discussed precision and recall for classification problems and also highlighted the importance of choosing a precision/recall basis for our use case. What if, for a use case, we are trying to get the best precision and recall at the same time? F1-Score is the harmonic mean of precision and recall values for a classification problem. The formula for F1-Score is as follows:

![image.png](attachment:image.png)

**Precision:** It is the ratio of the number of True Positive to the sum of True Positive and False Positives It refers to how close the measured value is to each other.

* **Precision = True Positive (TP) / True Positive (TP) + False Positive (FP)**

**Recall:** It is the ratio of the number of True Positive to the sum of True Positive and False Negatives. It represents the model’s ability to find all the relevant cases in a dataset.

* **Recall = True Positive (TP) / True Positive (FP) + False Negative (FN)**



#### Question:

Consider the following confusion matrix, and find the related f1 score.

![image.png](attachment:image.png)


#### Answer

To calculate the f1 score, firstly we will calculate the values of Precision and Recall.

Precision = TP/ (TP + FP) = 560/(560 + 60) = 0.903

Recall =TP / (TP + FN) = 560/(560 + 50)=0.918

Now, the f1 score will be:

![image.png](attachment:image.png)

### Regression Predictive Modeling

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

Regression predictive modeling is the task of approximating a mapping function (f) from input variables (X) to a continuous output variable (y).

Regression is different from classification, which involves predicting a category or class label.

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of 100000 to 200,000.

* A regression problem requires the prediction of a quantity.
* A regression can have real-valued or discrete input variables.
* A problem with multiple input variables is often called a multivariate regression problem.
* A regression problem where input variables are ordered by time is called a time series forecasting problem.

Now that we are familiar with regression predictive modeling, let’s look at how we might evaluate a regression model.

#### Evaluating Regression Models

A common question by beginners to regression predictive modeling projects is:

#### How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

**We cannot calculate accuracy for a regression model.**

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

* Mean Squared Error (MSE).
* Root Mean Squared Error (RMSE).
* Mean Absolute Percentage Error (MAPE).

#### Mean Squared Error

Mean Squared Error, or MSE for short, is a popular error metric for regression problems.

It is also an important loss function for algorithms fit or optimized using the least squares framing of a regression problem. Here “least squares” refers to minimizing the mean squared error between predictions and expected values.

The MSE is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.

![image.png](attachment:image.png)

The squaring also has the effect of inflating or magnifying large errors. That is, the larger the difference between the predicted and expected values, the larger the resulting squared positive error. This has the effect of “punishing” models more for larger errors when MSE is used as a loss function. 

In [1]:
from sklearn.metrics import mean_squared_error
  
# Given values
Y_true = [1,1,2,2,4]  # Y_true = Y (original values)
  
# calculated values
Y_pred = [0.6,1.29,1.99,2.69,3.4]  # Y_pred = Y'
  
# Calculation of Mean Squared Error (MSE)
mean_squared_error(Y_true,Y_pred)

0.21606

A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MSE is relative to your specific dataset.

It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.

#### Root Mean Squared Error

The Root Mean Squared Error, or RMSE, is an extension of the mean squared error.

Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.

For example, if your target variable has the units “dollars,” then the RMSE error score will also have the unit “dollars” and not “squared dollars” like the MSE.

As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.

![image.png](attachment:image.png)


#### RMSE = sqrt(MSE)

Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response. It's the most important criterion for fit if the main purpose of the model is prediction.

In [6]:
import math

# root mean square calculated by sklearn package
mse1 = math.sqrt(mean_squared_error(Y_true,Y_pred))
print('Root mean square error', mse1)
 
# where as the another way to find RMSE
# is by adding squared attribute as false in mean_squared_error
mse2 = mean_squared_error(Y_true,Y_pred, squared=False)
print('Root mean square error', mse2)

Root mean square error 0.4648225467853297
Root mean square error 0.4648225467853297


#### Mean Absolute Percentage Error

Mean Absolute Percentage Error (MAPE) is a statistical measure to define the accuracy of a machine learning algorithm on a particular dataset.

MAPE can be considered as a loss function to define the error termed by the model evaluation. Using MAPE, we can estimate the accuracy in terms of the differences in the actual v/s estimated values.

![image.png](attachment:image.png)

As seen above, in MAPE, we initially calculate the absolute difference between the Actual Value (A) and the Estimated/Forecast value (F). Further, we apply the mean function on the result to get the MAPE value.

MAPE can also be expressed in terms of percentage. Lower the MAPE, better fit is the model.

In [7]:
from sklearn.metrics import mean_absolute_error
Y_actual = [1,2,3,4,5]
Y_Predicted = [1,2.5,3,4.1,4.9]
mape = mean_absolute_error(Y_actual, Y_Predicted)*100
print(mape)

13.999999999999984


#### RMSE VS MAPE

A benefit of using RMSE is that the metric it produces is in terms of the unit being predicted. For example, using RMSE in a house price prediction model would give the error in terms of house price, which can help end users easily understand model performance

MAPE is a popular metric to use as it returns the error as a percentage, making it both easy for end users to understand and simpler to compare model accuracy across use cases and datasets.

#### Similarities between RMSE and MAPE

* Both are used for regression models
* Both are all-round metrics which give a good indication of general model performance
* Both are easily implemented in Python using the scikit-learn package

#### Differences between RMSE and MAPE

* RMSE is more sensitive to outliers than MAPE
* MAPE returns the error as a percentage whilst RMSE is an absolute measure in the same scale as the target
* MAPE is much more understandable for end users than RMSE due to it being a percentage
* RMSE can be used on any regression dataset, whilst MAPE can’t be used when actual values are close to 0 due to the division by 0 error

#### When should you use MAPE or RMSE

The main factors that determine whether you should use MAPE or RMSE relate to the model you are training, the dataset you have created, and to what extent end users are involved in the process.

**The Model**

Machine learning models use an error metric to guide their optimisation during the training process. It is common to track the same metric that is being used for this optimisation to better understand model development over time.

A common metric to use for this optimisation is RMSE, whilst MAPE is rarely used for this situation. Therefore, if your model is optimising for RMSE then it would make sense to track this rather than MAPE.

**The Dataset**

We saw previously that MAPE can suffer from a division by 0 error. Therefore, if your dataset has actual values around 0 you are unable to use MAPE and therefore would need to use RMSE.

**The End-users**

MAPE is much more understandable than RMSE. Therefore, if you need to convey model performance to end users, especially to those who aren’t data professionals, then MAPE would be the better choice as this is calculated as an easy to understand percentage