# Table of content

* [1. Introduction](#1.-Introduction)
* [2. Types of model](#2.-Types-of-model)
* [3. Cross validation set](#3.-Cross-validation-set)
  * [k- Fold Cross Validation](#k--Fold-Cross-Validation)
  * [But how do we choose k?](#But-how-do-we-choose-k?)
* [4. How do I measure the performance of my classification model?](#4.-How-do-I-measure-the-performance-of-my-classification-model?)
  * [Confusion Matrix](#Confusion-Matrix)
  * [Log Loss or Logarithmic Loss or Cross Entropy Loss](#Log-Loss-or-Logarithmic-Loss-or-Cross-Entropy-Loss)
  * [Receiver Operating characteristics(ROC) and Area under ROC Curve(AUC)](#Receiver-Operating-characteristics%28ROC%29-and-Area-under-ROC-Curve%28AUC%29)
* [5. How do I measure the performance of my regression model?](#5.-How-do-I-measure-the-performance-of-my-regression-model?)
  * [Root Mean Square Error (RMSE)](#Root-Mean-Square-Error-%28RMSE%29)
  * [$R^2$](#$R^2$)

# 1. Introduction

>**Rule 1:**  
Never use the testing data for tarining purposes. Use it at very end to evaluate performance of the model.

A good fitting model is one where the difference between the actual or observed values and predicted values for the selected model is small and unbiased for train, validation and test data sets. Predictive Modeling works on constructive feedback principle. You build a model. Get feedback from metrics, make improvements and continue until you achieve a desirable accuracy. Evaluation metrics explain the performance of a model. An important aspects of evaluation metrics is their capability to discriminate among model results.

Simply, building a predictive model is not your motive. But, creating and selecting a model which gives high accuracy on out of sample data. Hence, it is crucial to check accuracy of the model prior to computing predicted values. In our industry, we consider different kinds of metrics to evaluate our models. The choice of metric completely depends on the type of model and the implementation plan of the model. 

# 2. Types of model

Supervised Machine learning models are of two types:

* **Regression** — The output is a continuous variable (eg. Predict Housing Prices).
* **Classification** — The output is a discrete variable (eg. Cat vs Dog). Classification could be binary or a multi class classification.

The evaluation metrics used in each of these models are different.

# 3. Cross validation set

* Cross validation is used in Machine Learning to estimate the skill of a ML model on unseen data.
* It generally results in less biased estimate of model and every data point is used in test set once.
* It has a parameter (k) which specifies number of groups the training dataset is split into.
* A model can have error due to bias (underfitting) and error due to variance (overfitting). So, multiple models are trained with various complexities (For example: regression model of order 1,2,3,4,.. can be trained to fit the dataset).
* To find which model performs better, we divide the dataset into training set, cross validation set and test set.
* We train the model on training set and test it on cross validation set, with various complexities of model.
* We plot the training set error and cross validation set error for all models.
* CV set error is minimum for optimum model and after model is trained, it is tested on test set.
* **In summary, we use train set to train multiple model, cross validation set to select best model and test set to evaluate overall performance of the model.**

![variation of train set error and cross validation error with complexity](cross validation error.png)

A negative side of this simple cross validation set is that we loose a good amount of data from training the model. Hence, the model is very high bias. And this won’t give best estimate for the coefficients.

## k- Fold Cross Validation
Now, we will try to visualize how does a k-fold validation work.
![k-fold cross validation](k_fold.png "k-fold cross validation")


This is a 7-fold cross validation.

Here’s what goes on behind the scene : 
1. We divide the entire population into 7 equal samples. 
* Now we train models on 6 samples (Green boxes) and validate on 1 sample (grey box). 
* Then, at the second iteration we train the model with a different sample held as validation. 
* In 7 iterations, we have basically built model on each sample and held each of them as validation. 
* This is a way to reduce the selection bias and reduce the variance in prediction power. 
* Once we have all the 7 models, we take average of the error terms to find which of the models is best.

## But how do we choose k?
This is the tricky part. We have a trade off to choose k.  
For a small k, we have a higher selection bias but low variance in the performances.  
For a large k, we have a small selection bias but high variance in the performances.  

# 4. How do I measure the performance of my classification model?

For Binary classification we expect an output of 0 or 1. The output is a predictive score which conveys the probability of the output to be either a 0 or a 1. Typically if the score is the above a certain threshold value then we set the output to 1 else the output will be 0. This threshold value is usually selected as 0.5 but can vary.

>To evaluate the performance of a binary classification model we use Confusion matrix, Log Loss/Cross Entropy, or AUC.

## Confusion Matrix

A confusion matrix is an N X N matrix, where N is the number of classes being predicted. It is a table that tells us how many actual values and predicted values exists for different classes that the model will predict. Also referred as **Error matrix**.

![confusion matrix](confusion_matrix.png)
![confusion matrix](confusion_matrix1.png)

* **True Positive :** When prediction is **True** (correct) and actual class was **positive**.
* **True Negative :** When prediction is **True** (correct) and actual class was **negative**.
* **False Positive :** When prediction is **False** (incorrect) and actual class was **positive**. This is Type 1 error.
* **False Negative :** When prediction is **False** (incorrect) and actual class was **negative**. Also called as Type 2 error.
* **Accuracy :** The proportion of the total number of predictions that were correct.


$$
Accuracy=\frac{TP+TN}{TP+TN+FP+FN}
$$


* **Positive Predictive Value or Precision :** The proportion of positive cases that were correctly identified.


$$
Precision=\frac{TP}{TP+FP}
$$


* **Negative Predictive Value :** The proportion of negative cases that were correctly identified.
* **Sensitivity or Recall:** The proportion of actual positive cases which are correctly identified.


$$
Recall=\frac{TP}{TP+FN}
$$


* **Specificity :** The proportion of actual negative cases which are correctly identified.


$$
Specificity=\frac{TN}{FP+TN}
$$

* **F1 score :** It uses precision and recall score of the model to create 1 score which defines the performance of the model. It is harmonic mean of Precision and Recall. Range for F1 score is between 0 and 1 and a larger value means better prediction.

$$
\text{F1 score}=\frac{2*Precision*Recall}{Precision+Recall}
$$

* **$F_\beta$ score :** Smaller the value of $\beta$, more weight towards precision and when $\beta$ value is large, then more weight towards recall. F1 score is $F_\beta$ score with $\beta=1$.

$$
F_\beta=\frac{(1+\beta^2)*Precision*Recall}{(\beta^2*Precision)+Recall}
$$

## Log Loss or Logarithmic Loss or Cross Entropy Loss

Log loss should be close to 0 for a good binary classification model. Log loss increases as the predicted values diverges from the actual value. A perfect model will have a log loss of 0. It is calculated as:
$$
\text{Log Loss}=-(y*\log{p}+(1-y)*\log{(1-p)})
$$
In the above equation  
$y$ : is the actual output  
$p$ : predicted value  

**Accuracy** is the proportion of correct predictions for positive and negative class over the total number of observations.

## Receiver Operating characteristics(ROC) and Area under ROC Curve(AUC)

ROC curve is a graph showing the performance of a classification model at different classification threshold. ROC curve is a plot between true positive rate(TPR) and false positive rate(FPR) at different classification thresholds so various thresholds results in different true positive rate and false positive rates.

**True Positive Rate(TPR)** is the proportion of the positive data points correctly predicted to all the actual positive data points.

**False Positive Rate(FPR)** is the proportion of the negative data points falsely predicted as positive to all the actual negative data points.

$$
\text{Sensitivity / Recall / True Positive Rate}=\frac{\text{True Positive (TP)}}{\text{Actual Positive (TP+FN)}}\\
\text{False Positive Rate}=\frac{\text{False Positive (FP)}}{\text{Actual Negative (TN+FP)}}
$$

When we lower the classification threshold say from 0.5 to 0.3. In that scenario we will classify more data points as positive thus increasing the true positive and false positives.

**To compute the points in an ROC curve we can evaluate the model at different classification threshold but that would be inefficient and so we use AUC or Area under ROC Curve.**

![AUC](auc.png)

The dashed line is random classifier from which you can expect as many true positives as false positives.

we can think of AUC as the probability where model ranks a randomly positive example more highly than a randomly negative example.

AUC ranges between 0 and 1.

A value of 0 means 100% prediction of the model is incorrect. A value of 1 means that 100% prediction of the model is correct.

# 5. How do I measure the performance of my regression model?

>A few statistical tools like coefficient of determination also called as $R^2$, Adjusted $R^2$ and Root mean square Error -RMSE are commonly used to evaluate the performance of the regression model.

## Root Mean Square Error (RMSE)

RMSE shows the variation between the predicted and the actual value. Since the difference between predicted and actual values can be positive or negative . To offset that difference we take the square of the difference between predicted and actual value.

1. Find the difference between predicted and actual value for every observation and square the value and add them.
* divide the sum by number of observations.
* Take the square root of the value from step 2.

$$
RMSE=\sqrt{\frac{\sum_{i=1}^N(\text{Predicted}_i-\text{Actual}_i)^2}{N}}
$$

## $R^2$

It is also called as coefficient of determination.

R² gives us a measure of how well the actual outcomes are replicated by the model or the regression line. This is based on the total variation of prediction explained by the model.

$$
R^2=\frac{\text{Variance explained by the model}}{\text{Total variance}}
$$

$R^2$ is always between 0 and 1 or between 0% to 100%. A value of 1 means that the model explains all the variation in predicted variable around its mean.

**Sum square of errors(SSE)** or Residuals, how far did we predict a value when compared to the actual value.
$$
SSE = \text{Actual value} - \text{Predicted value}
$$

**Sum square of total (SST)**, how far is the actual value when compared to the mean value.
$$
SST = \text{Actual value} - \text{Mean value}
$$

**Sum square of Regression(SSR)**, how far is the actual value when compared to the mean value.
$$
SSR = \text{Predicted value} - \text{mean value}
$$

So, $R^2$ is defined as,
$$
R^2=1-\frac{SSE}{SST}
$$

If the error in prediction is low then SSE will be low and R² will be close to 1.