# Evaluation Metrics


Evaluation Metrics is a way to **quantify performance** of a Machine Learning model,  there are different metrics for the tasks of classification, regression, clustering, etc. On this Notebook we will focus on Supervised Learning Techniques Evaluations.


## Classification Problems


In order to demonstrate some metrics from scikit-learn we're going to use the digits dataset, and build a digit 9 classifier.

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets

digits = datasets.load_digits()
X = digits.data
y = digits.target.copy()

y[digits.target == 9] = 1 # Set 1 where the digit is "9"
y[digits.target != 9] = 0 # Set 0 otherwise

Now we split the dataset into train and test, and run a linear model.

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 6)

In [4]:
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg.fit(X_train,y_train) # Run the algorithm

# Run prediction on X test dataset for further analysis 
y_log_predict = log_reg.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


### Model Accuracy :  
  Classification Accuracy shows how many of the predictions are correct and can be obtained the following formula:   
  &nbsp;   
\begin{equation*}
Accuracy = \frac{Correct Predictions}{Total Predictions}
\end{equation*}    
  

In [5]:
# Run model accuracy passing the test X and y datasets
log_reg.score(X_test, y_test)

0.9844444444444445

Our example got a 98% accuracy. Now, let's see other metrics on the same model.


### Precision and Recall :
  In some cases, just the accuracy is not enought to measure how good a model generalize tha data, this became evident on non well balanced datasets. To solve this problem was created the following metrics.   To Fully evaluate the efectiveness of a model, you need examine both precision and recall. Unfortunately, improving precision typically reduces recall and vice versa.
  * Precision calculates the true positives divided by total positive predictions, this measure how precise our model is  out of those predicted positive. Is a good metric to determine, when the costs of False Positive is high.
  * Recall calculates the true positives divided by all positive cases, so recall is a good metric when there is a high cost associated with False Negative.
  
    \begin{equation*}
      Precision = \frac{TruePositive}{TruePositive + FalsePositive}
    \end{equation*} 
 &nbsp;
    \begin{equation*}
      Recall = \frac{TruePositive}{TruePositive + FalseNegative}
    \end{equation*} 

&nbsp;
```
True Positive:          When the prediction is correct and the valor is positive
True Negative           When the prediction is correct and the valor is negative
False Positive:         When the prediction is wrong and the valor is positive
False Negative:         When the prediction is wrong and the valor is negative
```


In [8]:
from sklearn.metrics import precision_score

precision_score(y_test, y_log_predict)

0.8888888888888888

In [9]:
from sklearn.metrics import recall_score

recall_score(y_test, y_log_predict)

0.9142857142857143

   
### Confusion Matrix : 
This is a common technique used to easily see all the metrics above, showing  all the corrects and wrongs predictions (True Positive, True Negative, False Positive, False Negative).     
![image.png](https://miro.medium.com/max/356/1*g5zpskPaxO8uSl0OWT4NTQ.png)



In [10]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_log_predict)

array([[411,   4],
       [  3,  32]])


### Area under the ROC Curve:
Receiver Operating Characteristics is a probability curve and AUC (Area Under the Curve) represents a degree of measure of separability. It tells how much the model is capable of distinguishing between classes. Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s.   An excellent model has AUC near to the 1 which means it has good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability.
* TPR (True Positive Rate) / Recall  &nbsp;
\begin{equation*}
      TPR = \frac{TruePositive}{TruePositive + FalseNegative}
\end{equation*} 
* Specificity &nbsp;
\begin{equation*}
      Specificity = \frac{TrueNegative}{TrueNegative + FalsePositive}
\end{equation*}
* FPR   &nbsp;
\begin{equation*}
      FPR = 1 - Specificity
\end{equation*} 
&nbsp;
\begin{equation*}
      FPR = \frac{FalsePositive}{TrueNegative + FalsePositive}
\end{equation*} 

    ![image.png](https://miro.medium.com/max/361/1*pk05QGzoWhCgRiiFbz-oKQ.png)

In [11]:
descision_score = log_reg.predict_proba(X_test)[:,1]

from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,descision_score)

0.9825817555938039

## Regression Problems

Regression problems are those where you want to predict a continuous value. For example, predicting the selling price of a house.  To study this topic we are going to use [the Boston house-prices dataset](https://scikit-learn.org/stable/datasets/index.html#boston-dataset). In this example we can use the sklearn.datasets module, that includes utilities to load and fetch popular reference datasets.

### Preparing the data


First we are going to i take a look on some information about the dataset.

In [14]:
from sklearn.datasets import load_boston

boston = load_boston()
print(boston.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu

Now we load the data and include the target variable to the dataframe.

In [15]:
# Converting to dataframe
dataset = pd.DataFrame(boston.data, columns = boston.feature_names)

# Adding the target variable to the dataframe
dataset['PRICE'] = boston.target
dataset.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,PRICE
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


To simplify, our goal will be to study the regression metrics creating a machine learning model using only the RM(average number of rooms per dwelling) and PRICE columns. Understanding the steps you can try to apply it to all features.

In [16]:
# Reshaping
X = np.array(dataset['RM']).reshape(-1,1)
Y = np.array(dataset['PRICE']).reshape(-1,1)

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= 0.4, random_state=11)

# Training the model
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)

# Getting predictions
predictions = lm.predict(X_test)

### Mean Absolute Error (MAE)
This is the simplest of all the metrics. It is measured by taking the average of the absolute difference between actual values and the predictions. It indicater us the difference between the predictions and the actual output. However, they don’t gives us any idea of the direction of the error i.e. whether we are under predicting the data or over predicting the data. The less the value of MAE the better the performance of your model.
![image](https://i.imgur.com/BmBC8VW.jpg)

In [None]:
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_test, predictions))

MAE: 4.488594176693228


### Mean Squared Error (MSE)

MSE is similar to MAE, but the difference is that it squares the difference between actual and predicted output values ​​before summing them all instead of using the absolute value. As, we take square of the error, the effect of larger errors become more pronounced then smaller error, hence the model can now focus more on the larger errors. In this metric, the larger the number the larger the error.

![image](https://cdn-media-1.freecodecamp.org/images/hmZydSW9YegiMVPWq2JBpOpai3CejzQpGkNG)

In [None]:
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test, predictions))

MSE: 43.60590161966949


### R2 Score

R2, also known as the coefficient determination, defines the degree to which the variance in the dependent  target can be explained by the features.If the R2 value for a particular model is 0.7, this means that 70% of the variation  in the dependent target can be explained by the variation in the features.    

  

* **SS ERROR**: **sum of squares of residuals**  is the sum of the squares of residuals (deviations predicted from actual empirical values of data). It is a measure of the discrepancy between the data and an estimation model.  
$\sum (y_{i} - ỹ)^2$
* **SS TOTAL**: **sum of squares of errors** is the squared differences between the observed dependent variable and its mean. You can think of this as the dispersion of the observed variables around the mean – much like the variance in descriptive statistics.    
$\sum(y_{i} - ŷ_{i})^2$

* **R 2**:  **R squared** is calculated by dividing the **SS ERROR**  by **SS TOTAL** and then subtract it from 1.

\begin{equation*}
      R^2 = 1 - \frac{\sum (y_{i} - ỹ)^2}{\sum(y_{i} - ŷ_{i})^2}
\end{equation*} 

In [None]:
from sklearn.metrics import r2_score
print('R2:', r2_score(y_test, predictions))

R2: 0.48350092178106796
