In [None]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

## Metrics for regression models

Regression models are built to analyse and predict on continuous variables.

Mean Absolute Error (MAE) is a simple and intuitive metric for the absolute difference between observations and preditions. 
MAE is not sensitive to outliers.

MSE (Mean Squared Error)

- most widely used regression metric
- allows larger errors to have a larger impact on the accuracy of the model



In [None]:
# this code not to be run as we don't have the supporting dataset loaded
# it is for illustration purposes only

# calculating MAE manually
rfr = RandomForestRegressor(n_estimators = 500,
                            random_state = 1111)

rfr.fit(X_train, y_train)
test_predictions = rfr.predict(X_test)
sum(abs(y_test - test_predictions))/len(test_predictions)

# calculating MAE using sklearn metrics
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, test_predictions)


# calculating MSE manually
sum(abs(y_test - test_predictions)**2)/len(test_predictions)

# calculating MSE using sklearn metrics
from sklearn.metrics import mean_squared_error
mean_squared_error(y_test, test_predictions)

## Accuracy, Precision, Recall (also called sensitivity), Specificity, F1 score, and others.

We can use the scikit learn confusion matrix function to build a confusion matrix from which to calculate the various metrics.

### Accuracy
Represents the overall ability of a model to correctly predict the correct classification.
(TN + TP) / (TN + TP + FN + FP)

### Precision
This is the number of true positives out of all predicted positive values.
(TP) / (TP + FP)

Precision is used when we don't want to overpredict positive values. If it costs $2000 to fly in someone for an interview, a company may only want to fly in someone who is a good chance of actually joining the company. This could be good for fraud detection.

### Recall
This metric is about finding all positive values.
(TP)/(TP + FN)

Recall is used when we can't afford to miss any positive values. Even if a person has a small chance of having cancer, we might want to administer additional tests. This could also be a good choice of metric when we have regulatory or compliance cases where the cost of missing a positive could be quite damaging.