# Model Evaluations

## Cross-Validation

We can split the dataset into k folds, training the models on k-1 groups and testing on the remaining group of data. After repeating such procedure k times, we would have k estimates of the test error (e..g take the mean, median, etc). Using the estimated test errors, we prefer to choose a model that commits the least test errors.

Cross-validation can be used to estimate the performance of a model on unseen dataset, which can then help with
- Detecting overfitting
- Tuning hyperparameters
- Model selection

The error estimate from cross-validation could potentially be an over-estimation since the hold-out set is not used to train the model, reducing the amount of training data. Common choices for k are 5 to 10 depending on the datasize.

## Regression Evaluation

Most common:
- Root Mean Square Error(RMSE): RSS divided by n then root
- Mean Absolute Error(MAE): Penalizes big prediction error less in comparison to RMSE

Old school:
- Residual Standard Error(RSE): An expression of RSS accounting for the degree of freedom, penalizing the number of features
- $R^2$: The fraction of variance explained by the model, always favors more complex models
- Adjusted $R^2$: Unlike the R2 statistic, the adjusted R2 statistic pays a price for the inclusion of unnecessary variables in the model
- $AIC$, $BIC$, $C_p$: Relative measures to compare models, penalize model complexity and account for the model's generalization

Others:
- Compare to a baseline model

## Classification Evaluation

- Accuracy: 
    - (TP+TN)/(TP+FP+FN+TN)
    - The proportion of true results among the total number of cases examined
    - Valid for classification problems which are well balanced and not skewed or no class imbalance
- Positive Predictive Value(PPV) (Precision): 
    - (TP)/(TP+FP)
    - The proportion of correct predictions among positive predictions
    - Valid when we want to be very sure of our positive prediction
- Negative Predictive Value(NPV):
    - TN/(TN+FN)
    - The proportion of correct predictions among negative predictions
    - Valid when a false negative prediction is costly to the problem
- True Positive Rate (TPR) (Recall, Sensitivity):
    - (TP)/(TP+FN)
    - The proportion of positives that are correctly predicted
    - Valid when we want to capture as many positives as possible
- True Negative Rate (TNR) (Specificity, Selectivity):
    - (TN)/(TN+FP)
    - The proportion of negatives that are correctly predicted
    - Valid when we want to capture as many negatives as possible
- F1 Score:
    - 2*((Precision*Recall) / (Precision+Recall))
    - A number between 0 and 1 and is the harmonic mean of precision and recall
    - Valid when we want to have a model with both good precision and recall (which has a trade-off relationship)
- AUC or ROC Curve
    - a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:
        - True Positive Rate
        - False Positive Rate
    - AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.
    - AUC is great when assessing the overall performance of the model because
        - AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute values.
        - AUC is classification-threshold-invariant. It measures the quality of the model's predictions irrespective of what classification threshold is chosen.