# ICS 434: DATA SCIENCE FUNDAMENTALS

## Cross-Validation

---

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics

# from sklearn.metrics import mean_squared_error

### Overfitting and RMSE

* in regression or classification, sespite the fact that RMSE $\approx$ 0, or erro rate $\approx$ 0 it is very unlikely that the model is perfect

* The results can be due to the model overfitting the training data
  * The model "Remembers" the data perfectly and predicts observation as it saw them
  
* You will remember that models that overfit the data tend to have poor generalizaiton power
  * Computing the RMSE on the test data shows that the model is overfitting

### Overfitting and RMSE -- Cont'd

<img src="https://www.dropbox.com/scl/fi/d6ob11stfawcddhdsi3ox/overfitting.png?rlkey=ypaf0zepwh6s1n166u13oaue7&dl=1" alt="drawing" width="900"/>

### Assessing a Model's Generalization Power

* In real modeling/ML applications, we are interested in how models generalize to unseen data
  * There is no merit in being able to "regurgitate" previously seen data

* Therefore, the most appropriate statistical learning method or parameters are selected based on the results observed with previously unseen test data
  * You cannot use the test set for mitigating overfitting since the model will learn to fit the test set


### Non-linear Regression and Train-Test Split


* Non-linear regression or classification can be complex models
  * Have many parameters that can impact the RMSE

* The test error rate can be highly variable, depending which observations are in which set (train and test)
  * Different train-test splits will yield different results
  
* How do we test the performance of different parameters while assessing the random nature of the splits in the data?


### Train-Test Split

* The test error rate can be highly variable, depending which observations are in which set (train and test)
  * Different train-test splits will yield different results
  
* How do we test the performance of different parameters while assessing the random nature of the splits in the data?


### Assessing a Model's Generalization Power -- Cont'd

* You don't want to use the test data until you have completed all your tests on parameterizing the model
  * You cannot use the test set for minimizing overfit since the model will learn to fit the test set

* For instance, to explore which degree is best, we can split the training set into:
  * a smaller training set: used to train the model
  * validation set: after the training (before the testing) to minimize overfitting 

### The Train / Validation / Test  Approach

* Train model on the new, smaller training set

* Explore different parameters if needed (eg. the polynomial degree)
  * Choose those perform best on the validation set
    
* Only use the test set to compare the decision tree regressor against other models
  * Choose the model with the smallest generalization error

### Shortcomings of the Train / Validation / Test  Approach


* Statistical methods tend to perform worse when trained to learn complex model using fewer observations
  * Less data to learn the model 
  * Validation set error rate may tend to overestimate the test error rate

* Wastes another chunk of data (validation set), which cannot be used in training

* A good alternative to the training/validation/test is a method called K-fold cross-validation

### K-Fold Cross-Validation

<img src="https://www.dropbox.com/scl/fi/re1v0br0vvmgb5gxpw2p1/cross_validation.png?rlkey=axb8gr9i0z0x2sx1z3rdhi4u3&dl=1" alt="drawing" style="width:800px;"/>

### K-Fold Cross-Validation -- Cont'd

* Cross-validation is applied on the training set

* We use the following algorithm to train/validate using $K$-fold cross-validation
  * The training set is split into $K$ complementary chunks of data
  * We consider $K-1$ chunks as training and $1$ chunk as testing
  * We repeat the training/testing $K$ times and average the estimates (ex. RMSE) into a single cross-validation estimate

* Once we find the best model parameters, we then use them to train on the full training set

* It's common to use $K=10$ for cross-validation

### Testing

* Once we have tweaked all the parameters for all models, we can use the test data to compare the models' generalization performance
  * For each model, we run the prediction on test data and compare the predicted with the observed values 
  * We select the model that makes the smallest error
* We determine whether the generalized performance is sufficient for our application
  * Ex. The performance requirements for a patient-facing application are perhaps not the same as those for an application for predicting algae blooms
* Note that in this approach, we are rewarding models that generalize well even if they are more complex 
 * We don't penalize complex models if they result in high generalization performance
 * Between models with relatively similar generalization performance, we need to choose the one with fewer parameters