# Model Selection and Assessment

# Definition

*Model Selection*

    Estimating the performance of different models in order to choose the best one.
    
*Model Assessment*

    Estimating the generalization error of a given model.

# Performance Metrics

Summarizes the performance of a model in a real valued score (higher is better) or error (lower is better).

Most algorithms try to minimize a loss function; loss and error dont have to be the same.

# How to choose a Performance Metric?

*Business goal*
  
      The performance metric should be a good proxy for the business goal. 
      Do you need to make decisions, rank items, provide accurate probabilities?
      Eg: Ad placement and well calibrated click probabilities -> log-loss .
      
*Response distribution*
  
      Do you expect outliers in the responses? How do you want them to be handled?

# Out-of-sample error

*Out-of-sample data*

    Data that you've not used to fit your model parameters.
    
    
*Train-Validation-Holdout*

    Its common practise to partition your data in 3 disjoin sets:
    
  * Train: Used to fit the model parameters.
  * Validation: Used for model selection.
  * Holdout: Used to estimate the generalization error.
    

# Partitioning

*Random (stratified) partitioning*

    Split sets randomly. Pitfall: make sure you shuffle!

*Grouped partitioning*

    Split sets by mutually exclusive group membership (e.g. user IDs for recommendations)

*Out-of-Time partitioning*

    Split sets by chronologic order. Note: extrapolation vs. interpolation.
    Mind the gap: use gaps to forecast multiple steps ahead.
    When estimating the generalization error: fit model on train+validation.

# Sample Efficiency

If your dataset is small or your class labels are highly skewed:

*k-fold cross-validation*

    Split your data into k disjoint sets, train on k-1 sets, use one for validation, iterate and average error.
    
*time-series cross-validation*

<img src="img/hyndman-tscv.png" width="600">
<div style="text-align: right; padding-right: 220px">Source: R. Hyndman https://robjhyndman.com/hyndsight/tscv/</div>

*Nested cross-validation*

    Multiple layers of cross-validation for model tuning, selection and assessment.
    
Links:

  * [scikit-learn user guide](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation-evaluating-estimator-performance)
  * [R. Hyndman TSCV](https://robjhyndman.com/hyndsight/tscv/)
  * [Towards Data Science Article](https://towardsdatascience.com/time-series-nested-cross-validation-76adba623eb9)

# No-free-lunch Theorem

Why do we need to do model selection? Aren't Deep Neural Networks the only learning algorithm we need?

NFL: *whenever a learning algorithm performs well on some function, as measured by out-of-training-set generalization, it must perform poorly on some other(s).*

<img src="img/sklearn-model-comp.png">
<div style="text-align: right">Source: Scikit-learn User Guide</div>

# NFL revisited

There is no silver bullet...
<img src="img/rolson17-ranking.png" width=500>
<div style="text-align: right">Source: R. Olson et. al. (2017) "Data-driven Advice for Applying Machine Learning to Bioinformatics Problems."</div>
... but Gradient Boosted Trees are pretty good.