# Hyper-parameter Tuning

# Definition

Hyper-parameters control the fitting behavior and are not learned from data.

```
estimator.get_params()
```

If you think of your estimator as a black-box, hyper-parameters are knobs on the outside of the box.
The goal of *hyper-parameter tuning* is to set the nobs to get optimal performance.

<img src="img/hp-tuning-two-knobs.jpg">
<div style="text-align: right">Source: Wikipedia</div>

# Why hyper-parameters?

HPs control the fitting behavior thus they "guide" the model search. You can think of this guidance as injecting *bias* into the model. 

<img src="img/eslii-mdl-search.png" style="width:400px;">
<div style="text-align: right">Source: T. Hastie et al. (2017) "Elements of Statistical Learning (Ed. 2)"</div>

# Examples

`sklearn.svm.SVC`
  * C ... complexity, higher C means more variance can be captured.
  * gamma ... width of the RBF kernerl, higher means more smoothness bias.
  
`sklearn.ensemble.RandomForestClassifier`
  * max_depth ... the deeper the trees the more variance we can capture.
  * n_features ... the more de-correlated trees, the more variance reduction (but the more trees needed).
  
`sklearn.linear_model.Ridge`
  * alpha ... penalty on the L2 norm of the model coefficients, higher alpha more bias.

# Hyper-parameter Tuning

*Grid Search* 

    Defacto standard method for tuning hyper-parameters in the past decades.
    
    
*Random Search*

    Explore the hyper-parameter space randomly by drawing samples. 
    Good for high-dimensional spaces (e.g. DNN).
    
    
<img src="img/bergstra12-grid-vs-rand.png" style="width:600px;">
<div style="text-align: right">Source: J. Bergstra and Y. Bengio (2012) "Random Search for Hyper-Parameter Optimization"</div>

# When is Grid Search not a good fit?
<img src="img/hp-tuning-many-knobs.jpg" >
<div style="text-align: right">Source: Wikipedia</div>

# Hyper-parameter Tuning in Scikit-learn

A search consists of:

  * an estimator (regressor or classifier such as `sklearn.svm.SVC()`);
  * a parameter space (e.g. `{'gamma': [0.01, 0.1, 1.0]}`);
  * a method for searching or sampling candidates;
  * a cross-validation scheme; and
  * a score function (e.g. `sklearn.metrics.accuracy_score`).
  
## Classes

  * `GridSearchCV`
  * `RandomizedSearchCV`

TODO: insert simple example

# Tuning Shortcuts

### Fit-once-evaluate-many

Some models allow us to evaluate many hyper-parameter settings in a single fit. 
Examples: n_estimators in `RandomForest` and `GradientBoosting`; "regularization path" in linear models.
    
### Warm-starts

Some models converge faster when warm started from a previous solution (with different HP settings). See [warm_start](https://scikit-learn.org/stable/glossary.html#term-warm-start) in sklearn.
    
### Heuristics

For some hyper-parameters, good values or ranges can be compute via heuristics.
Example: `gamma='auto'` in RBF kernel. 
    
### Sub-sampling

For some hyper-parameters, we can probe for good values on a subset of the data. Be cautious though!
Example: `learning_rate` in SGD.