# Experimental Design in Machine Learning



* How can we estimate the expected performance of a machine learning algorithm for a particular application? 

* How do we select between machine learning algorithms (or parameter settings)? 





* Commonly, we split a data set into three groups: training, validation and test. (Only for evaluation purposes - often use ALL data for a final training of the system) 
* The training and validation sets may be rotated (e.g., using cross-validation)
* As previously discussed, we cannot rely on training (or validation) performance (because of the potential for over-fitting) to help us answer the questions above
* Our performance estimate is only as good as our test set is representative of the true test data in application

* We also cannot rely on one training run of the algorithm:
  - variations in training/validation sets
  - random factors during training (e.g., random initialization, local optima, etc.)
        
* The No Free Lunch Theorem: There is no universally best algorithm. For any machine learning algorithm there are sets of data where it works well and sets where it works poorly

* Performance of an algorithm can be determined using any of the measures we discussed previously (e.g., error rate, accuracy, ROC curves, Perf-Recall curves, etc) but also in terms of:
    - Risk
    - Running time
    - Training time
    - Run time storage/memory
    - Train time storage/memory
    - Computational complexity
    - Interpretability
    - etc...
    


        

* The output of a trained learning system depends on:
    - Controllable parameters: hyperparameters/settings of the algorithm/algorithm design choices
    - Uncontrollable parameters: noise in data, any randomness in training 
        
* To fully test a system, you want to try to evaluate each of these parameters separately.  However, this is often not easily done.
* Various strategies:
    - Best guess
    - Varying one factor at a time
    - Full/Partial Factorial design

* If there is randomization in your experiment, need to run multiple times (replicate experiments) to estimate the expected result (e.g., mean and variance of performance)
* If there are real-world impacts to your runs (e.g., machine warming up, time of day, etc), need to randomize your trials across factors
* Need to ensure we are comparing the parameters and algorithms we are interested and not any confounding factors (e.g., if you want to compare one parameter  - ensure everything else is fixed)

* When conducting experiments:
    - Understand the goal of your study
    - Determine your evaluation measure(s)
    - Determine what factors to vary and how to vary them
    - Design your experiment (and get estimate of how long it will take using a couple trial subset runs)
    - Perform the experiment
    - Analyze results

# Cross Validation and Training, Validation, Testing Data Sets

* Since we can (often, easily) overfit, our error or prediction performance on a training data set is not a good indication of performance on unknown test data.  
* One way estimate test performance of a system on unknown test data is to use some of the training data for training and some for validation (to act like unknown test data). 
* If you are repeatedly changing your model/adjusting parameters/tweaking your algorithm, you may even over fit the hold-out validation set.  So, you can hold out yet another set for testing. 
* However, in general, we only have a limited amount of training data. So, we want to use as much of it as possible for training.  One strategy to balance the tradeoff between needing training data and validation data is to use cross-validation.
* Cross-validation can also give an indication of stability/robustness of your method. 
* However there are downsides to cross-validation: need to train many times (which can sometimes be very computationally complex.), and you end up with several models - how do you pick the final one to use?
 
* For further reading and reference: Simon Haykin. Neural Networks A Comprehensive Foundation