## Not looking at your data

> There  is  a  growing  realization  that  statistically  significant  claims  in  scientific  publications  are
routinely mistaken.  A dataset can be analyzed in so many different ways (with the choices being
not just what statistical test to perform but also decisions on what data to exclude or exclude, what
measures to study, what interactions to consider, etc.), that very little information is provided by
the statement that a study came up with a p < .05 result. The short version is that it’s easy to find a
p < . 05 comparison even if nothing is going on, if you look hard enough—and good scientists
are skilled at looking hard enough and subsequently coming up with good stories (plausible even to
themselves, as well as to their colleagues and peer reviewers) to back up any statistically-significant
comparisons they happen to come up with.

[A. Gelman and E. Loken, *The garden of forking paths:  Why multiple comparisons can be a problem,
even when there is no “fishing expedition” or “p-hacking” and the research
hypothesis was posited ahead of time*](http://www.stat.columbia.edu/~gelman/research/unpublished/p_hacking.pdf)


Having a model that fits the data is not enough if it has no predictive power. Consequently, it is never enough to test your model on the data used to build it.

## The simple way to test your model

Gather new data and test your model on it. Note that you're still not completely safe from falling into questionable research practices (and resulting bad statistics); Fix every step before seeing the data, if possible!

But that's not always practical, so what can you do instead?

## Splitting data

Often used when new data is difficult to collect, and especially common with machine learning.

**The idea:** 
Separate the dataset randomly to multiple independent datasets:

- Development data
    * Used to select and create the model
    
- Test data
    * Used to test the model on unseen data after the parameters have been selected. 
    * **Once you touch this, you are done!**
    
Usually the development dataset is further split into *training* and *validation* datasets in some way in order to prevent *overfitting* to the data. The model is then fit with the training set, and error on the validation set is used  to see whether the model generalizes to the rest of the data.

**Note:**
>Some texts/software will refer to the development data as training data, and may not be very clear on the difference between validation and test data. Be vigilant and know the reason every time you split your data.

## Cross-validation

To further reduce the danger of overfitting, the splitting/fitting procedure can be repeated multiple times on the development dataset, with the errors on the different validation sets used to determine model quality. Different types of cross-validation methods can be used depending on the constraints of the application:

*Exhaustive:*
- Leave-p-out (Remove each set of p samples in turn from the training set)
    * Special case: Leave-one-out
    * Time- and computation-wise usually very expensive
    
*Non-exhaustive*
- k-fold (randomly split into k subsets using each as validation set in turn, with the k-1 others as training)
- repeated random sampling

**Q:**
> Random splitting isn't always appropriate. What cases can you think of?