# Cross-validation using training and test sets

An important insight we already introduced in the last section is that an estimator will almost always perform better when evaluated on the same data it was trained on than when evaluated on an entirely new dataset. Since our estimators are usually not much use to us unless they can generalize to new data, we should probably care much more about how an estimator performs on new data than on data it's already seen.

The most straightforward way to obtain what's known as an *out-of-sample* performance estimate is to ensure that we always train and evaluate our estimator on independent datasets. The performance estimate obtained from the training dataset will typically suffer from overfitting to some degree; the test dataset estimate will not, so long as its error term is independent of the training dataset.

In practice, an easy way to construct training and test datasets with independent errors is to randomly split a dataset in two. We can make use of scikit-learn's `train_test_split` utility, found in the [model selection](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection) module, to do the work for us.

In [3]:
# a helpful utility that splits an arbitrary number of
# array-like objects into training and testing subsets
from sklearn.model_selection import train_test_split

# Get facet scores and age for a "full" sample of 1,000
items, age = get_features(data, 'items', 'AGE', n=1000)

# for every array we pass to train_test_split, we get back
# two: a training set, and a test set. the train_size
# parameter controls the proportion of all cases assigned
# to the training set (the remainder are assigned to test).
split_vars = train_test_split(items, age, train_size=0.5)

# Python supports parallel assignment: if the number of
# variables on the left side matches the number of
# elements in a list, the list elements will be mapped
# one-to-one onto the variables.
items_train, items_test, age_train, age_test = split_vars

# Verify shape...
items_train.shape

(500, 300)

Now we can fit our estimator using the training data, and evaluate its performance using both the training and test data. The difference between the two will tell us how badly we're overfitting to the training data. This practice is called *cross-validation*, and it's ubiquitous in machine learning. In most applications, if you report performance estimates from your training dataset without also reporting a corresponding cross-validated estimate, there's a good chance someone will (not unreasonably) yell at you.

In [4]:
est = LinearRegression()

est.fit(items_train, age_train)

# Estimate R^2 separately for the training and test samples
r2_train = est.score(items_train, age_train)
r2_test = est.score(items_test, age_test)

print(f"R^2 in training sample: {round(r2_train, 2)}")
print(f"R^2 in test sample: {round(r2_test, 2)}")

R^2 in training sample: 0.79
R^2 in test sample: -0.32


The difference here is pretty striking. In the training sample, the fitted model explains a majority of the variance. In the test sample, it explains... well, none. Actually, the value is negative!

If you're used to computing $R^2$ by taking the square of a correlation coefficient, you might be thinking that there must be an error somewhere. Correlations range from -1 to 1, and $R^2$ is the square of $R$, so how could we have a negative $R^2$ value?

The answer is that the [standard definition of $R^2$](https://en.wikipedia.org/wiki/Coefficient_of_determination) actually allows arbitrarily large negative values, because it's possible for the residual sum-of-squares (RSS) to be larger than the total sum-of-squares (TSS). Intuitively, we can have an estimator that's *so* bad at predicting new scores that we would have been better off just using the mean of the new data as our prediction. In fact, that's exactly what's happening in this case.