# Cross Validation

When we learn a model using training exampels, we cannot use the training examples to estimate hjow well the model generalizes to new data. For this, we use cross validation: we (randomly) split the dataset into a training and a validation set and use the validation to estimate the generalization error.

# Data

We can use the `train_test_split` function in SKLearn to randomly split the data. The parameter `test_size` controls which fraction of the data is used for validation (or testing), the remainder is used for training. The splitting is normally done randomly, however sometimes we wish to reproduce our results and for this we can set the `random_state` parameter to a number.

Note that in this case we do not convert the data to numpy arrays. Pandas actually stores the data in a Dataframe into numpy arrays in the background and we can use all numpy operators on it. Therefore, KLlearn can also use Pandas data to learn a model.

In [1]:
from ml import advertising_pd

In [2]:
df = advertising_pd()
y = df['Sales']
X = df[['TV', 'Radio']]

In [3]:
from sklearn.model_selection import train_test_split
train_X, valid_X, train_y, valid_y = train_test_split(
                X, y, test_size=0.2, random_state=0)

In [4]:
valid_X.head()

Unnamed: 0,TV,Radio
18,69.2,20.5
170,50.0,11.6
107,90.4,0.3
98,289.7,42.3
177,170.2,7.8


# Model

We learn a Linear Regression model on the training set and report the validation error.

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from math import sqrt

model = LinearRegression()
model.fit(train_X, train_y)
pred_y = model.predict(valid_X)
validation_error = sqrt(mean_squared_error(pred_y, valid_y))
validation_error

2.1158493250248576