## Introduction
The goal of any model building process is to increase the predictive performance of a statistical model. Today you'll learn the difference between a training set, test set, validation set and a process known as cross-validation - and ultimately how these varying methods have a direct effect on your models predictive performance on unseen data.

Since we normally only have access to a fixed data set, it's common to split the original data set into two portions named a **train** and **test** set - of which we'll use the test portion to simulate 'unseen data.'

How we deal with the **train** portion of the original data set will be the focus of this mornings sprint. Below are four different ways to build upon the complexity of our training strategy to produce the best process for trianing and validating our models.

- **Worst Option** - Train model with original data set without spliting into train and test set. Unable to score our model and determine its predictive performance since we don't have test set.
- **Bad Option** - Only perform a train-test split. Train model with entire training set and score against test set.
- **Better Option** - Further split training set into **one** smaller training set and **one** validation set. Use validation set score to guide our model choice and then score best model against test set. *Below is an image of this option.*
- **Even Better Option** - Cross validate training set by spliting into **many** training sets and **many** validation sets. Rotate through each portion to build and validate model and average results. Best model is used to score against test set.

- **Training Set** - Used to train one, or more, models.
- **Validation Set** - Used to tune hyperparameters of different models and choose the best performing model.
- **Test Set** - Used to test the predictive performance of the best scoring model.


In [26]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
import numpy as np
from sklearn.datasets import load_boston


## Part 2: Train and Test Split Only (Bad Option)

The reason this option is considered a poor chioce is two-fold: 1) High Variance - The split into a train and test set could randomly be such that the training set is not representative of the test set. This could mean our estimate of the performance of our model is wrong because the test set is unlike other data we will see. 2) No validation set for model tuning -- If we want to iterate on our model (for hyper-parameter optimization, or variable selection) on the basis of its test-set performance we can use the test set for this purpose, but without a validation set, we can't get an estimate of how our model will perform on truly unseen data, because the model has been able to "see" the test set. As a result estimates of the performance of the model are likely to be _optimistic_ when calculated based on training data for which the model has been selected to perform as well as possible.


1. Since we already split the original data set on Part 1, we'll train our model on the training set only. 


2. Write a function `rmse(true, predicted)` that takes your `true` and `predicted` values and calculates
   the RMSE. You should use `sklearn.metrics.mean_squared_error()` to confirm your
   results.

In [67]:
def rmse(targets, predictions):
    return np.sqrt(((targets - predictions) ** 2).mean())


3. Use `LinearRegression()` in scikit-learn to build a model with your training data.

   Note that there is multicollinearity and other issues in the data.  Do not worry
   about this for now. We will learn about Lasso and Ridge regularization this
   afternoon (alternative to the methods you have learned yesterday) to
   deal with some of the issues.

In [47]:
boston = load_boston()
X = boston.data # housing features
y = boston.target # housing prices

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

In [48]:
print(X_train.shape, X_test.shape)

print(y_train.shape, y_test.shape)

(379, 13) (127, 13)
(379,) (127,)


In [68]:
# Fit your model using the training set
linear = LinearRegression()
linear.fit(X_train, y_train)

# Call predict to get the predicted values for training and test set
train_predicted = linear.predict(X_train)
test_predicted = linear.predict(X_test)



# Calculate RMSE for training and test set
print( 'RMSE for training set ', rmse(y_test, test_predicted) )
print( 'RMSE for test set ', rmse(y_test, test_predicted) )

RMSE for training set  4.77717611421783
RMSE for test set  4.77717611421783


In [66]:
np.sqrt(sklearn.metrics.mean_squared_error(y_test, test_predicted))

4.77717611421783

4. Which RMSE did you expect to be higher?


I expected the error to be the same for both training and test set because they are from the same population

5. Explain the value of evaluating RMSE on a separate test set (instead of fitting a
   model and calculating RMSE on the entire data set).


This allows us to see if our validation test set is fundamentally different than our training set.

## Part 3: K-fold Cross Validation (Even Better Option)

In K-fold cross validation, we'll split our training set into **k** groups, usually 5 or 10. One of the k groups will act as our validation set, the rest of the (**k-1**) groups will be the training set. We'll iterate through each combination until each **fold** has had a chance to act as our validation set. At each iteration, a metric for accuracy (RMSE in this case) will be calculated and an average score will be calculated across the k iterations.