# Train Test Split

> Best practice is to train test split your data right away.  We will be moving towards treating the test sample as a hold out sample that is not touched until you want to test your final model.

> To achieve that, you will want to test out different variations of model only on your train set. Doing so takes requires some careful coding.  It may also include K-Folds, which is shown at the bottom of this notebook.

> Since comfort with train-test split is required for even more thorough cross-validation, for this project, it is acceptable to do a single TTS.  Fit the model on the train, validate on the test to see if it is overfitting.  

> In the back of your mind, however, know that if you are tuning your model by predicting on your test set, you are creating a biased model which has, in a way, seen the test data before final validation.  



In [40]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Binarizer, LabelBinarizer
from sklearn.linear_model import LinearRegression

In [41]:
df = pd.read_csv('Carseats.csv')
y = df.Sales
X = df.drop('Sales', axis=1)

Because the dataset is small, I will use a test size on the small side (.2 instead of .3)


In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

> Use a standard scaler to put all numeric values on the same scale. We want to fit standard scaler to the train set, and then apply the scaler to the test set

In [43]:
ss = StandardScaler()

X_train_num = X_train.select_dtypes(exclude='object')

# Standard scaler will strip our column names and index
# Save them here to re-apply in the future

X_train_num_columns = X_train_num.columns
X_train_num_index = X_train_num.index

X_train_num = pd.DataFrame(ss.fit_transform(X_train_num))
X_train_num.columns = X_train_num_columns
X_train_num.index = X_train_num_index

In [44]:
X_train_obj = X_train.select_dtypes(include='object')

In [45]:
# Convert Urban and US to binary
urb_bin = LabelBinarizer()
X_train_obj['Urban'] = urb_bin.fit_transform(X_train_obj['Urban'])

us_bin = LabelBinarizer()
X_train_obj['US'] = us_bin.fit_transform(X_train_obj['US'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [46]:
shelve_dum = pd.get_dummies(X_train_obj['ShelveLoc'])
shelve_dum.drop('Bad', axis=1, inplace=True)

X_train_processed = X_train_num.join(shelve_dum)
X_train_processed = X_train_processed.join(X_train_obj[['Urban', 'US']])

> We want to fit our linear regression on the train set.  Do not fit on the test set.  
> Once we fit, we get our score on the train set.<br>
> If the R2 is low, then we have high bias.<br>
> If it is high, it is low bias.<br>


In [47]:
lr = LinearRegression()
lr.fit(X_train_processed, y_train)
print(lr.score(X_train_processed, y_train)) 

0.8658420026331486


In [48]:
# Now repeat for test set
X_test_num = X_test.select_dtypes(exclude='object')
X_test_num_columns = X_test_num.columns
X_test_num_index = X_test_num.index

X_test_num = pd.DataFrame(ss.transform(X_test_num))
X_test_num.columns = X_test_num_columns
X_test_num.index = X_test_num_index

X_test_obj = X_test.select_dtypes(include='object')

# Convert Urban and US to binary
X_test_obj['Urban'] = urb_bin.transform(X_test_obj['Urban'])

us_bin = LabelBinarizer()
X_test_obj['US'] = urb_bin.transform(X_test_obj['US'])

shelve_dum = pd.get_dummies(X_test_obj['ShelveLoc'])
shelve_dum.drop('Bad', axis=1, inplace=True)

X_test_processed = X_test_num.join(shelve_dum)
X_test_processed = X_test_processed.join(X_test_obj[['Urban', 'US']])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  app.launch_new_instance()


> We compare our train score to our test score.  If there is a significant decline between train and test, that means the model is **overfit**.  <br>
> In other words, our model has **high variance**.

In [49]:
lr = LinearRegression()
lr.fit(X_train_processed, y_train)
print(lr.score(X_train_processed, y_train)) 
print(lr.score(X_test_processed, y_test))

0.8658420026331486
0.8892712759554209


# Bonus: K-Folds
_Do not attempt until you have train_test_split down._

> For an even more thorough validation, we use k-folds on the train set.
When you are building a model, you do not want to tune your model with knowledge of the test set.  

> But, how do you then prevent against over-fitting if the way you see if a model is overfit is by comparing train and test scores? 


In [50]:
kf = KFold(n_splits=3)

test_r2 = []
for train_ind, val_ind in kf.split(X_train_processed, y_train):
    X_tt, y_tt = X_train_processed.iloc[train_ind], y_train.iloc[train_ind]
    X_val, y_val = X_train_processed.iloc[val_ind], y_train.iloc[val_ind]
    lr.fit(X_tt, y_tt)
    test_r2.append(lr.score(X_val, y_val))

print(sum(test_r2)/len(test_r2))

0.852491264515875
