# Cross-Validation

One way to evaluate the Decision Tree model would be to use the train_test_split() function to split the training set into a smaller training set and a validation set, then train your models against the smaller training set and evaluate them against the validation set. 

A great alternative is to use Scikit-Learn's K-fold cross-validation feature. 

In [23]:
#stratified sampling
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import StratifiedShuffleSplit

data_housing = fetch_california_housing()
df = pd.DataFrame(data_housing.data, columns = data_housing.feature_names)
df['AvgHouseVal'] = data_housing.target
df['income_categorical'] = pd.cut(df['MedInc'], bins=[0., 1.5,3.0,4.5,6.,np.inf], labels=[1,2,3,4,5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df['income_categorical']):
    strat_train_set = df.loc[train_index]
    strat_test_set = df.loc[test_index]
for i in (strat_train_set,strat_test_set):
    i.drop('income_categorical', axis=1, inplace=True)

In [24]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

housing = strat_train_set.drop('AvgHouseVal', axis=1)
housing_label = strat_train_set['AvgHouseVal'].copy()

pipe = Pipeline([
    ('imp', SimpleImputer(strategy = 'median')),
    ('std_scaler', StandardScaler())
])
housing_prepared = pipe.fit_transform(housing)

In [25]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()

The following code randomly splits the training set into 10 distinct subsets called "folds", then it trains and evaluates the Decision Tree model 10 times, picking a different fold for evaluation every time and training on the other 9 folds. The result is an array containing the 10 evaluation scores:

In [26]:
from sklearn.model_selection import cross_val_score

score_tree_reg = cross_val_score(tree_reg, housing_prepared, housing_label,
                         scoring='neg_mean_squared_error',cv=10)
print('RMSE: ', np.sqrt(-score_tree_reg))
print('Mean: ', np.sqrt(-score_tree_reg).mean())
print('Standard deviation: ', np.sqrt(-score_tree_reg).std())

RMSE:  [0.70454858 0.71450523 0.73217492 0.7347886  0.71205754 0.73100768
 0.75472167 0.70973704 0.71566895 0.75309291]
Mean:  0.7262303109285979
Standard deviation:  0.0169019981522679


Cross-Validation features expect a utility function, it means, greater is better, so the scoring function is actually the opposite of MSE, which is why the preceding code computes -scores before calculating the squared root. And notice that cross-validation allows you to get not only an estimate of the performance of your model, but also a measure of how precise this estimate is

In [27]:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

In [28]:
score_lin_reg = cross_val_score(lin_reg, housing_prepared, housing_label,
                         scoring='neg_mean_squared_error',cv=10)
print('RMSE: ', np.sqrt(-score_lin_reg))
print('Mean: ', np.sqrt(-score_lin_reg).mean())
print('Standard deviation: ', np.sqrt(-score_lin_reg).std())

RMSE:  [0.7036911  0.72431503 0.72680092 0.73050519 0.75637811 0.74775988
 0.68872509 0.73301589 0.76535004 0.72080567]
Mean:  0.7297346932386132
Standard deviation:  0.021891450608441578


In [29]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()

In [30]:
score_forest_reg = cross_val_score(forest_reg, housing_prepared, housing_label,
                         scoring='neg_mean_squared_error',cv=10)
print('RMSE: ', np.sqrt(-score_forest_reg))
print('Mean: ', np.sqrt(-score_forest_reg).mean())
print('Standard deviation: ', np.sqrt(-score_forest_reg).std())

RMSE:  [0.49643739 0.48169674 0.5011856  0.5344223  0.49843895 0.54235484
 0.49470303 0.49194622 0.54107729 0.51841869]
Mean:  0.510068104450364
Standard deviation:  0.021062328065974228
