Open the dataset ames_housing_no_missing.csv with the following command:

In [1]:
import pandas as pd

ames_housing = pd.read_csv("../datasets/ames_housing_no_missing.csv")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

ames_housing is a pandas dataframe. The column "SalePrice" contains the target variable.

To simplify this exercise, we will only used the numerical features defined below:

In [2]:
numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

data_numerical = data[numerical_features]

We will compare the generalization performance of a decision tree and a linear regression. For this purpose, we will create two separate predictive models and evaluate them by 10-fold cross-validation.

Thus, use sklearn.linear_model.LinearRegression and sklearn.tree.DecisionTreeRegressor to create the models. Use the default parameters for both models.

Be aware that a linear model requires to scale numerical features. Please use sklearn.preprocessing.StandardScaler so that your linear regression model behaves the same way as the quiz author intended ;)

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate

linear_model = make_pipeline(StandardScaler(), LinearRegression())

cv_results = cross_validate(linear_model,
                            data_numerical, target,
                            return_estimator=True, cv=10)


In [4]:
cv_results["test_score"]

array([0.76129977, 0.80635105, 0.81358636, 0.66592199, 0.79964891,
       0.76868787, 0.75635094, 0.71822127, 0.31479306, 0.78635221])

In [5]:
print(f'Linear regression model mean accuracy: {cv_results["test_score"].mean():.2f} +/- {cv_results["test_score"].std():.2f}')

Linear regression model mean accuracy: 0.72 +/- 0.14


In [6]:
from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()

tree_cv_results = cross_validate(tree_model,
                            data_numerical, target,
                            cv=10)

In [7]:
tree_cv_results["test_score"]

array([0.57255068, 0.68686058, 0.69413218, 0.57931539, 0.73917407,
       0.63367046, 0.54579584, 0.67169015, 0.64038312, 0.6417573 ])

In [8]:
print(f'Decision tree regression model mean accuracy: {tree_cv_results["test_score"].mean():.2f} +/- {tree_cv_results["test_score"].std():.2f}')

Decision tree regression model mean accuracy: 0.64 +/- 0.06


In [9]:
print(
    'Linear regression is better than decision tree for '
    f'{sum(cv_results["test_score"] > tree_cv_results["test_score"])} CV iterations out of 10 folds.'
)

Linear regression is better than decision tree for 9 CV iterations out of 10 folds.


Instead of using the default parameters for the decision tree regressor, we will optimize the max_depth of the tree. Vary the max_depth from 1 level up to 15 levels. Use nested cross-validation to evaluate a grid-search (sklearn.model_selection.GridSearchCV). Set cv=10 for both the inner and outer cross-validations, then answer the questions below

In [10]:
import numpy as np
from sklearn.model_selection import GridSearchCV

max_depth = np.arange(1, 16, 1)
param_grid = {'max_depth': max_depth}

inner_cv = GridSearchCV(DecisionTreeRegressor(),
                        param_grid=param_grid,
                        cv=10)

gs_cv_results = cross_validate(inner_cv,
                               data_numerical, target,
                               return_estimator=True,
                               cv=10)

In [11]:
for search_cv in gs_cv_results['estimator']:
    print(search_cv.best_params_)

{'max_depth': 6}
{'max_depth': 5}
{'max_depth': 6}
{'max_depth': 5}
{'max_depth': 6}
{'max_depth': 6}
{'max_depth': 5}
{'max_depth': 6}
{'max_depth': 8}
{'max_depth': 13}


Most of the time max_depth was in the range 5 to 8.

Now, we want to evaluate the generalization performance of the decision tree while taking into account the fact that we tune the depth for this specific dataset. Use the grid-search as an estimator inside a cross_validate to automatically tune the max_depth parameter on each cross-validation fold.

In [12]:
gs_cv_results['test_score']

array([0.69038582, 0.76720246, 0.71227376, 0.4745161 , 0.76417411,
       0.73820259, 0.66928018, 0.7666175 , 0.47670107, 0.69585703])

In [13]:
print(f'Decision tree with optimized via GridSearchCV max_depth parameter regression model mean accuracy: {gs_cv_results["test_score"].mean():.2f} +/- {gs_cv_results["test_score"].std():.2f}')

Decision tree with optimized via GridSearchCV max_depth parameter regression model mean accuracy: 0.68 +/- 0.11


In [14]:
print(
    'Linear regression is better than decision tree for '
    f'{sum(cv_results["test_score"] > gs_cv_results["test_score"])} CV iterations out of 10 folds.'
)

Linear regression is better than decision tree for 8 CV iterations out of 10 folds.


Instead of using only the numerical features you will now use the entire dataset available in the variable data.

Create a preprocessor by dealing separately with the numerical and categorical columns. 

In [15]:
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_processor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)

preprocessor = make_column_transformer(
    (categorical_processor, selector(dtype_include=object)),
    ("passthrough", selector(dtype_exclude=object))
)
tree = make_pipeline(preprocessor,
                     DecisionTreeRegressor(max_depth=7, random_state=0))

In [16]:
cv_results = cross_validate(
    tree, data, target, cv=10, return_estimator=True, n_jobs=2
)



In [17]:
cv_results['test_score']

array([0.72939283, 0.78302269, 0.82629515, 0.74948593, 0.8330028 ,
       0.85093205, 0.78903061, 0.75170173, 0.60072763, 0.75856634])

In [18]:
print(f'Decision tree regression model with both categorical and numerical features mean accuracy: {cv_results["test_score"].mean():.2f} +/- {cv_results["test_score"].std():.2f}')

Decision tree regression model with both categorical and numerical features mean accuracy: 0.77 +/- 0.07


In [19]:
print(
    'A tree model using both numerical and categorical features is better than a '
    'tree with optimal depth using only numerical features for '
    f'{sum(cv_results["test_score"] > gs_cv_results["test_score"])} CV '
    'iterations out of 10 folds.'
)

A tree model using both numerical and categorical features is better than a tree with optimal depth using only numerical features for 9 CV iterations out of 10 folds.
