Open the dataset ames_housing_no_missing.csv with the following command:

In [1]:
import pandas as pd

ames_housing = pd.read_csv("../datasets/ames_housing_no_missing.csv")
target_name = "SalePrice"
data = ames_housing.drop(columns=target_name)
target = ames_housing[target_name]

ames_housing is a pandas dataframe. The column "SalePrice" contains the target variable.

To simplify this exercise, we will only used the numerical features defined below:

In [2]:
numerical_features = [
    "LotFrontage", "LotArea", "MasVnrArea", "BsmtFinSF1", "BsmtFinSF2",
    "BsmtUnfSF", "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF",
    "GrLivArea", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces",
    "GarageCars", "GarageArea", "WoodDeckSF", "OpenPorchSF", "EnclosedPorch",
    "3SsnPorch", "ScreenPorch", "PoolArea", "MiscVal",
]

data_numerical = data[numerical_features]

We will compare the generalization performance of a decision tree and a linear regression. For this purpose, we will create two separate predictive models and evaluate them by 10-fold cross-validation.

Thus, use sklearn.linear_model.LinearRegression and sklearn.tree.DecisionTreeRegressor to create the models. Use the default parameters for both models.

Be aware that a linear model requires to scale numerical features. Please use sklearn.preprocessing.StandardScaler so that your linear regression model behaves the same way as the quiz author intended ;)
<h1>Question 1</h1> (1 point possible)

By comparing the cross-validation test scores for both models fold-to-fold, count the number of times the linear model has a better test score than the decision tree model. Select the range which this number belongs to:
a) [0, 3]: the linear model is substantially worse than the decision tree
b) [4, 6]: both models are almost equivalent
c) [7, 10]: the linear model is substantially better than the decision tree 

In [38]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeRegressor

model_lin1 = make_pipeline(StandardScaler(), LinearRegression())
cv_lin1 = cross_validate(model_lin1, data_numerical, target, cv=10, return_estimator=True)

model_tree1 = DecisionTreeRegressor(random_state=2)
cv_tree1 = cross_validate(model_tree1, data_numerical, target, cv=10, return_estimator=True)

In [4]:
cv_lin1['test_score'] > cv_tree1['test_score']

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
        True])

<h1>Question 2</h1> (1 point possible)

Instead of using the default parameters for the decision tree regressor, we will optimize the max_depth of the tree. Vary the max_depth from 1 level up to 15 levels. Use nested cross-validation to evaluate a grid-search (sklearn.model_selection.GridSearchCV). Set cv=10 for both the inner and outer cross-validations, then answer the questions below

What is the optimal tree depth for the current problem?
a) The optimal depth is ranging from 3 to 5
b) The optimal depth is ranging from 5 to 8
c) The optimal depth is ranging from 8 to 11
d) The optimal depth is ranging from 11 to 15 

In [8]:
import numpy as np
np.arange(1, 16, 1)

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [39]:
from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {"max_depth": np.arange(1, 16, 1)}
model_tree_GS1 = GridSearchCV(model_tree1, param_grid=param_grid, cv=10)
cv_tree2 = cross_validate(model_tree_GS1, data_numerical, target, cv=10, return_estimator=True)

In [26]:
model_tree_GS1.best_params_['max_depth']
# attribute does not exist because the model hasn't been fit, just cv'ed

AttributeError: 'GridSearchCV' object has no attribute 'best_params_'

In [27]:
cv_tree2

{'fit_time': array([1.4733088 , 1.40304089, 1.45135307, 1.38454986, 1.39814806,
        1.38455296, 1.3834269 , 1.43239379, 1.49580407, 1.45897436]),
 'score_time': array([0.00152302, 0.00158119, 0.00152898, 0.00146508, 0.00141168,
        0.00139904, 0.001472  , 0.00132322, 0.001755  , 0.00135398]),
 'estimator': [GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
               param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])}),
  GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
               param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])}),
  GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
               param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])}),
  GridSearchCV(cv=10, estimator=DecisionTreeRegressor(),
               param_grid={'max_depth': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])}),
  GridSearchC

In [30]:
# random_state = default
best_params = []
for model in cv_tree2['estimator']:
    print(model.best_params_['max_depth'])
    best_params.append(model.best_params_['max_depth'])

8
7
9
7
8
6
7
6
5
7


In [35]:
# random_state = 1
best_params = []
for model in cv_tree2['estimator']:
    print(model.best_params_['max_depth'])
    best_params.append(model.best_params_['max_depth'])

7
7
11
7
5
6
6
6
7
8


In [40]:
# random_state = 2
best_params = []
for model in cv_tree2['estimator']:
    print(model.best_params_['max_depth'])
    best_params.append(model.best_params_['max_depth'])

5
7
14
5
6
7
6
5
5
6


In [29]:
cv_tree2['test_score']

array([0.59113507, 0.73749073, 0.74271795, 0.59819214, 0.75290865,
       0.71933625, 0.66292062, 0.76053576, 0.74639639, 0.68114069])

In [31]:
# random_state = default
pd.DataFrame({'Best params': best_params, 'Test scores': cv_tree2['test_score']})

Unnamed: 0,Best params,Test scores
0,8,0.591135
1,7,0.737491
2,9,0.742718
3,7,0.598192
4,8,0.752909
5,6,0.719336
6,7,0.662921
7,6,0.760536
8,5,0.746396
9,7,0.681141


In [36]:
# random_state = 1
pd.DataFrame({'Best params': best_params, 'Test scores': cv_tree2['test_score']})

Unnamed: 0,Best params,Test scores
0,7,0.61255
1,7,0.705582
2,11,0.721236
3,7,0.598192
4,5,0.738052
5,6,0.724811
6,6,0.679088
7,6,0.760536
8,7,0.52936
9,8,0.72107


In [41]:
# random_state = 2
pd.DataFrame({'Best params': best_params, 'Test scores': cv_tree2['test_score']})

Unnamed: 0,Best params,Test scores
0,5,0.578937
1,7,0.7177
2,14,0.753309
3,5,0.474516
4,6,0.762698
5,7,0.746467
6,6,0.655302
7,5,0.756933
8,5,0.746396
9,6,0.717594


<h1>Question 3</h1> (1 point possible)

Now, we want to evaluate the generalization performance of the decision tree while taking into account the fact that we tune the depth for this specific dataset. Use the grid-search as an estimator inside a cross_validate to automatically tune the max_depth parameter on each cross-validation fold.

A tree with tuned depth
a) is always worse than the linear models on all CV folds
b) is often but not always worse than the linear model
c) is often but not always better than the linear model
d) is always better than the linear models on all CV folds


Note: Try to set the random_state of the decision tree to different values e.g. random_state=1 or random_state=2 and re-run the nested cross-validation to check that your answer is stable enough.

In [32]:
# random_state = default
cv_lin1['test_score'] > cv_tree2['test_score']

array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True])

In [37]:
# random_state = 1
cv_lin1['test_score'] > cv_tree2['test_score']

array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True])

In [42]:
# random_state = 2
cv_lin1['test_score'] > cv_tree2['test_score']

array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True])


<h1>Question 4</h1> (1 point possible)

Instead of using only the numerical features you will now use the entire dataset available in the variable data.

Create a preprocessor by dealing separately with the numerical and categorical columns. For the sake of simplicity, we will assume the following:

    categorical columns can be selected if they have an object data type;
    use an OrdinalEncoder to encode the categorical columns;
    numerical columns can be selected if they do not have an object data type. It will be the complement of the numerical columns.

In addition, set the max_depth of the decision tree to 7 (fixed, no need to tune it with a grid-search).

Evaluate this model using cross_validate as in the previous questions.

A tree model trained with both numerical and categorical features
a) is most often worse than the tree model using only the numerical features
b) is most often better than the tree model using only the numerical features

Note: Try to set the random_state of the decision tree to different values e.g. random_state=1 or random_state=2 and re-run the (this time single) cross-validation to check that your answer is stable enough.

In [58]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer, make_column_selector

categorical_columns = make_column_selector(dtype_include=object)(data)
numerical_columns = make_column_selector(dtype_exclude=object)(data)
# categorical_processor = OneHotEncoder(handle_unknown='ignore')
categorical_processor = OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1)
# numerical_preprocessor = StandardScaler() # Not needed for trees

preprocessor = ColumnTransformer([("Ordinal", categorical_processor, categorical_columns)], remainder="passthrough")
model_tree3 = make_pipeline(preprocessor, DecisionTreeRegressor(max_depth=7, random_state=2))
cv_tree3 = cross_validate(model_tree3, data, target, cv=10, return_estimator=True)

In [55]:
# random_state = default
cv_tree2['test_score'] < cv_tree3['test_score']

array([ True,  True,  True,  True,  True,  True, False, False, False,
        True])

In [57]:
# random_state = 1
cv_tree2['test_score'] < cv_tree3['test_score']

array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True])

In [59]:
# random_state = 2
cv_tree2['test_score'] < cv_tree3['test_score']

array([ True,  True,  True,  True,  True,  True,  True, False, False,
        True])