The datafiles contain pre-processed training and test data from the Ames housing dataset. Train a DT model predicting "SalesPrice".

Build a pipeline for the Ames data that includes a Feature Selection step using Pearson's correlation and a DT step. Create a HP grid tuning the *k* HP of the feature selection step and the *max_depth* and *min_samples_leaf* of the DT model, choosing ranges for each of the HPs. Determine the best HP settings using Bayesian Optimisation.

Replace the step that selects features based on Pearson's *r* with feature selection based on RFE. Comment on the results obtained.

Some parts of the solution are already provided. Write code in the empty cells and in places indicated with "???".

Hint: use "Sklearn pipeline.ipynb" and "RFE.ipynb" as examples.

In [38]:
!pip install scikit-optimize



In [39]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set_theme(palette="Set2")

# execution time
from timeit import default_timer as timer
from datetime import timedelta

# increase column width
pd.set_option('display.max_colwidth', 250)

# silence warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [40]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Load the data

In [41]:
trainset = pd.read_csv("trainset-ames-housing.csv")
testset = pd.read_csv("testset-ames-housing.csv")

# separate predictors and target
ytrain = trainset["SalePrice"].copy()
Xtrain = trainset.drop("SalePrice", axis=1)
ytest = testset["SalePrice"].copy()
Xtest = testset.drop("SalePrice", axis=1)

# Model development

## Feature Selection with Pearson's r

In [42]:
from skopt import BayesSearchCV

# import relevant modules
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import r_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

In [43]:
start = timer()

pipe = Pipeline([
    ('fsel', SelectKBest(r_regression)),
    ('dt', DecisionTreeRegressor(random_state=7))
])

hp_grid = {
    'fsel__k': [3, 21],
    'dt__max_depth': [2,25],
    'dt__min_samples_leaf': [1,50],
}

opt_grid_search = BayesSearchCV(
     pipe,
     hp_grid,
     n_iter=30,
     random_state=7,
     scoring='neg_root_mean_squared_error',
     return_train_score=True,
     cv=10
)

np.int = int
opt_grid_search.fit(Xtrain, ytrain)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

Execution time HH:MM:SS: 0:00:43.566588


In [44]:
cv_results = pd.DataFrame(opt_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

Unnamed: 0,params,mean_train_score,mean_test_score,"diff, %"
29,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 17, 'fsel__k': 21}",32423.069608,38306.583111,-18.146072
17,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 19, 'fsel__k': 21}",33912.597352,38879.629763,-14.64657
23,"{'dt__max_depth': 21, 'dt__min_samples_leaf': 19, 'fsel__k': 21}",33912.597352,38879.629763,-14.64657
15,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 18, 'fsel__k': 21}",33062.633695,39125.920111,-18.338788
18,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 20, 'fsel__k': 21}",34332.083434,39266.006288,-14.371172
25,"{'dt__max_depth': 23, 'dt__min_samples_leaf': 20, 'fsel__k': 21}",34332.083434,39266.006288,-14.371172
26,"{'dt__max_depth': 22, 'dt__min_samples_leaf': 20, 'fsel__k': 21}",34332.083434,39266.006288,-14.371172
20,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 25, 'fsel__k': 21}",35480.635347,39361.439135,-10.937808
27,"{'dt__max_depth': 23, 'dt__min_samples_leaf': 21, 'fsel__k': 21}",34534.446797,39572.633485,-14.588873
16,"{'dt__max_depth': 25, 'dt__min_samples_leaf': 9, 'fsel__k': 21}",28454.538357,40012.756667,-40.619947


# Feature selection using RFE

In [46]:
from imblearn.pipeline import Pipeline
from sklearn.feature_selection import RFE

start = timer()

pipe = Pipeline([
    ('fsel', RFE(DecisionTreeRegressor(random_state=7, max_depth=10), step=1)),
    ('dt', DecisionTreeRegressor(random_state=7))
])

hp_grid = {
    'fsel__n_features_to_select': [3, 21],
    'dt__max_depth': [2, 25],
    'dt__min_samples_split': [2, 50],
}

opt_grid_search = BayesSearchCV(
     pipe,
     hp_grid,
     n_iter=30,
     random_state=7,
     scoring='neg_root_mean_squared_error',
     return_train_score=True,
     cv=10
)

np.int = int
opt_grid_search.fit(Xtrain, ytrain)

print("Execution time HH:MM:SS:", timedelta(seconds=timer() - start))

Execution time HH:MM:SS: 0:01:13.098465


In [47]:
cv_results = pd.DataFrame(opt_grid_search.cv_results_)[['params', 'mean_train_score', 'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

Unnamed: 0,params,mean_train_score,mean_test_score,"diff, %"
16,"{'dt__max_depth': 11, 'dt__min_samples_split': 29, 'fsel__n_features_to_select': 21}",27416.298781,40474.524497,-47.629426
13,"{'dt__max_depth': 10, 'dt__min_samples_split': 28, 'fsel__n_features_to_select': 21}",27318.979253,40491.689826,-48.218165
2,"{'dt__max_depth': 10, 'dt__min_samples_split': 28, 'fsel__n_features_to_select': 20}",27318.979253,40510.703754,-48.287765
18,"{'dt__max_depth': 11, 'dt__min_samples_split': 30, 'fsel__n_features_to_select': 21}",27593.357481,40621.16217,-47.213554
25,"{'dt__max_depth': 23, 'dt__min_samples_split': 27, 'fsel__n_features_to_select': 21}",26550.719357,40687.820367,-53.245642
21,"{'dt__max_depth': 14, 'dt__min_samples_split': 29, 'fsel__n_features_to_select': 21}",27317.091395,40698.864694,-48.986816
24,"{'dt__max_depth': 12, 'dt__min_samples_split': 28, 'fsel__n_features_to_select': 21}",27096.300419,40722.807278,-50.289178
19,"{'dt__max_depth': 10, 'dt__min_samples_split': 30, 'fsel__n_features_to_select': 20}",27725.630275,40755.164953,-46.994548
10,"{'dt__max_depth': 25, 'dt__min_samples_split': 30, 'fsel__n_features_to_select': 21}",27496.995327,40807.569098,-48.407375
26,"{'dt__max_depth': 10, 'dt__min_samples_split': 31, 'fsel__n_features_to_select': 21}",27933.077555,40826.468144,-46.158146
