# Ames Housing - Stepwise Feature Selection
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error

In [None]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/ames_housing/data/train.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

## Prepare data

Let us first focus on some easy to understand variables.

In [None]:
data = data[["SalePrice", "LotArea", "GrLivArea", "FullBath", "BedroomAbvGr", "KitchenAbvGr", "OverallQual", "OverallCond"]]

In [None]:
data.head()

Finally, we will split the data into features (*X*) and labels (*y*) and into training and test data.

In [None]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Forward/Backward feature selection

### Feature selection with pre-defined number of features

Let us first use the *SequentialFeatureSelector* to select a pre-defined number of features. We can use the *forward* and *backward* selection methods.

In [None]:
lm = LinearRegression()
sfs_fwd = SequentialFeatureSelector(lm, n_features_to_select=3, direction='forward', scoring="neg_root_mean_squared_error", cv=5)
sfs_fwd.fit(X_train, y_train)

Show the selected features. Note: The features are NOT listed in the order of their importance, but in the order they appear in the dataset!

In [None]:
sfs_fwd.get_feature_names_out()

To see how well a model with this selected number of features performs, we will train it on the full training data and evaluate it on the test data.

In [None]:
mod_selected_features = LinearRegression().fit(X_train[sfs_fwd.get_feature_names_out()], y_train)
preds_selected_features = mod_selected_features.predict(X_test[sfs_fwd.get_feature_names_out()])
print(mean_squared_error(y_test, preds_selected_features, squared=False))


## Compare models with different number of features

Now it's your turn! Compare the performance of models with different numbers of features (from 1 to 7). A for loop might be helpful here...

In [None]:
log = []
for i in range(1, 7):
    entry = {}
    sfs_fwd = SequentialFeatureSelector(lm, n_features_to_select=i, direction='forward', scoring="neg_root_mean_squared_error", cv=5)
    sfs_fwd.fit(X_train, y_train)
    mod_selected_features = LinearRegression().fit(X_train[sfs_fwd.get_feature_names_out()], y_train)
    preds_selected_features = mod_selected_features.predict(X_test[sfs_fwd.get_feature_names_out()])
    rmse = mean_squared_error(y_test, preds_selected_features, squared=False)
    entry["n_features"] = i
    entry["features"] = sfs_fwd.get_feature_names_out()
    entry["rmse"] = rmse
    log.append(entry)

In [None]:
log_df = pd.DataFrame(log)
log_df

In [None]:
sns.relplot(data=log_df, x="n_features", y="rmse", kind="line")