# Data Science for Business - Subset Selection on Ames Housing

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
import missingno as msno

from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
import statsmodels.formula.api as smf

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

In [None]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## Load data

Load training data from CSV file.

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/olivermueller/ds4b-2024/refs/heads/main/Session_05/ameshousing.csv')

In [None]:
data.head()

In [None]:
data.shape

In [None]:
data.columns

## Prepare data

First, we will remove some columns that are not useful for our task.

In [None]:
data = data.drop(['house_id', 'YrSold', 'MoSold', 'SaleCondition', 'SaleType'], axis=1)

Next, we will split the data into features (*X*) and labels (*y*) and into training (*X_train, y_train*) and test (*X_test, y_test*) sets.

In [None]:
X = data.drop("SalePrice", axis=1)
y = data["SalePrice"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Finally, we will do some feature engineering. It is important to use only information from the training set for feature engineering, and the mechanistically repeat these steps on the test set.

Typically, feature engineering depends strongly on the datatype of the variables. Hence, we will first determine which variables are categorical and which are numerical. Subsequentally, we will transform these variables seperately.

In [None]:
categorical_features = X_train.select_dtypes(include='object').columns
numerical_features = X_train.select_dtypes(exclude='object').columns

In [None]:
categorical_features

In [None]:
numerical_features

The categorical variables must be transformed into numerical representations, e.g., by one-hot encoding them.

In [None]:
enc = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
enc.fit(X_train[categorical_features])

X_train_cat = enc.transform(X_train[categorical_features])
X_test_cat = enc.transform(X_test[categorical_features])

X_train_cat = pd.DataFrame(X_train_cat, columns=enc.get_feature_names_out(categorical_features))
X_test_cat = pd.DataFrame(X_test_cat, columns=enc.get_feature_names_out(categorical_features))

In [None]:
X_train_cat.head()

The numerical variables will be standardized, that is, we will subtract the mean and divide by the standard deviation.

In [None]:
scaler = StandardScaler()
scaler.fit(X_train[numerical_features]) 

X_train_num = scaler.transform(X_train[numerical_features])
X_test_num = scaler.transform(X_test[numerical_features])

X_train_num = pd.DataFrame(X_train_num, columns=numerical_features)
X_test_num = pd.DataFrame(X_test_num, columns=numerical_features)

In [None]:
X_train_num.head()

Join categorical and numerical features.

In [None]:
X_train = pd.concat([X_train_num, X_train_cat], axis=1)
X_test = pd.concat([X_test_num, X_test_cat], axis=1)

In [33]:
X_train.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,SaleType_New,SaleType_Oth,SaleType_VWD,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,-0.871817,0.667826,0.03381,0.673941,-0.526415,0.181084,-0.381277,0.531409,-0.978369,-0.293998,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.062906,-1.717704,2.307082,-0.76675,-0.526415,-0.115603,-0.814347,-0.569155,-0.427633,4.19148,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
2,0.763949,0.369635,-0.035514,-1.487095,-0.526415,-0.28043,-1.054941,-0.569155,-0.978369,-0.293998,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.763949,0.071444,-0.363746,-1.487095,-0.526415,-0.708978,-1.632368,-0.569155,-0.978369,-0.293998,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,3.100756,0.160901,-0.310697,-1.487095,0.378216,-1.664971,-1.632368,-0.569155,-0.978369,-0.293998,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


## Stepwise (forward/backward) feature selection

### Feature selection with pre-defined number of features

Let us first use the `SequentialFeatureSelector` to select a pre-defined number of features. We can use the *forward* and *backward* selection methods.

In [None]:
lm = LinearRegression()
sfs_fwd = SequentialFeatureSelector(lm, n_features_to_select=3, direction='forward', scoring="neg_root_mean_squared_error", cv=5)
sfs_fwd.fit(X_train, y_train)

Show the selected features. Note: The features are NOT listed in the order of their importance, but in the order they appear in the dataset!

In [None]:
sfs_fwd.get_feature_names_out()

To see how well a model with this selected number of features performs, we will train it on the full training data and evaluate it on the test data.

In [None]:
mod_selected_features = LinearRegression().fit(X_train[sfs_fwd.get_feature_names_out()], y_train)
preds_selected_features = mod_selected_features.predict(X_test[sfs_fwd.get_feature_names_out()])
print(mean_squared_error(y_test, preds_selected_features, squared=False))


## Compare models with different number of features

Now it's your turn! Compare the performance of models with different numbers of features (from 1 to 20). Hint: A for loop might be helpful here...

In [None]:
# YOUR CODE HERE

In [None]:
log_df = pd.DataFrame(log)
log_df

In [None]:
sns.relplot(data=log_df, x="n_features", y="rmse", kind="line")