# Summary of previous data cleaning:

From **Ames_Data_Cleaning.ipynb**, the following modifications were made to the Ames dataset:

**Data Types**
- OverallQual and OverallCond reclassified as ordinal datatypes (per data dictionary)
- MSSubClass reclassified as categorical (nominal) datatype
- MoSold and YrSold redefined as categorical datatypes for modeling options
- Other date features remain numeric types, to prevent dimensionality problems when ordinal encoding
- GarageYrBlt changed from 'float' to 'int', after imputation of missing values, to match other date features.


**Missing Data and Imputation**
- Numerical Features:
    - After exploring relationship between GarageYrBlt and YearBuilt, imputed missing values with YearBuilt
    - After evaluating relationship between LotArea and LotFrontage, imputed LotFrontage with median
    - Imputed remaining numerical features with median values (meadian values are the same as the mode values for discrete features)
- Categorical Features:
    - Imputed missing Electrical value with mode value
    - After evaluating Garage features, imputed missing values with 'None'
    - Imputed remaining categorical features with 'None'


**Outliers**
 - Looked at GrLivArea vs. SalePrice for ames with and without the 2 extreme GrLivArea values. I did not see any significant impact on the two models and did not consider them worth removing. **This is contrary to what De Cock suggests in his paper**

**ames_clean.pkl** include these changes.

# Contents

[Numerical Models](#Numerical-Models)
    
no categoricalfeatures
    
- num linear-linear model, including mc features
- num linear-linear model, no mc features
- num log-linear model, no mc features
- num log-log (partial) model, log transforming most highly skewed features: 
    - LotArea
    - MasVnrArea
    - WoodDeckSF
    - OpenPorchSF
    - EnclosedPorch
    - 3SsnPorch
    - ScreenPorch


[Numerical and Categorical Models](#Numerical-and-Categorical-Models)

- num-cat linear-linear model, all df features
- num-cat log-linear model, without mc, composite features

**Model summaries in notebooks_results**

[Resources](#Resources)

In [3]:
# libraries
import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import KFold

from sklearn.linear_model import LinearRegression

In [4]:
# setting to enable full viewing of cell output
pd.set_option('display.max_colwidth', None)

### Import raw data, if needed for comparison.

In [6]:
ames_raw = pd.read_csv('Ames_Housing_Price_Data.csv', index_col=0).reset_index()
ames_raw.drop(columns=['index'], inplace=True)
ames_raw.head(2)

Unnamed: 0,PID,GrLivArea,SalePrice,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,909176150,856,126000,30,RL,,7890,Pave,,Reg,...,166,0,,,,0,3,2010,WD,Normal
1,905476230,1049,139500,120,RL,42.0,4235,Pave,,Reg,...,0,0,,,,0,2,2009,WD,Normal


### Import cleaned data: ames_clean.pkl

In [8]:
ames = pd.read_pickle('ames_clean.pkl')

In [9]:
ames.isnull().sum().sum()

0

In [10]:
ames.columns

Index(['GrLivArea', 'SalePrice', 'MSSubClass', 'MSZoning', 'LotFrontage',
       'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt',
       'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd',
       'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd',
       'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt',
       'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond',
       'Pav

## Define numerical features

In [12]:
numerical_features = ames.select_dtypes(include=['float64', 'int64'])
numerical_features

Unnamed: 0,GrLivArea,SalePrice,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal
0,856,126000,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,...,1939,2.0,399.0,0,0,0,0,166,0,0
1,1049,139500,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,...,1984,1.0,266.0,0,105,0,0,0,0,0
2,1001,124900,60.0,6060,1930,2007,0.0,737.0,0.0,100.0,...,1930,1.0,216.0,154,0,42,86,0,0,0
3,1039,114000,80.0,8146,1900,2003,0.0,0.0,0.0,405.0,...,1940,1.0,281.0,0,0,168,0,111,0,0
4,1665,227000,70.0,8400,2001,2001,0.0,643.0,0.0,167.0,...,2001,2.0,528.0,0,45,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2575,952,121000,68.0,8854,1916,1950,0.0,0.0,0.0,952.0,...,1916,1.0,192.0,0,98,0,0,40,0,0
2576,1733,139600,68.0,13680,1955,1955,0.0,0.0,0.0,0.0,...,1955,2.0,452.0,0,0,0,0,0,0,0
2577,2002,145000,82.0,6270,1949,1950,0.0,284.0,0.0,717.0,...,1949,3.0,871.0,0,0,0,0,0,0,0
2578,1842,217500,68.0,8826,2000,2000,144.0,841.0,0.0,144.0,...,2000,2.0,486.0,193,96,0,0,0,0,0


In [13]:
numerical_features_list = ames.select_dtypes(include=['float64', 'int64']).columns
numerical_features_list

Index(['GrLivArea', 'SalePrice', 'LotFrontage', 'LotArea', 'YearBuilt',
       'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
       'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'BsmtFullBath',
       'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr',
       'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars', 'GarageArea',
       'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch',
       'ScreenPorch', 'PoolArea', 'MiscVal'],
      dtype='object')

**Note:** These are the correct, 32 numerical features

# Numerical Models
[Contents](#Contents)

no categorical features

### numerical model with all numerical features, including multicollinear features

In [16]:
# define features and target 
X = numerical_features.drop('SalePrice', axis = 1)
y = ames['SalePrice']

In [17]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', X.columns.to_list())
])

# fit the model
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

model.fit(X, y)

In [18]:
# evaluate the model
cv_scores = cross_val_score(model, X, y)

mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'numerical linear-linear model, including mc features, no categorical')
print(f'mean cv score: {mean_cv_score}')
print(f'cv scores: {cv_scores}')

# Store model data for summary
# Initialize once to store results
notebook_results = []

# append notebook_results
notebook_results.append({
    'Scenario': 'num linear-linear model, including mc features',
    'Mean CV Score': mean_cv_score,
    'CV Scores': cv_scores.tolist()  # Convert to list for better display
})

numerical linear-linear model, including mc features, no categorical
mean cv score: 0.8277
cv scores: [0.79384382 0.8336217  0.85686165 0.82131624 0.83287847]


In [19]:
# notebook_results

### drop composite, multicollinear features

In [21]:
# drop target and define features for base model
base_features = numerical_features.drop(columns=['SalePrice', 'MiscVal', 'TotalBsmtSF', 'GrLivArea'], axis=1)
base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,1939,2.0,399.0,0,0,0,0,166,0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,0,1984,1.0,266.0,0,105,0,0,0,0


In [22]:
# Define base features list
base_features_list = base_features.columns.to_list()
print(f'number of base features (without target and 3 multicollinear features): {len(base_features_list)}')

number of base features (without target and 3 multicollinear features): 28


In [23]:
# define features and target for modeling
X = base_features
y = ames['SalePrice']

In [24]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', base_features_list)
])

# fit the model
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

model.fit(X, y)

In [25]:
# evaluate the model
cv_scores = cross_val_score(model, X, y)

In [26]:
mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'numerical linear-linear model without MiscVal, TotalBsmtSF, and GrLivArea')
print(f'mean cv score: {mean_cv_score}')
print(f'cv scores: {cv_scores}')

numerical linear-linear model without MiscVal, TotalBsmtSF, and GrLivArea
mean cv score: 0.8278
cv scores: [0.79418863 0.83366325 0.85697319 0.82128648 0.83308856]


In [27]:
# append notebook_results
notebook_results.append({
    'Scenario': 'num linear-linear model, no mc features (MiscVal, TotalBsmtSF, and GrLivArea)',
    'Mean CV Score': mean_cv_score,
    'CV Scores': cv_scores.tolist()  # Convert to list for better display
})

In [28]:
# notebook_results

## log-linear base model

In [30]:
# define features and target for modeling
X = base_features
y = ames['SalePrice']

In [31]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', base_features_list)
])

In [32]:
# fit the model
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [33]:
model.fit(X, np.log10(y))

In [34]:
# evaluate the model
cv_scores_log = cross_val_score(model, X, np.log10(y))

In [35]:
mean_cv_score_log = float(round(cv_scores_log.mean(), 4))
print(f'numerical log-linear model without MiscVal, TotalBsmtSF, and GrLivArea')
print(f'mean cv score: {mean_cv_score_log}')
print(f'cv scores: {cv_scores_log}')

numerical log-linear model without MiscVal, TotalBsmtSF, and GrLivArea
mean cv score: 0.8471
cv scores: [0.78969666 0.85385931 0.87261997 0.85933034 0.86020114]


In [36]:
# append notebook_results
notebook_results.append({
    'Scenario': 'num log-linear model, no mc features (MiscVal, TotalBsmtSF, and GrLivArea)',
    'Mean CV Score': mean_cv_score_log,
    'CV Scores': cv_scores_log.tolist()  # Convert to list for better display
})

In [37]:
# notebook_results

### log-log model
numerical model performance with log transformation of a few, highly skewed features:

- LotArea
- MasVnrArea
  
- WoodDeckSF
- OpenPorchSF
- EnclosedPorch
- 3SsnPorch
- ScreenPorch

In [39]:
log_features = base_features[['LotArea', 'MasVnrArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch']].copy()
non_log_features = base_features[['LotFrontage', 'YearBuilt', 'YearRemodAdd',
       'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'PoolArea']].copy()
non_log_features

Unnamed: 0,LotFrontage,YearBuilt,YearRemodAdd,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,LowQualFinSF,BsmtFullBath,...,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,PoolArea
0,68.0,1939,1950,238.0,0.0,618.0,856,0,0,1.0,...,1,0,2,1,4,1,1939,2.0,399.0,0
1,42.0,1984,1984,552.0,393.0,104.0,1049,0,0,1.0,...,2,0,2,1,5,0,1984,1.0,266.0,0
2,60.0,1930,2007,737.0,0.0,100.0,1001,0,0,0.0,...,1,0,2,1,5,0,1930,1.0,216.0,0
3,80.0,1900,2003,0.0,0.0,405.0,717,322,0,0.0,...,1,0,2,1,6,0,1940,1.0,281.0,0
4,70.0,2001,2001,643.0,0.0,167.0,810,855,0,1.0,...,2,1,3,1,6,0,2001,2.0,528.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2575,68.0,1916,1950,0.0,0.0,952.0,952,0,0,0.0,...,1,0,2,1,4,1,1916,1.0,192.0,0
2576,68.0,1955,1955,0.0,0.0,0.0,1733,0,0,0.0,...,2,0,4,1,8,1,1955,2.0,452.0,0
2577,82.0,1949,1950,284.0,0.0,717.0,1001,1001,0,0.0,...,2,0,4,2,8,0,1949,3.0,871.0,0
2578,68.0,2000,2000,841.0,0.0,144.0,985,857,0,1.0,...,2,1,3,1,7,1,2000,2.0,486.0,0


In [40]:
log_features_list = log_features.columns.to_list()
non_log_features_list = non_log_features.columns.to_list()
log_features_list

['LotArea',
 'MasVnrArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch']

In [41]:
X1 = ames[log_features_list].join(ames[non_log_features_list])
X1

Unnamed: 0,LotArea,MasVnrArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,LotFrontage,YearBuilt,YearRemodAdd,...,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,PoolArea
0,7890,0.0,0,0,0,0,166,68.0,1939,1950,...,1,0,2,1,4,1,1939,2.0,399.0,0
1,4235,149.0,0,105,0,0,0,42.0,1984,1984,...,2,0,2,1,5,0,1984,1.0,266.0,0
2,6060,0.0,154,0,42,86,0,60.0,1930,2007,...,1,0,2,1,5,0,1930,1.0,216.0,0
3,8146,0.0,0,0,168,0,111,80.0,1900,2003,...,1,0,2,1,6,0,1940,1.0,281.0,0
4,8400,0.0,0,45,0,0,0,70.0,2001,2001,...,2,1,3,1,6,0,2001,2.0,528.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2575,8854,0.0,0,98,0,0,40,68.0,1916,1950,...,1,0,2,1,4,1,1916,1.0,192.0,0
2576,13680,0.0,0,0,0,0,0,68.0,1955,1955,...,2,0,4,1,8,1,1955,2.0,452.0,0
2577,6270,0.0,0,0,0,0,0,82.0,1949,1950,...,2,0,4,2,8,0,1949,3.0,871.0,0
2578,8826,144.0,193,96,0,0,0,68.0,2000,2000,...,2,1,3,1,7,1,2000,2.0,486.0,0


In [42]:
from sklearn.preprocessing import FunctionTransformer

In [43]:
# helper function for handling log10 of zero values
def log10_1p(x):
    return np.log10(1 + x)

In [44]:
# Set up preprocessor
preprocessor1 = ColumnTransformer(
    transformers=[
        ('log', FunctionTransformer(log10_1p, validate=False), log_features_list),
        ('no_log', 'passthrough', non_log_features_list)
])

In [45]:
# fit the model
model1 = Pipeline(
    steps=[
        ('preprocessor1', preprocessor1),
        ('regressor', LinearRegression())
    ])

In [46]:
model1.fit(X1, np.log10(y))

In [47]:
X1[log_features_list]

Unnamed: 0,LotArea,MasVnrArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch
0,7890,0.0,0,0,0,0,166
1,4235,149.0,0,105,0,0,0
2,6060,0.0,154,0,42,86,0
3,8146,0.0,0,0,168,0,111
4,8400,0.0,0,45,0,0,0
...,...,...,...,...,...,...,...
2575,8854,0.0,0,98,0,0,40
2576,13680,0.0,0,0,0,0,0
2577,6270,0.0,0,0,0,0,0
2578,8826,144.0,193,96,0,0,0


In [48]:
log_transformed_data = \
    model1.named_steps['preprocessor1'].named_transformers_['log'].transform(X1[log_features_list])
log_transformed_data

Unnamed: 0,LotArea,MasVnrArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch
0,3.897132,0.000000,0.000000,0.000000,0.000000,0.000000,2.222716
1,3.626956,2.176091,0.000000,2.025306,0.000000,0.000000,0.000000
2,3.782544,0.000000,2.190332,0.000000,1.633468,1.939519,0.000000
3,3.910998,0.000000,0.000000,0.000000,2.227887,0.000000,2.049218
4,3.924331,0.000000,0.000000,1.662758,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...
2575,3.947189,0.000000,0.000000,1.995635,0.000000,0.000000,1.612784
2576,4.136118,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2577,3.797337,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2578,3.945813,2.161368,2.287802,1.986772,0.000000,0.000000,0.000000


In [49]:
# evaluate the model
cv1_scores_log = cross_val_score(model1, X1, np.log10(y))
cv1_scores_log

array([0.79006354, 0.85502   , 0.87826501, 0.86494615, 0.8651429 ])

In [50]:
mean_cv1_score_log = float(round(cv1_scores_log.mean(), 4))
print(f'numerical log-log (partial: most highly skewed features log transformed) model without MiscVal, TotalBsmtSF, and GrLivArea')
print(f'mean cv score log-log(partial) model: {mean_cv1_score_log}')
print(f'cv scores log-log(partial) model: {cv1_scores_log}')

numerical log-log (partial: most highly skewed features log transformed) model without MiscVal, TotalBsmtSF, and GrLivArea
mean cv score log-log(partial) model: 0.8507
cv scores log-log(partial) model: [0.79006354 0.85502    0.87826501 0.86494615 0.8651429 ]


In [51]:
# append notebook_results
notebook_results.append({
    'Scenario': 'num log-log (partial) model',
    'Mean CV Score': mean_cv1_score_log,
    'CV Scores': cv1_scores_log.tolist()  # Convert to list for better display
})

In [52]:
# notebook_results

## Compare model scores with comoponent and composite porch features

In [54]:
# create a composite porch area feature
total_porch_features = base_features.copy()
total_porch_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,1939,2.0,399.0,0,0,0,0,166,0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,0,1984,1.0,266.0,0,105,0,0,0,0


In [55]:
total_porch_features['TotalPorch'] = (
    total_porch_features['OpenPorchSF'] + 
    total_porch_features['EnclosedPorch'] +
    total_porch_features['3SsnPorch'] + 
    total_porch_features['ScreenPorch']
)
total_porch_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,TotalPorch
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1939,2.0,399.0,0,0,0,0,166,0,166
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,1984,1.0,266.0,0,105,0,0,0,0,105


In [56]:
total_porch_features.drop(columns=['OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch'], inplace=True)

In [57]:
total_porch_features

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,BedroomAbvGr,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,PoolArea,TotalPorch
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,2,1,4,1,1939,2.0,399.0,0,0,166
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,2,1,5,0,1984,1.0,266.0,0,0,105
2,60.0,6060,1930,2007,0.0,737.0,0.0,100.0,1001,0,...,2,1,5,0,1930,1.0,216.0,154,0,128
3,80.0,8146,1900,2003,0.0,0.0,0.0,405.0,717,322,...,2,1,6,0,1940,1.0,281.0,0,0,279
4,70.0,8400,2001,2001,0.0,643.0,0.0,167.0,810,855,...,3,1,6,0,2001,2.0,528.0,0,0,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2575,68.0,8854,1916,1950,0.0,0.0,0.0,952.0,952,0,...,2,1,4,1,1916,1.0,192.0,0,0,138
2576,68.0,13680,1955,1955,0.0,0.0,0.0,0.0,1733,0,...,4,1,8,1,1955,2.0,452.0,0,0,0
2577,82.0,6270,1949,1950,0.0,284.0,0.0,717.0,1001,1001,...,4,2,8,0,1949,3.0,871.0,0,0,0
2578,68.0,8826,2000,2000,144.0,841.0,0.0,144.0,985,857,...,3,1,7,1,2000,2.0,486.0,193,0,96


In [58]:
# Define base features list
total_porch_features_list = total_porch_features.columns.to_list()
total_porch_features_list

['LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'PoolArea',
 'TotalPorch']

In [59]:
# define features and target for modeling
X_totpor = total_porch_features
y = ames['SalePrice']

In [60]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', total_porch_features_list)
])

In [61]:
# fit the model
model_totpor = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [62]:
model_totpor.fit(X_totpor, y)

In [63]:
# evaluate the model
cv_scores = cross_val_score(model_totpor, X_totpor, y)

In [64]:
mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'Model with composite TotalPorch feature, dropping 4 component porch features:')
print(f'mean cv score: {mean_cv_score}')
print(f'cv scores: {cv_scores}')

Model with composite TotalPorch feature, dropping 4 component porch features:
mean cv score: 0.8278
cv scores: [0.79252561 0.8328325  0.85868476 0.82228316 0.83279752]


In [65]:
# # append notebook_results
# notebook_results.append({
#     'Scenario': 'Model with composite TotalPorch feature, dropping 4 component porch features',
#     'Mean CV Score': mean_cv_score,
#     'CV Scores': cv_scores.tolist()  # Convert to list for better display
# })

# notebook_results

## Compare model scores with component and composite bathroom features

In [67]:
base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,1939,2.0,399.0,0,0,0,0,166,0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,0,1984,1.0,266.0,0,105,0,0,0,0


In [68]:
bath_features = base_features.copy()
bath_features['TotalBath'] = (
    bath_features['BsmtFullBath'] + 
    bath_features['BsmtHalfBath'] +
    bath_features['FullBath'] + 
    bath_features['HalfBath']
)
bath_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,TotalBath
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1939,2.0,399.0,0,0,0,0,166,0,2.0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,1984,1.0,266.0,0,105,0,0,0,0,3.0


In [69]:
bath_features.drop(columns=['BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath'], inplace=True)
bath_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,TotalBath
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1939,2.0,399.0,0,0,0,0,166,0,2.0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,1984,1.0,266.0,0,105,0,0,0,0,3.0


In [70]:
bath_features_list = bath_features.columns.to_list()
bath_features_list

['LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'TotalBath']

In [71]:
# define features and target for modeling
X_totbath = bath_features
y = ames['SalePrice']

In [72]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', bath_features_list)
])

In [73]:
# fit the model
model_totbath = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [74]:
model_totbath.fit(X_totbath, y)

In [75]:
# evaluate the model
cv_scores = cross_val_score(model_totbath, X_totbath, y)

In [76]:
mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'Model with composite TotalBath feature, dropping 4 component bath features:')
print(f'mean cv score: {mean_cv_score}')
print(f'cv scores: {cv_scores}')

Model with composite TotalBath feature, dropping 4 component bath features:
mean cv score: 0.8283
cv scores: [0.79415271 0.83364377 0.85682397 0.82442745 0.83268379]


In [77]:
# # append notebook_results
# notebook_results.append({
#     'Scenario': 'Model with composite TotalBath feature, dropping 4 component bath features',
#     'Mean CV Score': mean_cv_score,
#     'CV Scores': cv_scores.tolist()  # Convert to list for better display
# })

# notebook_results

### Dropping all the component features in lieu of composite (bath and porch) 

In [79]:
composite_base_features = base_features.copy()
composite_base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,1939,2.0,399.0,0,0,0,0,166,0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,0,1984,1.0,266.0,0,105,0,0,0,0


In [80]:
composite_base_features['TotalPorch'] = (
    composite_base_features['OpenPorchSF'] + 
    composite_base_features['EnclosedPorch'] +
    composite_base_features['3SsnPorch'] + 
    composite_base_features['ScreenPorch']
)

composite_base_features['TotalBath'] = (
    composite_base_features['BsmtFullBath'] + 
    composite_base_features['BsmtHalfBath'] +
    composite_base_features['FullBath'] + 
    composite_base_features['HalfBath']
)
composite_base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,TotalPorch,TotalBath
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,2.0,399.0,0,0,0,0,166,0,166,2.0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,1.0,266.0,0,105,0,0,0,0,105,3.0


In [81]:
composite_base_features.drop(
    columns=[
        'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 
        'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath'
    ], inplace=True)

In [82]:
composite_base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,PoolArea,TotalPorch,TotalBath
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,4,1,1939,2.0,399.0,0,0,166,2.0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,1,5,0,1984,1.0,266.0,0,0,105,3.0


In [83]:
# define features and target for modeling
X_composite = composite_base_features.copy()
y = ames['SalePrice']

In [84]:
composite_base_features_list = composite_base_features.columns.to_list()

In [85]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', composite_base_features_list)
])

In [86]:
# fit the model
model_comp = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [87]:
model_comp.fit(X_composite, y)

In [88]:
# evaluate the model
cv_scores = cross_val_score(model_comp, X_composite, y)

In [89]:
mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'Model with both composite Bath and Porch features, dropping 8 component features:')
print(f'mean cv score: {mean_cv_score}')
print(f'cv scores: {cv_scores}')

Model with both composite Bath and Porch features, dropping 8 component features:
mean cv score: 0.8284
cv scores: [0.79250939 0.83286663 0.85876069 0.82534545 0.83239619]


In [90]:
# # append notebook_results
# notebook_results.append({
#     'Scenario': 'Model with both composite Bath and Porch features, dropping 8 component features',
#     'Mean CV Score': mean_cv_score,
#     'CV Scores': cv_scores.tolist()  # Convert to list for better display
# })

# notebook_results

# Numerical and Categorical Models
[Contents](#Contents)

### Define categorical features

In [93]:
print(f'number of categorical features: {len(ames.select_dtypes(include=["object"]).columns)}')
categorical_features_list = ames.select_dtypes(include=['object']).columns.to_list()
ames.select_dtypes(include=['object']).columns

number of categorical features: 48


Index(['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour',
       'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
       'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond',
       'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC',
       'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu',
       'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive',
       'PoolQC', 'Fence', 'MiscFeature', 'MoSold', 'YrSold', 'SaleType',
       'SaleCondition'],
      dtype='object')

In [94]:
categorical_features = ames[categorical_features_list].copy()
categorical_features.head(2)

Unnamed: 0,MSSubClass,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,...,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,MoSold,YrSold,SaleType,SaleCondition
0,30,RL,Pave,,Reg,Lvl,AllPub,Corner,Gtl,SWISU,...,TA,TA,Y,,,,3,2010,WD,Normal
1,120,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,Edwards,...,TA,TA,Y,,,,2,2009,WD,Normal


### Base model with everything

numerical and categorical features, including mc features

In [96]:
X = numerical_features.drop('SalePrice', axis=1).join(categorical_features)
X.head(2)

Unnamed: 0,GrLivArea,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,MoSold,YrSold,SaleType,SaleCondition
0,856,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856.0,...,TA,TA,Y,,,,3,2010,WD,Normal
1,1049,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049.0,...,TA,TA,Y,,,,2,2009,WD,Normal


In [97]:
y = ames['SalePrice']

In [98]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', numerical_features.drop('SalePrice', axis=1).columns.to_list()),
        ('categorical', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features_list)
])

In [99]:
# fit the model
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [100]:
model.fit(X, y)

In [101]:
cv_scores = cross_val_score(model, X, y)



In [102]:
mean_cv_score = float(round(cv_scores.mean(), 4))
print(f'num-cat linear-linear model with everything')
print(f'mean cv score log model: {mean_cv_score}')
print(f'cv scores log model: {cv_scores}')

num-cat linear-linear model with everything
mean cv score log model: 0.914
cv scores log model: [0.86743571 0.92265047 0.93065861 0.92360722 0.92584001]


In [103]:
# append notebook_results
notebook_results.append({
    'Scenario': 'num-cat linear-linear model, all features',
    'Mean CV Score': mean_cv_score,
    'CV Scores': cv_scores.tolist()  # Convert to list for better display
})

In [104]:
# notebook_results

# base model on train test data

In [106]:
X.head(2)

Unnamed: 0,GrLivArea,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,MoSold,YrSold,SaleType,SaleCondition
0,856,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856.0,...,TA,TA,Y,,,,3,2010,WD,Normal
1,1049,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049.0,...,TA,TA,Y,,,,2,2009,WD,Normal


In [107]:
numerical_features.drop('SalePrice', axis=1).columns.to_list()

['GrLivArea',
 'LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 'TotalBsmtSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea',
 'MiscVal']

In [108]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 0)

In [109]:
# Set up preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', numerical_features.drop('SalePrice', axis=1).columns.to_list()),
        ('categorical', OneHotEncoder(drop='first', handle_unknown='ignore', sparse_output=False), categorical_features_list)
])

# fit the model
model = Pipeline(
    steps=[
        ('preprocessor', preprocessor),
        ('regressor', LinearRegression())
    ])

In [110]:
# fit model on training data
model.fit(X_train, y_train)

In [111]:
model.score(X_train, y_train)
print(f'score for training data: {model.score(X_train, y_train)}')

score for training data: 0.9443870958025232


In [112]:
print(f'score for test data: {model.score(X_test, y_test)}')

score for test data: 0.9106981825104055




# log-linear model, without mc, composite features

In [114]:
# numerical features for base model, without multicollinear features
base_features.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,1,1939,2.0,399.0,0,0,0,0,166,0
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,0,1984,1.0,266.0,0,105,0,0,0,0


In [115]:
base_features_list

['LotFrontage',
 'LotArea',
 'YearBuilt',
 'YearRemodAdd',
 'MasVnrArea',
 'BsmtFinSF1',
 'BsmtFinSF2',
 'BsmtUnfSF',
 '1stFlrSF',
 '2ndFlrSF',
 'LowQualFinSF',
 'BsmtFullBath',
 'BsmtHalfBath',
 'FullBath',
 'HalfBath',
 'BedroomAbvGr',
 'KitchenAbvGr',
 'TotRmsAbvGrd',
 'Fireplaces',
 'GarageYrBlt',
 'GarageCars',
 'GarageArea',
 'WoodDeckSF',
 'OpenPorchSF',
 'EnclosedPorch',
 '3SsnPorch',
 'ScreenPorch',
 'PoolArea']

In [116]:
# define features and target for modeling, numeric and categorical
X2 = ames[base_features_list].join(ames[categorical_features_list])
y = ames['SalePrice']

In [117]:
X2.head(2)

Unnamed: 0,LotFrontage,LotArea,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,1stFlrSF,2ndFlrSF,...,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,MoSold,YrSold,SaleType,SaleCondition
0,68.0,7890,1939,1950,0.0,238.0,0.0,618.0,856,0,...,TA,TA,Y,,,,3,2010,WD,Normal
1,42.0,4235,1984,1984,149.0,552.0,393.0,104.0,1049,0,...,TA,TA,Y,,,,2,2009,WD,Normal


In [118]:
# Set up preprocessor
preprocessor2 = ColumnTransformer(
    transformers=[
        ('numerical', 'passthrough', base_features_list),
        ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'), categorical_features_list)
])

In [119]:
# fit the model
model2 = Pipeline(
    steps=[
        ('preprocessor2', preprocessor2),
        ('regressor', LinearRegression())
    ])

In [120]:
model2.fit(X2, np.log10(y))

In [121]:
# evaluate the model
cv_scores_with_cat = cross_val_score(model2, X2, np.log10(y))



In [122]:
mean_cv_score_with_cat = float(round(cv_scores_with_cat.mean(), 4))
print(f'numerical-categorical log-linear base model (no composite mc features)')
print(f'mean cv score log model: {mean_cv_score_with_cat}')
print(f'cv scores log model: {cv_scores_with_cat}')

numerical-categorical log-linear base model (no composite mc features)
mean cv score log model: 0.9156
cv scores log model: [0.87677974 0.91924517 0.92318485 0.93136373 0.92744137]


In [123]:
# append notebook_results
notebook_results.append({
    'Scenario': 'num-cat log-linear model, without mc, composite features',
    'Mean CV Score': mean_cv_score_with_cat,
    'CV Scores': cv_scores_with_cat.tolist()  # Convert to list for better display
})

In [124]:
# notebook_results

In [125]:
# Create summary table
summary_df = pd.DataFrame(notebook_results)

# veiw summary of models 
print("Summary of All Scenarios:")
summary_df

Summary of All Scenarios:


Unnamed: 0,Scenario,Mean CV Score,CV Scores
0,"num linear-linear model, including mc features",0.8277,"[0.793843816346853, 0.8336217009930801, 0.8568616477472124, 0.8213162382700148, 0.8328784733922474]"
1,"num linear-linear model, no mc features (MiscVal, TotalBsmtSF, and GrLivArea)",0.8278,"[0.7941886310042306, 0.8336632492234609, 0.8569731930524078, 0.8212864751095819, 0.8330885584671347]"
2,"num log-linear model, no mc features (MiscVal, TotalBsmtSF, and GrLivArea)",0.8471,"[0.7896966581825936, 0.8538593096400977, 0.872619972201618, 0.859330336412344, 0.8602011438920767]"
3,num log-log (partial) model,0.8507,"[0.7900635444096901, 0.8550199958295697, 0.878265009992304, 0.864946145859404, 0.8651428985520748]"
4,"num-cat linear-linear model, all features",0.914,"[0.8674357122978216, 0.9226504668303502, 0.9306586050446611, 0.9236072194536997, 0.9258400058128113]"
5,"num-cat log-linear model, without mc, composite features",0.9156,"[0.8767797355898818, 0.9192451722630199, 0.9231848490817174, 0.931363730545358, 0.9274413692737092]"


Need more num-cat models?

# Resources
[Return To Top](#Contents)

**Dean De Cock paper and original data:**

- [Ames, Iowa: Alternative to the Boston Housing Data as an
End of Semester Regression Project](https://jse.amstat.org/v19n3/decock.pdf)

- [DataDocumentation.txt](https://jse.amstat.org/v19n3/decock/DataDocumentation.txt)

- [Ames Data Dictionary on Github](https://github.com/Padre-Media/dataset/blob/main/Ames%20Data%20Dictionary.txt)