## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [25]:
# import models and fit
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
# load data
train_df = pd.read_csv('../preprocessed/train_df.csv')
test_df = pd.read_csv('../preprocessed/test_df.csv')


In [26]:
x_train = train_df.drop(columns=['description.sold_price'])
y_train = train_df['description.sold_price']
x_test = test_df.drop(columns=['description.sold_price'])
y_test = test_df['description.sold_price']
numerical_features = x_train.select_dtypes(include=['float64', 'int64']).columns
categorical_features = x_train.select_dtypes(include=['object']).columns
boolean_features = x_train.select_dtypes(include=['bool']).columns

print(numerical_features)
print(categorical_features)
print(boolean_features)

Index(['list_price', 'price_reduced_amount', 'description.year_built',
       'description.baths_3qtr', 'description.baths_full',
       'description.baths_half', 'description.lot_sqft', 'description.sqft',
       'description.baths', 'description.garage', 'description.stories',
       'description.beds', 'central_air', 'dishwasher', 'fireplace',
       'forced_air', 'hardwood_floors', 'washer_dryer', 'basement',
       'single_story', 'garage_1_or_more', 'garage_2_or_more', 'dining_room',
       'two_or_more_stories', 'shopping', 'family_room', 'central_heat',
       'laundry_room', 'recreation_facilities', 'view',
       'community_outdoor_space', 'city_view', 'community_security_features',
       'city_mean_price', 'state_mean_price'],
      dtype='object')
Index(['description.sub_type', 'description.type'], dtype='object')
Index([], dtype='object')


In [27]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(sparse_output=False,handle_unknown='ignore')
boolean_transformer = OneHotEncoder(drop='if_binary', handle_unknown='ignore')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features)
    ])

lr = Pipeline(steps=[('preprocessor', preprocessor),
                     ('pca', PCA(n_components=0.95)),
                      ('model', LinearRegression())])

lr.fit(x_train, y_train)

In [28]:
lr.score(x_test, y_test)
print(mean_squared_error(y_test, lr.predict(x_test)))
print(mean_absolute_error(y_test, lr.predict(x_test)))
print(r2_score(y_test, lr.predict(x_test)))


580663960687.1451
110806.13679255564
0.12765361149169596


In [7]:
svm_pca = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('pca', PCA(n_components=0.95)),  # Retain 95% of variance
    ('model', SVR())
])

# Fit the pipeline
svm_pca.fit(x_train, y_train)

# Evaluate the model
y_pred = svm_pca.predict(x_test)
print(svm_pca.score(x_test, y_test))
print(mean_squared_error(y_test, y_pred))
print(mean_absolute_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

-0.03598146730111651
116747246944.63228
204337.18007100347
-0.03598146730111651


In [8]:
tree_based_preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features)
    ])

In [11]:
rf = Pipeline(steps=[('model', RandomForestRegressor())])
rf.fit(x_train, y_train)
rf.score(x_test, y_test)
print(mean_squared_error(y_test, rf.predict(x_test)))
print(mean_absolute_error(y_test, rf.predict(x_test)))
print(r2_score(y_test, rf.predict(x_test)))


4282580390.934271
24378.20427159555
0.9619976142192099


In [12]:
xgb = Pipeline(steps=[('model', XGBRegressor())])
xgb.fit(x_train, y_train)
xgb.score(x_test, y_test)
print(mean_squared_error(y_test, xgb.predict(x_test)))
print(mean_absolute_error(y_test, xgb.predict(x_test)))
print(r2_score(y_test, xgb.predict(x_test)))


4061674497.779681
30978.783349153036
0.9639578695341322


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [6]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [17]:
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Lasso
from sklearn.preprocessing import PolynomialFeatures
# Define the pipeline
poly_features = PolynomialFeatures(degree=3)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features),
        ('bool', boolean_transformer, boolean_features),
        ('poly', poly_features, numerical_features)
    ])
lr_fs = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('fs', SelectKBest(score_func=f_regression)),
    ('model', Lasso())  # Note the step name 'model'
])

# Define the parameter grid with the correct step name
param_grid = { 
    'model__fit_intercept': [True, False],  # Use 'model__fit_intercept' to match the step name
}

# Initialize GridSearchCV
grid_search = GridSearchCV(lr_fs, param_grid, cv=5, scoring='r2')

# Fit the GridSearchCV
grid_search.fit(x_train, y_train)

# Evaluate the model
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

# Access the best estimator
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(x_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters found:  {'model__fit_intercept': True}
Best cross-validation score:  0.5741210057940705
Mean Squared Error: 42463500132.923225
Mean Absolute Error: 122810.70849220925
R^2 Score: 0.6231911216261056


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


In [20]:
svm_fs = Pipeline(steps=[('preprocessor', preprocessor),
                        ('fs', SelectKBest(score_func=f_regression)),
                      ('model', SVR())])
svm_param_grid = {
    'fs__k': [90,100,110,120,130,140],  # Number of features to select
    'model__C': [True, False]  # Use 'model__fit_intercept' to match the step name
}
grid_search = GridSearchCV(svm_fs, svm_param_grid, cv=5, scoring='r2',)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))


5 fits failed out of a total of 10.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/opt/anaconda3/envs/LHL/lib/python3

Best parameters found:  {'model__C': True}
Best cross-validation score:  -0.04481682284185062
Mean Squared Error: 116785084350.10847
Mean Absolute Error: 204341.20200574267
R^2 Score: -0.03631722554698502


In [22]:
rf_fs = Pipeline(steps=[('model', RandomForestRegressor())])
rf_param_grid = {
    'model__n_estimators': [100,200,300,400,500],
    'model__max_depth': [None, 10, 20, 30, 40, 50]
}
grid_search = GridSearchCV(rf_fs, rf_param_grid, cv=5, scoring='r2',n_jobs=-1)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print(mean_squared_error(y_test, predictions))
print(mean_absolute_error(y_test, predictions))
print(r2_score(y_test, predictions))


Best parameters found:  {'model__max_depth': 50, 'model__n_estimators': 500}
Best cross-validation score:  0.9265245800609984
4002373145.6827397
23714.91653108996
0.9644840926645797


In [24]:
xgb_fs = Pipeline(steps=[('model', XGBRegressor())])
xgb_param_grid = {
    'model__n_estimators': [100,200,300,400,500],
    'model__max_depth': [None, 10, 20, 30, 40, 50],
    'model__learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}
grid_search = GridSearchCV(xgb_fs, xgb_param_grid, cv=5, scoring='r2',n_jobs=-1)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))



Best parameters found:  {'model__learning_rate': 0.2, 'model__max_depth': 20, 'model__n_estimators': 500}
Best cross-validation score:  0.9353939750552271
Mean Squared Error: 3495682132.863482
Mean Absolute Error: 13414.526209066516
R^2 Score: 0.9689803228769952
