## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
# import models and fit
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
# load data
train_df = pd.read_csv('../preprocessed/train_df.csv')
test_df = pd.read_csv('../preprocessed/test_df.csv')


In [2]:
x_train = train_df.drop(columns=['description.sold_price'])
y_train = train_df['description.sold_price']
x_test = test_df.drop(columns=['description.sold_price'])
y_test = test_df['description.sold_price']
numerical_features = x_train.select_dtypes(include=['float64', 'int64']).columns
categorical_features = x_train.select_dtypes(include=['object']).columns



In [3]:
# check the columns between train and test
print(set(x_train.columns) - set(x_test.columns))
print(set(x_test.columns) - set(x_train.columns))


set()
set()


In [4]:
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(sparse_output=False,handle_unknown='ignore')
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

lr = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', LinearRegression())])

lr.fit(x_train, y_train)

In [8]:
lr.score(x_test, y_test)
print(mean_squared_error(y_test, lr.predict(x_test)))
print(mean_absolute_error(y_test, lr.predict(x_test)))
print(r2_score(y_test, lr.predict(x_test)))


2.024120436180816e+29
24513166090421.414
-3.040888141772492e+17


In [9]:
svm = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', SVR())])
svm.fit(x_train, y_train)
svm.score(x_test, y_test)
print(mean_squared_error(y_test, svm.predict(x_test)))
print(mean_absolute_error(y_test, svm.predict(x_test)))
print(r2_score(y_test, svm.predict(x_test)))

676881446690.8846
236579.82106384833
-0.016896390074414125


In [10]:
rf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', RandomForestRegressor())])
rf.fit(x_train, y_train)
rf.score(x_test, y_test)
print(mean_squared_error(y_test, rf.predict(x_test)))
print(mean_absolute_error(y_test, rf.predict(x_test)))
print(r2_score(y_test, rf.predict(x_test)))


538196026901.375
27095.87866775572
0.1914542796467955


In [11]:
xgb = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', XGBRegressor())])
xgb.fit(x_train, y_train)
xgb.score(x_test, y_test)
print(mean_squared_error(y_test, xgb.predict(x_test)))
print(mean_absolute_error(y_test, xgb.predict(x_test)))
print(r2_score(y_test, xgb.predict(x_test)))


540091825116.29596
32841.05340771593
0.18860617327530704


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [6]:
# gather evaluation metrics and compare results

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [27]:

# Define the pipeline
lr_fs = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('fs', SelectKBest(score_func=f_regression)),
    ('model', LinearRegression())  # Note the step name 'model'
])

# Define the parameter grid with the correct step name
param_grid = {
    'fs__k': [130,140,150,160,170,180,190,200],  # Number of features to select
    'model__fit_intercept': [True, False]  # Use 'model__fit_intercept' to match the step name
}

# Initialize GridSearchCV
grid_search = GridSearchCV(lr_fs, param_grid, cv=5, scoring='r2')

# Fit the GridSearchCV
grid_search.fit(x_train, y_train)

# Evaluate the model
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)

# Access the best estimator
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(x_test)

# Evaluate the model
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))


Best parameters found:  {'fs__k': 170, 'model__fit_intercept': True}
Best cross-validation score:  0.9084188227877016
Mean Squared Error: 548928843925.9778
Mean Absolute Error: 61421.32482373883
R^2 Score: 0.1753300928471654


In [33]:
svm_fs = Pipeline(steps=[('preprocessor', preprocessor),
                        ('fs', SelectKBest(score_func=f_regression)),
                      ('model', SVR())])
svm_param_grid = {
    'fs__k': [90,100,110,120,130,140],  # Number of features to select
    'model__C': [True, False]  # Use 'model__fit_intercept' to match the step name
}
grid_search = GridSearchCV(svm_fs, svm_param_grid, cv=5, scoring='r2',n_jobs=-1)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))


30 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/LHL/lib/python3.11/site-packages/sklearn/pipeline.py", line 473, in fit
    self._final_estimator.fit(Xt, y, **last_step_params["fit"])
  File "/opt/anaconda3/envs/LHL/lib/pytho

Best parameters found:  {'fs__k': 90, 'model__C': True}
Best cross-validation score:  -0.04602343008841343
Mean Squared Error: 676841446287.0402
Mean Absolute Error: 236513.0231623813
R^2 Score: -0.016836296439893816


In [34]:
rf_fs = Pipeline(steps=[('preprocessor', preprocessor),
                        ('fs', SelectKBest(f_regression, k=10)),
                      ('model', RandomForestRegressor())])
rf_param_grid = {
    'fs__k': [130,140,150,160,170,180,190,200],  # Number of features to select
    'model__n_estimators': [100,200,300,400,500],
    'model__max_depth': [None, 10, 20, 30, 40, 50]
}
grid_search = GridSearchCV(rf_fs, rf_param_grid, cv=5, scoring='r2',n_jobs=-1)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print(mean_squared_error(y_test, predictions))
print(mean_absolute_error(y_test, predictions))
print(r2_score(y_test, predictions))




Best parameters found:  {'fs__k': 140, 'model__max_depth': 20, 'model__n_estimators': 300}
Best cross-validation score:  0.9914964843736593


NotFittedError: This SelectKBest instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

In [35]:
xgb_fs = Pipeline(steps=[('preprocessor', preprocessor),
                        ('fs', SelectKBest(f_regression, k=10)),
                      ('model', XGBRegressor())])
xgb_param_grid = {
    'fs__k': [130,140,150,160,170,180,190,200],  # Number of features to select
    'model__n_estimators': [100,200,300,400,500],
    'model__max_depth': [None, 10, 20, 30, 40, 50],
    'model__learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3]
}
grid_search = GridSearchCV(xgb_fs, xgb_param_grid, cv=5, scoring='r2',n_jobs=-1)
grid_search.fit(x_train, y_train)
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: ", grid_search.best_score_)
best_model = grid_search.best_estimator_
predictions = best_model.predict(x_test)
print("Mean Squared Error:", mean_squared_error(y_test, predictions))
print("Mean Absolute Error:", mean_absolute_error(y_test, predictions))
print("R^2 Score:", r2_score(y_test, predictions))

  _data = np.array(data, dtype=dtype, copy=copy,


Best parameters found:  {'fs__k': 140, 'model__learning_rate': 0.3, 'model__max_depth': None, 'model__n_estimators': 400}
Best cross-validation score:  0.9942706930202807
Mean Squared Error: 543208806492.71295
Mean Absolute Error: 25996.22631106515
R^2 Score: 0.18392345206156535
