# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, PowerTransformer,\
    PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from utils import run_regression

Apply a scaler to the feature data. Actually, some features should be scaled whereas others are Boolean and don't need to be scaled. We tried a few different scalers. `MinMaxScaler()` and `RobustScaler()` yielded similar results whereas the result from `PowerTransformer()` were a little worse than the other two. 

In [7]:
# Get a list of columns to be scaled.
columns = X_train.columns.to_list()
central_air_idx = columns.index('central_air')
features_to_scale = columns[:central_air_idx]
features_to_scale.append('median_by_pc')
print(features_to_scale)

['price_reduced_amount', 'year_built', 'lot_sqft', 'sqft', 'baths', 'garage', 'stories', 'beds', 'median_by_pc']


In [8]:
X_train_fts = X_train[features_to_scale]
X_train_other = X_train.drop(columns=features_to_scale)

X_test_fts = X_test[features_to_scale]
X_test_other = X_test.drop(columns=features_to_scale)

In [9]:
# Choose a scaler.
# scaler = MinMaxScaler()
scaler = RobustScaler()
# scaler = PowerTransformer(method='yeo-johnson')

In [10]:
# Scale the columns that need scaling. Then, recombine with the other columns. 
X_train_sc = np.hstack([
    scaler.fit_transform(X_train_fts), 
    X_train_other.to_numpy()
])

X_test_sc = np.hstack([
    scaler.transform(X_test_fts), 
    X_test_other.to_numpy()
])

Select the k best features. To begin with, there are 50 features. However, many have low correlations with the target and can safely be dropped, leading to simpler models. Some trial and error indicates that k=8 is a good choice.

In [11]:
skb = SelectKBest(f_classif, k=8)

X_train_sc_skb = skb.fit_transform(
    X_train_sc, 
    np.ravel(y_train.to_numpy())
)

X_test_sc_skb = skb.transform(X_test_sc)

Investigate polynomial features. Some trial and error seems to indicate that there is no reason to include polynomial features here; hence, we've set `degree=1` in the following:

In [12]:
poly = PolynomialFeatures(degree=1)

X_train_sc_skb_poly = poly.fit_transform(X_train_sc_skb)
X_test_sc_skb_poly = poly.transform(X_test_sc_skb)

#### Ordinary Least Squares

In [13]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    LinearRegression()
)

RMSE train: 89251.61664716937
RMSE test: 86062.98812899395
MAE train: 61213.46283121494
MAE test: 61046.30914102518
R**2 train: 0.765630007488411
R**2 test: 0.7843355595723416
Adj R**2 train: 0.765130403051508
Adj R**2 test: 0.7829501349015001


#### Ridge

In [14]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    Ridge(alpha=10)
)

RMSE train: 89263.91730220409
RMSE test: 86195.83894571947
MAE train: 61165.628526310684
MAE test: 61053.90533074585
R**2 train: 0.7655654013065272
R**2 test: 0.7836692264288947
Adj R**2 train: 0.7650656591491986
Adj R**2 test: 0.7822795212453544


#### Lasso

In [15]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    Lasso(alpha=1)
)

RMSE train: 89251.61721756474
RMSE test: 86063.55819614459
MAE train: 61212.943755002125
MAE test: 61045.868023732415
R**2 train: 0.7656300044927554
R**2 test: 0.784332702510952
Adj R**2 train: 0.7651304000494666
Adj R**2 test: 0.7829472594863971


### Support Vector Machine

In [16]:
from sklearn import svm

In [17]:
# model = run_regression(
#     [X_train_sc_skb_poly, y_train],
#     [X_test_sc_skb_poly, y_test],
#     svm.SVR()
# )

### Random Forest

### XGBoost