# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler, RobustScaler, PowerTransformer,\
    PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from utils import run_regression

Apply a scaler to the feature data. Actually, some features should be scaled whereas others are Boolean and don't need to be scaled. We tried a few different scalers. `MinMaxScaler()` and `RobustScaler()` yielded similar results whereas the result from `PowerTransformer()` were a little worse than the other two. 

In [7]:
# Get a list of columns to be scaled.
columns = X_train.columns.to_list()
central_air_idx = columns.index('central_air')
features_to_scale = columns[:central_air_idx]
features_to_scale.append('median_by_pc')
print(features_to_scale)

['price_reduced_amount', 'year_built', 'lot_sqft', 'sqft', 'baths', 'garage', 'stories', 'beds', 'median_by_pc']


In [8]:
X_train_fts = X_train[features_to_scale]
X_train_other = X_train.drop(columns=features_to_scale)

X_test_fts = X_test[features_to_scale]
X_test_other = X_test.drop(columns=features_to_scale)

In [9]:
# Choose a scaler.
scaler = MinMaxScaler()
# scaler = RobustScaler()
# scaler = PowerTransformer(method='yeo-johnson')

In [10]:
# Scale the columns that need scaling. Then, recombine with the other columns. 
X_train_sc = np.hstack([
    scaler.fit_transform(X_train_fts), 
    X_train_other.to_numpy()
])

X_test_sc = np.hstack([
    scaler.transform(X_test_fts), 
    X_test_other.to_numpy()
])

Select the k best features. To begin with, there are 50 features. However, many have low correlations with the target and can safely be dropped, leading to simpler models. Some trial and error indicates that k=8 is a good choice.

In [11]:
skb = SelectKBest(f_classif, k=8)

X_train_sc_skb = skb.fit_transform(
    X_train_sc, 
    np.ravel(y_train.to_numpy())
)

X_test_sc_skb = skb.transform(X_test_sc)

Investigate polynomial features. Some trial and error seems to indicate that there is no reason to include polynomial features here; hence, we've set `degree=1` in the following:

In [12]:
poly = PolynomialFeatures(degree=1)

X_train_sc_skb_poly = poly.fit_transform(X_train_sc_skb)
X_test_sc_skb_poly = poly.transform(X_test_sc_skb)

#### Ordinary Least Squares

In [13]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    LinearRegression()
)

RMSE train: 89251.61664716937
RMSE test: 86062.98812899395
MAE train: 61213.46283121494
MAE test: 61046.30914102518
R**2 train: 0.765630007488411
R**2 test: 0.7843355595723416
Adj R**2 train: 0.765130403051508
Adj R**2 test: 0.7829501349015001


#### Ridge

We loop over a few values of the hyperparameter alpha. Doing so doesn't appear to have much effect on the analysis.

In [14]:
alphas = [0.01, 0.1, 1, 10]
for alpha in alphas:
    print(f'alpha = {alpha}')
    model = run_regression(
        [X_train_sc_skb_poly, y_train],
        [X_test_sc_skb_poly, y_test],
        Ridge(alpha=alpha)
    )
    print()

alpha = 0.01
RMSE train: 89251.61666170432
RMSE test: 86063.11627149125
MAE train: 61213.408181275925
MAE test: 61046.306809453046
R**2 train: 0.7656300074120751
R**2 test: 0.7843349173496823
Adj R**2 train: 0.7651304029750094
Adj R**2 test: 0.7829494885532134

alpha = 0.1
RMSE train: 89251.61809835208
RMSE test: 86064.27013338474
MAE train: 61212.916932103646
MAE test: 61046.2861523519
R**2 train: 0.765629999866954
R**2 test: 0.7843291343987603
Adj R**2 train: 0.7651303954138045
Adj R**2 test: 0.7829436684527138

alpha = 1
RMSE train: 89251.75948763241
RMSE test: 86075.8643164402
MAE train: 61208.06326392328
MAE test: 61046.11153113789
R**2 train: 0.7656292573049119
R**2 test: 0.7842710221084586
Adj R**2 train: 0.7651296512688495
Adj R**2 test: 0.7828851828500547

alpha = 10
RMSE train: 89263.91730220409
RMSE test: 86195.83894571947
MAE train: 61165.628526310684
MAE test: 61053.90533074585
R**2 train: 0.7655654013065272
R**2 test: 0.7836692264288947
Adj R**2 train: 0.7650656591491986


#### Lasso

Again, we loop through several values of alpha to asses the parameter's impact on the analysis. As with the Ridge regression analysis above, changing alpha does not have much effect on the results.

In [15]:
alphas = [0.1, 1, 10]
for alpha in alphas:
    print(f'alpha = {alpha}')
    model = run_regression(
        [X_train_sc_skb_poly, y_train],
        [X_test_sc_skb_poly, y_test],
        Lasso(alpha=alpha)
    )
    print()

alpha = 0.1
RMSE train: 89251.61665290635
RMSE test: 86063.04462270497
MAE train: 61213.41380193851
MAE test: 61046.267419976146
R**2 train: 0.7656300074582811
R**2 test: 0.78433527643812
Adj R**2 train: 0.765130403021314
Adj R**2 test: 0.7829498499484291

alpha = 1
RMSE train: 89251.61721756474
RMSE test: 86063.55819614459
MAE train: 61212.943755002125
MAE test: 61045.868023732415
R**2 train: 0.7656300044927554
R**2 test: 0.784332702510952
Adj R**2 train: 0.7651304000494666
Adj R**2 test: 0.7829472594863971

alpha = 10
RMSE train: 89251.67369078356
RMSE test: 86068.8199203835
MAE train: 61208.077184139045
MAE test: 61041.774336419076
R**2 train: 0.765629707901329
R**2 test: 0.7843063309186202
Adj R**2 train: 0.7651301028257989
Adj R**2 test: 0.7829207184834079



### Support Vector Machine

In [16]:
from sklearn import svm

**For our SVM investigations, we continue to use the dataset from the Linear Regression section above. Recall that this dataset has been scaled. Also, from this dataset, we work with the k=8 best features.**

In the category of SVMs, we first look at a linear kernel. Through trial and error, we find a good value of the hyperparameter C of ~350.

In [17]:
C_values = list(np.arange(300, 500, 50))
for C in C_values:
    print(f'C = {C}')
    model = run_regression(
        [X_train_sc_skb_poly, np.ravel(y_train)],
        [X_test_sc_skb_poly, np.ravel(y_test)],
        svm.SVR(kernel='linear', C=C)
    )
    print()

C = 300
RMSE train: 92209.48718993028
RMSE test: 90426.64161958164
MAE train: 59607.367412719854
MAE test: 59455.536016781094
R**2 train: 0.7498381748012491
R**2 test: 0.7619114529348443
Adj R**2 train: 0.7493049070544967
Adj R**2 test: 0.7603819761871025

C = 350
RMSE train: 92245.25422686318
RMSE test: 90323.22653533611
MAE train: 59175.919932360746
MAE test: 58930.07049544549
R**2 train: 0.7496440671700073
R**2 test: 0.7624557143925299
Adj R**2 train: 0.7491103856457368
Adj R**2 test: 0.7609297339710687

C = 400
RMSE train: 92316.32861722662
RMSE test: 90356.26632702498
MAE train: 59035.72708948502
MAE test: 58741.02061375538
R**2 train: 0.7492581231885771
R**2 test: 0.7622818975467414
Adj R**2 train: 0.7487236189509402
Adj R**2 test: 0.7607548005288404

C = 450
RMSE train: 92511.56197034537
RMSE test: 90515.91693536108
MAE train: 58900.084984945
MAE test: 58529.13531836098
R**2 train: 0.7481964487927597
R**2 test: 0.761441106535968
Adj R**2 train: 0.7476596813932179
Adj R**2 test: 

Second, we consider the 'rbf' kernel. Through trial and error, we found the good values of the hyperparameters C=100,000 and gamma=2. The hyperparameter epsilon appears to have little effect on the analysis.  

In [18]:
eps_values = [0.01, 0.1, 1, 10]
for eps in eps_values:
    print(f'eps = {eps}')
    model = run_regression(
        [X_train_sc_skb_poly, np.ravel(y_train)],
        [X_test_sc_skb_poly, np.ravel(y_test)],
        svm.SVR(kernel='rbf', C=100_000, gamma=2, epsilon=eps)
    )
    print()

eps = 0.01
RMSE train: 75891.06462011661
RMSE test: 79054.05813524779
MAE train: 43384.79660196193
MAE test: 50021.992379172465
R**2 train: 0.8305462774233964
R**2 test: 0.8180324118839486
Adj R**2 train: 0.8301850544240621
Adj R**2 test: 0.8168634552151088

eps = 0.1
RMSE train: 75891.0727791279
RMSE test: 79054.07108678637
MAE train: 43384.80866476328
MAE test: 50021.99440336805
R**2 train: 0.8305462409876205
R**2 test: 0.8180323522599272
Adj R**2 train: 0.8301850179106164
Adj R**2 test: 0.8168633952080637

eps = 1
RMSE train: 75891.15434895335
RMSE test: 79054.20053226083
MAE train: 43384.96000516488
MAE test: 50022.02164180411
R**2 train: 0.8305458767202445
R**2 test: 0.818031756341021
Adj R**2 train: 0.8301846528667348
Adj R**2 test: 0.8168627954609847

eps = 10
RMSE train: 75892.06109790191
RMSE test: 79055.80534323858
MAE train: 43386.48389773113
MAE test: 50022.163204172015
R**2 train: 0.8305418274139096
R**2 test: 0.8180243683060494
Adj R**2 train: 0.8301805949285295
Adj R**2 

### Random Forest

### XGBoost