# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from utils import print_scores

In [7]:
scaler = MinMaxScaler()

X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [8]:
def run_regression(
    train: [np.array, np.array],
    test: [np.array, np.array],
    regressor
):
    """
    Return a fitted regressor.

    Fit the regressor and print out a variety of test score.

    :param train: [X_train, y_train]
    :param test: [X_test, y_train]
    :param regressor: a regression model instance such as LinearRegression()

    :return: A fitted regression model
    """
    X_train, y_train = train
    X_test, y_test = test

    regressor.fit(X_train, y_train)

    y_train_pred = regressor.predict(X_train)
    y_test_pred = regressor.predict(X_test)

    print_scores(
        [X_train, y_train, y_train_pred], 
        [X_test, y_test, y_test_pred]
    )

    return regressor

#### Ordinary Least Squares

In [9]:
model = run_regression(
    [X_train_sc, y_train],
    [X_test_sc, y_test],
    LinearRegression()
)

RMSE train: 84306.072648115
RMSE test: 82383.26915252862
MAE train: 58852.89461247637
MAE test: 60265.88377037562
R**2 train: 0.7908838665720899
R**2 test: 0.8023832535989004
Adj R**2 train: 0.7883830756915838
Adj R**2 test: 0.7951179320400364


In [10]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 50
Coefficients: [[ 1.02695840e+05  3.71788522e+04 -6.98590054e+04  1.63272237e+05
   4.00223399e+04  5.09517251e+04  1.75747967e+04  2.03306731e+04
  -1.04707997e+04  6.72310064e+03  9.86289479e+03  2.17896008e+03
   2.43425499e+03  1.28767530e+04 -1.42656053e+04  8.27453448e+03
   4.30742772e+03 -8.72488367e+03  2.51463648e+03  6.22532878e+03
  -9.58794596e+03  1.92105606e+04  1.48725137e+04  2.48228655e+04
  -7.79316072e+03 -2.95062168e+03  6.90489924e+03  2.39859773e+04
  -2.01134596e+03 -1.31082433e+04  6.23679452e+04 -6.03458311e+04
   7.06829023e+03 -1.30130562e+03  1.16279779e+04  6.73778898e+03
  -1.13792422e+04 -6.28795896e+03  1.46516708e+04 -6.13628047e+16
  -6.13628047e+16 -6.13628047e+16 -6.13628047e+16 -6.13628047e+16
  -6.13628047e+16 -6.13628047e+16 -1.42041686e+17 -1.42041686e+17
  -1.42041686e+17  6.85006914e+05]]


#### Ridge

In [11]:
model = run_regression(
    [X_train, y_train],
    [X_test, y_test],
    Ridge(alpha=10)
)

RMSE train: 84384.22424118425
RMSE test: 82460.57424595708
MAE train: 58826.50023791573
MAE test: 60247.919375112724
R**2 train: 0.79049598623783
R**2 test: 0.8020122086342046
Adj R**2 train: 0.7879905567501216
Adj R**2 test: 0.7947332457163445


In [12]:
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

#### Lasso

In [13]:
model = run_regression(
    [X_train, y_train],
    [X_test, y_test],
    Lasso(alpha=1)
)

RMSE train: 84305.76752980656
RMSE test: 82382.50237832572
MAE train: 58853.41065050963
MAE test: 60268.808200501015
R**2 train: 0.7908853802245097
R**2 test: 0.8023869321787771
Adj R**2 train: 0.7883846074455634
Adj R**2 test: 0.7951217458618204


In [14]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 50
Coefficients: [ 5.38929320e-01  1.99382855e+02 -5.02534831e-04  5.77938934e+01
  6.67472956e+03  1.01795851e+04  5.83804302e+03  6.78503042e+03
 -1.04535826e+04  6.67912366e+03  9.86583387e+03  2.18792610e+03
  2.46760912e+03  1.28426460e+04 -1.42543622e+04  8.28510391e+03
  4.27206207e+03 -8.73967697e+03  2.49450596e+03  6.19657678e+03
 -9.58917067e+03  1.92138438e+04  1.48385879e+04  2.48583235e+04
 -7.78904039e+03 -2.92948185e+03  6.85609590e+03  2.39927229e+04
 -1.98803261e+03 -1.31295857e+04  6.20413630e+04 -5.99862411e+04
  7.03327635e+03 -1.29327824e+03  1.16007676e+04  6.74727895e+03
 -1.13668145e+04 -6.25545815e+03  1.46490381e+04  0.00000000e+00
 -2.83349348e+04 -6.99726648e+04  2.12744818e+04  1.07474933e+05
  1.88081915e+04 -2.91023911e+04  2.38448304e+01  1.46053731e+04
 -3.47941017e+03  7.78961139e-01]


### Support Vector Machines

### Random Forest

### XGBoost