# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from utils import print_scores

Apply a scaler to the feature data.

In [7]:
scaler = MinMaxScaler()

X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

Select the k 

In [None]:
skb = SelectKBest(f_classif, k=5)

X_train_sc_skb = skb.fit_transform(
    X_train_sc, 
    np.ravel(y_train.to_numpy())
)

X_test_sc_skb = skb.transform(X_test_sc)

In [10]:
def run_regression(
    train: [np.array, np.array],
    test: [np.array, np.array],
    regressor
):
    """
    Return a fitted regressor.

    Fit the regressor and print out a variety of test score.

    :param train: [X_train, y_train]
    :param test: [X_test, y_train]
    :param regressor: a regression model instance such as LinearRegression()

    :return: A fitted regression model
    """
    X_train, y_train = train
    X_test, y_test = test

    regressor.fit(X_train, y_train)

    y_train_pred = regressor.predict(X_train)
    y_test_pred = regressor.predict(X_test)

    print_scores(
        [X_train, y_train, y_train_pred], 
        [X_test, y_test, y_test_pred]
    )

    return regressor

#### Ordinary Least Squares

In [11]:
model = run_regression(
    [X_train_sc_skb, y_train],
    [X_test_sc_skb, y_test],
    LinearRegression()
)

RMSE train: 100679.25895530147
RMSE test: 97592.40727636927
MAE train: 65083.039461415436
MAE test: 62743.96514556088
R**2 train: 0.7017710194590668
R**2 test: 0.7226821894989526
Adj R**2 train: 0.7014181692691226
Adj R**2 test: 0.721695293376173


In [12]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 5
Coefficients: [[ -85321.3953897  -102526.8771943    16141.3107541     6723.54063453
   792832.60401293]]


#### Ridge

In [13]:
model = run_regression(
    [X_train_sc_skb, y_train],
    [X_test_sc_skb, y_test],
    Ridge(alpha=10)
)

RMSE train: 101124.86896744039
RMSE test: 98601.55836232286
MAE train: 67542.41903160974
MAE test: 65789.79370211065
R**2 train: 0.6991252328804254
R**2 test: 0.71691734577513
Adj R**2 train: 0.6987692523230193
Adj R**2 test: 0.7159099341942585


In [14]:
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

#### Lasso

In [15]:
model = run_regression(
    [X_train_sc_skb, y_train],
    [X_test_sc_skb, y_test],
    Lasso(alpha=1)
)

RMSE train: 100679.25975472081
RMSE test: 97592.99014440672
MAE train: 65083.291426962904
MAE test: 62744.27165503323
R**2 train: 0.7017710147230364
R**2 test: 0.7226788769426755
Adj R**2 train: 0.7014181645274886
Adj R**2 test: 0.7216919690314394


In [16]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 5
Coefficients: [ -85301.85394224 -102485.06478651   16093.41904181    6702.66215753
  792802.9298412 ]


### Support Vector Machines

### Random Forest

### XGBoost