# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.linear_model import LinearRegression, Ridge, Lasso

from utils import print_scores

Apply a scaler to the feature data.

In [7]:
scaler = MinMaxScaler()

X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

Select the k best features. To begin with, there are 50 features. However, many have low correlations with the target and can safely be dropped, leading to simpler models. Some trial and error indicates that k=8 is a good choice.

In [19]:
skb = SelectKBest(f_classif, k=8)

X_train_sc_skb = skb.fit_transform(
    X_train_sc, 
    np.ravel(y_train.to_numpy())
)

X_test_sc_skb = skb.transform(X_test_sc)

Investigate polynomial features. Some trial and error seems to indicate that there is no reason to include polynomial features here; hence, we've left

In [9]:
poly = PolynomialFeatures(degree=1)

X_train_sc_skb_poly = poly.fit_transform(X_train_sc_skb)
X_test_sc_skb_poly = poly.transform(X_test_sc_skb)

In [10]:
def run_regression(
    train: [np.array, np.array],
    test: [np.array, np.array],
    regressor
):
    """
    Return a fitted regressor.

    Fit the regressor and print out a variety of test score.

    :param train: [X_train, y_train]
    :param test: [X_test, y_train]
    :param regressor: a regression model instance such as LinearRegression()

    :return: A fitted regression model
    """
    X_train, y_train = train
    X_test, y_test = test

    regressor.fit(X_train, y_train)

    y_train_pred = regressor.predict(X_train)
    y_test_pred = regressor.predict(X_test)

    print_scores(
        [X_train, y_train, y_train_pred], 
        [X_test, y_test, y_test_pred]
    )

    return regressor

#### Ordinary Least Squares

In [11]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    LinearRegression()
)

RMSE train: 89254.29011925084
RMSE test: 86076.32198255186
MAE train: 61171.33790170132
MAE test: 61005.00779588944
R**2 train: 0.7656159664874821
R**2 test: 0.7842687280366942
Adj R**2 train: 0.7651163321195018
Adj R**2 test: 0.7828828740412126


In [12]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 9
Coefficients: [[ 0.00000000e+00  2.07690707e+05  8.51663351e+04 -5.82135617e+04
  -8.79517087e+04 -4.19336571e+18 -4.19336571e+18 -4.19336571e+18
   7.25821433e+05]]


#### Ridge

In [13]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    Ridge(alpha=10)
)

RMSE train: 89677.89496762241
RMSE test: 86991.59909339725
MAE train: 61978.9517333725
MAE test: 62135.71752462699
R**2 train: 0.7633858928127033
R**2 test: 0.7796564563685622
Adj R**2 train: 0.7628815046164252
Adj R**2 test: 0.7782409732188955


In [14]:
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

#### Lasso

In [15]:
model = run_regression(
    [X_train_sc_skb_poly, y_train],
    [X_test_sc_skb_poly, y_test],
    Lasso(alpha=1)
)

RMSE train: 89251.61748105274
RMSE test: 86063.53998478205
MAE train: 61212.58438737709
MAE test: 61045.37161907897
R**2 train: 0.7656300031089445
R**2 test: 0.7843327937829144
Adj R**2 train: 0.7651303986627058
Adj R**2 test: 0.7829473513446891


In [16]:
print(f'No. features: {model.n_features_in_}')
print(f'Coefficients: {model.coef_}')

No. features: 9
Coefficients: [     0.         208684.28911058  85469.11127912 -57334.05184123
 -88799.75199856  -1001.1963041   14137.77705301     -0.
 725819.13662042]


### Support Vector Machines

### Random Forest

### XGBoost