# 2 - Model Selection

## Import Train-Test Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

In [6]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import Ridge, Lasso

from utils import print_scores

#### Ordinary Least Squares

In [7]:
scaler = MinMaxScaler()
reg = Ridge(alpha=0)

In [8]:
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

In [9]:
reg.fit(X_train_sc, y_train)

y_train_pred = reg.predict(X_train_sc)
y_test_pred = reg.predict(X_test_sc)

print_scores(
    [X_train_sc, y_train, y_train_pred], 
    [X_test_sc, y_test, y_test_pred]
)

RMSE train: 84511.8694924404
RMSE test: 82051.68937397089
MAE train: 59144.353024574666
MAE test: 60121.05882352941
R**2 train: 0.7898616871437589
R**2 test: 0.8039708053506173
Adj R**2 train: 0.7873486721610247
Adj R**2 test: 0.7967638496649782


#### Ridge

#### Lasso

### Support Vector Machines

### Random Forest

### XGBoost