# 2 - Model Selection

## Import Train-Test Data

In [58]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

#specific to R-SQUARE metric
from sklearn.metrics import r2_score
from sklearn.linear_model import Ridge

In [59]:
dirname = '../data/processed/'
X_train = pd.read_csv(dirname + 'X_train_trimmed.csv', sep=',')
X_test = pd.read_csv(dirname + 'X_test_trimmed.csv', sep=',')
y_train = pd.read_csv(dirname + 'y_train_trimmed.csv', sep=',')
y_test = pd.read_csv(dirname + 'y_test_trimmed.csv', sep=',')

#X_train.info() 4232 entries, 0 to 4231 Data columns (total 50 columns)
#X_test.info() 1411 entries, 0 to 1410 Data columns (total 50 columns)
#y_train.info() 4232 entries, 0 to 4231 Data columns (total 1 columns)
#y_test.info() 1411 entries, 0 to 1410 Data columns (total 1 columns)


## Instructions (Delete Later)

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [4]:
# gather evaluation metrics and compare results

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [5]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)

## Machine Learning

### Linear Regression

### Support Vector Machines

### XGBoost

In [53]:
#concatinting datasets to form one common dataset
data_x = pd.concat([X_train, X_test], ignore_index=True)
data_y = pd.concat([y_train, y_test], ignore_index=True)

# creating X and y from datasets
X = data_x.apply(lambda x: x.combine_first(data_x.iloc[0]), axis=1)

y = data_y.apply(lambda y: y.combine_first(data_y.iloc[0]), axis=1)

#splitting X and y
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model training and fitting
model = xgb.XGBRegressor()
model.fit(X_train, y_train)

# Predictions
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

#Model training and evaluation
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

Train RMSE: 17777.76151529174
Test RMSE: 28168.875238882905


In [57]:
#let us add R-SQUARE 
#calculating r-squared score
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

#evaluating random forest using r2
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

Train R-squared: 0.9907349517208001
Test R-squared: 0.9766310132530434


### Random Forest

In [66]:
#using the same dataset as above, we can train a different model
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

#predicting the model results
y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

# evaluating the model results on RMSE and r2
train_rmse = mean_squared_error(y_train, y_pred_train, squared=False)
test_rmse = mean_squared_error(y_test, y_pred_test, squared=False)

print("Train RMSE:", train_rmse)
print("Test RMSE:", test_rmse)

#let us add R-SQUARE 
#calculating r-squared score
train_r2 = r2_score(y_train, y_pred_train)
test_r2 = r2_score(y_test, y_pred_test)

#evaluating random forest using r2
print("Train R-squared:", train_r2)
print("Test R-squared:", test_r2)

  model.fit(X_train, y_train)


Train RMSE: 18785.15520230696
Test RMSE: 51995.015334661904
Train R-squared: 0.9896175607544512
Test R-squared: 0.9212828933806827


In [None]:
# comparing both XGBoost and Random Forest, It seems like random forest is performs b