This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [50]:
#Imports 
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
import xgboost as xgb

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV

# Disable all warnings
import warnings
warnings.filterwarnings('ignore')

#Import from functions_variables.py
from functions_variables import *

In [11]:
# X = pd.read_csv('../data/dataframes/X_train_encoded.csv')
# y = pd.read_csv('../data/dataframes/Y_train_encoded.csv')
#Not encoding for now

In [33]:
X = pd.read_csv('../data/dataframes/X_train.csv')
y = pd.read_csv('../data/dataframes/Y_train.csv')

In [34]:
#Drop permalink, listing, primary_photo, state, postal code
X.drop(columns=['permalink', 'listing', 'primary_photo', 'state', 'postal_code', 'city', 'sold_date'], inplace=True)
#Drop date and city too 

In [35]:
#Save X to csv again 
X.to_csv('../data/dataframes/X_train_final.csv', index=False)

In [36]:

# Split the encoded data into training and validation sets
X_train, X_val, Y_train, Y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [37]:
print(X_train.shape)
print(Y_val.shape)

(3204, 23)
(801, 1)


best thing to not encode in MVP 
and then try later if u have brains for it 

use state? 
using static dict or dynamic encoding? 

no Y should be indicated in Kfold train data 
data leakage - should not have access to the future prices
(me knowing what I am going to do in the future)



In [38]:
#Save again 
X_train.to_csv('../data/dataframes/X_train_final2.csv', index=False)
X_val.to_csv('../data/dataframes/X_val_final.csv', index=False)
Y_train.to_csv('../data/dataframes/Y_train_final.csv', index=False)
Y_val.to_csv('../data/dataframes/Y_val_final.csv', index=False)

In [11]:
#Models 
models = {
    'Linear Regression': LinearRegression(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'Random Forest': RandomForestRegressor(random_state=42),
    'K-Nearest Neighbors': KNeighborsRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'AdaBoost': AdaBoostRegressor(random_state=42),
    'Elastic Net': ElasticNet(random_state=42),
    'Ridge': Ridge(random_state=42),
    'Lasso': Lasso(random_state=42),
    'Support Vector Machine': SVR(),
    'XGBoost': xgb.XGBRegressor()
}

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

Mean Squared Error (MSE): This is the average of the squared differences between the predicted and actual values. It's a popular metric for regression problems, but its value can be hard to interpret because it's in squared units.

Root Mean Squared Error (RMSE): This is the square root of the MSE. It's in the same units as the target variable, which can make it easier to interpret than MSE. It gives a higher weight to larger errors, which can be useful if large errors are particularly undesirable.

Mean Absolute Error (MAE): This is the average of the absolute differences between the predicted and actual values. It's less sensitive to outliers than MSE and RMSE.

R^2: This is the coefficient of determination, which measures how well the predicted values fit the actual values. A value of 1 means a perfect fit. It's a popular metric for regression problems, but it can be misleading if the model is overfitting.

Adjusted R^2: This is a modified version of R^2 that takes into account the number of predictors in the model. It's generally a better choice than R^2 when comparing models with different numbers of predictors.

If your data has many outliers, MAE might be a good choice because it's less sensitive to outliers than MSE and RMSE. If you care more about large errors, RMSE might be a good choice. If you want a balance between these two, you might consider using both RMSE and MAE.

R^2 or adjusted R^2 can be useful for getting a sense of how well the model fits the data overall, but they should be used in conjunction with other metrics, not in isolation.

Remember to consider the nature of your problem and your specific needs when choosing metrics.



In [51]:
results = pd.DataFrame(columns=['Model', 'R2', 'MSE', 'RMSE', 'MAE']) #initialize dataframe
for i, (name, model) in enumerate(models.items()):
    result = train_models(name, model, X_train, Y_train, X_val, Y_val) #Function from functions_variables.py
    results.loc[i] = result #Add results to dataframe

Linear Regression R2: 0.29348543043156283
Linear Regression MSE: 216346981985.07208
Linear Regression RMSE: 465131.14493126783
Linear Regression MAE: 195268.97186896513


Decision Tree R2: 0.9935170009893199
Decision Tree MSE: 1985206435.3458178
Decision Tree RMSE: 44555.65548104772
Decision Tree MAE: 7110.639200998751


Random Forest R2: 0.9719021853398747
Random Forest MSE: 8604036864.812567
Random Forest RMSE: 92757.94771777008
Random Forest MAE: 23025.069375780276


K-Nearest Neighbors R2: 0.759734546103638
K-Nearest Neighbors MSE: 73573437922.87675
K-Nearest Neighbors RMSE: 271244.24034968327
K-Nearest Neighbors MAE: 137401.28164794008


Gradient Boosting R2: 0.9212704295458354
Gradient Boosting MSE: 24108356280.81085
Gradient Boosting RMSE: 155268.65839830926
Gradient Boosting MAE: 94444.86698563586


AdaBoost R2: 0.7059626175984804
AdaBoost MSE: 90039332539.4789
AdaBoost RMSE: 300065.54707176716
AdaBoost MAE: 252481.59546373616


Elastic Net R2: 0.2883285232603424
Elastic Net MS

In [52]:
results.sort_values('R2', ascending=False)


Unnamed: 0,Model,R2,MSE,RMSE,MAE
1,Decision Tree,0.993517,1985206000.0,44555.655481,7110.639201
10,XGBoost,0.989807,3121348000.0,55869.029567,13922.39931
2,Random Forest,0.971902,8604037000.0,92757.947718,23025.069376
4,Gradient Boosting,0.92127,24108360000.0,155268.658398,94444.866986
3,K-Nearest Neighbors,0.759735,73573440000.0,271244.24035,137401.281648
5,AdaBoost,0.705963,90039330000.0,300065.547072,252481.595464
7,Ridge,0.293531,216333000000.0,465116.075828,195194.910954
8,Lasso,0.293488,216346300000.0,465130.456062,195265.928302
0,Linear Regression,0.293485,216347000000.0,465131.144931,195268.971869
6,Elastic Net,0.288329,217926100000.0,466825.573594,178525.508365


In [None]:
# gather evaluation metrics and compare results

Decision Trees

A decision tree is a type of supervised learning algorithm that is mostly used in classification problems. It works for both categorical and continuous input and output variables. In this technique, we split the population or sample into two or more homogeneous sets (or sub-populations) based on the most significant splitter/differentiator in input variables.

Decision Trees: Decision trees can handle both numerical and categorical data, which is common in housing data. For example, a house's neighborhood is categorical, while its size is numerical. Decision trees can also model nonlinear relationships, which can be common in housing data.However, decision trees can be prone to overfitting, especially if they are allowed to become very deep. This can lead to poor generalization performance on new data.

XGBoost stands for eXtreme Gradient Boosting. Rather than training all the models in isolation of one another, boosting trains models in succession, with each new model being trained to correct the errors made by the previous ones.
XGBoost is a powerful, flexible, and efficient version of the gradient boosting algorithm. It can handle both numerical and categorical data, and it can model complex nonlinear relationships. XGBoost also has built-in regularization to prevent overfitting, which can lead to better generalization performance.





**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)