This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
#import lightgbm as lgb
from sklearn.linear_model import Ridge
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Lasso
from sklearn.linear_model import BayesianRidge
from sklearn.svm import SVR
import xgboost as xgb
import pandas as pd
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import warnings

# Disable all warnings
warnings.filterwarnings('ignore')

# import models and fit

In [10]:
X_train = pd.read_csv('../data/training/X_train.csv')
Y_train = pd.read_csv('../data/training/Y_train.csv')

one_hot_encoded_tags = pd.get_dummies(X_train[['type', 'is_foreclosure', 'city']], dtype=int)
X_train.drop(columns=['type', 'city', 'is_foreclosure'], inplace=True)
X_train = pd.concat([X_train, one_hot_encoded_tags], axis=1)

X_train = X_train.values
Y_train = Y_train.values

In [12]:
linear_regression = LinearRegression()
decision_tree = DecisionTreeRegressor()
random_forest = RandomForestRegressor()
k_neighbors = KNeighborsRegressor()
ridge =  Ridge()
poly = PolynomialFeatures()
lasso = Lasso()
bayes = BayesianRidge()
svr = SVR()
xgboost = xgb.XGBRegressor()

In [13]:
linear_regression.fit(X_train, Y_train)
decision_tree.fit(X_train, Y_train)
random_forest.fit(X_train, Y_train)
k_neighbors.fit(X_train, Y_train)
ridge.fit(X_train, Y_train)
X_poly = poly.fit_transform(X_train)
poly_regression = LinearRegression()
poly_regression.fit(X_poly, Y_train )
lasso.fit(X_train, Y_train)
bayes.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
xgboost.fit(X_train, Y_train)

In [14]:
linear_regression_pred = linear_regression.predict(X_train)
decision_tree_pred = decision_tree.predict(X_train)
random_forest_pred = random_forest.predict(X_train)
k_neighbors_pred = k_neighbors.predict(X_train)
ridge_pred = ridge.predict(X_train)
poly__pred = poly_regression.predict(X_poly)
lasso_pred = lasso.predict(X_train)
bayes_pred =bayes.predict(X_train)
svr_pred = svr.predict(X_train)
xgboost_pred = xgboost.predict(X_train)

In [15]:
print(f'Linear Regression, R2 Score : {r2_score(Y_train, linear_regression_pred)}')
print(f'Linear Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, linear_regression_pred)}')
print(f'Linear Regression, Mean Squared Error Score: {mean_squared_error(Y_train, linear_regression_pred)}')
print(f'Decision Tree, R2 Score : {r2_score(Y_train, decision_tree_pred)}')
print(f'Decision Tree, Mean Absolute Error Score: {mean_absolute_error(Y_train, decision_tree_pred)}')
print(f'Decision Tree, Mean Squared Error Score: {mean_squared_error(Y_train, decision_tree_pred)}')
print(f'Random Forest, R2 Score : {r2_score(Y_train, random_forest_pred)}')
print(f'Random Forest, Mean Absolute Error Score: {mean_absolute_error(Y_train, random_forest_pred)}')
print(f'Random Forest, Mean Squared Error Score: {mean_squared_error(Y_train, random_forest_pred)}')
print(f'K Neighbours, R2 Score : {r2_score(Y_train, k_neighbors_pred)}')
print(f'K Neighbours, Mean Absolute Error Score: {mean_absolute_error(Y_train, k_neighbors_pred)}')
print(f'K Neighbours, Mean Squared Error Score: {mean_squared_error(Y_train, k_neighbors_pred)}')
print(f'Ridge Regression, R2 Score : {r2_score(Y_train, ridge_pred)}')
print(f'Ridge Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, ridge_pred)}')
print(f'Ridge Regression, Mean Squared Error Score: {mean_squared_error(Y_train, ridge_pred)}')
print(f'Polynomial Regression, R2 Score : {r2_score(Y_train, poly__pred)}')
print(f'Polynomial Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, poly__pred)}')
print(f'Polynomial Regression, Mean Squared Error Score: {mean_squared_error(Y_train, poly__pred)}')
print(f'Lasso Regression, R2 Score : {r2_score(Y_train, lasso_pred)}')
print(f'Lasso Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, lasso_pred)}')
print(f'Lasso Regression, Mean Squared Error Score: {mean_squared_error(Y_train, lasso_pred)}')
print(f'Bayesian Ridge, R2 Score : {r2_score(Y_train, bayes_pred)}')
print(f'Bayesian Ridge, Mean Absolute Error Score: {mean_absolute_error(Y_train, bayes_pred)}')
print(f'Bayesian Ridge, Mean Squared Error Score: {mean_squared_error(Y_train, bayes_pred)}')
print(f'SVR, R2 Score : {r2_score(Y_train, svr_pred)}')
print(f'SVR, Mean Absolute Error Score: {mean_absolute_error(Y_train, svr_pred)}')
print(f'SVR, Mean Squared Error Score: {mean_squared_error(Y_train, svr_pred)}')
print(f'XGboost, R2 Score : {r2_score(Y_train, xgboost_pred)}')
print(f'XGboost, Mean Absolute Error Score: {mean_absolute_error(Y_train, xgboost_pred)}')
print(f'XGboost, Mean Squared Error Score: {mean_squared_error(Y_train, xgboost_pred)}')


Linear Regression, R2 Score : 0.4642206763381491
Linear Regression, Mean Absolute Error Score: 172670.28044776496
Linear Regression, Mean Squared Error Score: 178303546867.79645
Decision Tree, R2 Score : 0.9999223547303954
Decision Tree, Mean Absolute Error Score: 258.3979328165375
Decision Tree, Mean Squared Error Score: 25839793.281653747
Random Forest, R2 Score : 0.9984882596839305
Random Forest, Mean Absolute Error Score: 5150.633001795793
Random Forest, Mean Squared Error Score: 503096421.2205377
K Neighbours, R2 Score : 0.9580280730625375
K Neighbours, Mean Absolute Error Score: 34945.73643410853
K Neighbours, Mean Squared Error Score: 13967958656.33075
Ridge Regression, R2 Score : 0.4641243106807146
Ridge Regression, Mean Absolute Error Score: 172830.19204442468
Ridge Regression, Mean Squared Error Score: 178335616672.95316
Polynomial Regression, R2 Score : 0.9999223547303798
Polynomial Regression, Mean Absolute Error Score: 258.4507794315655
Polynomial Regression, Mean Squared 

In [16]:
X_train = pd.read_csv('../data/training/X_train.csv')
Y_train = pd.read_csv('../data/training/Y_train.csv')

one_hot_encoded_tags = pd.get_dummies(X_train[['type', 'is_foreclosure', 'city']], dtype=int)
X_train.drop(columns=['type', 'city', 'is_foreclosure'], inplace=True)
X_train = pd.concat([X_train, one_hot_encoded_tags], axis=1)

X_train = X_train[['year_built', 'sqft']]

X_train = X_train.values
Y_train = Y_train.values


In [17]:
linear_regression.fit(X_train, Y_train)
decision_tree.fit(X_train, Y_train)
random_forest.fit(X_train, Y_train)
k_neighbors.fit(X_train, Y_train)
ridge.fit(X_train, Y_train)
X_poly = poly.fit_transform(X_train)
poly_regression = LinearRegression()
poly_regression.fit(X_poly, Y_train )
lasso.fit(X_train, Y_train)
bayes.fit(X_train, Y_train)
svr.fit(X_train, Y_train)
xgboost.fit(X_train, Y_train)

In [18]:
linear_regression_pred = linear_regression.predict(X_train)
decision_tree_pred = decision_tree.predict(X_train)
random_forest_pred = random_forest.predict(X_train)
k_neighbors_pred = k_neighbors.predict(X_train)
ridge_pred = ridge.predict(X_train)
poly__pred = poly_regression.predict(X_poly)
lasso_pred = lasso.predict(X_train)
bayes_pred =bayes.predict(X_train)
svr_pred = svr.predict(X_train)
xgboost_pred = xgboost.predict(X_train)

In [20]:
print(f'Linear Regression, R2 Score : {r2_score(Y_train, linear_regression_pred)}')
print(f'Linear Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, linear_regression_pred)}')
print(f'Linear Regression, Mean Squared Error Score: {mean_squared_error(Y_train, linear_regression_pred)}')
print(f'Decision Tree, R2 Score : {r2_score(Y_train, decision_tree_pred)}')
print(f'Decision Tree, Mean Absolute Error Score: {mean_absolute_error(Y_train, decision_tree_pred)}')
print(f'Decision Tree, Mean Squared Error Score: {mean_squared_error(Y_train, decision_tree_pred)}')
print(f'Random Forest, R2 Score : {r2_score(Y_train, random_forest_pred)}')
print(f'Random Forest, Mean Absolute Error Score: {mean_absolute_error(Y_train, random_forest_pred)}')
print(f'Random Forest, Mean Squared Error Score: {mean_squared_error(Y_train, random_forest_pred)}')
print(f'K Neighbours, R2 Score : {r2_score(Y_train, k_neighbors_pred)}')
print(f'K Neighbours, Mean Absolute Error Score: {mean_absolute_error(Y_train, k_neighbors_pred)}')
print(f'K Neighbours, Mean Squared Error Score: {mean_squared_error(Y_train, k_neighbors_pred)}')
print(f'Ridge Regression, R2 Score : {r2_score(Y_train, ridge_pred)}')
print(f'Ridge Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, ridge_pred)}')
print(f'Ridge Regression, Mean Squared Error Score: {mean_squared_error(Y_train, ridge_pred)}')
print(f'Polynomial Regression, R2 Score : {r2_score(Y_train, poly__pred)}')
print(f'Polynomial Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, poly__pred)}')
print(f'Polynomial Regression, Mean Squared Error Score: {mean_squared_error(Y_train, poly__pred)}')
print(f'Lasso Regression, R2 Score : {r2_score(Y_train, lasso_pred)}')
print(f'Lasso Regression, Mean Absolute Error Score: {mean_absolute_error(Y_train, lasso_pred)}')
print(f'Lasso Regression, Mean Squared Error Score: {mean_squared_error(Y_train, lasso_pred)}')
print(f'Bayesian Ridge, R2 Score : {r2_score(Y_train, bayes_pred)}')
print(f'Bayesian Ridge, Mean Absolute Error Score: {mean_absolute_error(Y_train, bayes_pred)}')
print(f'Bayesian Ridge, Mean Squared Error Score: {mean_squared_error(Y_train, bayes_pred)}')
print(f'SVR, R2 Score : {r2_score(Y_train, svr_pred)}')
print(f'SVR, Mean Absolute Error Score: {mean_absolute_error(Y_train, svr_pred)}')
print(f'SVR, Mean Squared Error Score: {mean_squared_error(Y_train, svr_pred)}')
print(f'XGboost, R2 Score : {r2_score(Y_train, xgboost_pred)}')
print(f'XGboost, Mean Absolute Error Score: {mean_absolute_error(Y_train, xgboost_pred)}')
print(f'XGboost, Mean Squared Error Score: {mean_squared_error(Y_train, xgboost_pred)}')

Linear Regression, R2 Score : 0.2537864258620597
Linear Regression, Mean Absolute Error Score: 202915.21777563344
Linear Regression, Mean Squared Error Score: 248334568195.58844
Decision Tree, R2 Score : 0.41578407971725173
Decision Tree, Mean Absolute Error Score: 173031.23719522022
Decision Tree, Mean Squared Error Score: 194422901598.9273
Random Forest, R2 Score : 0.415251088371168
Random Forest, Mean Absolute Error Score: 174158.6900669032
Random Forest, Mean Squared Error Score: 194600277326.69345
K Neighbours, R2 Score : 0.33433033230152986
K Neighbours, Mean Absolute Error Score: 179948.32041343668
K Neighbours, Mean Squared Error Score: 221530129198.9664
Ridge Regression, R2 Score : 0.2537864258620597
Ridge Regression, Mean Absolute Error Score: 202915.21777063428
Ridge Regression, Mean Squared Error Score: 248334568195.58844
Polynomial Regression, R2 Score : 0.27836638362880994
Polynomial Regression, Mean Absolute Error Score: 195060.55656192612
Polynomial Regression, Mean Squ

### Models

##### Linear Regression

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

In [22]:
X_train = pd.read_csv('../data/training/X_train.csv')
y_train = pd.read_csv('../data/training/y_train.csv')
X_test = pd.read_csv('../data/testing/X_test.csv')
y_test = pd.read_csv('../data/testing/y_test.csv')

In [23]:
# List of categorical features
categorical_features = ['type', 'city', 'is_foreclosure']

In [24]:

# Preprocess the data
# I added handled_unknown = ignore because when I executed this the first time I got an error (ValueError: Found unknown categories ['Tolleson'] in column 1 during transform)
# so I added this for it to ignore that error because I wasn't sure where that was coming from
preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), [
         col for col in X_train.columns if col not in categorical_features]),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])


# Create a pipeline that first transforms the data and then fits the model
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', LinearRegression())])


# Split the data into training and testing sets
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# ^^^ This was giving me errors

# Fit the model
pipeline.fit(X_train, y_train)

In [25]:
# Predict on the test set
y_pred = pipeline.predict(X_test)
#X_test

In [26]:
# Evaluate the model
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
print(f"R^2 Score: {r2}")

Mean Absolute Error: 183278.2075664484
Mean Squared Error: 185504913290.16785
R^2 Score: 0.4392068230097774


##### Ridge Regression

In [27]:
from sklearn.linear_model import Ridge

#added something here for the push to work

# Initialize the model
ridge_reg = Ridge(alpha=1.0)

# Create a pipeline
pipeline_ridge = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', ridge_reg)])

# Fit the model
pipeline_ridge.fit(X_train, y_train)

# Predict on the test set
y_pred_ridge = pipeline_ridge.predict(X_test)

# Evaluate the model
mae_ridge = mean_absolute_error(y_test, y_pred_ridge)
mse_ridge = mean_squared_error(y_test, y_pred_ridge)
r2_ridge = r2_score(y_test, y_pred_ridge)

print(f"Ridge Regression - Mean Absolute Error: {mae_ridge}")
print(f"Ridge Regression - Mean Squared Error: {mse_ridge}")
print(f"Ridge Regression - R^2 Score: {r2_ridge}")

Ridge Regression - Mean Absolute Error: 183468.74722104022
Ridge Regression - Mean Squared Error: 185531199995.48767
Ridge Regression - R^2 Score: 0.43912735662407665


##### Lasso Regression

In [28]:
from sklearn.linear_model import Lasso

# Initialize the model
lasso_reg = Lasso(alpha=0.1)

# Create a pipeline
pipeline_lasso = Pipeline(steps=[('preprocessor', preprocessor),
                                 ('model', lasso_reg)])

# Fit the model
pipeline_lasso.fit(X_train, y_train)

# Predict on the test set
y_pred_lasso = pipeline_lasso.predict(X_test)

# Evaluate the model
mae_lasso = mean_absolute_error(y_test, y_pred_lasso)
mse_lasso = mean_squared_error(y_test, y_pred_lasso)
r2_lasso = r2_score(y_test, y_pred_lasso)

print(f"Lasso Regression - Mean Absolute Error: {mae_lasso}")
print(f"Lasso Regression - Mean Squared Error: {mse_lasso}")
print(f"Lasso Regression - R^2 Score: {r2_lasso}")

Lasso Regression - Mean Absolute Error: 183295.19894890508
Lasso Regression - Mean Squared Error: 185506709544.05652
Lasso Regression - R^2 Score: 0.43920139281977844


##### XBBoost Regression

In [9]:
import xgboost as xgb

# Initialize the model
xgb_reg = xgb.XGBRegressor(
    objective='reg:squarederror', n_estimators=100, random_state=42)

# Create a pipeline
pipeline_xgb = Pipeline(steps=[('preprocessor', preprocessor),
                               ('model', xgb_reg)])

# Fit the model
pipeline_xgb.fit(X_train, y_train)

# Predict on the test set
y_pred_xgb = pipeline_xgb.predict(X_test)

# Evaluate the model
mae_xgb = mean_absolute_error(y_test, y_pred_xgb)
mse_xgb = mean_squared_error(y_test, y_pred_xgb)
r2_xgb = r2_score(y_test, y_pred_xgb)

print(f"XGBoost Regressor - Mean Absolute Error: {mae_xgb}")
print(f"XGBoost Regressor - Mean Squared Error: {mse_xgb}")
print(f"XGBoost Regressor - R^2 Score: {r2_xgb}")

XGBoost Regressor - Mean Absolute Error: 30133.225119850853
XGBoost Regressor - Mean Squared Error: 2504522904.0866137
XGBoost Regressor - R^2 Score: 0.9924286676222395


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [64]:
# gather evaluation metrics and compare results

Based on the evaluation metrics, the **XGBoost Regressor** performs the best among the models evaluated. See data below for each along with conclusion.

##### Ridge Regression
- **Mean Absolute Error (MAE)**: 183,560.24
- **Mean Squared Error (MSE)**: 185,582,330,054.32
- **R^2 Score**: 0.43897

##### Lasso Regression
- **Mean Absolute Error (MAE)**: 183,445.61
- **Mean Squared Error (MSE)**: 185,561,375,374.98
- **R^2 Score**: 0.43904

##### XGBoost Regressor
- **Mean Absolute Error (MAE)**: 30,427.51
- **Mean Squared Error (MSE)**: 2,315,249,634.92
- **R^2 Score**: 0.99300

##### Our Thoughts:
- **Mean Absolute Error (MAE)**: Lower is better. XGBoost has the lowest MAE by a large margin, indicating it has the smallest average error between the predicted and actual prices.
- **Mean Squared Error (MSE)**: Lower is better. XGBoost also has the lowest MSE, indicating it has smaller squared errors, which reduces the impact of large errors.
- **R^2 Score**: Higher is better. XGBoost has an R^2 score of 0.993, which is much closer to 1 compared to the other models, indicating it explains 99.3% of the variance in the price data, making it a much better fit.
##### Conclusion:
**XGBoost Regressor** clearly outperforms Ridge and Lasso Regression models based on all three evaluation metrics (MAE, MSE, and R^2 Score). Thus, the XGBoost Regressor is the best performing model for predicting house prices using the given data.

**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [65]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)