***Imports***

In [1]:
import pandas as pd
import warnings

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.svm import SVR
import xgboost as xgb

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.feature_selection import RFE
from sklearn.model_selection import GridSearchCV

# Disable all warnings
warnings.filterwarnings('ignore')

# **Label Encoding**

Label encoding was chosen for the `city` and `type` instead of One Hot Encoding to reduce the amount of columns being used in our dataset

In [2]:
X = pd.read_csv('../data/training/X_train.csv')
y = pd.read_csv('../data/training/y_train.csv')

In [3]:
X['city_encoded'] = X['city'].astype('category').cat.codes
X['type_encoded'] = X['type'].astype('category').cat.codes

X = X.drop(columns=['city', 'type'])


# Converting column datatypes into integers

X['is_foreclosure'] = X['is_foreclosure'].astype(int)

# **Model Selection**

***Train Test Split for Validation Data***

In [4]:
X_train, X_validate, y_train, y_validate = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape)
print(y_validate.shape)

(3096, 29)
(774, 1)


In [5]:
# Exporting test/training data into CSV

X_train.to_csv('../data/training/X_train_v3.csv', index=False)
X_validate.to_csv('../data/training/X_validation.csv', index=False)
y_train.to_csv('../data/training/y_train_v3.csv', index=False)
y_validate.to_csv('../data/training/y_validatione.csv', index=False)

In [6]:
models = {
    'Linear Regression': LinearRegression(),
    'Support Vector Machine': SVR(),
    'Random Forest': RandomForestRegressor(random_state=42),
    'XGBoost': xgb.XGBRegressor(),
    'Decision Tree': DecisionTreeRegressor(random_state=42),
    'K-Nearest Neighbors': KNeighborsRegressor(),
    'Gradient Boosting': GradientBoostingRegressor(random_state=42),
    'AdaBoost': AdaBoostRegressor(random_state=42),
    'Ridge Regression': Ridge(random_state=42),
    'Lasso Regression': Lasso(random_state=42),
    'ElasticNet Regression': ElasticNet(random_state=42)
}

In [7]:
def evaluate_model(name, model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f'{name} Performance:')
    print(f'Mean Squared Error: {mse}')
    print(f'Mean Absolute Error: {mae}')
    print(f'R^2 Score: {r2}')
    print('')


# Evaluate each model

for name, model in models.items():
    evaluate_model(name, model, X_train, X_validate, y_train, y_validate)

Linear Regression Performance:
Mean Squared Error: 274501793270.35184
Mean Absolute Error: 204038.20322228255
R^2 Score: 0.3966762919721759

Support Vector Machine Performance:
Mean Squared Error: 483900366034.21
Mean Absolute Error: 250867.79209560188
R^2 Score: -0.06355794500856349

Random Forest Performance:
Mean Squared Error: 6612175921.268344
Mean Absolute Error: 24107.056817715733
R^2 Score: 0.9854671896768887

XGBoost Performance:
Mean Squared Error: 4439562534.9116125
Mean Absolute Error: 21193.024010296016
R^2 Score: 0.9902423466940842

Decision Tree Performance:
Mean Squared Error: 6193112904.076359
Mean Absolute Error: 8933.185677371725
R^2 Score: 0.9863882425065167

K-Nearest Neighbors Performance:
Mean Squared Error: 77581395348.8372
Mean Absolute Error: 97467.70025839793
R^2 Score: 0.8294849204510151

Gradient Boosting Performance:
Mean Squared Error: 34976229429.468475
Mean Absolute Error: 109245.70462906508
R^2 Score: 0.9231262274070615

AdaBoost Performance:
Mean Squa

### **Model Performance Comparison**

Based on the performance metrics selected, the **Decision Tree Regressor** performed the best across all the models. See details results and comparison below, along with summary and next steps for Part 3 of the project:

##### ***Best Performing Model: Decision Tree Regressor***
- **Mean Squared Error (MSE)**: 6,193,112,904.08
- **Mean Absolute Error (MAE)**: 8,933.19
- **R² Score**: 0.9864

##### ***Comparison with Other Models***

| Model                         | Mean Squared Error (MSE)  | Mean Absolute Error (MAE)  | R² Score  |
|-------------------------------|---------------------------|----------------------------|-----------|
| **Decision Tree**             | 6,193,112,904.08          | 8,933.19                   | 0.9864    |
| **Random Forest**             | 6,612,175,921.27          | 24,107.06                  | 0.9855    |
| **XGBoost**                   | 4,439,562,534.91          | 21,193.02                  | 0.9902    |
| **Linear Regression**         | 274,501,793,270.35        | 204,038.20                 | 0.3967    |
| **Support Vector Machine**    | 483,900,366,034.21        | 250,867.79                 | -0.0636   |
| **K-Nearest Neighbors**       | 77,604,134,366.93         | 97,416.02                  | 0.8294    |
| **Gradient Boosting**         | 34,976,229,429.47         | 109,245.70                 | 0.9231    |
| **AdaBoost**                  | 87,353,813,837.12         | 247,798.41                 | 0.8080    |
| **Ridge Regression**          | 274,532,623,656.19        | 203,986.27                 | 0.3966    |
| **Lasso Regression**          | 274,502,594,826.23        | 204,035.61                 | 0.3967    |
| **ElasticNet Regression**     | 291,398,512,534.51        | 198,455.73                 | 0.3595    |

##### ***Summary***

The **Decision Tree Regressor** achieved the best performance in terms of Mean Squared Error, Mean Absolute Error, and R² Score, indicating that it is the most accurate model among those tested for this particular dataset.

The **Random Forest Regressor** also performed very well, closely following the Decision Tree Regressor. The other models, especially the simpler linear models (Linear Regression, Ridge, Lasso, and ElasticNet), did not perform as well, which might suggest that the dataset benefits from more complex, non-linear models.

Based on these results, you might consider focusing on the Decision Tree and Random Forest models for further tuning and optimization.

# **Next Steps for Part 3**

Based on these results, we'll focusing on the Decision Tree and Random Forest models for further tuning and optimization.