_**Note: This is a supporting notebook where we separated other modeling attempts for our project**_

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from imblearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer, SimpleImputer

## Loading data

In [2]:
listings = pd.read_csv("combined_listings_clean.csv")

## Data Splitting: Train, Test

In [3]:
#creating train/test split
x = listings.drop(columns = ['Occupancy_Rate'])
y = listings['Occupancy_Rate']

full_train_x, test_x, full_train_y, test_y = train_test_split(x, y, test_size = 0.2, random_state = 11)

#creating train/validation split
train_x, val_x, train_y, val_y = train_test_split(full_train_x, full_train_y, test_size = 0.2, random_state = 11)

## Processing Pipeline

In this step we will first split the x's variable into numeric and categorical dataframes. Then, create separate pipeline for each variable type. 
- For numeric variables we will use a Simple Imputer with the mean strategy and apply a Standard Scaler.
- For categorical variables we will use Simple Imputer with the strategy being most frequent and an One Hot Encoder to transform our categorical variables (i.e. Neighborhood).

In [4]:
#creating variables sets, num and cat
num_vars = train_x.select_dtypes(include = ['float64','int64']).columns
cat_vars = train_x.select_dtypes(include = 'object').columns

#creating numeric_pipeline
num_pipeline = Pipeline([('imputer', SimpleImputer(strategy='mean')), ('scalar', StandardScaler())])

#creating cat_pipeline
cat_pipeline = Pipeline([('imputer', SimpleImputer(strategy= 'most_frequent')),('encoder', OneHotEncoder(handle_unknown='ignore'))])

#creating preprocessor
preprocessing = ColumnTransformer([('num', num_pipeline, num_vars),('cat', cat_pipeline, cat_vars)])

#fitting and transforming the train to the preprocessing
train_x_process = preprocessing.fit_transform(train_x)

#transforming the validation data
val_x_process = preprocessing.transform(val_x)

#transforming the test data
test_x_process = preprocessing.transform(test_x)

## Baseline Advanced Models

We will train several baseline advanced regression models, including Linear SVM, XGBoost, LightGBM, and Random Forest, using our processed training data. 

In [7]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from lightgbm import LGBMRegressor
from sklearn.svm import LinearSVR

In [9]:
#Base SVM
base_SVM = LinearSVR(random_state= 11)
base_SVM.fit(train_x_process, train_y)

#Base Xgboost
base_xgboost = XGBRegressor()
base_xgboost.fit(train_x_process, train_y)

#Base lightgbm
base_lightgbm = LGBMRegressor(random_state= 11)
base_lightgbm.fit(train_x_process, train_y)

#base Random Forest
base_rf = RandomForestRegressor(random_state=11)
base_rf.fit(train_x_process, train_y)

### Baseline Results

In [10]:
models = {
    "Random Forest": base_rf,
    "LightGBM": base_lightgbm,
    "XGBoost": base_xgboost,
    "LinearSVM": base_SVM
}

for name, model in models.items():
    y_pred = model.predict(val_x_process)
    mse = mean_squared_error(val_y, y_pred)
    mae = mean_absolute_error(val_y, y_pred)
    r2 = r2_score(val_y, y_pred)
    
    print(f"\n{name}")
    print(f"R²: {r2:.4f}")
    print(f"MAE: {mae:.4f}")
    print(f"MSE: {mse:.4f}")


Random Forest
R²: 0.5710
MAE: 0.0892
MSE: 0.0150

LightGBM
R²: 0.5900
MAE: 0.0861
MSE: 0.0143

XGBoost
R²: 0.5837
MAE: 0.0872
MSE: 0.0145

LinearSVR
R²: 0.2689
MAE: 0.1184
MSE: 0.0255


From the results, the tree-based models significantly outperformed the _LinearSVM_ model. _LightGBM_ achieved the best overall performance with the highest R² score of **0.5900** and the lowest MAE and MSE values. 
In contrast, _LinearSVM_ showed the weakest results across all metrics, indicating that it struggled to capture the complex patterns in the data compared to the ensemble models. These results suggest that tree-based approaches are better suited for this dataset.

## Hyper Tuning Models

We will perform hyperparameter tuning on all baseline models — Random Forest, LightGBM, XGBoost, and LinearSVR — using randomized search with cross-validation. The goal is to optimize each model's performance by finding the best combination of parameters

In [11]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

### Random Forest

In [12]:
rf_hyper = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None,10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4], 
    'max_features': ['auto', 'sqrt', 'log2'] 
}

hyper_rf = RandomForestRegressor(random_state=11)

rf_search = RandomizedSearchCV(hyper_rf, rf_hyper, cv = 3, n_iter= 75, scoring = 'neg_root_mean_squared_error', n_jobs=-1, verbose = 3, random_state=11)

rf_search.fit(train_x_process, train_y)

best_rf = rf_search.best_estimator_

rf_hyper_pred = best_rf.predict(val_x_process)

In [13]:
print(f"R² Score: {r2_score(val_y, rf_hyper_pred):.4f}")
print(f"MAE: {mean_absolute_error(val_y, rf_hyper_pred):.4f}")
print(f"MSE: {mean_squared_error(val_y, rf_hyper_pred):.4f}")

R² Score: 0.5802
MAE: 0.0900
MSE: 0.0146


In [14]:
best_rf

After applying hyperparameter tuning to the Random Forest model, the performance showed only marginal improvement over the baseline:

- R² Score improved slightly from 0.5710 to 0.5802 

- MSE decreased from 0.0150 to 0.0146

- MAE increased slightly from 0.0892 to 0.0900

### LightGB

In [15]:
base_lightgbm.get_params()

{'boosting_type': 'gbdt',
 'class_weight': None,
 'colsample_bytree': 1.0,
 'importance_type': 'split',
 'learning_rate': 0.1,
 'max_depth': -1,
 'min_child_samples': 20,
 'min_child_weight': 0.001,
 'min_split_gain': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'num_leaves': 31,
 'objective': None,
 'random_state': 11,
 'reg_alpha': 0.0,
 'reg_lambda': 0.0,
 'subsample': 1.0,
 'subsample_for_bin': 200000,
 'subsample_freq': 0}

In [16]:
hyper_lgbm = LGBMRegressor(random_state=11)

lgbm_params = {
    'n_estimators': [300, 500, 700, 1000],
    'learning_rate': [0.01, 0.03, 0.05, 0.07],
    'num_leaves': [70, 100, 130, 160],
    'max_depth': [15, 20, 25, -1],
    'min_child_samples': [5, 10, 15, 20],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.6, 0.7, 0.8],
    'reg_alpha': [0.0, 0.01, 0.1],
    'reg_lambda': [0.3, 0.5, 0.7]               
}
lgbm_search = RandomizedSearchCV(
    hyper_lgbm,
    lgbm_params,
    cv=3,
    n_iter=150,                          
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=3,
    random_state=11
)

lgbm_search.fit(train_x_process, train_y)

In [17]:
best_lgbm = lgbm_search.best_estimator_
print("Best Parameters:", lgbm_search.best_params_)

lgbm_hyper_pred = best_lgbm.predict(val_x_process)

print(f"R² Score: {r2_score(val_y, lgbm_hyper_pred):.4f}")
print(f"MAE: {mean_absolute_error(val_y, lgbm_hyper_pred):.4f}")
print(f"MSE: {mean_squared_error(val_y, lgbm_hyper_pred):.4f}")

Best Parameters: {'subsample': 0.7, 'reg_lambda': 0.5, 'reg_alpha': 0.0, 'num_leaves': 70, 'n_estimators': 1000, 'min_child_samples': 15, 'max_depth': 15, 'learning_rate': 0.03, 'colsample_bytree': 0.6}
R² Score: 0.6191
MAE: 0.0822
MSE: 0.0133


After applying hyperparameter tuning to the LightGBM model, we observed a modest improvement in performance over the baseline:

- R² Score increased from 0.5900 to 0.6191

- MSE decreased from 0.0143 to 0.0133

- MAE decreased from 0.0861 to 0.0822

### XGBoost

In [18]:
xgb_params = {
    'n_estimators': [250, 300, 350],
    'learning_rate': [0.07, 0.1, 0.13],
    'max_depth': [8, 10, 12],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9],
    'reg_alpha': [0.05, 0.1, 0.2],
    'reg_lambda': [1.0, 1.5, 2.0] 
}

hyper_xgb = XGBRegressor(objective='reg:squarederror', random_state=11, verbosity = 2)

xgb_search = RandomizedSearchCV(
    hyper_xgb,
    xgb_params,
    cv=3,
    n_iter=50,                          
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=3,
    random_state=11
)

xgb_search.fit(train_x_process, train_y)

In [19]:
best_xgb = xgb_search.best_estimator_
print("Best Parameters:", best_xgb)

xgb_hyper_pred = best_xgb.predict(val_x_process)

print(f"R² Score: {r2_score(val_y, xgb_hyper_pred):.4f}")
print(f"MAE: {mean_absolute_error(val_y, xgb_hyper_pred):.4f}")
print(f"MSE: {mean_squared_error(val_y, xgb_hyper_pred):.4f}")

Best Parameters: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=0.7, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=0.07, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=8, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=350, n_jobs=None,
             num_parallel_tree=None, random_state=11, ...)
R² Score: 0.6115
MAE: 0.0830
MSE: 0.0135


After tuning the hyperparameters for the XGBoost model, the performance improved moderately compared to the baseline:

- R² Score increased from 0.5837 to 0.6115

- MSE decreased from 0.0145 to 0.0135

- MAE decreased from 0.0872 to 0.0830

### SVM

In [20]:
svm_params = {
    'C': [0.01, 0.1, 1, 10, 100], 
    'epsilon': [0.01, 0.1, 0.2, 0.5] 
}

hyper_SVM = LinearSVR(random_state= 11)


svm_search = RandomizedSearchCV(
    hyper_SVM,
    svm_params,
    cv=3,
    n_iter=25,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=3,
    random_state=11,
    error_score='raise'
)

svm_search.fit(train_x_process, train_y)

In [21]:
best_svm = svm_search.best_estimator_
print("Best Parameters:", svm_search.best_params_)

svm_hyper_pred = best_svm.predict(val_x_process)

print(f"R² Score: {r2_score(val_y, svm_hyper_pred):.4f}")
print(f"MAE: {mean_absolute_error(val_y, svm_hyper_pred):.4f}")
print(f"MSE: {mean_squared_error(val_y, svm_hyper_pred):.4f}")

Best Parameters: {'epsilon': 0.1, 'C': 0.01}
R² Score: 0.3589
MAE: 0.1163
MSE: 0.0224


After tuning the hyperparameters for the Linear SVM model, the performance showed a noticeable improvement compared to the baseline:

- R² Score increased from 0.2689 to 0.3589

- MSE decreased from 0.0255 to 0.0224

- MAE improved from 0.1184 to 0.1163

### Conclusion

After evaluating both baseline and tuned versions of all the regression models, we found that the tuned LightGBM model achieved the overall best performance, with the highest R² Score (0.6191) and the lowest MSE (0.0133) and MAE (0.0822), making it the most accurate and consistent model for predicting occupancy rate.