# XGBoost Regression Model (Version 2)

Regression counterpart of Gradient Boosted Tree, supported by [`sklearn.ensemble.GradientBoostingRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) function of ScikitLearn

### Summary

| Techniques                     | Used / Description           |
| ------------------------------ | ---------------------------- |
| Handling Unknown Variables     | Drop Rows                    |
| Handling Categorical Variables | Drop Columns (Drop Features) |
| Handling Class Imbalance       | Not Applied                  |
| Handling Outliers              | Not Applied                  |

### Results

| Metric                 | Value   |
| ---------------------- | ------- |
| RMSE (Lower is better) | 0.7731  |
| R2 (Higher is better)  | 0.5396  |


### Preprocessing Stage

In [1]:
import numpy as np
import pandas as pd
import random
import optuna

from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler


In [2]:
X_train = pd.read_csv('../../cleaned-data/X_train.csv')
y_train = pd.read_csv('../../cleaned-data/y_train.csv')

X_test = pd.read_csv('../../cleaned-data/X_test.csv')
y_test = pd.read_csv('../../cleaned-data/y_test.csv')

In [3]:
X_train.head()

Unnamed: 0,latitude,longitude,land_use_label,distance_to_waterbody,distance_to_open_space,subzone,planning_area,region,elevation,temp_2024_04_07_min,...,built-up,bare / sparse vegetation,snow and ice,permanent water bodies,herbaceous wetland,mangroves,moss and lichen,min_ndvi,mean_ndvi,max_ndvi
0,1.327345,103.776261,ROAD,0.005491,0.000305,HOLLAND ROAD,BUKIT TIMAH,CENTRAL REGION,34,28.880736,...,128,1,0,1,0,0,0,0.1176063463,0.2107233339,0.3355351585
1,1.36231,103.885041,RESIDENTIAL,0.002163,0.002288,KOVAN,HOUGANG,NORTH-EAST REGION,14,33.603571,...,183,1,0,0,0,0,0,0.06873453002,0.1237388913,0.1772913102
2,1.304792,103.740678,BUSINESS 2,0.00166,0.001437,PENJURU CRESCENT,JURONG EAST,WEST REGION,10,28.880736,...,251,8,0,33,0,0,0,0.03399855502,0.07334574643,0.1149060753
3,1.432131,103.793028,ROAD,0.002688,0.002472,WOODLANDS SOUTH,WOODLANDS,NORTH REGION,32,30.168782,...,-,-,-,-,-,-,-,-,-,-
4,1.30353,103.820861,CIVIC & COMMUNITY INSTITUTION,0.011124,0.004127,RIDOUT,TANGLIN,CENTRAL REGION,17,30.168782,...,63,1,0,0,0,0,0,0.09017470784,0.2076336658,0.3255961435


In [4]:
# Combine X and y to make sure that the oversampling is done correctly
X_train = pd.concat([X_train, y_train], axis=1)
X_test = pd.concat([X_test, y_test], axis=1)

- Drop subzone and planning area columns
- Replace land use label by one hot encoding
- Drop temperature data, since they are not independent variables

In [5]:
X_train.columns

Index(['latitude', 'longitude', 'land_use_label', 'distance_to_waterbody',
       'distance_to_open_space', 'subzone', 'planning_area', 'region',
       'elevation', 'temp_2024_04_07_min', 'temp_2024_04_07_max',
       'temp_2024_04_07_median', 'temp_2024_04_08_min', 'temp_2024_04_08_max',
       'temp_2024_04_08_median', 'temp_2024_04_09_min', 'temp_2024_04_09_max',
       'temp_2024_04_09_median', 'temp_2024_04_10_min', 'temp_2024_04_10_max',
       'temp_2024_04_10_median', 'Total_x', 'HDB Total',
       'Condominiums & Other Apartments', 'Landed Properties_x',
       'Other Dwellings_x', 'Floor_below_60', 'Floor_60-80', 'Floor_80-100',
       'Floor_100-120', 'Floor_above_120', 'Below $1,000', '$1,000 - $1,999',
       '$2,000 - $2,999', '$3,000 - $3,999', '$4,000 - $4,999',
       '$5,000 - $5,999', '$6,000 - $6,999', '$7,000 - $7,999',
       '$8,000 - $8,999', '$9,000 - $9,999', '$10,000 - 10,999',
       '$11,000 - 11,999', '$12,000 - $14,999', '$15,000 & Over', 'tree cover',
 

In [6]:
columns_to_drop = ['land_use_label', 'subzone', 'planning_area', 'region',
       'temp_2024_04_07_min', 'temp_2024_04_07_max',
       'temp_2024_04_07_median', 'temp_2024_04_08_min', 'temp_2024_04_08_max',
       'temp_2024_04_08_median', 'temp_2024_04_09_min', 'temp_2024_04_09_max',
       'temp_2024_04_09_median', 'temp_2024_04_10_min', 'temp_2024_04_10_max',
       'temp_2024_04_10_median']

X_train = X_train.drop(columns=columns_to_drop)
X_test = X_test.drop(columns=columns_to_drop)

In [7]:
# Remove rows where min_ndvi values is -
X_train = X_train[X_train['min_ndvi'] != '-']
X_test = X_test[X_test['min_ndvi'] != '-']

In [8]:
# Split X and y
y_train = X_train['avg_temp']
X_train = X_train.drop(columns=['avg_temp'])

y_test = X_test['avg_temp']
X_test = X_test.drop(columns=['avg_temp'])

In [9]:
def set_data_types(X):
    X_train = X.copy()
    X_train['tree cover'] = X_train['tree cover'].astype('int')
    X_train['grassland'] = X_train['grassland'].astype('category')
    X_train['shrubland'] = X_train['shrubland'].astype('category')
    X_train['cropland'] = X_train['cropland'].astype('category')
    X_train['built-up'] = X_train['built-up'].astype('int')
    X_train['permanent water bodies'] = X_train['permanent water bodies'].astype('int')
    X_train['herbaceous wetland'] = X_train['herbaceous wetland'].astype('category')
    X_train['bare / sparse vegetation'] = X_train['bare / sparse vegetation'].astype('int')
    X_train['min_ndvi'] = X_train['min_ndvi'].astype('float')
    X_train['mean_ndvi'] = X_train['mean_ndvi'].astype('float')
    X_train['max_ndvi'] = X_train['max_ndvi'].astype('float')
    X_train.drop(['snow and ice', 'mangroves', 'moss and lichen'], axis=1, inplace=True)
    return X_train

In [10]:
new_X_train = set_data_types(X_train)
new_X_test = set_data_types(X_test)

## Model Training

In [11]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

def objective(trial):
    max_depth = trial.suggest_int('max_depth', 2, 10)
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    gamma = trial.suggest_float('gamma', 0, 1)
    subsample = trial.suggest_float('subsample', 0.5, 1)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 1)

    # Create the XGBoost regressor with the suggested hyperparameters
    regressor = XGBRegressor(max_depth=max_depth,
                            n_estimators=n_estimators,
                            learning_rate=learning_rate,
                            min_child_weight=min_child_weight,
                            gamma=gamma,
                            subsample=subsample,
                            colsample_bytree=colsample_bytree,
                            random_state=42)
    regressor.fit(X_train_scaled, y_train)

    # Predict and calculate the R2 score
    y_pred = regressor.predict(X_test_scaled)
    score = r2_score(y_test, y_pred)
    return score

# Create a study object and optimize the objective function
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=1000)

print("Best trial:")
trial = study.best_trial
print(f"  R2 score: {trial.value}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

[I 2024-04-19 17:08:48,078] A new study created in memory with name: no-name-e61dcd12-6183-4e5e-9721-ea18b5ce67d4


[I 2024-04-19 17:08:48,333] Trial 0 finished with value: 0.49423863575661964 and parameters: {'max_depth': 4, 'n_estimators': 184, 'learning_rate': 0.18963807692691895, 'min_child_weight': 5, 'gamma': 0.699412153387575, 'subsample': 0.9442404533950274, 'colsample_bytree': 0.761405667242048}. Best is trial 0 with value: 0.49423863575661964.
[I 2024-04-19 17:08:48,637] Trial 1 finished with value: 0.3553432061245161 and parameters: {'max_depth': 7, 'n_estimators': 247, 'learning_rate': 0.23999720755454942, 'min_child_weight': 4, 'gamma': 0.4167234801053451, 'subsample': 0.5643910858537098, 'colsample_bytree': 0.8253430190591786}. Best is trial 0 with value: 0.49423863575661964.
[I 2024-04-19 17:08:48,910] Trial 2 finished with value: 0.29694417125645367 and parameters: {'max_depth': 5, 'n_estimators': 163, 'learning_rate': 0.1894291011001, 'min_child_weight': 3, 'gamma': 0.28447893865509966, 'subsample': 0.5365360454069088, 'colsample_bytree': 0.751838460699954}. Best is trial 0 with val

Best trial:
  R2 score: 0.5384807017450977
  Params: 
    max_depth: 3
    n_estimators: 105
    learning_rate: 0.054188652276254946
    min_child_weight: 8
    gamma: 0.8979889084986685
    subsample: 0.6703343884211945
    colsample_bytree: 0.6211053204195089


In [12]:
regressor = XGBRegressor(max_depth=3,
                            n_estimators=105,
                            learning_rate=0.054188652276254946,
                            min_child_weight=8,
                            gamma=0.8979889084986685,
                            subsample=0.6703343884211945,
                            colsample_bytree=0.6211053204195089,
                            random_state=42)
regressor.fit(X_train_scaled, y_train)

y_pred = regressor.predict(X_test_scaled)
score = r2_score(y_test, y_pred)

# Calculate the RMSE
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
print(f"RMSE: {rmse}")

# Calculate the R2
r2 = r2_score(y_test, y_pred)
print(f"R2: {r2}")

RMSE: 0.7739713740605249
R2: 0.5384807017450977


In [13]:
# Scale the features (important for SVM)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

def objective(trial):
    max_depth = trial.suggest_int('max_depth', 2, 10)
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3)
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    gamma = trial.suggest_float('gamma', 0, 1)
    subsample = trial.suggest_float('subsample', 0.5, 1)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.5, 1)

    # Create the XGBoost regressor with the suggested hyperparameters
    regressor = XGBRegressor(max_depth=max_depth,
                            n_estimators=n_estimators,
                            learning_rate=learning_rate,
                            min_child_weight=min_child_weight,
                            gamma=gamma,
                            subsample=subsample,
                            colsample_bytree=colsample_bytree,
                            random_state=42)
    regressor.fit(X_train_scaled, y_train)

    # Predict and calculate the rmse score
    y_pred = regressor.predict(X_test_scaled)
    score = np.sqrt(np.mean((y_test - y_pred)**2))
    return score

# Create a study object and optimize the objective function
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=1000)

print("Best trial:")
trial = study.best_trial
print(f"  Rmse score: {trial.value}")
print("  Params: ")
for key, value in trial.params.items():
    print(f"    {key}: {value}")

[I 2024-04-19 17:14:55,158] A new study created in memory with name: no-name-37b27b55-e9cb-4950-82a8-418b215e0288


[I 2024-04-19 17:14:55,483] Trial 0 finished with value: 0.7942087450951101 and parameters: {'max_depth': 3, 'n_estimators': 217, 'learning_rate': 0.03941240917165297, 'min_child_weight': 8, 'gamma': 0.8854224428430415, 'subsample': 0.9975962335665556, 'colsample_bytree': 0.958648985968584}. Best is trial 0 with value: 0.7942087450951101.
[I 2024-04-19 17:14:55,784] Trial 1 finished with value: 0.9332071640923036 and parameters: {'max_depth': 4, 'n_estimators': 253, 'learning_rate': 0.17163194986934838, 'min_child_weight': 2, 'gamma': 0.23311871006579254, 'subsample': 0.6950168427107208, 'colsample_bytree': 0.5831132469944307}. Best is trial 0 with value: 0.7942087450951101.
[I 2024-04-19 17:14:56,200] Trial 2 finished with value: 0.9646121858457842 and parameters: {'max_depth': 9, 'n_estimators': 315, 'learning_rate': 0.25933019877561836, 'min_child_weight': 5, 'gamma': 0.041888670795614535, 'subsample': 0.65666408535684, 'colsample_bytree': 0.763330018758683}. Best is trial 0 with va

Best trial:
  Rmse score: 0.7730690828813059
  Params: 
    max_depth: 2
    n_estimators: 158
    learning_rate: 0.06945942365265755
    min_child_weight: 5
    gamma: 0.3338678571732243
    subsample: 0.9583095899756601
    colsample_bytree: 0.8149947994070121


In [14]:
regressor = XGBRegressor(max_depth=2,
                            n_estimators=158,
                            learning_rate=0.06945942365265755,
                            min_child_weight=5,
                            gamma=0.3338678571732243,
                            subsample=0.9583095899756601,
                            colsample_bytree=0.8149947994070121,
                            random_state=42)
regressor.fit(X_train_scaled, y_train)

y_pred = regressor.predict(X_test_scaled)
score = r2_score(y_test, y_pred)

# Calculate the RMSE
rmse = np.sqrt(np.mean((y_test - y_pred)**2))
print(f"RMSE: {rmse}")

# Calculate the R2
r2 = r2_score(y_test, y_pred)
print(f"R2: {r2}")

RMSE: 0.7730690828813059
R2: 0.5395561473572801
