# Gaussion Process Regression Model (Version 1)

Supported by [`sklearn.gaussian_process.GaussianProcessRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html) function of ScikitLearn

### Summary

| Techniques                     | Used / Description           |
| ------------------------------ | ---------------------------- |
| Handling Unknown Variables     | Drop Rows                    |
| Handling Categorical Variables | Drop Columns (Drop Features) |
| Handling Class Imbalance       | Not Applied                  |
| Handling Outliers              | Not Applied                  |

### Results

| Metric                 | Value   |
| ---------------------- | ------- |
| RMSE (Lower is better) | 0.85619 |
| R2 (Higher is better)  | 0.43520 |


In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, ConstantKernel, DotProduct, ExpSineSquared, RationalQuadratic, Matern
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.pipeline import Pipeline

In [143]:
X_train = pd.read_csv("../../cleaned-data/X_train.csv")
y_train = pd.read_csv("../../cleaned-data/y_train.csv")

X_test = pd.read_csv("../../cleaned-data/X_test.csv")
y_test = pd.read_csv("../../cleaned-data/y_test.csv")

In [144]:
columns_to_drop = ['land_use_label', 'subzone', 'planning_area', 'region',
                   'temp_2024_04_07_min', 'temp_2024_04_07_max', 'temp_2024_04_07_median',
                   'temp_2024_04_08_min', 'temp_2024_04_08_max', 'temp_2024_04_08_median',
                   'temp_2024_04_09_min', 'temp_2024_04_09_max', 'temp_2024_04_09_median',
                   'temp_2024_04_10_min', 'temp_2024_04_10_max', 'temp_2024_04_10_median']

X_train.drop(columns_to_drop, axis=1, inplace=True)
X_test.drop(columns_to_drop, axis=1, inplace=True)

In [132]:
pd.concat((X_train, y_train)).describe()

Unnamed: 0,latitude,longitude,distance_to_waterbody,distance_to_open_space,elevation,Total_x,HDB Total,Condominiums & Other Apartments,Landed Properties_x,Other Dwellings_x,...,"$5,000 - $5,999","$6,000 - $6,999","$7,000 - $7,999","$8,000 - $8,999","$9,000 - $9,999","$10,000 - 10,999","$11,000 - 11,999","$12,000 - $14,999","$15,000 & Over",avg_temp
count,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0,...,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0,611.0
mean,1.333808,103.825289,0.00345,0.003158,18.790507,15881.963993,11266.759411,2835.204583,1632.978723,144.140753,...,5721.201309,4160.641571,3418.569558,2979.972177,2363.749591,2133.001637,1689.85761,3186.022913,5837.605565,30.065189
std,0.033638,0.087219,0.002659,0.005375,11.746306,22527.915515,20687.51055,3817.916875,3707.929779,243.301823,...,4539.120772,3265.876008,2578.008588,2219.346693,1717.486356,1466.083238,1246.052864,2228.565349,4383.833809,1.264064
min,1.260529,103.62403,0.0,0.0,-1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.776378
25%,1.311093,103.761419,0.001412,0.000544,11.0,80.0,0.0,0.0,0.0,0.0,...,1799.0,1239.0,1138.0,1250.0,867.0,1032.0,594.0,1465.0,4025.0,28.967618
50%,1.329272,103.83276,0.003,0.001407,16.0,8040.0,0.0,1110.0,10.0,60.0,...,4562.0,3488.0,2753.0,2367.0,2038.0,2030.0,1422.0,3180.0,5440.0,29.98497
75%,1.35552,103.893404,0.004961,0.002993,25.0,24320.0,16685.0,4910.0,1085.0,210.0,...,10695.0,7552.0,5855.0,5104.0,3872.0,3255.0,2385.0,4404.0,7842.0,30.942114
max,1.429386,103.989652,0.013994,0.041106,91.0,130980.0,125230.0,16920.0,18850.0,2730.0,...,14433.0,11198.0,8471.0,7626.0,5497.0,5482.0,4067.0,7528.0,16726.0,32.836497


In [133]:
valid_rows_train = X_train.min_ndvi != "-"
valid_rows_test = X_test.min_ndvi != "-"

X_train = X_train[valid_rows_train]
y_train = y_train[valid_rows_train]

X_test = X_test[valid_rows_test]
y_test = y_test[valid_rows_test]

In [142]:
kernel = (
    ConstantKernel()
    # * RBF(length_scale=1e-1, length_scale_bounds=(1e-5, 1e5))
    # * Matern(length_scale=1e-1, length_scale_bounds=(1e-5, 1e5), nu=1.5)
    * RationalQuadratic(length_scale=1.0, alpha=1e-1, alpha_bounds=(1e-5, 1e2))
    + WhiteKernel(noise_level=1e-2, noise_level_bounds=(1e-5, 1e1))
)

pipeline = Pipeline(
    [
        ("scaler", StandardScaler()),
        ("gpr", GaussianProcessRegressor(kernel=kernel, n_restarts_optimizer=10)),
    ]
)

pipeline.fit(X_train, y_train)

In [141]:
y_mean, y_std = pipeline.predict(X_test, return_std=True)

rmse = mean_squared_error(y_test, y_mean, squared=False)
r2 = r2_score(y_test, y_mean)

print(f"STD: {np.mean(y_std)}")
print(f"RMSE: {rmse}")
print(f"R2: {r2}")

STD: 0.9924402808904845
RMSE: 0.8561996709616964
R2: 0.4352058597107318
