# Urban Heat Island (UHI) Benchmark Notebook 

<b>Challenge Overview: </b><p align="justify"><p>
<p align="justify">Welcome to the EY Open Science AI & Data Challenge 2025! The objective of this challenge is to build a machine learning model to predict urban heat island (UHI) hotspots in a city. By the end of the challenge, you will have developed a regression model capable of predicting the intensity of the UHI effect.

Participants will be given ground-level air temperature data in an index format, which was collected on 24th July 2021 on traverse points in the Bronx and Manhattan regions of New York city. This dataset constitutes traverse points (latitude and longitude) and their corresponding UHI (Urban Heat Island) index values. Participants will use this dataset to build a regression model to predict UHI index values for a given set of locations. It is important to understand that the UHI Index at any given location is indicative of the relative temperature elevation at that specific point compared to the city's average temperature.

This challenge is designed for participants with varying skill levels in data science and programming, offering a great opportunity to apply your knowledge and enhance your capabilities in the field.</p>

<b>Challenge Aim: </b><p align="justify"><p>
<p align="justify">In this notebook, we will demonstrate a basic model workflow that can serve as a starting point for the challenge. The basic model has been constructed to predict the Urban Heat Island (UHI) index using features from the Sentinel-2 satellite dataset as predictor variables. In this demonstration, we utilized three features from the Sentinel-2 dataset: band B01 (Coastal Aerosol), band B06 (Red Edge), and NDVI (Normalized Difference Vegetation Index) derived from bands B04 (Red) and B08 (Near Infrared). A random forest regression model was then trained using these features.
    
These features were extracted from a GeoTIFF image created by the Sentinel-2 sample notebook. For the sample model shown in this notebook, data from a single day (24th July 2021) was considered, assuming that the values of bands B01, B04, B06, and B08 for this specific date are representative of the UHI index behavior at any location. Participants should review the details of the Sentinel-2 sample notebook to gain an understanding of the data and options for modifying the output product. 
    
</p>

<p align="justify">Most of the functions presented in this notebook were adapted from the <a href="https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a#Example-Notebook">Sentinel-2-Level-2A notebook</a> found in the Planetary Computer portal.</p>

<p align="justify">Please note that this notebook is just a starting point. We have made many assumptions in this notebook that you may think are not best for solving the challenge effectively. You are encouraged to modify these functions, rewrite them, or try an entirely new approach.</p>

## 1. Load In Dependencies

To run this demonstration notebook, you will need to have the following packages imported below installed. This may take some time.  

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')

# Data Manipulation
import numpy as np
import pandas as pd

# Visualization
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import plotly.graph_objects as go

# Multi-dimensional arrays and datasets
import xarray as xr

# Geospatial raster data handling
import rioxarray as rxr

# Geospatial data analysis
import geopandas as gpd

# Geospatial operations
import rasterio
from rasterio import warp
from rasterio import windows
from rasterio import features 
from rasterio.windows import from_bounds
from rasterio.warp import transform_bounds

# Image Processing
from PIL import Image

# Coordinate transformations
from pyproj import Proj, Transformer, CRS

# Feature Engineering
from sklearn.impute import KNNImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split

# Machine Learning
from sklearn.metrics import r2_score
from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, HistGradientBoostingRegressor

# Planetary Computer Tools
import pystac_client
import planetary_computer
from odc.stac import stac_load
from pystac.extensions.eo import EOExtension as eo

# Others
import os
from tqdm import tqdm
from datetime import datetime

## 2. Load Data

In [2]:
# Load the training data from csv file and display the first few rows to inspect the data
ground_df = pd.read_csv("../data/Training_data_uhi_index_2025-02-18.csv")
ground_df

Unnamed: 0,Longitude,Latitude,datetime,UHI Index
0,-73.909167,40.813107,24-07-2021 15:53,1.030289
1,-73.909187,40.813045,24-07-2021 15:53,1.030289
2,-73.909215,40.812978,24-07-2021 15:53,1.023798
3,-73.909242,40.812908,24-07-2021 15:53,1.023798
4,-73.909257,40.812845,24-07-2021 15:53,1.021634
...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245


### 2.1 Load Sentinel-2 Data

In [3]:
# Extracts satellite band values from a GeoTIFF based on coordinates from a csv file and returns them in a DataFrame.
def map_sentinel_2_data(tiff_path, df, box_size_deg):
    
    # Load the GeoTIFF data
    data = rxr.open_rasterio(tiff_path)
    tiff_crs = data.rio.crs

    # Read the Excel file using pandas
    latitudes = df['Latitude'].values
    longitudes = df['Longitude'].values

    # 3. Convert lat/long to the GeoTIFF's CRS
    # Create a Proj object for EPSG:4326 (WGS84 - lat/long) and the GeoTIFF's CRS
    proj_wgs84 = Proj(init='epsg:4326')  # EPSG:4326 is the common lat/long CRS
    proj_tiff = Proj(tiff_crs)
    
    # Create a transformer object
    transformer = Transformer.from_proj(proj_wgs84, proj_tiff)

    B01_values = []
    B02_values = []
    B03_values = []
    B04_values = []
    B05_values = []
    B06_values = []
    B07_values = []
    B08_values = []
    B8A_values = []
    B11_values = []
    B12_values = []

    # Iterate over the latitudes and longitudes, and extract the corresponding band values
    for lat, lon in tqdm(zip(latitudes, longitudes), total=len(latitudes), desc="Mapping values"):
        # Assuming the correct dimensions are 'y' and 'x' (replace these with actual names from data.coords)

        # Transform the latitude and longitude to the GeoTIFF's CRS
        x, y = transformer.transform(lat, lon)
        
        # Define the bounding box for the specified radius
        x_min, x_max = x - box_size_deg / 2, x + box_size_deg / 2
        y_min, y_max = y - box_size_deg / 2, y + box_size_deg / 2
    
        B01_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=1).values
        B01_value = B01_data.mean() 
        B01_values.append(B01_value) 
         
        B02_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=2).values
        B02_value = B02_data.mean() 
        B02_values.append(B02_value) 
         
        B03_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=3).values
        B03_value = B03_data.mean() 
        B03_values.append(B03_value) 
 
        B04_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=4).values
        B04_value = B04_data.mean() 
        B04_values.append(B04_value) 
 
        B05_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=5).values
        B05_value = B05_data.mean() 
        B05_values.append(B05_value) 
         
        B06_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=6).values
        B06_value = B06_data.mean() 
        B06_values.append(B06_value) 
         
        B07_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=7).values
        B07_value = B07_data.mean() 
        B07_values.append(B07_value) 
 
        B08_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=8).values
        B08_value = B08_data.mean() 
        B08_values.append(B08_value) 
         
        B8A_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=9).values
        B8A_value = B8A_data.mean() 
        B8A_values.append(B8A_value) 
 
        B11_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=10).values
        B11_value = B11_data.mean() 
        B11_values.append(B11_value) 
 
        B12_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=11).values
        B12_value = B12_data.mean()
        B12_values.append(B12_value)

    # Create a DataFrame with the band values
    # Create a DataFrame to store the band values
    df = pd.DataFrame()
    df['B01'] = B01_values
    df['B02'] = B02_values
    df['B03'] = B03_values
    df['B04'] = B04_values
    df['B05'] = B05_values
    df['B06'] = B06_values
    df['B07'] = B07_values
    df['B08'] = B08_values
    df['B8A'] = B8A_values
    df['B11'] = B11_values 
    df['B12'] = B12_values
    
    return df

In [4]:
sentinel_2_data = map_sentinel_2_data('../data/S2_median_2025-03-15_v1.tiff', ground_df, 9040/111320.0) # 9040 meter
sentinel_2_data

Mapping values: 100%|██████████| 11229/11229 [04:53<00:00, 38.32it/s]


Unnamed: 0,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12
0,952.568307,1143.988376,1741.951470,1882.093033,1129.856803,1804.947448,1000.253359,1349.151588,1904.062503,1949.556412,1527.531476
1,953.030288,1144.601424,1742.376938,1882.433379,1130.356545,1805.633125,1000.713106,1349.779036,1904.413154,1949.893339,1528.229636
2,953.246997,1144.774241,1742.405201,1882.364107,1130.502264,1805.756397,1000.854321,1349.934798,1904.393419,1949.856158,1528.375205
3,953.419715,1144.978706,1742.310635,1882.146704,1130.677990,1805.856581,1001.028434,1350.140561,1904.184853,1949.637293,1528.573465
4,953.423710,1144.959963,1742.102912,1881.881197,1130.666825,1805.723987,1001.016320,1350.107137,1903.908753,1949.356750,1528.492977
...,...,...,...,...,...,...,...,...,...,...,...
11224,879.495500,1061.263875,1643.931551,1787.334517,1053.615009,1664.755482,926.876241,1252.043240,1810.865678,1859.083371,1382.233011
11225,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697
11226,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697
11227,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697


In [5]:
sentinel_2_full_data = pd.concat([ground_df, sentinel_2_data], axis=1)
sentinel_2_full_data

Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,952.568307,1143.988376,1741.951470,1882.093033,1129.856803,1804.947448,1000.253359,1349.151588,1904.062503,1949.556412,1527.531476
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,953.030288,1144.601424,1742.376938,1882.433379,1130.356545,1805.633125,1000.713106,1349.779036,1904.413154,1949.893339,1528.229636
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,953.246997,1144.774241,1742.405201,1882.364107,1130.502264,1805.756397,1000.854321,1349.934798,1904.393419,1949.856158,1528.375205
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,953.419715,1144.978706,1742.310635,1882.146704,1130.677990,1805.856581,1001.028434,1350.140561,1904.184853,1949.637293,1528.573465
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,953.423710,1144.959963,1742.102912,1881.881197,1130.666825,1805.723987,1001.016320,1350.107137,1903.908753,1949.356750,1528.492977
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470,879.495500,1061.263875,1643.931551,1787.334517,1053.615009,1664.755482,926.876241,1252.043240,1810.865678,1859.083371,1382.233011
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,926.867078,1251.982648,1810.624999,1858.838585,1382.093697


In [6]:
duplicated_rows = sentinel_2_full_data[sentinel_2_full_data.duplicated()]
print(f"Number of duplicated rows: {sentinel_2_full_data.duplicated().sum()}")
duplicated_rows

Number of duplicated rows: 0


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,B07,B08,B8A,B11,B12


### 2.2 Load Landsat Data

In [7]:
# Extracts satellite band values from a GeoTIFF based on coordinates from a csv file and returns them in a DataFrame.
def map_landsat_data(tiff_path, df, box_size_deg):
    
    # Load the GeoTIFF data
    data = rxr.open_rasterio(tiff_path)
    tiff_crs = data.rio.crs

    # Read the Excel file using pandas
    latitudes = df['Latitude'].values
    longitudes = df['Longitude'].values

    # Convert lat/long to the GeoTIFF's CRS
    proj_wgs84 = Proj(init='epsg:4326')  # EPSG:4326 is the common lat/long CRS
    proj_tiff = Proj(tiff_crs)
    
    # Create a transformer object
    transformer = Transformer.from_proj(proj_wgs84, proj_tiff)

    # Initialize lists for each band
    lwir11_values = []
    emis_values = []
    drad_values = []
    urad_values = []
    atran_values = []
    swir16_values = []
    swir22_values = []
    coastal_values = []

    # Iterate over the latitudes and longitudes, and extract the corresponding band values
    for lat, lon in tqdm(zip(latitudes, longitudes), total=len(latitudes), desc="Mapping values"):
        # Transform the latitude and longitude to the GeoTIFF's CRS
        x, y = transformer.transform(lat, lon)

        # Define the bounding box for the specified radius
        x_min, x_max = x - box_size_deg / 2, x + box_size_deg / 2
        y_min, y_max = y - box_size_deg / 2, y + box_size_deg / 2
        
        # Slice over the box for each band
        lwir11_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=1).values
        lwir11_values.append(lwir11_data.mean())

        emis_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=2).values
        emis_values.append(emis_data.mean())

        drad_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=3).values
        drad_values.append(drad_data.mean())

        urad_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=4).values
        urad_values.append(urad_data.mean())

        atran_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=5).values
        atran_values.append(atran_data.mean())

        swir16_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=6).values
        swir16_values.append(swir16_data.mean())

        swir22_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=7).values
        swir22_values.append(swir22_data.mean())

        coastal_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=8).values
        coastal_values.append(coastal_data.mean())

    # Create a DataFrame with the band values
    df = pd.DataFrame()
    df['lwir11'] = lwir11_values
    df['emis'] = emis_values
    df['drad'] = drad_values
    df['urad'] = urad_values
    df['atran'] = atran_values
    df['swir16'] = swir16_values
    df['swir22'] = swir22_values
    df['coastal'] = coastal_values
    
    return df

In [8]:
landsat_data = map_landsat_data('../data/Landsat_median_2025-03-19_v1.tiff', ground_df, 0.001) # 111.139 meter, 1300 meter
landsat_data

Mapping values: 100%|██████████| 11229/11229 [00:30<00:00, 368.86it/s]


Unnamed: 0,lwir11,emis,drad,urad,atran,swir16,swir22,coastal
0,41.908947,0.968475,1.484,3.231937,0.616556,195.317375,191.207847,183.852695
1,41.908947,0.968475,1.484,3.231937,0.616556,195.317375,191.207847,183.852695
2,41.665556,0.968917,1.484,3.231917,0.616575,193.790304,189.355138,182.645565
3,41.511175,0.969137,1.484,3.231688,0.616581,195.218894,190.158301,183.502776
4,41.511175,0.969137,1.484,3.231688,0.616581,195.218894,190.158301,183.502776
...,...,...,...,...,...,...,...,...
11224,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11225,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11226,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11227,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816


In [9]:
landsat_full_data = pd.concat([ground_df, landsat_data], axis=1)
landsat_full_data

Unnamed: 0,Longitude,Latitude,datetime,UHI Index,lwir11,emis,drad,urad,atran,swir16,swir22,coastal
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,41.908947,0.968475,1.484,3.231937,0.616556,195.317375,191.207847,183.852695
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,41.908947,0.968475,1.484,3.231937,0.616556,195.317375,191.207847,183.852695
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,41.665556,0.968917,1.484,3.231917,0.616575,193.790304,189.355138,182.645565
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,41.511175,0.969137,1.484,3.231688,0.616581,195.218894,190.158301,183.502776
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,41.511175,0.969137,1.484,3.231688,0.616581,195.218894,190.158301,183.502776
...,...,...,...,...,...,...,...,...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245,34.882353,0.983631,1.475,3.208750,0.619225,194.520122,184.141518,179.856816


In [10]:
duplicated_rows = landsat_full_data[landsat_full_data.duplicated()]
print(f"Number of duplicated rows: {landsat_full_data.duplicated().sum()}")
duplicated_rows

Number of duplicated rows: 0


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,lwir11,emis,drad,urad,atran,swir16,swir22,coastal


### 2.3 Load Sentinel-1 Data

In [11]:
# Extracts satellite band values from a GeoTIFF based on coordinates from a csv file and returns them in a DataFrame.
def map_sentinel_1_data(tiff_path, df, box_size_deg):
    
    # Load the GeoTIFF data
    data = rxr.open_rasterio(tiff_path)
    tiff_crs = data.rio.crs

    # Read the coordinates from the DataFrame
    latitudes = df['Latitude'].values
    longitudes = df['Longitude'].values

    # Convert lat/long to the GeoTIFF's CRS
    proj_wgs84 = Proj(init='epsg:4326')  # EPSG:4326 is the common lat/long CRS
    proj_tiff = Proj(tiff_crs)
    
    # Create a transformer object
    transformer = Transformer.from_proj(proj_wgs84, proj_tiff)

    # Initialize empty lists for the bands we want to extract
    vv_values = []
    vh_values = []

    # Iterate over the latitudes and longitudes, and extract the corresponding band values
    for lat, lon in tqdm(zip(latitudes, longitudes), total=len(latitudes), desc="Mapping values"):
        # Transform the latitude and longitude to the GeoTIFF's CRS
        x, y = transformer.transform(lat, lon)

        # Define the bounding box for the specified radius (in degrees)
        x_min, x_max = x - box_size_deg / 2, x + box_size_deg / 2
        y_min, y_max = y - box_size_deg / 2, y + box_size_deg / 2

        # Slice over the box for each band
        vv_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=1).values
        vv_value = vv_data.mean()
        vv_values.append(vv_value)

        vh_data = data.sel(x=slice(x_min, x_max), y=slice(y_max, y_min), band=2).values
        vh_value = vh_data.mean()
        vh_values.append(vh_value)

    # Create a DataFrame with the band values
    df_output = pd.DataFrame()
    df_output['vv'] = vv_values
    df_output['vh'] = vh_values
    
    return df_output

In [12]:
sentinel_1_data = map_sentinel_1_data('../data/S1_median_2025-03-15_v1.tiff', ground_df, 5800/111320.0) # 5800 meter

Mapping values: 100%|██████████| 11229/11229 [00:29<00:00, 384.03it/s]


In [13]:
sentinel_1_full_data = pd.concat([ground_df, sentinel_1_data], axis=1)
sentinel_1_full_data

Unnamed: 0,Longitude,Latitude,datetime,UHI Index,vv,vh
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,1.011585,0.137795
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,1.011365,0.137733
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,1.010216,0.137650
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,1.010636,0.138434
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,1.007889,0.138498
...,...,...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470,0.978522,0.350995
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470,0.978522,0.350995
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124,0.987513,0.356130
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245,0.987950,0.356467


In [14]:
duplicated_rows = sentinel_1_full_data[sentinel_1_full_data.duplicated()]
print(f"Number of duplicated rows: {sentinel_1_full_data.duplicated().sum()}")
duplicated_rows

Number of duplicated rows: 0


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,vv,vh


### 2.4 Load Footprint Data

In [15]:
footprint_df = pd.read_csv("../data/building_footprint_data_with_roof_2025-03-19_v1_0.004.csv")
footprint_df

Unnamed: 0,Longitude,Latitude,datetime,UHI Index,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,108,0.000004,3.302316e-08,1.855104e-08,1.958780e-15,4622.322123,42.799279,36.125,980.284349
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,106,0.000004,3.472510e-08,1.880488e-08,2.044352e-15,4613.519580,43.523770,36.600,979.373418
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,104,0.000004,3.508403e-08,1.883252e-08,2.077283e-15,4517.969580,43.442015,36.600,1004.844554
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,101,0.000004,3.466292e-08,1.883848e-08,2.055836e-15,4369.359580,43.260986,36.520,1029.721643
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,96,0.000003,3.477614e-08,1.894471e-08,2.104002e-15,4135.599580,43.079162,36.125,1070.326838
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470,0,0.000000,,,,0.000000,,,
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470,0,0.000000,,,,0.000000,,,
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124,0,0.000000,,,,0.000000,,,
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245,0,0.000000,,,,0.000000,,,


In [16]:
duplicated_rows = footprint_df[footprint_df.duplicated()]
print(f"Number of duplicated rows: {footprint_df.duplicated().sum()}")
duplicated_rows

Number of duplicated rows: 0


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var


In [17]:
# Checking if the ground data are similar across the three datasets
sentinel_1_full_data.iloc[:, 0:4].equals(landsat_full_data.iloc[:, 0:4]) and \
sentinel_1_full_data.iloc[:, 0:4].equals(sentinel_2_full_data.iloc[:, 0:4]) and \
sentinel_1_full_data.iloc[:, 0:4].equals(footprint_df.iloc[:, 0:4])

True

In [18]:
# Checking if all dataframes have the same number of rows
sentinel_1_full_data.shape[0] == landsat_full_data.shape[0] == sentinel_2_full_data.shape[0] == footprint_df.shape[0]

True

## 3. Concatenate Loaded Datasets

### 3.1 Merge Datasets

In [19]:
# Assuming the first 3 columns are identical in all dataframes.
# Keep them from sentinel_2_full_data and then take the rest of the columns from the others.
uhi_data = pd.concat([
    sentinel_2_full_data,
    landsat_full_data.iloc[:, 4:], 
    sentinel_1_full_data.iloc[:, 4:],
    footprint_df.iloc[:, 4:]
], axis=1)

uhi_data

Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,...,vh,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,952.568307,1143.988376,1741.951470,1882.093033,1129.856803,1804.947448,...,0.137795,108,0.000004,3.302316e-08,1.855104e-08,1.958780e-15,4622.322123,42.799279,36.125,980.284349
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,953.030288,1144.601424,1742.376938,1882.433379,1130.356545,1805.633125,...,0.137733,106,0.000004,3.472510e-08,1.880488e-08,2.044352e-15,4613.519580,43.523770,36.600,979.373418
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,953.246997,1144.774241,1742.405201,1882.364107,1130.502264,1805.756397,...,0.137650,104,0.000004,3.508403e-08,1.883252e-08,2.077283e-15,4517.969580,43.442015,36.600,1004.844554
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,953.419715,1144.978706,1742.310635,1882.146704,1130.677990,1805.856581,...,0.138434,101,0.000004,3.466292e-08,1.883848e-08,2.055836e-15,4369.359580,43.260986,36.520,1029.721643
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,953.423710,1144.959963,1742.102912,1881.881197,1130.666825,1805.723987,...,0.138498,96,0.000003,3.477614e-08,1.894471e-08,2.104002e-15,4135.599580,43.079162,36.125,1070.326838
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11224,-73.957050,40.790333,24-07-2021 15:57,0.972470,879.495500,1061.263875,1643.931551,1787.334517,1053.615009,1664.755482,...,0.350995,0,0.000000,,,,0.000000,,,
11225,-73.957063,40.790308,24-07-2021 15:57,0.972470,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.350995,0,0.000000,,,,0.000000,,,
11226,-73.957093,40.790270,24-07-2021 15:57,0.981124,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.356130,0,0.000000,,,,0.000000,,,
11227,-73.957112,40.790253,24-07-2021 15:59,0.981245,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.356467,0,0.000000,,,,0.000000,,,


### 3.2 Check and Removed Duplicated Rows

In [20]:
# Remove duplicate rows from the DataFrame based on specified columns and keep the first occurrence
columns_to_check = uhi_data.columns.tolist()[4:]

for col in columns_to_check:
    # Check if the value is a numpy array and has more than one dimension
    uhi_data[col] = uhi_data[col].apply(lambda x: tuple(x) if isinstance(x, np.ndarray) and x.ndim > 0 else x)

# Check for duplicates
duplicated_rows = uhi_data[uhi_data.duplicated(subset=columns_to_check)]
print(f"Number of duplicated rows: {uhi_data.duplicated(subset=columns_to_check).sum()}")
display(duplicated_rows)

# Drop duplicates only if any exist
if not duplicated_rows.empty:
    uhi_data = uhi_data.drop_duplicates(subset=columns_to_check, keep='first').reset_index(drop=True)
    print("Removing duplicated rows...\n")
    print("Updated Dataset after removing duplicates:")
    display(uhi_data)
else:
    print("Congratulations, no duplicated rows found!")

Number of duplicated rows: 61


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,...,vh,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var
810,-73.920328,40.822262,24-07-2021 15:22,1.004805,933.211348,1116.593739,1716.889957,1860.852292,1105.716079,1767.117251,...,0.187229,16,2.046629e-06,1.279143e-07,8.082930e-08,2.181178e-14,1569.25,98.078125,100.21,5775.893216
1409,-73.994123,40.772040,24-07-2021 15:47,1.004204,793.634775,965.293168,1519.123191,1665.762626,972.426761,1484.186736,...,1.029326,4,7.413440e-07,1.853360e-07,9.802319e-08,4.937232e-14,598.00,149.500000,72.50,36036.333333
1411,-73.994160,40.771997,24-07-2021 15:47,1.004204,793.574148,965.272946,1519.127790,1665.772032,972.405640,1484.188566,...,1.030994,4,7.413440e-07,1.853360e-07,9.802319e-08,4.937232e-14,598.00,149.500000,72.50,36036.333333
1417,-73.994350,40.771688,24-07-2021 15:47,1.012859,792.787061,964.511042,1518.216030,1664.976256,971.720195,1482.936558,...,1.038165,5,1.131658e-06,2.263317e-07,1.039576e-07,5.212261e-14,294.67,58.934000,41.00,1238.231780
1419,-73.994383,40.771610,24-07-2021 15:47,1.015023,792.435508,964.260026,1517.856553,1664.696227,971.509264,1482.351583,...,1.040902,6,1.207601e-06,2.012669e-07,8.995037e-08,4.546754e-14,373.67,62.278333,60.00,1057.692817
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11134,-73.955107,40.793258,24-07-2021 15:55,1.000358,878.037755,1057.325876,1658.178958,1807.079048,1051.959975,1674.866106,...,0.268276,0,0.000000e+00,,,,0.00,,,
11143,-73.955148,40.792895,24-07-2021 15:56,1.000358,878.692904,1058.482280,1657.585265,1805.854316,1052.725190,1675.254662,...,0.275554,0,0.000000e+00,,,,0.00,,,
11159,-73.955115,40.792277,24-07-2021 15:56,0.989539,880.752709,1060.992341,1656.420198,1803.292331,1054.398378,1676.378149,...,0.288911,1,1.923966e-09,1.923966e-09,1.923966e-09,,12.00,12.000000,12.00,
11169,-73.955300,40.792003,24-07-2021 15:56,0.987375,881.021512,1061.566808,1655.057812,1801.334647,1054.707875,1675.575372,...,0.297289,1,1.923966e-09,1.923966e-09,1.923966e-09,,12.00,12.000000,12.00,


Removing duplicated rows...

Updated Dataset after removing duplicates:


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,...,vh,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var
0,-73.909167,40.813107,24-07-2021 15:53,1.030289,952.568307,1143.988376,1741.951470,1882.093033,1129.856803,1804.947448,...,0.137795,108,0.000004,3.302316e-08,1.855104e-08,1.958780e-15,4622.322123,42.799279,36.125,980.284349
1,-73.909187,40.813045,24-07-2021 15:53,1.030289,953.030288,1144.601424,1742.376938,1882.433379,1130.356545,1805.633125,...,0.137733,106,0.000004,3.472510e-08,1.880488e-08,2.044352e-15,4613.519580,43.523770,36.600,979.373418
2,-73.909215,40.812978,24-07-2021 15:53,1.023798,953.246997,1144.774241,1742.405201,1882.364107,1130.502264,1805.756397,...,0.137650,104,0.000004,3.508403e-08,1.883252e-08,2.077283e-15,4517.969580,43.442015,36.600,1004.844554
3,-73.909242,40.812908,24-07-2021 15:53,1.023798,953.419715,1144.978706,1742.310635,1882.146704,1130.677990,1805.856581,...,0.138434,101,0.000004,3.466292e-08,1.883848e-08,2.055836e-15,4369.359580,43.260986,36.520,1029.721643
4,-73.909257,40.812845,24-07-2021 15:53,1.021634,953.423710,1144.959963,1742.102912,1881.881197,1130.666825,1805.723987,...,0.138498,96,0.000003,3.477614e-08,1.894471e-08,2.104002e-15,4135.599580,43.079162,36.125,1070.326838
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11163,-73.957050,40.790333,24-07-2021 15:57,0.972470,879.495500,1061.263875,1643.931551,1787.334517,1053.615009,1664.755482,...,0.350995,0,0.000000,,,,0.000000,,,
11164,-73.957063,40.790308,24-07-2021 15:57,0.972470,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.350995,0,0.000000,,,,0.000000,,,
11165,-73.957093,40.790270,24-07-2021 15:57,0.981124,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.356130,0,0.000000,,,,0.000000,,,
11166,-73.957112,40.790253,24-07-2021 15:59,0.981245,879.518179,1061.254101,1643.735251,1787.093975,1053.597269,1664.579154,...,0.356467,0,0.000000,,,,0.000000,,,


In [21]:
# # Dumping the predictions into a csv file.
# uhi_data.to_csv("training_data_2025-03-15_v1_concat.csv", index = False)

## 4. Features Construction

### 4.1 NDVI Calculation
1. Calculate **NDVI (Normalized Difference Vegetation Index)** and ***handle division by zero*** by ***replacing infinities with `NaN`***.
2. See the **Sentinel-2 sample notebook** for more information about the NDVI index.

In [22]:
uhi_data['NDVI'] = (uhi_data['B08'] - uhi_data['B04']) / (uhi_data['B08'] + uhi_data['B04'])
uhi_data['NDVI'] = uhi_data['NDVI'].replace([np.inf, -np.inf], np.nan)

### 4.2 NDBI Calculation
1. Calculate **NDBI (Normalized Difference Buildup Index)** and ***handle division by zero*** by ***replacing infinities with `NaN`***.
2. See the **Sentinel-2 sample notebook** for more information about the NDBI index.

In [23]:
uhi_data['NDBI'] = (uhi_data['B11'] - uhi_data['B08']) / (uhi_data['B11'] + uhi_data['B08'])
uhi_data['NDBI'] = uhi_data['NDBI'].replace([np.inf, -np.inf], np.nan)

### 4.3 NDWI Calculation
1. Calculate **NDWI (Normalized Difference Water Index)** and ***handle division by zero*** by ***replacing infinities with `NaN`***.
2. See the **Sentinel-2 sample notebook** for more information about the NDWI index.

In [24]:
uhi_data['NDWI'] = (uhi_data['B03'] - uhi_data['B08']) / (uhi_data['B03'] + uhi_data['B08'])
uhi_data['NDWI'] = uhi_data['NDWI'].replace([np.inf, -np.inf], np.nan)

### 4.4 Check & Remove Duplicates

In [25]:
# Remove duplicate rows from the DataFrame based on specified columns and keep the first occurrence
columns_to_check = uhi_data.columns[4:]
columns_to_check

Index(['B01', 'B02', 'B03', 'B04', 'B05', 'B06', 'B07', 'B08', 'B8A', 'B11',
       'B12', 'lwir11', 'emis', 'drad', 'urad', 'atran', 'swir16', 'swir22',
       'coastal', 'vv', 'vh', 'Density', 'Building Area Covered', 'Mean Area',
       'Median Area', 'Variance Area', 'Roof Height Sum', 'Roof Height Avg',
       'Roof Height Median', 'Roof Height Var', 'NDVI', 'NDBI', 'NDWI'],
      dtype='object')

In [26]:
for col in columns_to_check:
    # Check if the value is a numpy array and has more than one dimension
    uhi_data[col] = uhi_data[col].apply(lambda x: tuple(x) if isinstance(x, np.ndarray) and x.ndim > 0 else x)

# Check for duplicates
duplicated_rows = uhi_data[uhi_data.duplicated(subset=columns_to_check)]
print(f"Number of duplicated rows: {uhi_data.duplicated(subset=columns_to_check).sum()}")
display(duplicated_rows)

# Drop duplicates only if any exist
if not duplicated_rows.empty:
    uhi_data = uhi_data.drop_duplicates(subset=columns_to_check, keep='first').reset_index(drop=True)
    print("Removing duplicated rows...\n")
    print("Updated Dataset after removing duplicates:")
    display(uhi_data)
else:
    print("Congratulations, no duplicated rows found!")

Number of duplicated rows: 0


Unnamed: 0,Longitude,Latitude,datetime,UHI Index,B01,B02,B03,B04,B05,B06,...,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var,NDVI,NDBI,NDWI


Congratulations, no duplicated rows found!


In [27]:
uhi_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11168 entries, 0 to 11167
Data columns (total 37 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Longitude              11168 non-null  float64
 1   Latitude               11168 non-null  float64
 2   datetime               11168 non-null  object 
 3   UHI Index              11168 non-null  float64
 4   B01                    11168 non-null  float64
 5   B02                    11168 non-null  float64
 6   B03                    11168 non-null  float64
 7   B04                    11168 non-null  float64
 8   B05                    11168 non-null  float64
 9   B06                    11168 non-null  float64
 10  B07                    11168 non-null  float64
 11  B08                    11168 non-null  float64
 12  B8A                    11168 non-null  float64
 13  B11                    11168 non-null  float64
 14  B12                    11168 non-null  float64
 15  lw

## 5. Final Data Validation

### 5.1 Features Selection
We will remove features that are not suitable to serve as independent features for model training. Those features are `Longitude`, `Latitude`, and `datetime`.

In [28]:
final_data = uhi_data.drop(columns=['Longitude', 'Latitude', 'datetime']) #, 'NDVI', 'NDBI', 'NDWI']) #, 'Density', 'Building Area Covered', 'Mean Area', 'Median Area', 'Variance Area', 'Roof Height Sum', 'Roof Height Avg', 'Roof Height Median', 'Roof Height Var'])
print(f"Rows: {final_data.shape[0]}, Columns: {final_data.shape[1]}")
final_data.head()

Rows: 11168, Columns: 34


Unnamed: 0,UHI Index,B01,B02,B03,B04,B05,B06,B07,B08,B8A,...,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var,NDVI,NDBI,NDWI
0,1.030289,952.568307,1143.988376,1741.95147,1882.093033,1129.856803,1804.947448,1000.253359,1349.151588,1904.062503,...,3.302316e-08,1.855104e-08,1.95878e-15,4622.322123,42.799279,36.125,980.284349,-0.164934,0.182012,0.127074
1,1.030289,953.030288,1144.601424,1742.376938,1882.433379,1130.356545,1805.633125,1000.713106,1349.779036,1904.413154,...,3.47251e-08,1.880488e-08,2.044352e-15,4613.51958,43.52377,36.6,979.373418,-0.164796,0.181871,0.126966
2,1.023798,953.246997,1144.774241,1742.405201,1882.364107,1130.502264,1805.756397,1000.854321,1349.934798,1904.393419,...,3.508403e-08,1.883252e-08,2.077283e-15,4517.96958,43.442015,36.6,1004.844554,-0.164722,0.181806,0.126917
3,1.023798,953.419715,1144.978706,1742.310635,1882.146704,1130.67799,1805.856581,1001.028434,1350.140561,1904.184853,...,3.466292e-08,1.883848e-08,2.055836e-15,4369.35958,43.260986,36.52,1029.721643,-0.164591,0.181678,0.126815
4,1.021634,953.42371,1144.959963,1742.102912,1881.881197,1130.666825,1805.723987,1001.01632,1350.107137,1903.908753,...,3.477614e-08,1.894471e-08,2.104002e-15,4135.59958,43.079162,36.125,1070.326838,-0.164535,0.18162,0.126769


### 5.2 Data Overview

In [29]:
final_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11168 entries, 0 to 11167
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   UHI Index              11168 non-null  float64
 1   B01                    11168 non-null  float64
 2   B02                    11168 non-null  float64
 3   B03                    11168 non-null  float64
 4   B04                    11168 non-null  float64
 5   B05                    11168 non-null  float64
 6   B06                    11168 non-null  float64
 7   B07                    11168 non-null  float64
 8   B08                    11168 non-null  float64
 9   B8A                    11168 non-null  float64
 10  B11                    11168 non-null  float64
 11  B12                    11168 non-null  float64
 12  lwir11                 11168 non-null  float64
 13  emis                   11168 non-null  float64
 14  drad                   11168 non-null  float64
 15  ur

### 5.3 Missing Values Checking

In [30]:
final_data.isna().sum()[final_data.isna().sum() > 0]

Mean Area             144
Median Area           144
Variance Area         298
Roof Height Avg       144
Roof Height Median    144
Roof Height Var       298
dtype: int64

### 5.4 Multicollinearity Checking
Identify which independent features are highly correlated with each other.

In [31]:
# Compute the correlation matrix
corr_matrix = final_data.fillna(0).corr()

# List to store pairs of highly correlated features
high_corr_pairs = []

# Loop through the upper triangle of the correlation matrix
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)):
        col1 = corr_matrix.columns[i]
        col2 = corr_matrix.columns[j]
        corr_value = corr_matrix.iloc[i, j]
        if corr_value >= 0.9:
            high_corr_pairs.append((col1, col2, corr_value))

# Print the highly correlated pairs
print("Highly positively correlated feature pairs (>90%):")
for pair in high_corr_pairs:
    print(f"{pair[0]} and {pair[1]} with correlation: {pair[2]:.2f}")

Highly positively correlated feature pairs (>90%):
B01 and B02 with correlation: 1.00
B01 and B05 with correlation: 1.00
B01 and B06 with correlation: 0.94
B01 and B07 with correlation: 1.00
B01 and B08 with correlation: 0.99
B01 and B12 with correlation: 0.98
B02 and B05 with correlation: 1.00
B02 and B06 with correlation: 0.94
B02 and B07 with correlation: 1.00
B02 and B08 with correlation: 0.99
B02 and B12 with correlation: 0.98
B03 and B04 with correlation: 0.98
B03 and B06 with correlation: 0.95
B03 and B8A with correlation: 0.99
B03 and B11 with correlation: 0.98
B04 and B8A with correlation: 1.00
B04 and B11 with correlation: 1.00
B05 and B06 with correlation: 0.95
B05 and B07 with correlation: 1.00
B05 and B08 with correlation: 0.99
B05 and B12 with correlation: 0.98
B06 and B07 with correlation: 0.92
B06 and B08 with correlation: 0.98
B06 and B12 with correlation: 0.99
B07 and B08 with correlation: 0.98
B07 and B12 with correlation: 0.97
B08 and B12 with correlation: 1.00
B8A 

## 6. Model Training

### 6.1 PyCaret AutoML
In Section 5, we found that there are some **missing values** and many **highly correlated features** in our dataset. These 2 issues will **deteriorate the machine learning model's performance** if the dataset is used for training without proper preprocessing. 
- **Missing values:** Can lead to **biased** or **incomplete learning**, as the model may struggle to interpret the gaps in data, resulting in inaccurate predictions. 
- **Highly correlated features:** Introduce **multicollinearity**, which can inflate the variance of the model's coefficient estimates, making the model unstable, less interpretable, and increasing the complexity. 

In [32]:
from pycaret.regression import RegressionExperiment
s = RegressionExperiment()

Therefore, to adress the issue of **missing values** and **multicollinearity**, we decided to implement the following strategies:
1. **Missing values:**
    - *Replace them with `0` to indicate the absence of `roof height data`, which suggests no surrounding structures and may influence the UHI Index*
2. **Multicollinearity:**
    - *Remove features with correlation **above and equal** to **0.9***

In [33]:
# init setup
s.setup(
    data=final_data, 
    target='UHI Index', 
    session_id=123, 
    train_size=0.8, 
    imputation_type='simple', 
    numeric_imputation=0,          # Fill missing values with 0
    remove_multicollinearity=True, # Remove features with correlation above 0.9
    fold=10,                       # 10-fold cross-validation
    fold_shuffle=True
)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,UHI Index
2,Target type,Regression
3,Original data shape,"(11168, 34)"
4,Transformed data shape,"(11168, 18)"
5,Transformed train set shape,"(8934, 18)"
6,Transformed test set shape,"(2234, 18)"
7,Numeric features,33
8,Rows with missing values,2.7%
9,Preprocess,True


<pycaret.regression.oop.RegressionExperiment at 0x165aec02010>

#### 6.1.1 Model Traning with **PyCaret AutoML**
Using this strategy will allow us to quickly identify the **best-performing machine learning model** for our dataset. Additionally, it enables us to **view feature importance** values, revealing which features contribute the most during model training.

In [34]:
# model training and selection
best = s.compare_models()

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,0.0019,0.0,0.0027,0.9715,0.0014,0.0019,0.336
rf,Random Forest Regressor,0.0023,0.0,0.0032,0.9619,0.0016,0.0023,1.215
dt,Decision Tree Regressor,0.0025,0.0,0.0043,0.9278,0.0022,0.0024,0.033
lightgbm,Light Gradient Boosting Machine,0.0037,0.0,0.0048,0.9117,0.0024,0.0037,0.117
gbr,Gradient Boosting Regressor,0.0065,0.0001,0.0082,0.7436,0.0041,0.0065,0.538
ada,AdaBoost Regressor,0.0085,0.0001,0.0103,0.5967,0.0052,0.0085,0.181
knn,K Neighbors Regressor,0.008,0.0001,0.011,0.5429,0.0055,0.008,0.035
lr,Linear Regression,0.0106,0.0002,0.0131,0.3492,0.0066,0.0106,0.651
br,Bayesian Ridge,0.0107,0.0002,0.0132,0.3406,0.0066,0.0107,0.019
ridge,Ridge Regression,0.011,0.0002,0.0136,0.2989,0.0068,0.011,0.016


In [35]:
# Print best model parameters
best.get_params()

{'bootstrap': False,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': -1,
 'oob_score': False,
 'random_state': 123,
 'verbose': 0,
 'warm_start': False}

In [None]:
# Evaluate trained model (click the Feature Importance tab towards the right to see the full feature importance)
s.evaluate_model(best)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [37]:
# Predict on hold-out/test set
pred_holdout = s.predict_model(best)

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Extra Trees Regressor,0.0018,0.0,0.0026,0.9744,0.0013,0.0018


In [38]:
# Retrive the feature used for model tranining
best.feature_names_in_

array(['B04', 'lwir11', 'emis', 'drad', 'atran', 'coastal', 'vv', 'vh',
       'Density', 'Building Area Covered', 'Median Area', 'Variance Area',
       'Roof Height Sum', 'Roof Height Median', 'Roof Height Var', 'NDVI',
       'NDWI'], dtype=object)

### 6.2 Scikit-Learn

#### 6.2.1 Feature Selection with Recursive Feature Elimination Cross-Validation (RFECV)
RFECV is a powerful feature selection technique that works by recursively removing attributes and building a model using the remaining features.

How RFE Works:
1. Train an initial model using all available features
2. Rank features based on importance (coefficients for linear models, feature importance for tree-based models)
3. Eliminate the least important feature(s)
4. Retrain the model with the remaining features
5. Repeat steps 2-4 until reaching the desired number of features

In [39]:
# Define a cross-validation strategy
cv = KFold(n_splits=10, shuffle=True, random_state=123)

# -----------------------------------
# Extra Trees Model Feature Selection
# -----------------------------------
et_estimator = ExtraTreesRegressor(random_state=123)
et_selector = RFECV(estimator=et_estimator, step=1, cv=cv, scoring='r2')
et_selector.fit(final_data[best.feature_names_in_.tolist()].fillna(0), final_data['UHI Index'])

et_optimal_features = final_data[best.feature_names_in_.tolist()].fillna(0).columns[et_selector.support_]
et_best_score = et_selector.grid_scores_.max() if hasattr(et_selector, "grid_scores_") else max(et_selector.cv_results_['mean_test_score'])
print("Number of optimal features (ET):", et_selector.n_features_)
print("Selected features (ET):", list(et_optimal_features))
print("Best R2 Score (ET):", et_best_score)

Number of optimal features (ET): 12
Selected features (ET): ['B04', 'lwir11', 'emis', 'drad', 'atran', 'vv', 'vh', 'Density', 'Building Area Covered', 'Roof Height Median', 'NDVI', 'NDWI']
Best R2 Score (ET): 0.9756166348745327


In [42]:
# Define a cross-validation strategy
cv = KFold(n_splits=10, shuffle=True, random_state=123)

# -----------------------------------
# Extra Trees Model Feature Selection
# -----------------------------------
et_estimator = ExtraTreesRegressor(random_state=123)
et_selector = RFECV(estimator=et_estimator, step=1, cv=cv, scoring='r2')
et_selector.fit(final_data[best.feature_names_in_.tolist()].fillna(0), final_data['UHI Index'])

et_optimal_features = final_data[best.feature_names_in_.tolist()].fillna(0).columns[et_selector.support_]
et_best_score = et_selector.grid_scores_.max() if hasattr(et_selector, "grid_scores_") else max(et_selector.cv_results_['mean_test_score'])
print("Number of optimal features (ET):", et_selector.n_features_)
print("Selected features (ET):", list(et_optimal_features))
print("Best R2 Score (ET):", et_best_score)

Number of optimal features (ET): 16
Selected features (ET): ['B04', 'lwir11', 'emis', 'drad', 'atran', 'coastal', 'vv', 'vh', 'Density', 'Building Area Covered', 'Median Area', 'Roof Height Sum', 'Roof Height Median', 'Roof Height Var', 'NDVI', 'NDWI']
Best R2 Score (ET): 0.9755595499543157


In [46]:
# Separate features and target variable
X = final_data[list(et_optimal_features)].fillna(0)
y = final_data['UHI Index']

# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Define the model
etr_model = ExtraTreesRegressor(n_estimators=1500, random_state=123, n_jobs=-1)

# Train the model
etr_model.fit(X_train, y_train)

In [47]:
# Model Evaluation
print("Evaluation on Training Data:")
print("---------------------------------------------------")
print("Model R2 Score (ET):", r2_score(y_train, etr_model.predict(X_train)))

print("\nEvaluation on Testing Data:")
print("---------------------------------------------------")
print("Model R2 Score (ET):", r2_score(y_test, etr_model.predict(X_test)))

Evaluation on Training Data:
---------------------------------------------------
Model R2 Score (ET): 0.9999883605414133

Evaluation on Testing Data:
---------------------------------------------------
Model R2 Score (ET): 0.9750894063035737


In [48]:
# Save the model
import joblib
joblib.dump(etr_model, "../model/extra_trees_regressor_model.pkl")

['../model/extra_trees_regressor_model.pkl']

## 7. Submission

### 7.1 Load Submission Template Data

In [53]:
# Reading the coordinates for the submission
test_file_df = pd.read_csv('../submissions/Submission_template.csv')
test_file_df

Unnamed: 0,Longitude,Latitude,UHI Index
0,-73.971665,40.788763,
1,-73.971928,40.788875,
2,-73.967080,40.789080,
3,-73.972550,40.789082,
4,-73.969697,40.787953,
...,...,...,...
1035,-73.919388,40.813803,
1036,-73.931033,40.833178,
1037,-73.934647,40.854542,
1038,-73.917223,40.815413,


### 7.2 Load Building Footprint Data
An additional dataset was sourced from [NYC Open Data](https://data.cityofnewyork.us/City-Government/Building-Footprints/5zhs-2jue/about_data). This dataset was selected because we hypothesized that the `HEIGHTROOF` attribute (indicative of building height) could contribute towards our model performance. The dataset filename is **`Building_Footprints_20250319.csv`**.

In [54]:
from shapely.geometry import box
from shapely import wkt

In [56]:
def map_building_elevation_data(tiff_path, filtered_elevation_df, df, box_size_deg):
    
    latitudes = df['Latitude'].values
    longitudes = df['Longitude'].values
 
    data = rxr.open_rasterio(tiff_path)
    tiff_crs = data.rio.crs
    proj_wgs84 = Proj(init='epsg:4326')  # EPSG:4326 is the common lat/long CRS
    proj_tiff = Proj(tiff_crs)
   
    # Create a transformer object
    transformer = Transformer.from_proj(proj_wgs84, proj_tiff)
    
    Building_Density_values = []
    Building_Area_Covered_values = []
    Building_mean_area_values = []
    Building_median_area_values = []
    Building_variance_area_values = []
    Buidling_Roof_Height_Sum_values = []
    Buidling_Roof_Height_Avg_values = []
    Buidling_Roof_Height_Median_values = []
    Buidling_Roof_Height_Var_values = []

    # Iterate over the latitudes and longitudes, and extract the corresponding band values
    for lat, lon in tqdm(zip(latitudes, longitudes), total=len(latitudes), desc="Mapping values"):
        # Assuming the correct dimensions are 'y' and 'x' (replace these with actual names from data.coords)
 
        # Transform the latitude and longitude to the GeoTIFF's CRS
        x, y = transformer.transform(lat, lon)

        # Define the bounding box for the specified radius
        x_min, x_max = x - box_size_deg / 2, x + box_size_deg / 2
        y_min, y_max = y - box_size_deg / 2, y + box_size_deg / 2

        bbox = box(x_min, y_min, x_max, y_max)
        clipped_gdf = filtered_elevation_df[filtered_elevation_df['geometry'].apply(lambda x: bbox.contains(x))]

        num_buildings = len(clipped_gdf)
        
        clipped_gdf['area'] = clipped_gdf['geometry'].apply(lambda geom: geom.area)
        area_covered = clipped_gdf['area'].sum()
        mean_area = clipped_gdf['area'].mean()
        median_area = clipped_gdf['area'].median()
        variance_area = clipped_gdf['area'].var()

        roof_sum = clipped_gdf['HEIGHTROOF'].sum()
        roof_mean = clipped_gdf['HEIGHTROOF'].mean()
        roof_median = clipped_gdf['HEIGHTROOF'].median()
        roof_var = clipped_gdf['HEIGHTROOF'].var()

        Building_Density_values.append(num_buildings)
        Building_Area_Covered_values.append(area_covered)
        Buidling_Roof_Height_Sum_values.append(roof_sum)
        Buidling_Roof_Height_Avg_values.append(roof_mean)
        Building_mean_area_values.append(mean_area)
        Building_median_area_values.append(median_area)
        Building_variance_area_values.append(variance_area)
        Buidling_Roof_Height_Median_values.append(roof_median)
        Buidling_Roof_Height_Var_values.append(roof_var)
    
    # Create a DataFrame to store the band values
    df = pd.DataFrame()
    df['Density'] = Building_Density_values
    df['Building Area Covered'] = Building_Area_Covered_values
    df['Mean Area'] = Building_mean_area_values
    df['Median Area'] = Building_median_area_values
    df['Variance Area'] = Building_variance_area_values
    df['Roof Height Sum'] = Buidling_Roof_Height_Sum_values
    df['Roof Height Avg'] = Buidling_Roof_Height_Avg_values
    df['Roof Height Median'] = Buidling_Roof_Height_Median_values
    df['Roof Height Var'] = Buidling_Roof_Height_Var_values

    return df

In [57]:
# Load the training data from csv file and display the first few rows to inspect the data
elevation_df = pd.read_csv("../data/Building_Footprints_20250319.csv")
elevation_df['geometry'] = elevation_df['the_geom'].apply(wkt.loads)
elevation_df.head()
 
# Define the coordinates for the bounding box
lower_left = (40.75, -74.01)
upper_right = (40.88, -73.86)

# Create a Shapely box (polygon) using the bounding box coordinates
bounding_box = box(lower_left[1], lower_left[0], upper_right[1], upper_right[0])
 
# Filter rows where the geometries are contained within the bounding box
filtered_df = elevation_df[elevation_df['geometry'].apply(lambda x: bounding_box.contains(x))]
 
# Display the filtered DataFrame
print(len(filtered_df))

109420


### 7.3 Map Satellite Data

In [58]:
# Mapping Satellite 2 data for submission
val_sentinel_2_data = map_sentinel_2_data('../data/S2_median_2025-03-15_v1.tiff', test_file_df, 9040/111320.0) # 111139.0

# Mapping Landsat data for submission
val_landsat_data = map_landsat_data('../data/Landsat_median_2025-03-19_v1.tiff', test_file_df, 0.001)

# Mapping Sentinel 1 data for submission
val_sentinel_1_data = map_sentinel_1_data('../data/S1_median_2025-03-15_v1.tiff', test_file_df, 5800/111320.0) # 111139.0

Mapping values: 100%|██████████| 1040/1040 [00:33<00:00, 30.70it/s]
Mapping values: 100%|██████████| 1040/1040 [00:02<00:00, 375.21it/s]
Mapping values: 100%|██████████| 1040/1040 [00:03<00:00, 287.83it/s]


In [59]:
# Mapping Building Footprint data for submission
val_building_elevation_data = map_building_elevation_data('../data/S2_median_2025-03-15_v1.tiff', filtered_df, test_file_df, 0.004)

Mapping values: 100%|██████████| 1040/1040 [11:02<00:00,  1.57it/s]


The `val_building_elevation_data` data might take a long time to load, owing on to the large size of `filtered_df`. Hence, after running this mapping once, the data was saved as a CSV file so it could be easily re-loaded when required.

In [60]:
# Save the building footprint data as a csv file
val_building_elevation_data.to_csv("../data/val_building_elevation_data_0.004.csv", index=False)

In [61]:
# # Load the building footprint data (when you didn't run the "Mapping Building Footprint data for submission" cell)
# val_building_elevation_data = pd.read_csv("../data/val_building_elevation_data_0.004.csv")
# val_building_elevation_data.head()

In [62]:
# Combining ground data and final data into a single dataset.
val_data = pd.concat([test_file_df, val_sentinel_2_data, val_landsat_data, val_sentinel_1_data, val_building_elevation_data], axis=1)
val_data

Unnamed: 0,Longitude,Latitude,UHI Index,B01,B02,B03,B04,B05,B06,B07,...,vh,Density,Building Area Covered,Mean Area,Median Area,Variance Area,Roof Height Sum,Roof Height Avg,Roof Height Median,Roof Height Var
0,-73.971665,40.788763,,838.666942,1015.107741,1591.583381,1737.205111,1016.271483,1581.883152,891.948992,...,0.568413,214,0.000005,2.481178e-08,1.187726e-08,1.331756e-15,12656.003462,59.140203,55.895000,885.525719
1,-73.971928,40.788875,,838.056638,1014.175836,1591.108294,1736.892228,1015.517123,1580.700400,891.251965,...,0.565218,205,0.000006,2.700714e-08,1.166441e-08,1.483107e-15,12403.695235,60.505830,50.480000,1225.549645
2,-73.967080,40.789080,,847.023855,1025.091513,1601.905192,1746.772520,1024.400834,1599.707510,899.095812,...,0.567900,117,0.000004,3.299255e-08,1.202216e-08,2.699617e-15,8203.772085,70.117710,58.996033,2047.672247
3,-73.972550,40.789082,,836.825712,1012.240380,1589.969090,1736.056275,1013.988018,1578.073451,889.843283,...,0.558536,182,0.000005,2.754555e-08,1.112336e-08,1.345706e-15,11378.956027,62.521736,47.560000,1597.029516
4,-73.969697,40.787953,,843.161759,1020.658594,1591.672675,1735.265007,1020.543564,1586.937352,896.169853,...,0.586979,255,0.000005,2.084626e-08,1.298559e-08,7.674012e-16,16750.100101,65.686667,59.530000,906.581157
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035,-73.919388,40.813803,,946.359631,1133.260694,1712.598032,1847.304914,1118.790370,1778.569538,990.473252,...,0.175039,225,0.000005,2.205283e-08,1.000466e-08,1.243554e-15,7752.923228,34.457437,30.290000,337.840157
1036,-73.931033,40.833178,,886.234319,1056.406433,1694.098408,1856.210292,1056.020054,1696.346258,924.976323,...,0.183024,114,0.000003,2.663084e-08,1.325982e-08,8.769004e-16,4669.089816,40.956928,35.931063,368.428645
1037,-73.934647,40.854542,,826.698267,982.174596,1664.640910,1847.819441,995.521861,1612.003537,864.033822,...,0.110375,60,0.000005,7.577360e-08,7.859022e-08,2.280503e-15,3043.288884,50.721481,62.656648,582.293902
1038,-73.917223,40.815413,,948.225483,1136.377671,1721.820201,1857.920908,1121.746391,1787.607709,993.098894,...,0.169633,135,0.000005,4.061618e-08,2.212027e-08,2.572704e-15,5869.508445,43.477840,35.000000,717.110719


In [63]:
# Print features with missing values
cols_with_ms_val = [col for col in val_data.columns if col != 'UHI Index']
val_data[cols_with_ms_val].isnull().sum()[val_data[cols_with_ms_val].isnull().sum() > 0]

Mean Area             12
Median Area           12
Variance Area         29
Roof Height Avg       12
Roof Height Median    12
Roof Height Var       29
dtype: int64

In [64]:
# Replace missing values with 0
val_data[cols_with_ms_val] = val_data[cols_with_ms_val].fillna(0)
val_data[cols_with_ms_val].isnull().sum()[val_data[cols_with_ms_val].isnull().sum() > 0]

Series([], dtype: int64)

In [65]:
# Remove duplicate rows from the DataFrame based on specified columns and keep the first occurrence
columns_to_check = val_data.columns[3:]

for col in columns_to_check:
    # Check if the value is a numpy array and has more than one dimension
    val_data[col] = val_data[col].apply(lambda x: tuple(x) if isinstance(x, np.ndarray) and x.ndim > 0 else x)

In [66]:
val_data['NDVI'] = (val_data['B08'] - val_data['B04']) / (val_data['B08'] + val_data['B04'])
val_data['NDVI'] = val_data['NDVI'].replace([np.inf, -np.inf], np.nan)

val_data['NDBI'] = (val_data['B11'] - val_data['B08']) / (val_data['B11'] + val_data['B08'])
val_data['NDBI'] = val_data['NDBI'].replace([np.inf, -np.inf], np.nan)

val_data['NDWI'] = (val_data['B03'] - val_data['B08']) / (val_data['B03'] + val_data['B08'])
val_data['NDWI'] = val_data['NDWI'].replace([np.inf, -np.inf], np.nan)

In [67]:
# Remove duplicate rows from the DataFrame based on specified columns and keep the first occurrence
columns_to_check = val_data.columns[3:]

for col in columns_to_check:
    # Check if the value is a numpy array and has more than one dimension
    val_data[col] = val_data[col].apply(lambda x: tuple(x) if isinstance(x, np.ndarray) and x.ndim > 0 else x)

In [68]:
# Select the optimal features 
val_data = val_data[et_optimal_features.tolist()+['UHI Index']]
val_data

Unnamed: 0,B04,lwir11,emis,drad,atran,vv,vh,Density,Building Area Covered,Roof Height Median,NDVI,NDWI,UHI Index
0,1737.205111,40.042851,0.955700,1.474000,0.619500,1.262525,0.568413,214,0.000005,55.895000,-0.183246,0.140628,
1,1736.892228,39.844891,0.959175,1.474000,0.619475,1.257067,0.565218,205,0.000006,50.480000,-0.183536,0.140863,
2,1746.772520,37.682423,0.977375,1.474333,0.619433,1.150323,0.567900,117,0.000004,58.996033,-0.181306,0.139136,
3,1736.056275,39.905560,0.960333,1.474000,0.619492,1.236686,0.558536,182,0.000005,47.560000,-0.184097,0.141317,
4,1735.265007,38.040817,0.968975,1.474000,0.619550,1.204202,0.586979,255,0.000005,59.530000,-0.180731,0.138652,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035,1847.304914,43.824605,0.962017,1.483750,0.616650,0.812626,0.175039,225,0.000005,30.290000,-0.162009,0.124935,
1036,1856.210292,39.996992,0.965512,1.489000,0.615075,0.717932,0.183024,114,0.000003,35.931063,-0.192859,0.148505,
1037,1847.819441,40.921994,0.963950,1.495000,0.613287,0.479162,0.110375,60,0.000005,62.656648,-0.219633,0.169422,
1038,1857.920908,43.642596,0.966300,1.484000,0.616467,0.835503,0.169633,135,0.000005,35.000000,-0.162951,0.125710,


In [69]:
val_data.columns

Index(['B04', 'lwir11', 'emis', 'drad', 'atran', 'vv', 'vh', 'Density',
       'Building Area Covered', 'Roof Height Median', 'NDVI', 'NDWI',
       'UHI Index'],
      dtype='object')

In [70]:
submission_val_data = val_data.copy()
submission_val_data

Unnamed: 0,B04,lwir11,emis,drad,atran,vv,vh,Density,Building Area Covered,Roof Height Median,NDVI,NDWI,UHI Index
0,1737.205111,40.042851,0.955700,1.474000,0.619500,1.262525,0.568413,214,0.000005,55.895000,-0.183246,0.140628,
1,1736.892228,39.844891,0.959175,1.474000,0.619475,1.257067,0.565218,205,0.000006,50.480000,-0.183536,0.140863,
2,1746.772520,37.682423,0.977375,1.474333,0.619433,1.150323,0.567900,117,0.000004,58.996033,-0.181306,0.139136,
3,1736.056275,39.905560,0.960333,1.474000,0.619492,1.236686,0.558536,182,0.000005,47.560000,-0.184097,0.141317,
4,1735.265007,38.040817,0.968975,1.474000,0.619550,1.204202,0.586979,255,0.000005,59.530000,-0.180731,0.138652,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035,1847.304914,43.824605,0.962017,1.483750,0.616650,0.812626,0.175039,225,0.000005,30.290000,-0.162009,0.124935,
1036,1856.210292,39.996992,0.965512,1.489000,0.615075,0.717932,0.183024,114,0.000003,35.931063,-0.192859,0.148505,
1037,1847.819441,40.921994,0.963950,1.495000,0.613287,0.479162,0.110375,60,0.000005,62.656648,-0.219633,0.169422,
1038,1857.920908,43.642596,0.966300,1.484000,0.616467,0.835503,0.169633,135,0.000005,35.000000,-0.162951,0.125710,


In [71]:
# Make predictions on the submission data (pycaret)
predictions = etr_model.predict(submission_val_data.drop('UHI Index', axis=1))
submission_val_data['UHI Index'] = predictions
submission_val_data

Unnamed: 0,B04,lwir11,emis,drad,atran,vv,vh,Density,Building Area Covered,Roof Height Median,NDVI,NDWI,UHI Index
0,1737.205111,40.042851,0.955700,1.474000,0.619500,1.262525,0.568413,214,0.000005,55.895000,-0.183246,0.140628,0.964262
1,1736.892228,39.844891,0.959175,1.474000,0.619475,1.257067,0.565218,205,0.000006,50.480000,-0.183536,0.140863,0.963343
2,1746.772520,37.682423,0.977375,1.474333,0.619433,1.150323,0.567900,117,0.000004,58.996033,-0.181306,0.139136,0.963228
3,1736.056275,39.905560,0.960333,1.474000,0.619492,1.236686,0.558536,182,0.000005,47.560000,-0.184097,0.141317,0.961985
4,1735.265007,38.040817,0.968975,1.474000,0.619550,1.204202,0.586979,255,0.000005,59.530000,-0.180731,0.138652,0.959371
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1035,1847.304914,43.824605,0.962017,1.483750,0.616650,0.812626,0.175039,225,0.000005,30.290000,-0.162009,0.124935,1.038882
1036,1856.210292,39.996992,0.965512,1.489000,0.615075,0.717932,0.183024,114,0.000003,35.931063,-0.192859,0.148505,1.042203
1037,1847.819441,40.921994,0.963950,1.495000,0.613287,0.479162,0.110375,60,0.000005,62.656648,-0.219633,0.169422,1.041188
1038,1857.920908,43.642596,0.966300,1.484000,0.616467,0.835503,0.169633,135,0.000005,35.000000,-0.162951,0.125710,1.035870


In [72]:
submission_df = pd.DataFrame({
    'Longitude':test_file_df['Longitude'].values, 
    'Latitude':test_file_df['Latitude'].values, 
    'UHI Index':predictions}
)
submission_df

Unnamed: 0,Longitude,Latitude,UHI Index
0,-73.971665,40.788763,0.964262
1,-73.971928,40.788875,0.963343
2,-73.967080,40.789080,0.963228
3,-73.972550,40.789082,0.961985
4,-73.969697,40.787953,0.959371
...,...,...,...
1035,-73.919388,40.813803,1.038882
1036,-73.931033,40.833178,1.042203
1037,-73.934647,40.854542,1.041188
1038,-73.917223,40.815413,1.035870


In [73]:
# Dumping the predictions into a csv file.
submission_df.to_csv("../submissions/submission_for_validation.csv", index=False)