# 2025 EY Open Science AI & Data Challenge - Team Her In Venture

## 1. Sentinel-2 Satellite Data Extraction
We extracted Sentinel-2 satellite data to GeoTIFF file that is suitable for further analysis and can also be used to generate spectral index products using mathematical combinations of bands, such as NDVI. The baseline data is [Sentinel-2 Level-2A](https://planetarycomputer.microsoft.com/dataset/sentinel-2-l2a) data from the MS Planetary Computer catalog.

### 1.1 Load Dependencies

In [1]:
# Supress Warnings 
import warnings
warnings.filterwarnings('ignore')

# Import common GIS tools
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import rioxarray as rio
import rasterio
from matplotlib.cm import RdYlGn,jet,RdBu

# Import Planetary Computer tools
import stackstac
import pystac_client
import planetary_computer 
from odc.stac import stac_load

### 1.2 Discover and load the data for analysis

First, we defined our area of interest using latitude and longitude coordinates. 

In [2]:
# Define the bounding box for the entire data region using (Latitude, Longitude)
# This is the region of New York City that contains our temperature dataset
lower_left = (40.75, -74.01)
upper_right = (40.88, -73.86)

# Calculate the bounds for doing an archive data search
# bounds = (min_lon, min_lat, max_lon, max_lat)
bounds = (lower_left[1], lower_left[0], upper_right[1], upper_right[0])

# Define the time window
time_window = "2021-06-01/2021-09-01"

Using the `pystac_client`, we searched the Planetary Computer's STAC endpoint for items matching our query parameters. We used a period of 3 months as a representative dataset for the region. The query searched for "low cloud" scenes with overall cloud cover <30%. The result is the number of scenes matching our search criteria that touch our area of interest. Some of these may be partial scenes or contain clouds.

In [7]:
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bounds, 
    datetime=time_window,
    collections=["sentinel-2-l2a"],
    query={"eo:cloud_cover": {"lt": 30}},
)

items = list(search.get_items())
print('This is the number of scenes that touch our region:',len(items))

This is the number of scenes that touch our region: 10


In [8]:
signed_items = [planetary_computer.sign(item).to_dict() for item in items]

# Define the pixel resolution for the final product
# Define the scale according to our selected crs, so we will use degrees
resolution = 10  # meters per pixel 
scale = resolution / 111320.0 # degrees per pixel for crs=4326 

Next, we loaded the data into an [xarray](https://xarray.pydata.org/en/stable/) DataArray using [stackstac](https://stackstac.readthedocs.io/). We only kept the commonly used spectral bands (Red, Green, Blue, NIR, SWIR). There are also several other <b>important settings for the data</b>: We have changed the projection to epsg=4326 which is standard latitude-longitude in degrees. We have specified the spatial resolution of each pixel to be 10-meters. 

#### Sentinel-2 Bands Summary 
The following list of common bands can be loaded by the Open Data Cube (ODC) stac command.<br><br>
B01 = Coastal Aerosol = 60m <br>
B02 = Blue = 10m <br>
B03 = Green = 10m <br>
B04 = Red = 10m <br>
B05 = Red Edge (704 nm) = 20m <br>
B06 = Red Edge (740 nm) = 20m <br>
B07 = Red Edge (780 nm) = 20m <br>
B08 = NIR (833 nm) = 10m <br>
B8A = NIR (narrow 864 nm) = 20m <br>
B11 = SWIR (1.6 um) = 20m <br>
B12 = SWIR (2.2 um) = 20m

In [10]:
data = stac_load(
    items,
    bands=["B01", "B02", "B03", "B04", "B05", "B06", "B07", "B08", "B8A", "B11", "B12"],
    crs="EPSG:4326", # Latitude-Longitude
    resolution=scale, # Degrees
    chunks={"x": 2048, "y": 2048},
    dtype="uint16",
    patch_url=planetary_computer.sign,
    bbox=bounds
)

# View the dimensions of our XARRAY and the loaded variables
# This insures we have the right coordinates and spectral bands in our xarray
display(data)

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 46.15 MiB 4.62 MiB Shape (10, 1448, 1671) (1, 1448, 1671) Dask graph 10 chunks in 3 graph layers Data type uint16 numpy.ndarray",1671  1448  10,

Unnamed: 0,Array,Chunk
Bytes,46.15 MiB,4.62 MiB
Shape,"(10, 1448, 1671)","(1, 1448, 1671)"
Dask graph,10 chunks in 3 graph layers,10 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


### 1.3 Save the output data in a GeoTIFF file
We selected a single date (July 24, 2021) to create a GeoTIFF output product. This date is the same as the ground temperature data collection date. Though this image contains some clouds, it will be used as the baseline for the benchmark notebook. Participants in the data challenge may desire to use other single scenes with less cloud cover or create a median mosaic that statistically filters the clouds over a time series stack of data (see the median dataset above).
<br><br>The output product below only contains 7 selected bands that will be used in our model building.

In [12]:
filename = "S2.tiff"

# We will pick a single time slice from the time series (time=7) 
# This time slice is the date of July 24, 2021
data_slice = data.isel(time=7)

# Calculate the dimensions of the file
# height = median.dims["latitude"]
# width = median.dims["longitude"]
height = data_slice.dims["latitude"]
width = data_slice.dims["longitude"]

# Define the Coordinate Reference System (CRS) to be common Lat-Lon coordinates
# Define the tranformation using our bounding box so the Lat-Lon information is written to the GeoTIFF
gt = rasterio.transform.from_bounds(lower_left[1],lower_left[0],upper_right[1],upper_right[0],width,height)
data_slice.rio.write_crs("epsg:4326", inplace=True)
data_slice.rio.write_transform(transform=gt, inplace=True);

# Create the GeoTIFF output file using the defined parameters 
with rasterio.open(filename,'w',driver='GTiff',width=width,height=height,
                   crs='epsg:4326',transform=gt,count=7,compress='lzw',dtype='float64') as dst:
    dst.write(data_slice.B01,1)
    dst.write(data_slice.B04,2)
    dst.write(data_slice.B06,3) 
    dst.write(data_slice.B08,4)
    dst.write(data_slice.B02,5)
    dst.write(data_slice.B03,6)
    dst.write(data_slice.B11,7)
    dst.close()

## 2. Landsat Land Surface Temperature (LST) Data Extraction

We extracted Landsat Land Surface Temperature (LST) data to GeoTIFF file product. The baseline data is [Landsat Collection-2 Level-2](https://www.usgs.gov/landsat-missions/landsat-collection-2) data from the MS Planetary Computer catalog.

### 2.1 Discover and load the data for analysis

We used our area of interest previously defined when processing Sentinel-2 Satellite Data

In [3]:
stac = pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = stac.search(
    bbox=bounds, 
    datetime=time_window,
    collections=["landsat-c2-l2"],
    query={"eo:cloud_cover": {"lt": 50},"platform": {"in": ["landsat-8"]}},
)

items = list(search.get_items())
print('This is the number of scenes that touch our region:',len(items))

This is the number of scenes that touch our region: 8


In [4]:
signed_items = [planetary_computer.sign(item).to_dict() for item in items]

# Define the pixel resolution for the final product
# Define the scale according to our selected crs, so we will use degrees
resolution = 30  # meters per pixel 
scale = resolution / 111320.0 # degrees per pixel for crs=4326 

Next, we loaded the data into an [xarray](https://xarray.pydata.org/en/stable/) DataArray using [stackstac](https://stackstac.readthedocs.io/). We only kept the commonly used spectral bands (Red, Green, Blue, NIR, Surface Temperature). There are also several other <b>important settings for the data</b>: We have changed the projection to epsg=4326 which is standard latitude-longitude in degrees. We have specified the spatial resolution of each pixel to be 30-meters. 

#### Landsat Band Summary 
The following list of bands will be loaded by the Open Data Cube (ODC) stac command:<br>
We will use two load commands to separate the RGB data from the Surface Temperature data.<br><br>
Band 2 = blue = 30m<br>
Band 3 = green = 30m<br>
Band 4 = red = 30m<br>
Band 5 = nir08 (near infrared) = 30m<br>
Band 11 = Surface Temperature = lwir11 = 100m

In [5]:
data1 = stac_load(
    items,
    bands=["red", "green", "blue", "nir08"],
    crs="EPSG:4326", # Latitude-Longitude
    resolution=scale, # Degrees
    chunks={"x": 2048, "y": 2048},
    dtype="uint16",
    patch_url=planetary_computer.sign,
    bbox=bounds
)

data2 = stac_load(
    items,
    bands=["lwir11"],
    crs="EPSG:4326", # Latitude-Longitude
    resolution=scale, # Degrees
    chunks={"x": 2048, "y": 2048},
    dtype="uint16",
    patch_url=planetary_computer.sign,
    bbox=bounds
)

# View the dimensions of our XARRAY and the loaded variables
# This insures we have the right coordinates and spectral bands in our xarray
display(data1)
display(data2)

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 4.12 MiB 527.48 kiB Shape (8, 484, 558) (1, 484, 558) Dask graph 8 chunks in 3 graph layers Data type uint16 numpy.ndarray",558  484  8,

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 4.12 MiB 527.48 kiB Shape (8, 484, 558) (1, 484, 558) Dask graph 8 chunks in 3 graph layers Data type uint16 numpy.ndarray",558  484  8,

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 4.12 MiB 527.48 kiB Shape (8, 484, 558) (1, 484, 558) Dask graph 8 chunks in 3 graph layers Data type uint16 numpy.ndarray",558  484  8,

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 4.12 MiB 527.48 kiB Shape (8, 484, 558) (1, 484, 558) Dask graph 8 chunks in 3 graph layers Data type uint16 numpy.ndarray",558  484  8,

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray
"Array Chunk Bytes 4.12 MiB 527.48 kiB Shape (8, 484, 558) (1, 484, 558) Dask graph 8 chunks in 3 graph layers Data type uint16 numpy.ndarray",558  484  8,

Unnamed: 0,Array,Chunk
Bytes,4.12 MiB,527.48 kiB
Shape,"(8, 484, 558)","(1, 484, 558)"
Dask graph,8 chunks in 3 graph layers,8 chunks in 3 graph layers
Data type,uint16 numpy.ndarray,uint16 numpy.ndarray


In [6]:
# Scale Factors for the RGB and NIR bands 
scale1 = 0.0000275 
offset1 = -0.2 
data1 = data1.astype(float) * scale1 + offset1

# Scale Factors for the Surface Temperature band
scale2 = 0.00341802 
offset2 = 149.0 
kelvin_celsius = 273.15 # convert from Kelvin to Celsius
data2 = data2.astype(float) * scale2 + offset2 - kelvin_celsius

### 2.2 Save the output data in a GeoTIFF file

In [7]:
# Pick one of the scenes above (numbering starts with 0)
scene = 2

filename = "Landsat_LST.tiff"

# Only select one of the time slices to output
data3 = data2.isel(time=scene)

# Calculate the dimensions of the file
height = data3.dims["latitude"]
width = data3.dims["longitude"]

# Define the Coordinate Reference System (CRS) to be common Lat-Lon coordinates
# Define the tranformation using our bounding box so the Lat-Lon information is written to the GeoTIFF
gt = rasterio.transform.from_bounds(lower_left[1],lower_left[0],upper_right[1],upper_right[0],width,height)
data3.rio.write_crs("epsg:4326", inplace=True)
data3.rio.write_transform(transform=gt, inplace=True);

# Create the GeoTIFF output file using the defined parameters 
with rasterio.open(filename,'w',driver='GTiff',width=width,height=height,
                   crs='epsg:4326',transform=gt,count=1,compress='lzw',dtype='float64') as dst:
    dst.write(data3.lwir11,1)
    dst.close()

In [8]:
# Calculate NDVI for the median mosaic
ndvi_data = (data1.isel(time=scene).nir08 - data1.isel(time=scene).red) / \
            (data1.isel(time=scene).nir08 + data1.isel(time=scene).red)

filename = "Landsat_NDVI.tiff"

# Use .sizes to get the dimension sizes
height = ndvi_data.sizes["latitude"]
width = ndvi_data.sizes["longitude"]

# Define the Coordinate Reference System (CRS) to be common Lat-Lon coordinates
# Define the transformation using our bounding box so the Lat-Lon information is written to the GeoTIFF
gt = rasterio.transform.from_bounds(
    lower_left[1], lower_left[0],
    upper_right[1], upper_right[0],
    width, height
)

# Write CRS and transform information to the xarray DataArray
ndvi_data.rio.write_crs("epsg:4326", inplace=True)
ndvi_data.rio.write_transform(transform=gt, inplace=True)

# Create the GeoTIFF output file using the defined parameters 
with rasterio.open(
    filename, 'w', driver='GTiff', width=width, height=height,
    crs='epsg:4326', transform=gt, count=1, compress='lzw', dtype='float64'
) as dst:
    dst.write(ndvi_data.values, 1)


In [9]:
# Show the location and size of the new output file
!ls *.tiff

Landsat_LST.tiff  Landsat_NDVI.tiff S2.tiff           S2_sample.tiff


## 3. Consolidating Training Data w/ S2 Data

### 3.1 Load In Dependencies

In [1]:
# Supress Warnings
import warnings
warnings.filterwarnings('ignore')


# Data Science
import numpy as np
import pandas as pd

# Multi-dimensional arrays and datasets
import xarray as xr

# Geospatial raster data handling
import rioxarray as rxr

# Geospatial operations
import rasterio
from rasterio import windows  
from rasterio import features  
from rasterio import warp
from rasterio.warp import transform_bounds 
from rasterio.windows import from_bounds 

# Image Processing
from PIL import Image

# Coordinate transformations
from pyproj import Proj, Transformer, CRS

# Feature Engineering
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Machine Learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

# Planetary Computer Tools
import pystac_client
import planetary_computer as pc
from pystac.extensions.eo import EOExtension as eo

# Others
import os
from tqdm import tqdm

### 3.2 Response Variable
Before building the model, we need to load in the Urban Heat Island (UHI) index training dataset. We have curated data for the New York region. The dataset consists of geo-locations (Longitude and Latitude), with additional fields including date & time of data collection and the UHI index for each location. 

In [23]:
# Load the training data from csv file and display the first few rows to inspect the data
ground_df = pd.read_csv("Training_data_uhi_index.csv")
ground_df.head()

Unnamed: 0,Longitude,Latitude,datetime,UHI Index
0,-73.909167,40.813107,24-07-2021 15:53,1.030289
1,-73.909187,40.813045,24-07-2021 15:53,1.030289
2,-73.909215,40.812978,24-07-2021 15:53,1.023798
3,-73.909242,40.812908,24-07-2021 15:53,1.023798
4,-73.909257,40.812845,24-07-2021 15:53,1.021634


### 3.3 Predictor Variables
We gathered the predictor variables from the Sentinel-2 dataset. Sentinel-2 optical data provides high-resolution imagery that is sensitive to land surface characteristics, which are crucial for understanding urban heat dynamics. Band values such as B01 (Coastal aerosol), B06 (Red Edge), and NDVI (Normalized Difference Vegetation Index) derived from B04 (Red) and B08 (Near Infrared) could help us in estimating the UHI index. Hence, we are choosing B01, B06, and NDVI as predictor variables for this experiment.</p>

<ul> 
<li>B01 - Reflectance values from the Coastal aerosol band, which help in assessing aerosol presence and improving atmospheric correction.</li>

<li>B06 - Reflectance values from the Red Edge band, which provide useful information for detecting vegetation, water bodies, and urban surfaces.</li>

<li>NDVI - Derived from B04 (Red) and B08 (Near Infrared), NDVI is an important indicator for vegetation health and land cover.</li>
</ul>


### 3.3.1 Extracting S2 data from GeoTIFF file and integrate with training dataset
Wen used the GeoTIFF file (S2.tiff) that we previously prepared to extract the band values for the geo-locations given in the training dataset to create the features.

In [36]:
import rasterio
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# Define file paths
geotiff_path = "S2.tiff"
uhi_data_path = "Training_data_uhi_index.csv"
uhi_updated_path = "UHI_updated_S2_indices1.csv"

# Read the GeoTIFF
with rasterio.open(geotiff_path) as src:
    if src.count < 7:
        raise ValueError(f"The GeoTIFF file must have at least 7 bands, but it has {src.count}.")
    
    B01 = src.read(1)
    B04 = src.read(2)
    B06 = src.read(3)
    B08 = src.read(4)
    B02 = src.read(5)
    B03 = src.read(6)
    B11 = src.read(7)
    
    # Calculate Spectral Indices
    NDVI = np.divide((B08 - B04), (B08 + B04), out=np.zeros_like(B04, dtype=float), where=(B08 + B04) != 0)
    EVI = np.divide(2.5 * (B08 - B04), (B08 + 6 * B04 - 7.5 * B02 + 1), out=np.zeros_like(B04, dtype=float), where=(B08 + 6 * B04 - 7.5 * B02 + 1) != 0)
    GNDVI = np.divide((B08 - B03), (B08 + B03), out=np.zeros_like(B03, dtype=float), where=(B08 + B03) != 0)
    SAVI = np.divide((B08 - B04) * (1.5), (B08 + B04 + 0.5), out=np.zeros_like(B04, dtype=float), where=(B08 + B04 + 0.5) != 0)
    NDBI = np.divide((B11 - B08), (B11 + B08), out=np.zeros_like(B08, dtype=float), where=(B11 + B08) != 0)
    MNDWI = np.divide((B03 - B11), (B03 + B11), out=np.zeros_like(B03, dtype=float), where=(B03 + B11) != 0)
    NDWI = np.divide((B03 - B08), (B03 + B08), out=np.zeros_like(B03, dtype=float), where=(B03 + B08) != 0)
    LSWI = np.divide((B08 - B11), (B08 + B11), out=np.zeros_like(B08, dtype=float), where=(B08 + B11) != 0)
    BI = np.sqrt(B11**2 + B04**2)
    NBAI = np.divide((B11 - B08), (B11 + B08), out=np.zeros_like(B08, dtype=float), where=(B11 + B08) != 0)

    
    # Albedo Estimation (Approximation using Red, NIR, and SWIR bands)
    Albedo = (B02 * 0.3 + B03 * 0.3 + B04 * 0.1 + B08 * 0.2 + B11 * 0.1) / 5.0
    
    # Index-Based Built-Up Index (IBI)
    IBI = np.divide(NDBI - (NDVI + MNDWI), NDBI + (NDVI + MNDWI), out=np.zeros_like(NDBI, dtype=float), where=(NDBI + (NDVI + MNDWI)) != 0)

    # Get metadata for georeferencing
    transform = src.transform

# Flatten arrays and associate them with spatial coordinates
rows, cols = np.where(~np.isnan(NDVI))  # Exclude NaN values
lon, lat = rasterio.transform.xy(transform, rows, cols, offset='center')

# Create DataFrame with spectral indices
indices_df = pd.DataFrame({
    "Longitude": lon,
    "Latitude": lat,
    "NDVI": NDVI[rows, cols],
    "EVI": EVI[rows, cols],
    "GNDVI": GNDVI[rows, cols],
    "SAVI": SAVI[rows, cols],
    "NDBI": NDBI[rows, cols],
    "MNDWI": MNDWI[rows, cols],
    "NDWI": NDWI[rows, cols],
    "LSWI": LSWI[rows, cols],
    "BI": BI[rows, cols],
    "Albedo": Albedo[rows, cols],
    "IBI": IBI[rows, cols],
    "NBAI": NBAI[rows, cols]
})

# Load UHI dataset
uhi_df = pd.read_csv(uhi_data_path)

# Merge indices with UHI dataset based on spatial proximity
indices_tree = cKDTree(indices_df[["Longitude", "Latitude"]].values)
distances, indices = indices_tree.query(uhi_df[["Longitude", "Latitude"]].values)

# Remove Longitude and Latitude from indices_df before merging to avoid conflicts
indices_df_cleaned = indices_df.drop(columns=["Longitude", "Latitude"])

# Concatenate UHI dataset with spectral indices
uhi_df = pd.concat([uhi_df, indices_df_cleaned.iloc[indices].reset_index(drop=True)], axis=1)

# Save updated UHI dataset
uhi_df.to_csv(uhi_updated_path, index=False)
print(f"Updated UHI dataset saved to {uhi_updated_path}")


Updated UHI dataset saved to UHI_updated_S2_indices1.csv


## 4. Consolidating Training Data w/ Landsat Data

In [37]:
import pandas as pd
import rasterio
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import r2_score

# File paths
uhi_updated_path = "UHI_updated_S2_indices1.csv"
lst_tiff_path = "Landsat_LST.tiff"
uhi_with_lst_path = "UHI_updated_S2_indices_LST1.csv"

# Load the UHI dataset with NDVI
uhi_df = pd.read_csv(uhi_updated_path)

# Extract LST from the GeoTIFF
with rasterio.open(lst_tiff_path) as src:
    lst = src.read(1)
    transform = src.transform
    rows, cols = lst.shape
    lon, lat = rasterio.transform.xy(transform, *np.meshgrid(range(cols), range(rows)))
    lon = lon.flatten()
    lat = lat.flatten()
    lst_flat = lst.flatten()

    # Combine into a DataFrame
    lst_df = pd.DataFrame({
        "Longitude": lon,
        "Latitude": lat,
        "LST": lst_flat
    }).dropna()

# Merge LST with UHI dataset using spatial proximity
lst_tree = cKDTree(lst_df[["Longitude", "Latitude"]].values)
distances, indices = lst_tree.query(uhi_df[["Longitude", "Latitude"]].values)
uhi_df["LST"] = lst_df.iloc[indices]["LST"].values

# Save the updated UHI dataset with NDVI and LST
uhi_df.to_csv(uhi_with_lst_path, index=False)
print(f"Updated UHI dataset with S2 and LST saved to {uhi_with_lst_path}")

Updated UHI dataset with S2 and LST saved to UHI_updated_S2_indices_LST1.csv


## 5. Consolidating Training Data w/ Building Data

In [38]:
import xml.etree.ElementTree as ET
import pandas as pd
from shapely.geometry import Point, Polygon
from scipy.spatial import cKDTree
import numpy as np

# Load the UHI dataset
uhi_path = "UHI_updated_S2_indices_LST1.csv"
uhi_df = pd.read_csv(uhi_path)

# Convert UHI points to Shapely Point objects
uhi_df["geometry"] = uhi_df.apply(lambda row: Point(row["Longitude"], row["Latitude"]), axis=1)

# Load and Parse Building Footprint KML
building_kml_path = "Building_Footprint.kml"
tree = ET.parse(building_kml_path)
root = tree.getroot()

# Namespace for KML
ns = {"kml": "http://www.opengis.net/kml/2.2"}

# Extract building footprint polygons and metadata
building_data = []

for placemark in root.findall(".//kml:Placemark", ns):
    polygon = placemark.find(".//kml:Polygon", ns)
    if polygon is not None:
        coordinates = polygon.find(".//kml:coordinates", ns).text.strip()
        coord_list = [tuple(map(float, coord.split(",")[:2])) for coord in coordinates.split()]
        if len(coord_list) > 2:  # Valid polygon
            poly = Polygon(coord_list)
            centroid = poly.centroid  # Get centroid for nearest neighbor search
            building_data.append({"geometry": centroid, "area": poly.area, "perimeter": poly.length})

# Convert building data to DataFrame
buildings_df = pd.DataFrame(building_data)

# Create KDTree for nearest building search
building_coords = np.array([(geom.x, geom.y) for geom in buildings_df["geometry"]])
building_tree = cKDTree(building_coords)

# Find the nearest building for each UHI point
uhi_coords = np.array([(geom.x, geom.y) for geom in uhi_df["geometry"]])
distances, indices = building_tree.query(uhi_coords)

# Assign building features to UHI dataset
uhi_df["nearest_building_area"] = buildings_df.iloc[indices]["area"].values
uhi_df["nearest_building_perimeter"] = buildings_df.iloc[indices]["perimeter"].values

# Compute building density using KDTree (efficiently count buildings within 100m radius)
buffer_radius = 0.05  # ~5000 meters in degrees
counts = building_tree.query_ball_point(uhi_coords, buffer_radius)
uhi_df["building_density"] = [len(c) for c in counts]

# Save updated dataset
output_path = "UHI_updated_S2_indices_LST_building_features_final.csv"
uhi_df.drop(columns=["geometry"]).to_csv(output_path, index=False)
print(f"Updated UHI dataset saved to {output_path}")

Updated UHI dataset saved to UHI_updated_S2_indices_LST_building_features_final.csv


In [6]:
import xml.etree.ElementTree as ET
import pandas as pd
from shapely.geometry import Point, Polygon
from scipy.spatial import cKDTree
import numpy as np

# Load the UHI dataset
uhi_path = "Merged_UHI_HHI_HVI_GreenRoof_SVI_UHII_Data.csv"
uhi_df = pd.read_csv(uhi_path)

# Convert UHI points to Shapely Point objects
uhi_df["geometry"] = uhi_df.apply(lambda row: Point(row["longitude"], row["latitude"]), axis=1)

# Load and Parse Building Footprint KML
building_kml_path = "Building_Footprint.kml"
tree = ET.parse(building_kml_path)
root = tree.getroot()

# Namespace for KML
ns = {"kml": "http://www.opengis.net/kml/2.2"}

# Extract building footprint polygons and metadata
building_data = []

for placemark in root.findall(".//kml:Placemark", ns):
    polygon = placemark.find(".//kml:Polygon", ns)
    if polygon is not None:
        coordinates = polygon.find(".//kml:coordinates", ns).text.strip()
        coord_list = [tuple(map(float, coord.split(",")[:2])) for coord in coordinates.split()]
        if len(coord_list) > 2:  # Valid polygon
            poly = Polygon(coord_list)
            centroid = poly.centroid  # Get centroid for nearest neighbor search
            building_data.append({"geometry": centroid, "area": poly.area, "perimeter": poly.length})

# Convert building data to DataFrame
buildings_df = pd.DataFrame(building_data)

# Create KDTree for nearest building search
building_coords = np.array([(geom.x, geom.y) for geom in buildings_df["geometry"]])
building_tree = cKDTree(building_coords)

# Find the nearest building for each UHI point
uhi_coords = np.array([(geom.x, geom.y) for geom in uhi_df["geometry"]])
distances, indices = building_tree.query(uhi_coords)

# Assign building features to UHI dataset
uhi_df["nearest_building_area"] = buildings_df.iloc[indices]["area"].values
uhi_df["nearest_building_perimeter"] = buildings_df.iloc[indices]["perimeter"].values

# Compute building density using KDTree (efficiently count buildings within 100m radius)
buffer_radius = 0.005  # ~5000 meters in degrees
counts = building_tree.query_ball_point(uhi_coords, buffer_radius)
uhi_df["building_density"] = [len(c) for c in counts]

# Save updated dataset
output_path = "building1.csv"
uhi_df.drop(columns=["geometry"]).to_csv(output_path, index=False)
print(f"Updated UHI dataset saved to {output_path}")

  uhi_df = pd.read_csv(uhi_path)


Updated UHI dataset saved to building1.csv


## 6. Consolidating Training Data w/ Weather Data

In [39]:
# Import necessary libraries
import pandas as pd
import numpy as np
from dateutil import parser
from scipy.spatial import cKDTree

# -------------------------
# Load the Datasets
# -------------------------
weather_data_path = "new weather data.csv"  # Path to weather dataset
uhi_data_path = "UHI_updated_S2_indices_LST_building_features_final.csv"  # Path to UHI dataset

# Load the datasets
weather_df = pd.read_csv(weather_data_path)
uhi_df = pd.read_csv(uhi_data_path)

# -------------------------
# Preprocess Weather Dataset
# -------------------------
# Drop unnecessary columns
weather_df.drop(columns=["station"], inplace=True, errors="ignore")

# Rename columns to match standard format
weather_df.rename(columns={
    "latitude [degrees_north]": "Latitude",
    "longitude [degrees_east]": "Longitude",
    "time": "datetime"
}, inplace=True)

# -------------------------
# Handle Datetime Parsing Issues
# -------------------------
def parse_datetime_column(df, column_name):
    """Parses datetime column, removes timezone information, and handles mixed formats."""
    df[column_name] = df[column_name].astype(str)  # Convert to string if necessary

    # Attempt parsing with multiple formats
    def safe_parse_datetime(value):
        try:
            dt = parser.parse(value)  # Let dateutil handle flexible parsing
            return dt.replace(tzinfo=None)  # Remove timezone
        except Exception:
            return pd.NaT  # Return NaT for unparseable values

    df[column_name] = df[column_name].apply(safe_parse_datetime)  # Apply parsing
    df.dropna(subset=[column_name], inplace=True)  # Remove invalid rows
    df[column_name] = df[column_name].dt.strftime("%d-%m-%Y %H:%M")  # Convert to uniform format
    return df

# Apply datetime parsing
weather_df = parse_datetime_column(weather_df, "datetime")
uhi_df = parse_datetime_column(uhi_df, "datetime")

# -------------------------
# Find Nearest Datetime Matches
# -------------------------
weather_df["datetime"] = pd.to_datetime(weather_df["datetime"], format="%d-%m-%Y %H:%M")
uhi_df["datetime"] = pd.to_datetime(uhi_df["datetime"], format="%d-%m-%Y %H:%M")

def find_nearest_time(row, weather_times):
    """Finds the nearest datetime in the weather dataset for a given UHI datetime."""
    return weather_times[np.abs(weather_times - row).argmin()]

# Convert weather datetime column to numpy array for fast searching
weather_times = weather_df["datetime"].values

# Apply the function to find nearest timestamps
uhi_df["nearest_datetime"] = uhi_df["datetime"].apply(lambda x: find_nearest_time(x.to_numpy(), weather_times))

# -------------------------
# Find Nearest Latitude and Longitude Matches
# -------------------------
# Build KDTree for fast nearest-neighbor lookup on Latitude & Longitude
weather_tree = cKDTree(weather_df[["Latitude", "Longitude"]].values)

# Find the nearest Latitude & Longitude match for each UHI record
distances, indices = weather_tree.query(uhi_df[["Latitude", "Longitude"]].values, k=1)

# Debugging: Ensure indices are assigned correctly
if len(indices) != len(uhi_df):
    raise ValueError("Nearest neighbor search failed. Mismatched indices.")

# Assign nearest Latitude and Longitude from the weather dataset
uhi_df["nearest_Latitude"] = weather_df.iloc[indices]["Latitude"].values
uhi_df["nearest_Longitude"] = weather_df.iloc[indices]["Longitude"].values

# -------------------------
# Debugging Check Before Merge
# -------------------------
print("Columns in UHI DataFrame before merge:")
print(uhi_df.columns)

print("Columns in Weather DataFrame before merge:")
print(weather_df.columns)

# -------------------------
# Ensure Consistent Column Names Before Merging
# -------------------------
# Rename datetime in weather_df
weather_df.rename(columns={"datetime": "nearest_datetime"}, inplace=True)

# Also rename Latitude and Longitude in weather_df to match the merging columns
weather_df.rename(columns={"Latitude": "nearest_Latitude", "Longitude": "nearest_Longitude"}, inplace=True)

# -------------------------
# Merge Datasets on Nearest Matches
# -------------------------
merged_df = pd.merge(
    uhi_df.drop(columns=["datetime"]), 
    weather_df, 
    on=["nearest_Latitude", "nearest_Longitude", "nearest_datetime"], 
    how="inner"
)

# -------------------------
# Handle Missing Values
# -------------------------
# Drop rows where Latitude, Longitude, nearest_Latitude, or nearest_Longitude are missing
cleaned_df = merged_df.dropna(subset=["nearest_Latitude", "nearest_Longitude"])

# Fill missing weather-related values with column mean
weather_columns = ["temp_2m [degF]", "relative_humidity [percent]", "solar_insolation [W/m^2]"]
cleaned_df[weather_columns] = cleaned_df[weather_columns].fillna(cleaned_df[weather_columns].mean())

# Fill missing UHI-related values with column median
uhi_columns = ["UHI Index", "NDVI", "EVI", "GNDVI", "SAVI", "NDBI", "MNDWI", "NDWI", "LSWI", "BI", "Albedo", "IBI", "LST", 
               "nearest_building_area", "nearest_building_perimeter", "building_density"]

cleaned_df[uhi_columns] = cleaned_df[uhi_columns].fillna(cleaned_df[uhi_columns].median())

# -------------------------
# Rename Columns for Consistency
# -------------------------
cleaned_df.rename(columns={"nearest_Latitude": "Latitude", "nearest_Longitude": "Longitude"}, inplace=True)
cleaned_df.drop(columns=["Latitude_x", "Longitude_x"], errors="ignore", inplace=True)

# -------------------------
# Save the Cleaned Dataset
# -------------------------
cleaned_file_path = "final_merged_weather_uhi_cleaned2.csv"
cleaned_df.to_csv(cleaned_file_path, index=False)

print(f"Cleaned dataset saved successfully: {cleaned_file_path}")


Columns in UHI DataFrame before merge:
Index(['Longitude', 'Latitude', 'datetime', 'UHI Index', 'NDVI', 'EVI',
       'GNDVI', 'SAVI', 'NDBI', 'MNDWI', 'NDWI', 'LSWI', 'BI', 'Albedo', 'IBI',
       'NBAI', 'LST', 'nearest_building_area', 'nearest_building_perimeter',
       'building_density', 'nearest_datetime', 'nearest_Latitude',
       'nearest_Longitude'],
      dtype='object')
Columns in Weather DataFrame before merge:
Index(['datetime', 'Latitude', 'Longitude', 'elevation [feet]',
       'temp_2m [degF]', 'relative_humidity [percent]',
       'avg_wind_speed_merge [mile/hr]', 'max_wind_speed_merge [mile/hr]',
       'wind_speed_stddev_merge [mile/hr]', 'wind_direction_merge [degrees]',
       'wind_direction_stddev_merge [degrees]', 'solar_insolation [W/m^2]'],
      dtype='object')
Cleaned dataset saved successfully: final_merged_weather_uhi_cleaned2.csv


## 7. Consolidating w/ heatmap data
https://github.com/NewYorkCityCouncil/heat_map/tree/main

In [40]:
import rasterio
import pandas as pd
import numpy as np
from rasterio.transform import rowcol

# Paths to your files
training_data_path = "final_merged_weather_uhi_cleaned2.csv"
heatmap_files = {
    "mean_temp": "f_mean_temp.tif",
    "temp_deviation": "f_deviation.tif",
    "temp_deviation_smooth": "f_deviation_smooth.tif",
}

# Load the training dataset
training_data = pd.read_csv(training_data_path)

# Ensure the dataset has 'Longitude' and 'Latitude' instead of 'latitude' and 'longitude'
training_data.rename(columns={"Longitude": "longitude", "Latitude": "latitude"}, inplace=True)

# Optimized function for extracting raster values
def extract_raster_values_optimized(raster_path, latitudes, longitudes):
    """
    Extracts values from a raster file (GeoTIFF) based on given latitude and longitude points.

    Parameters:
        raster_path (str): Path to the raster file.
        latitudes (list): List of latitude values.
        longitudes (list): List of longitude values.

    Returns:
        np.array: Extracted raster values, with NaN for missing values.
    """
    with rasterio.open(raster_path) as src:
        transform = src.transform
        band = src.read(1)  # Read entire raster band for faster access
        nodata_value = src.nodata  # Get nodata value from metadata

        # Convert lat/lon to row/col indices
        rows, cols = zip(*[rowcol(transform, lon, lat) for lat, lon in zip(latitudes, longitudes)])

        # Extract values using numpy indexing
        values = np.full(len(latitudes), np.nan)  # Initialize with NaNs for missing data handling
        for i, (row, col) in enumerate(zip(rows, cols)):
            try:
                value = band[row, col]
                if value != nodata_value:  # Handle nodata values
                    values[i] = value
            except IndexError:
                continue  # Skip out-of-bounds points

    return values

# Extract values from each heatmap file
for feature_name, file_path in heatmap_files.items():
    print(f"Extracting {feature_name} from {file_path}...")
    training_data[feature_name] = extract_raster_values_optimized(
        file_path, training_data["latitude"], training_data["longitude"]
    )

# Save the enriched dataset
output_path = "final_merged_weather_uhi_cleaned3.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated dataset saved to: {output_path}")


Extracting mean_temp from f_mean_temp.tif...
Extracting temp_deviation from f_deviation.tif...
Extracting temp_deviation_smooth from f_deviation_smooth.tif...
Updated dataset saved to: final_merged_weather_uhi_cleaned3.csv


## 8. Intergrating with Hyperlocal data
https://data.cityofnewyork.us/dataset/Hyperlocal-Temperature-Monitoring/qdq3-9eqn/about_data

In [41]:
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

# File paths
UHI_FILE = "final_merged_weather_uhi_cleaned3.csv"
HYPERLOCAL_FILE = "Hyperlocal_Temperature_Monitoring_20250312.csv"
OUTPUT_FILE = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"

# Load full UHI dataset (keep all columns including UHI Index)
print("Loading UHI dataset...")
uhi_data = pd.read_csv(UHI_FILE)

# Select only required columns for faster processing
use_cols = ["Latitude", "Longitude", "AirTemp", "Year"]

# Load Hyperlocal Temperature dataset in chunks
chunk_size = 100000  # Adjust based on available memory
filtered_data = []

print("Processing Hyperlocal dataset in chunks...")
for chunk in pd.read_csv(HYPERLOCAL_FILE, usecols=use_cols, chunksize=chunk_size):
    # Convert temperature from Fahrenheit to Celsius
    chunk["AirTemp_C"] = (chunk["AirTemp"] - 32) * 5/9
    
    # Filter for the year 2019 (proxy for 2021)
    chunk = chunk[chunk["Year"] == 2019]
    
    # Append filtered chunk
    filtered_data.append(chunk)

# Combine filtered chunks
hyperlocal_data = pd.concat(filtered_data, ignore_index=True)

# Reduce size for efficient processing (sample 10,000 points)
print("Sampling 10,000 points for spatial matching...")
hyperlocal_sample = hyperlocal_data.sample(n=10000, random_state=42)

# Build KDTree for fast nearest-neighbor search
print("Building KDTree for spatial lookup...")
tree = cKDTree(hyperlocal_sample[["Latitude", "Longitude"]].values)

# Find nearest temperature values for UHI dataset
print("Finding nearest temperature matches...")
uhi_coords = uhi_data[["latitude", "longitude"]].values
_, nearest_idx = tree.query(uhi_coords, k=1)

# Assign nearest temperature values
uhi_data["Nearest_AirTemp_C"] = hyperlocal_sample.iloc[nearest_idx]["AirTemp_C"].values

# Compute Temperature Anomaly (deviation from mean temperature)
mean_temp_c = uhi_data["Nearest_AirTemp_C"].mean()
uhi_data["Temp_Anomaly"] = uhi_data["Nearest_AirTemp_C"] - mean_temp_c

# Save integrated dataset (keeping all original columns + new features)
print(f"Saving integrated dataset to {OUTPUT_FILE}...")
uhi_data.to_csv(OUTPUT_FILE, index=False)

print("Processing complete! Integrated dataset saved successfully.")


Loading UHI dataset...
Processing Hyperlocal dataset in chunks...
Sampling 10,000 points for spatial matching...
Building KDTree for spatial lookup...
Finding nearest temperature matches...
Saving integrated dataset to final_merged_weather_uhi_cleaned3_hyperlocal.csv...
Processing complete! Integrated dataset saved successfully.


## 9. Integrating morning heat index geotiff raster with fahrenheit values
https://osf.io/j6eqr/?view_only=
https://github.com/OpenStoryMap/geodata/blob/main/nyc-heat-watch-2021/air-quality.geojson

### af_hi_f

In [31]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "af_hi_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["af_hi_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_afhi.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_afhi.csv


### am_hi_f

In [32]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "am_hi_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["am_hi_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_amhi.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_amhi.csv


### pm_hi_f

In [35]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "pm_hi_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["pm_hi_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_pmhi.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_pmhi.csv


### af_t_f

In [38]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "af_t_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["af_t_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_aft.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_aft.csv


### am_t_f

In [41]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "am_t_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["am_t_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_amt.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_amt.csv


### pm_t_f

In [44]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update path
training_data = pd.read_csv(training_data_path)

# Load the raster file
raster_path = "pm_t_f.tif"  # Update path
with rasterio.open(raster_path) as src:
    raster_crs = src.crs  # Get CRS of raster
    raster_transform = src.transform  # Get affine transform

    # Transformer to convert lat/lon (WGS84) to raster CRS
    transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

    # Function to get raster values at lat/lon points
    def get_raster_values(lat, lon):
        try:
            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(raster_transform, x, y)

            # Check if coordinates are within raster bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan  # Out of bounds
        except Exception:
            return np.nan

    # Apply raster value extraction for each row in the dataset
    training_data["pm_t_f_value"] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"]), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_pmt.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")

Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_pmt.csv


### Consolidated

In [42]:
import rasterio
import numpy as np
import pandas as pd
from pyproj import Transformer

# Define paths to all raster (TIFF) files
tif_files = {
    "af_hi_f": "af_hi_f.tif",
    "af_t_f": "af_t_f.tif",
    "am_hi_f": "am_hi_f.tif",
    "am_t_f": "am_t_f.tif",
    "pm_hi_f": "pm_hi_f.tif",
    "pm_t_f": "pm_t_f.tif",
}

# Load the training dataset
training_data_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"  # Update if necessary
training_data = pd.read_csv(training_data_path)

# Initialize storage for raster values
for key in tif_files.keys():
    training_data[key] = np.nan

# Function to extract raster values at lat/lon points
def get_raster_values(lat, lon, raster_path):
    try:
        with rasterio.open(raster_path) as src:
            raster_crs = src.crs
            transformer = Transformer.from_crs("EPSG:4326", raster_crs, always_xy=True)

            # Convert lat/lon to raster CRS
            x, y = transformer.transform(lon, lat)
            row, col = rasterio.transform.rowcol(src.transform, x, y)

            # Validate coordinates within bounds
            if 0 <= row < src.height and 0 <= col < src.width:
                return src.read(1)[row, col]  # Extract raster value
            else:
                return np.nan
    except Exception:
        return np.nan

# Iterate through TIFF files and extract raster values
for key, raster_path in tif_files.items():
    print(f"Processing: {raster_path}")
    training_data[key] = training_data.apply(
        lambda row: get_raster_values(row["latitude"], row["longitude"], raster_path), axis=1
    )

# Save the updated dataset to a new CSV file
output_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
training_data.to_csv(output_path, index=False)

print(f"Updated training data with raster features saved to: {output_path}")


Processing: af_hi_f.tif
Processing: af_t_f.tif
Processing: am_hi_f.tif
Processing: am_t_f.tif
Processing: pm_hi_f.tif
Processing: pm_t_f.tif
Updated training data with raster features saved to: final_merged_weather_uhi_cleaned3_hyperlocal_all.csv


## 10. Integrating PLUTO Data 
Extensive land use and geographic data at the tax lot level in comma–separated values
https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page

In [43]:
import pandas as pd
import numpy as np
from scipy.spatial import cKDTree

# File paths (Update these to match your local setup)
UUHI_FILE = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
PLUTO_FILE = "pluto_24v4_1.csv"
OUTPUT_FILE = "uhi_pluto.csv"

# Load the UHI dataset
uhi_data = pd.read_csv(UUHI_FILE)

# Load the PLUTO dataset
pluto_data = pd.read_csv(PLUTO_FILE, low_memory=False)

# Print column names to debug
print("UHI Dataset Columns:", uhi_data.columns)
print("PLUTO Dataset Columns:", pluto_data.columns)

# Create a new DataFrame with just latitude and longitude from PLUTO
# Based on the output, use the 'latitude' and 'longitude' columns directly
pluto_coords_df = pd.DataFrame()
pluto_coords_df['latitude'] = pluto_data['latitude']
pluto_coords_df['longitude'] = pluto_data['longitude']
pluto_coords_df = pluto_coords_df.dropna()

# Print debugging information
print("PLUTO coordinates DataFrame shape after cleaning:", pluto_coords_df.shape)
print("PLUTO coordinates sample:", pluto_coords_df.head())

# Convert to 2D numpy arrays
uhi_coords = uhi_data[['latitude', 'longitude']].values
pluto_coords = pluto_coords_df[['latitude', 'longitude']].values

# Verify shapes before KDTree
print("UHI Coordinates Shape:", uhi_coords.shape)
print("PLUTO Coordinates Shape:", pluto_coords.shape)

# Build a KDTree for fast spatial lookup
pluto_tree = cKDTree(pluto_coords)

# Find the nearest PLUTO tax lot for each UHI point
distances, indices = pluto_tree.query(uhi_coords, k=1)

# Attach the closest PLUTO lot data to the UHI dataset
uhi_data['nearest_pluto_index'] = indices
uhi_data['pluto_distance'] = distances

# Save the indices of pluto_coords_df for later merging
pluto_coords_df = pluto_coords_df.reset_index()

# Merge UHI data with corresponding PLUTO tax lot attributes
result = pd.merge(
    uhi_data,
    pluto_coords_df,
    left_on='nearest_pluto_index',
    right_on='index',
    how='left',
    suffixes=('', '_pluto')
)

# Create a temporary index in the original pluto_data for merging
pluto_data['temp_index'] = range(len(pluto_data))

# Merge with the full PLUTO dataset to get all attributes
final_result = pd.merge(
    result,
    pluto_data,
    left_on=['latitude_pluto', 'longitude_pluto'],
    right_on=['latitude', 'longitude'],
    how='left',
    suffixes=('', '_full_pluto')
)

# Clean up temporary columns
columns_to_drop = ['index', 'latitude_pluto', 'longitude_pluto']
final_result = final_result.drop(columns=columns_to_drop, errors='ignore')

# Save the output
final_result.to_csv(OUTPUT_FILE, index=False)
print(f"Spatial join completed. Output saved to: {OUTPUT_FILE}")

UHI Dataset Columns: Index(['longitude', 'latitude', 'UHI Index', 'NDVI', 'EVI', 'GNDVI', 'SAVI',
       'NDBI', 'MNDWI', 'NDWI', 'LSWI', 'BI', 'Albedo', 'IBI', 'NBAI', 'LST',
       'nearest_building_area', 'nearest_building_perimeter',
       'building_density', 'nearest_datetime', 'Latitude.1', 'Longitude.1',
       'elevation [feet]', 'temp_2m [degF]', 'relative_humidity [percent]',
       'avg_wind_speed_merge [mile/hr]', 'max_wind_speed_merge [mile/hr]',
       'wind_speed_stddev_merge [mile/hr]', 'wind_direction_merge [degrees]',
       'wind_direction_stddev_merge [degrees]', 'solar_insolation [W/m^2]',
       'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
       'Nearest_AirTemp_C', 'Temp_Anomaly', 'af_hi_f', 'af_t_f', 'am_hi_f',
       'am_t_f', 'pm_hi_f', 'pm_t_f'],
      dtype='object')
PLUTO Dataset Columns: Index(['borough', 'block', 'lot', 'cd', 'bct2020', 'bctcb2020', 'ct2010',
       'cb2010', 'schooldist', 'council', 'zipcode', 'firecomp', 'policeprct',
     

In [44]:
import pandas as pd
import numpy as np

# Load the merged dataset
file_path = "uhi_pluto.csv"
data = pd.read_csv(file_path)

# Print original shape
print(f"Original dataset shape: {data.shape}")

# Method 1: Identify completely empty columns (all NaN or empty strings)
empty_columns = []
for column in data.columns:
    # Check if column is all NaN
    if data[column].isna().all():
        empty_columns.append(column)
    # Check if column contains strings and all are empty
    elif data[column].dtype == 'object' and (data[column].fillna('') == '').all():
        empty_columns.append(column)

print(f"Found {len(empty_columns)} completely empty columns:")
print(empty_columns)

# Method 2: Identify columns with missing values above a threshold
threshold = 0.99  # 99% missing values
high_missing_columns = [column for column in data.columns 
                       if data[column].isna().mean() > threshold]

print(f"Found {len(high_missing_columns)} columns with more than {threshold*100}% missing values:")
print(high_missing_columns)

# Remove completely empty columns
data_cleaned = data.drop(columns=empty_columns)
print(f"Shape after removing completely empty columns: {data_cleaned.shape}")

# Optional: Also remove columns with high percentage of missing values
data_cleaned_strict = data.drop(columns=list(set(empty_columns + high_missing_columns)))
print(f"Shape after removing both empty and high-missing columns: {data_cleaned_strict.shape}")

# Save the cleaned dataset
output_file = "uhi_pluto_cleaned.csv"
data_cleaned.to_csv(output_file, index=False)
print(f"Cleaned dataset saved to: {output_file}")

# Optional: Save the stricter cleaned dataset 
output_file_strict = "uhi_pluto_cleaned_strict.csv"
data_cleaned_strict.to_csv(output_file_strict, index=False)
print(f"Stricter cleaned dataset saved to: {output_file_strict}")

  data = pd.read_csv(file_path)


Original dataset shape: (30749, 137)
Found 3 completely empty columns:
['zonedist4', 'spdist3', 'notes']
Found 10 columns with more than 99.0% missing values:
['zonedist3', 'zonedist4', 'overlay2', 'spdist2', 'spdist3', 'ltdheight', 'landmark', 'zmcode', 'edesignum', 'notes']
Shape after removing completely empty columns: (30749, 134)
Shape after removing both empty and high-missing columns: (30749, 127)
Cleaned dataset saved to: uhi_pluto_cleaned.csv
Stricter cleaned dataset saved to: uhi_pluto_cleaned_strict.csv


In [51]:
import pandas as pd

# Load the dataset
file_path = "uhi_pluto_cleaned_strict.csv"
df = pd.read_csv(file_path)

# Define the required columns
required_columns = [
    'longitude', 'latitude', 'UHI Index', 
    'NDVI', 'EVI', 'GNDVI', 'SAVI', 'NDBI', 'MNDWI', 'NDWI', 'LSWI', 'BI', 'Albedo', 'IBI', 'NBAI', 'LST', 
    'nearest_building_area', 'nearest_building_perimeter', 'building_density', 'nearest_datetime', 
    'Latitude.1', 'Longitude.1', 'elevation [feet]', 'temp_2m [degF]', 'relative_humidity [percent]', 
    'avg_wind_speed_merge [mile/hr]', 'max_wind_speed_merge [mile/hr]', 'wind_speed_stddev_merge [mile/hr]', 
    'wind_direction_merge [degrees]', 'wind_direction_stddev_merge [degrees]', 'solar_insolation [W/m^2]', 
    'mean_temp','temp_deviation', 'temp_deviation_smooth', 'Nearest_AirTemp_C', 'Temp_Anomaly', 
    'af_hi_f', 'af_t_f', 'am_hi_f', 'am_t_f', 'pm_hi_f', 'pm_t_f', 
    "bldgarea", "numfloors", "unitsres", "unitstotal", "bldgfront", "bldgdepth",
    "lotarea", "residfar", "commfar", "facilfar", "garagearea", "strgearea", "factryarea",
    "assessland", "yearbuilt", "yearalter1", "yearalter2", "temp_index"
]

# Keep only the required columns that exist in the dataset
df_cleaned = df[[col for col in required_columns if col in df.columns]]

# Save the cleaned dataset
cleaned_file_path = "uhi_pluto_cleaned_filtered.csv"
df_cleaned.to_csv(cleaned_file_path, index=False)


## 11. Heat & Health Index (HHI) Data
https://ephtracking.cdc.gov/Applications/heatTracker/?page=detail3

https://catalog.data.gov/dataset/modified-zip-code-tabulation-areas-modzcta

https://data.cityofnewyork.us/Health/Modified-Zip-Code-Tabulation-Areas-MODZCTA-/pri4-ifjk/about_data

In [52]:
import pandas as pd
from shapely.wkt import loads
from shapely.geometry import Point
from scipy.spatial import cKDTree
import numpy as np

# Load datasets
uhi_data_path = "uhi_pluto_cleaned_filtered.csv"
hhi_data_path = "HHI Data 2024 United States.xlsx"
zip_code_data_path = "Modified_Zip_Code_Tabulation_Areas__MODZCTA_.csv"

uhi_data = pd.read_csv(uhi_data_path)
hhi_data = pd.read_excel(hhi_data_path, sheet_name="Sheet1")
zip_code_data = pd.read_csv(zip_code_data_path)

# Convert ZIP geometries to centroids for easier spatial join
zip_code_data["centroid"] = zip_code_data["the_geom"].apply(lambda x: loads(x).centroid if pd.notnull(x) else None)

# Extract latitude and longitude for ZIP centroids
zip_code_data["latitude"] = zip_code_data["centroid"].apply(lambda x: x.y if x else None)
zip_code_data["longitude"] = zip_code_data["centroid"].apply(lambda x: x.x if x else None)

# Remove invalid ZIP centroids
zip_code_data.dropna(subset=["latitude", "longitude"], inplace=True)

# Use KDTree to find the nearest ZIP centroid for each UHI point
uhi_coords = np.array(list(zip(uhi_data["latitude"], uhi_data["longitude"])))
zip_coords = np.array(list(zip(zip_code_data["latitude"], zip_code_data["longitude"])))
zip_tree = cKDTree(zip_coords)
_, nearest_zip_idx = zip_tree.query(uhi_coords)

# Assign the nearest ZIP code to each UHI point
uhi_data["ZCTA"] = zip_code_data.iloc[nearest_zip_idx]["ZCTA"].values

# Convert ZIP code column types to string for merging
uhi_data["ZCTA"] = uhi_data["ZCTA"].astype(str)
hhi_data["ZCTA"] = hhi_data["ZCTA"].astype(str)

# Merge UHI data with HHI data
merged_data = uhi_data.merge(hhi_data, on="ZCTA", how="left")

# Save the merged dataset
merged_data.to_csv("Merged_UHI_HHI_Data.csv", index=False)

print("Merged dataset saved as Merged_UHI_HHI_Data.csv")

Merged dataset saved as Merged_UHI_HHI_Data.csv


## 12. Heat Vulnerability Index Rankings (HVI)
https://data.cityofnewyork.us/Health/Heat-Vulnerability-Index-Rankings/4mhf-duep/about_data

In [53]:
import pandas as pd

# Load datasets
uhi_hhi_data_path = "Merged_UHI_HHI_Data.csv"
hvi_data_path = "Heat_Vulnerability_Index_Rankings_20250315.csv"

uhi_hhi_data = pd.read_csv(uhi_hhi_data_path)
hvi_data = pd.read_csv(hvi_data_path)

# Rename HVI columns for consistency
hvi_data.rename(columns={"ZIP Code Tabulation Area (ZCTA) 2020": "ZCTA", 
                         "Heat Vulnerability Index (HVI)": "HVI"}, inplace=True)

# Convert ZCTA to string for consistent merging
uhi_hhi_data["ZCTA"] = uhi_hhi_data["ZCTA"].astype(str)
hvi_data["ZCTA"] = hvi_data["ZCTA"].astype(str)

# Merge HVI data into the existing dataset using ZCTA
final_merged_data = uhi_hhi_data.merge(hvi_data, on="ZCTA", how="left")

# Save the updated dataset
final_merged_data_path = "Merged_UHI_HHI_HVI_Data.csv"
final_merged_data.to_csv(final_merged_data_path, index=False)

print("Merged dataset saved as Merged_UHI_HHI_HVI_Data.csv")


  uhi_hhi_data = pd.read_csv(uhi_hhi_data_path)


Merged dataset saved as Merged_UHI_HHI_HVI_Data.csv


## 13. Greenroof Data
https://zenodo.org/records/1469674
https://github.com/tnc-ny-science/NYC_GreenRoofMapping/tree/master/greenroof_gisdata/CurrentDatasets
https://github.com/CityOfNewYork/nyc-geo-metadata/blob/main/Metadata/Metadata_BuildingFootprints.md

In [54]:
import pandas as pd
from shapely.geometry import Point
from scipy.spatial import cKDTree
import numpy as np

# File paths
merged_data_file = "Merged_UHI_HHI_HVI_Data.csv"
green_roof_file = "GreenRoofData2016_20180917.csv"

# Load merged UHI-HHI-HVI data
merged_df = pd.read_csv(merged_data_file)

# Load Green Roof data
green_roof_df = pd.read_csv(green_roof_file)

# Create Point objects for spatial operations
merged_df["geometry"] = merged_df.apply(lambda row: Point(row["longitude"], row["latitude"]), axis=1)
green_roof_df["geometry"] = green_roof_df.apply(lambda row: Point(row["xcoord"], row["ycoord"]), axis=1)

# Convert coordinates to NumPy arrays for KDTree
merged_coords = np.array([[point.x, point.y] for point in merged_df["geometry"]])
green_roof_coords = np.array([[point.x, point.y] for point in green_roof_df["geometry"]])

# Build KDTree for Green Roof Data
tree = cKDTree(green_roof_coords)

# Query the nearest green roof for each merged data point
distances, indices = tree.query(merged_coords, k=1)

# Assign the nearest green roof data to the merged dataset
merged_df["distance_to_green_roof"] = distances
nearest_green_roofs = green_roof_df.iloc[indices].reset_index()

# Select relevant columns from Green Roof data
green_roof_features = ["gr_area", "bldg_area", "prop_gr", "heightroof", "groundelev"]
merged_df = pd.concat([merged_df.reset_index(drop=True), nearest_green_roofs[green_roof_features]], axis=1)

# Save the updated dataset
output_file = "Merged_UHI_HHI_HVI_GreenRoof_Data.csv"
merged_df.to_csv(output_file, index=False)

print(f"Updated dataset saved as: {output_file}")


  merged_df = pd.read_csv(merged_data_file)


Updated dataset saved as: Merged_UHI_HHI_HVI_GreenRoof_Data.csv


## 14. Model Building
Our model has been constructed to predict the Urban Heat Island (UHI) index using features from the Sentinel-2 satellite dataset, Landset dataset, and building footprint dataset as predictor variables. In the best model, we utilized ten features: band B01 (Coastal Aerosol), band B06 (Red Edge), and NDVI (Normalized Difference Vegetation Index) derived from bands B04 (Red) and B08 (Near Infrared). A random forest regression model was then trained using these features.
    
These features were extracted from a GeoTIFF image created by the Sentinel-2 sample notebook. For the sample model shown in this notebook, data from a single day (24th July 2021) was considered, assuming that the values of bands B01, B04, B06, and B08 for this specific date are representative of the UHI index behavior at any location. Participants should review the details of the Sentinel-2 sample notebook to gain an understanding of the data and options for modifying the output product. 

In [16]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission224.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9881
LightGBM R² Score: 0.9895

Model Performance Metrics:
               Metric     Score
0           R-squared  0.983970
1    Out-of-Bag Score  0.971188
2   Mean CV R-squared -0.067227
3  Ensemble R-squared  0.983295

Submission file saved to Submission224.csv


In [23]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_afhi.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission225.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_hi_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["af_hi_f_value"] = uhi_df.iloc[indices]["af_hi_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_hi_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9906
LightGBM R² Score: 0.9885

Model Performance Metrics:
               Metric     Score
0           R-squared  0.985202
1    Out-of-Bag Score  0.979669
2   Mean CV R-squared  0.463351
3  Ensemble R-squared  0.980776

Submission file saved to Submission225.csv


In [24]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_hi_f_value                       0.368574
building_density_ratio_squared      0.078918
building_density_ratio              0.076779
log_building_density_ratio          0.064036
building_density                    0.063138
Wind_Speed_x_Building_Density       0.047465
temp_deviation_smooth               0.041792
building_density_LST_interaction    0.039870
Nearest_AirTemp_C                   0.023659
Temp_Anomaly                        0.021733
temp_deviation                      0.019746
temp_2m_                            0.019133
mean_temp                           0.019016
LST                                 0.018299
relative_humidity_                  0.017526
log_LST                             0.017161
SAVI_LST_sqrt_diff                  0.014619
solar_insolation_                   0.013876
wind_direction_merge_               0.012292
nearest_building_area               0.008599
nearest_building_perimeter          0.006938
log_building_perimet

In [27]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_amhi.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission226.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_hi_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["am_hi_f_value"] = uhi_df.iloc[indices]["am_hi_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_hi_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9766
LightGBM R² Score: 0.9673

Model Performance Metrics:
               Metric     Score
0           R-squared  0.983288
1    Out-of-Bag Score  0.971649
2   Mean CV R-squared -0.026794
3  Ensemble R-squared  0.977200

Submission file saved to Submission226.csv


In [28]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
building_density_ratio_squared      0.104017
building_density_ratio              0.103868
log_building_density_ratio          0.093085
building_density                    0.092655
Wind_Speed_x_Building_Density       0.072445
am_hi_f_value                       0.063629
building_density_LST_interaction    0.063430
temp_deviation_smooth               0.051936
Nearest_AirTemp_C                   0.034641
Temp_Anomaly                        0.033689
LST                                 0.030794
temp_2m_                            0.029904
log_LST                             0.028773
temp_deviation                      0.027235
relative_humidity_                  0.026864
mean_temp                           0.026789
SAVI_LST_sqrt_diff                  0.026046
solar_insolation_                   0.024173
wind_direction_merge_               0.020773
nearest_building_area               0.017057
nearest_building_perimeter          0.014107
log_building_perimet

In [36]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_pmhi.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission227.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_hi_f_value"] = uhi_df.iloc[indices]["pm_hi_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9842
LightGBM R² Score: 0.9822

Model Performance Metrics:
               Metric     Score
0           R-squared  0.979415
1    Out-of-Bag Score  0.971538
2   Mean CV R-squared  0.050022
3  Ensemble R-squared  0.977135

Submission file saved to Submission227.csv


In [37]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_hi_f_value                       0.127367
building_density_ratio              0.100041
building_density_ratio_squared      0.098921
building_density                    0.087816
log_building_density_ratio          0.085378
Wind_Speed_x_Building_Density       0.068481
building_density_LST_interaction    0.053777
temp_deviation_smooth               0.048219
Nearest_AirTemp_C                   0.031381
Temp_Anomaly                        0.030903
LST                                 0.028647
log_LST                             0.027949
relative_humidity_                  0.026740
temp_2m_                            0.025893
temp_deviation                      0.025745
mean_temp                           0.025080
SAVI_LST_sqrt_diff                  0.024658
solar_insolation_                   0.023510
wind_direction_merge_               0.018274
nearest_building_area               0.015134
nearest_building_perimeter          0.013105
log_building_perimet

In [39]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_aft.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission228.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_t_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["af_t_f_value"] = uhi_df.iloc[indices]["af_t_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_t_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9827
LightGBM R² Score: 0.9768

Model Performance Metrics:
               Metric     Score
0           R-squared  0.976947
1    Out-of-Bag Score  0.981224
2   Mean CV R-squared  0.767943
3  Ensemble R-squared  0.971740

Submission file saved to Submission228.csv


In [40]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f_value                        0.470651
building_density_ratio_squared      0.073062
building_density_ratio              0.069529
log_building_density_ratio          0.060127
building_density                    0.058082
Wind_Speed_x_Building_Density       0.039668
building_density_LST_interaction    0.037388
temp_deviation_smooth               0.025940
Nearest_AirTemp_C                   0.017730
Temp_Anomaly                        0.016580
temp_2m_                            0.014458
LST                                 0.014408
relative_humidity_                  0.013903
mean_temp                           0.013536
temp_deviation                      0.013531
log_LST                             0.013261
SAVI_LST_sqrt_diff                  0.010881
solar_insolation_                   0.010277
wind_direction_merge_               0.009048
nearest_building_area               0.007058
log_building_perimeter              0.005503
nearest_building_per

In [42]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_amt.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission229.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_t_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["am_t_f_value"] = uhi_df.iloc[indices]["am_t_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_t_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
41 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9803
LightGBM R² Score: 0.9784

Model Performance Metrics:
               Metric     Score
0           R-squared  0.982636
1    Out-of-Bag Score  0.971908
2   Mean CV R-squared -0.021364
3  Ensemble R-squared  0.979036

Submission file saved to Submission229.csv


In [43]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
building_density_ratio_squared      0.105577
building_density_ratio              0.104256
log_building_density_ratio          0.091817
building_density                    0.091026
Wind_Speed_x_Building_Density       0.070997
building_density_LST_interaction    0.064213
am_t_f_value                        0.064086
temp_deviation_smooth               0.054301
Temp_Anomaly                        0.033814
Nearest_AirTemp_C                   0.033700
LST                                 0.030494
temp_2m_                            0.030043
log_LST                             0.028774
relative_humidity_                  0.027739
mean_temp                           0.027495
temp_deviation                      0.027240
SAVI_LST_sqrt_diff                  0.025275
solar_insolation_                   0.022316
wind_direction_merge_               0.021673
nearest_building_area               0.016731
log_building_perimeter              0.014332
nearest_building_per

In [45]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_pmt.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission230.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f_value'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f_value"] = uhi_df.iloc[indices]["pm_t_f_value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f_value'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9708
LightGBM R² Score: 0.9704

Model Performance Metrics:
               Metric     Score
0           R-squared  0.977006
1    Out-of-Bag Score  0.972023
2   Mean CV R-squared  0.052964
3  Ensemble R-squared  0.972006

Submission file saved to Submission230.csv


In [48]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f_value                        0.145858
building_density_ratio              0.102414
building_density_ratio_squared      0.099705
log_building_density_ratio          0.088337
building_density                    0.088213
Wind_Speed_x_Building_Density       0.065342
building_density_LST_interaction    0.050847
temp_deviation_smooth               0.045839
Nearest_AirTemp_C                   0.029143
Temp_Anomaly                        0.028085
LST                                 0.027304
log_LST                             0.026512
temp_deviation                      0.025018
mean_temp                           0.024435
temp_2m_                            0.024404
relative_humidity_                  0.024324
SAVI_LST_sqrt_diff                  0.023797
solar_insolation_                   0.022840
wind_direction_merge_               0.016948
nearest_building_area               0.014770
log_building_perimeter              0.012954
nearest_building_per

In [52]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission231.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
42 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9875
LightGBM R² Score: 0.9824

Model Performance Metrics:
               Metric     Score
0           R-squared  0.977455
1    Out-of-Bag Score  0.981895
2   Mean CV R-squared  0.879511
3  Ensemble R-squared  0.974475

Submission file saved to Submission231.csv


In [53]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.374413
af_hi_f                             0.216662
building_density_ratio_squared      0.051028
building_density_ratio              0.050045
pm_t_f                              0.044583
building_density                    0.040932
log_building_density_ratio          0.037554
pm_hi_f                             0.029575
Wind_Speed_x_Building_Density       0.021034
building_density_LST_interaction    0.016682
temp_deviation_smooth               0.015772
am_t_f                              0.014889
am_hi_f                             0.013671
Temp_Anomaly                        0.007508
log_LST                             0.006903
Nearest_AirTemp_C                   0.006892
temp_deviation                      0.006507
mean_temp                           0.006454
LST                                 0.006372
SAVI_LST_sqrt_diff                  0.006194
relative_humidity_                  0.005889
temp_2m_            

In [54]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission232.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_hi_f', 'af_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'af_hi_f', 'af_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9880
LightGBM R² Score: 0.9809

Model Performance Metrics:
               Metric     Score
0           R-squared  0.979351
1    Out-of-Bag Score  0.982047
2   Mean CV R-squared  0.853949
3  Ensemble R-squared  0.974210

Submission file saved to Submission232.csv


In [55]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.372638
af_hi_f                             0.232059
building_density_ratio_squared      0.058220
building_density_ratio              0.055607
building_density                    0.047498
log_building_density_ratio          0.044476
Wind_Speed_x_Building_Density       0.032102
building_density_LST_interaction    0.024578
temp_deviation_smooth               0.022522
Nearest_AirTemp_C                   0.012338
Temp_Anomaly                        0.012044
mean_temp                           0.010007
temp_2m_                            0.009731
temp_deviation                      0.009180
log_LST                             0.009047
relative_humidity_                  0.008920
LST                                 0.008632
SAVI_LST_sqrt_diff                  0.007449
solar_insolation_                   0.006593
wind_direction_merge_               0.005890
nearest_building_area               0.004103
nearest_building_per

In [56]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission233.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_hi_f', 'am_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_hi_f', 'am_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
10 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9818
LightGBM R² Score: 0.9690

Model Performance Metrics:
               Metric     Score
0           R-squared  0.983066
1    Out-of-Bag Score  0.971642
2   Mean CV R-squared -0.007526
3  Ensemble R-squared  0.976697

Submission file saved to Submission233.csv


In [57]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
building_density_ratio_squared      0.105615
building_density_ratio              0.103947
building_density                    0.087945
log_building_density_ratio          0.086241
Wind_Speed_x_Building_Density       0.070316
building_density_LST_interaction    0.061729
am_hi_f                             0.053530
am_t_f                              0.050030
temp_deviation_smooth               0.049448
Nearest_AirTemp_C                   0.031889
Temp_Anomaly                        0.031721
LST                                 0.028203
temp_2m_                            0.028160
log_LST                             0.028111
relative_humidity_                  0.025727
SAVI_LST_sqrt_diff                  0.025044
mean_temp                           0.024714
temp_deviation                      0.023490
solar_insolation_                   0.021723
wind_direction_merge_               0.020073
nearest_building_area               0.016022
nearest_building_per

In [58]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission234.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'pm_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'pm_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9775
LightGBM R² Score: 0.9684

Model Performance Metrics:
               Metric     Score
0           R-squared  0.974426
1    Out-of-Bag Score  0.971532
2   Mean CV R-squared  0.073370
3  Ensemble R-squared  0.972429

Submission file saved to Submission234.csv


In [59]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.107645
building_density_ratio_squared      0.102393
building_density_ratio              0.099444
pm_hi_f                             0.089087
building_density                    0.081901
log_building_density_ratio          0.080126
Wind_Speed_x_Building_Density       0.063154
building_density_LST_interaction    0.044128
temp_deviation_smooth               0.042695
Nearest_AirTemp_C                   0.027034
Temp_Anomaly                        0.026531
LST                                 0.025717
log_LST                             0.025027
relative_humidity_                  0.023447
SAVI_LST_sqrt_diff                  0.022949
mean_temp                           0.022131
temp_2m_                            0.021890
solar_insolation_                   0.020871
temp_deviation                      0.020678
wind_direction_merge_               0.015885
nearest_building_area               0.013204
nearest_building_per

In [60]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission235.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
27 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9848
LightGBM R² Score: 0.9892

Model Performance Metrics:
               Metric     Score
0           R-squared  0.985371
1    Out-of-Bag Score  0.979422
2   Mean CV R-squared  0.483797
3  Ensemble R-squared  0.979101

Submission file saved to Submission235.csv


In [61]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_hi_f                             0.318654
pm_hi_f                             0.073757
building_density_ratio_squared      0.071540
building_density_ratio              0.069850
log_building_density_ratio          0.065565
building_density                    0.058002
Wind_Speed_x_Building_Density       0.047134
temp_deviation_smooth               0.035007
building_density_LST_interaction    0.034749
am_hi_f                             0.034592
Nearest_AirTemp_C                   0.018819
Temp_Anomaly                        0.018793
log_LST                             0.016971
LST                                 0.016732
mean_temp                           0.016174
temp_deviation                      0.015773
relative_humidity_                  0.015723
temp_2m_                            0.014898
SAVI_LST_sqrt_diff                  0.014468
solar_insolation_                   0.012899
wind_direction_merge_               0.010100
nearest_building_are

In [62]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission236.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_t_f', 'af_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_t_f', 'af_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
65 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9820
LightGBM R² Score: 0.9784

Model Performance Metrics:
               Metric     Score
0           R-squared  0.975205
1    Out-of-Bag Score  0.981096
2   Mean CV R-squared  0.763708
3  Ensemble R-squared  0.972453

Submission file saved to Submission236.csv


In [63]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.407346
pm_t_f                              0.083147
building_density_ratio_squared      0.066769
building_density_ratio              0.066149
log_building_density_ratio          0.058660
building_density                    0.055757
Wind_Speed_x_Building_Density       0.040894
building_density_LST_interaction    0.030752
am_t_f                              0.025865
temp_deviation_smooth               0.022617
Temp_Anomaly                        0.014455
Nearest_AirTemp_C                   0.014218
log_LST                             0.012678
LST                                 0.012615
mean_temp                           0.011644
temp_deviation                      0.011594
SAVI_LST_sqrt_diff                  0.011316
relative_humidity_                  0.010938
temp_2m_                            0.010680
solar_insolation_                   0.008487
wind_direction_merge_               0.007560
nearest_building_are

In [64]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission237.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
35 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9712
LightGBM R² Score: 0.9612

Model Performance Metrics:
               Metric     Score
0           R-squared  0.977161
1    Out-of-Bag Score  0.972615
2   Mean CV R-squared  0.097822
3  Ensemble R-squared  0.970752

Submission file saved to Submission237.csv


In [65]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.137621
building_density_ratio_squared      0.102162
building_density_ratio              0.100631
building_density                    0.084591
log_building_density_ratio          0.081949
Wind_Speed_x_Building_Density       0.063914
am_t_f                              0.049576
building_density_LST_interaction    0.049059
temp_deviation_smooth               0.042615
Temp_Anomaly                        0.027083
Nearest_AirTemp_C                   0.027017
LST                                 0.025169
log_LST                             0.025022
SAVI_LST_sqrt_diff                  0.022570
relative_humidity_                  0.022554
mean_temp                           0.022461
temp_2m_                            0.022323
temp_deviation                      0.021007
solar_insolation_                   0.018600
wind_direction_merge_               0.016469
nearest_building_area               0.013690
log_building_perimet

In [66]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission238.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
31 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9770
LightGBM R² Score: 0.9760

Model Performance Metrics:
               Metric     Score
0           R-squared  0.979364
1    Out-of-Bag Score  0.972000
2   Mean CV R-squared  0.095323
3  Ensemble R-squared  0.975673

Submission file saved to Submission238.csv


In [67]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_hi_f                             0.120378
building_density_ratio_squared      0.103411
building_density_ratio              0.097799
building_density                    0.083346
log_building_density_ratio          0.081816
Wind_Speed_x_Building_Density       0.065140
am_hi_f                             0.052909
building_density_LST_interaction    0.050113
temp_deviation_smooth               0.043548
Nearest_AirTemp_C                   0.029788
Temp_Anomaly                        0.027791
log_LST                             0.026554
LST                                 0.025969
temp_2m_                            0.023990
relative_humidity_                  0.023781
SAVI_LST_sqrt_diff                  0.023141
mean_temp                           0.022613
solar_insolation_                   0.021728
temp_deviation                      0.021176
wind_direction_merge_               0.016732
nearest_building_area               0.013836
log_building_perimet

In [13]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission239.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f',
     'pm_t_f', 'am_t_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f',
     'pm_t_f', 'am_t_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9747
LightGBM R² Score: 0.9730

Model Performance Metrics:
               Metric     Score
0           R-squared  0.973921
1    Out-of-Bag Score  0.972099
2   Mean CV R-squared  0.129846
3  Ensemble R-squared  0.970811

Submission file saved to Submission239.csv


In [14]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.109659
building_density_ratio_squared      0.091511
building_density_ratio              0.089435
log_building_density_ratio          0.079321
pm_hi_f                             0.078861
building_density                    0.076687
Wind_Speed_x_Building_Density       0.059132
building_density_LST_interaction    0.043896
am_t_f                              0.040812
am_hi_f                             0.036909
temp_deviation_smooth               0.036291
LST                                 0.024083
log_LST                             0.023738
Nearest_AirTemp_C                   0.022383
Temp_Anomaly                        0.022184
SAVI_LST_sqrt_diff                  0.021629
mean_temp                           0.020670
relative_humidity_                  0.019512
temp_deviation                      0.019402
temp_2m_                            0.018714
solar_insolation_                   0.017423
wind_direction_merge

In [15]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "final_merged_weather_uhi_cleaned3_hyperlocal_all.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission240.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9752
LightGBM R² Score: 0.9723

Model Performance Metrics:
               Metric     Score
0           R-squared  0.976228
1    Out-of-Bag Score  0.972509
2   Mean CV R-squared  0.083838
3  Ensemble R-squared  0.972708

Submission file saved to Submission240.csv


In [17]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission241.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     "bldgarea", "numfloors", "unitsres", "unitstotal", "bldgfront", "bldgdepth",
     "lotarea", "residfar", "commfar", "facilfar", "garagearea", "strgearea", "factryarea",
     "assessland", "yearbuilt", "yearalter1", "yearalter2", "temp_index"
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     "bldgarea", "numfloors", "unitsres", "unitstotal", "bldgfront", "bldgdepth",
     "lotarea", "residfar", "commfar", "facilfar", "garagearea", "strgearea", "factryarea",
     "assessland", "yearbuilt", "yearalter1", "yearalter2", "temp_index"
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9709
LightGBM R² Score: 0.9790

Model Performance Metrics:
               Metric     Score
0           R-squared  0.966824
1    Out-of-Bag Score  0.978930
2   Mean CV R-squared  0.296213
3  Ensemble R-squared  0.945207

Submission file saved to Submission241.csv


In [18]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.119197
Wind_Speed_x_Building_Density       0.111653
building_density_ratio              0.076816
building_density_ratio_squared      0.076434
log_building_density_ratio          0.072667
building_density                    0.067788
building_density_LST_interaction    0.053016
temp_deviation_smooth               0.048554
am_hi_f                             0.046984
wind_direction_merge_               0.027610
log_LST                             0.024376
Temp_Anomaly                        0.022945
LST                                 0.022631
Nearest_AirTemp_C                   0.022249
relative_humidity_                  0.022042
SAVI_LST_sqrt_diff                  0.021494
temp_2m_                            0.019538
solar_insolation_                   0.017373
mean_temp                           0.015244
temp_deviation                      0.015166
temp_index                          0.014659
nearest_building_are

In [19]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission242.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     "bldgarea", "numfloors", "bldgfront", "bldgdepth",
     "lotarea", "residfar", "commfar", "facilfar", "temp_index"
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     "bldgarea", "numfloors", "bldgfront", "bldgdepth",
     "lotarea", "residfar", "commfar", "facilfar", "temp_index"
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9723
LightGBM R² Score: 0.9694

Model Performance Metrics:
               Metric     Score
0           R-squared  0.965480
1    Out-of-Bag Score  0.979798
2   Mean CV R-squared  0.317806
3  Ensemble R-squared  0.945357


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- assessland
- factryarea
- garagearea
- strgearea
- unitsres
- ...


In [21]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.132610
Wind_Speed_x_Building_Density       0.131537
building_density_ratio              0.083182
building_density_ratio_squared      0.077356
log_building_density_ratio          0.074333
building_density                    0.071703
am_hi_f                             0.048650
building_density_LST_interaction    0.046128
temp_deviation_smooth               0.044095
LST                                 0.023518
log_LST                             0.023459
wind_direction_merge_               0.023054
Nearest_AirTemp_C                   0.021729
Temp_Anomaly                        0.021451
SAVI_LST_sqrt_diff                  0.020204
relative_humidity_                  0.019669
temp_2m_                            0.019554
solar_insolation_                   0.015247
temp_deviation                      0.014517
mean_temp                           0.013938
temp_index                          0.013377
nearest_building_are

In [22]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission243.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     'temp_index'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     'temp_index'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9755
LightGBM R² Score: 0.9652

Model Performance Metrics:
               Metric     Score
0           R-squared  0.962705
1    Out-of-Bag Score  0.980743
2   Mean CV R-squared  0.270970
3  Ensemble R-squared  0.947151

Submission file saved to Submission243.csv


In [23]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.136600
Wind_Speed_x_Building_Density       0.131712
building_density_ratio_squared      0.083885
log_building_density_ratio          0.078859
building_density_ratio              0.077590
building_density                    0.071371
temp_deviation_smooth               0.052861
building_density_LST_interaction    0.050750
am_hi_f                             0.048259
log_LST                             0.025276
LST                                 0.024281
Nearest_AirTemp_C                   0.023020
wind_direction_merge_               0.022507
Temp_Anomaly                        0.021179
SAVI_LST_sqrt_diff                  0.020538
relative_humidity_                  0.020226
temp_2m_                            0.019189
solar_insolation_                   0.015804
temp_deviation                      0.014907
temp_index                          0.014467
mean_temp                           0.014319
nearest_building_are

In [24]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission244.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f',
     'temp_index'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f',
     'temp_index'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
65 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9745
LightGBM R² Score: 0.9716

Model Performance Metrics:
               Metric     Score
0           R-squared  0.965370
1    Out-of-Bag Score  0.980505
2   Mean CV R-squared  0.286599
3  Ensemble R-squared  0.948894

Submission file saved to Submission244.csv


In [25]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.153939
Wind_Speed_x_Building_Density       0.126996
building_density_ratio_squared      0.090196
building_density_ratio              0.086027
building_density                    0.079408
log_building_density_ratio          0.075928
temp_deviation_smooth               0.053572
building_density_LST_interaction    0.049401
Temp_Anomaly                        0.026418
log_LST                             0.024997
LST                                 0.024757
Nearest_AirTemp_C                   0.023409
relative_humidity_                  0.021964
temp_2m_                            0.020904
SAVI_LST_sqrt_diff                  0.020677
wind_direction_merge_               0.020543
mean_temp                           0.016970
solar_insolation_                   0.016276
temp_deviation                      0.016081
temp_index                          0.015448
nearest_building_area               0.012580
log_building_perimet

In [1]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission245.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_t_f',
     'temp_index'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'am_t_f',
     'temp_index'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
27 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9542


Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



LightGBM R² Score: 0.9653

Model Performance Metrics:
               Metric     Score
0           R-squared  0.962119
1    Out-of-Bag Score  0.980358
2   Mean CV R-squared  0.158215
3  Ensemble R-squared  0.947831

Submission file saved to Submission245.csv


In [2]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
Wind_Speed_x_Building_Density       0.131212
building_density_ratio_squared      0.099937
building_density_ratio              0.093924
building_density                    0.085671
log_building_density_ratio          0.085205
temp_deviation_smooth               0.063744
am_t_f                              0.061310
building_density_LST_interaction    0.056750
LST                                 0.030682
log_LST                             0.030450
Temp_Anomaly                        0.029469
Nearest_AirTemp_C                   0.026414
wind_direction_merge_               0.026284
SAVI_LST_sqrt_diff                  0.025299
relative_humidity_                  0.024567
temp_2m_                            0.024567
solar_insolation_                   0.018383
mean_temp                           0.017207
temp_deviation                      0.017185
temp_index                          0.014671
nearest_building_area               0.013040
nearest_building_per

In [3]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "uhi_pluto_cleaned_filtered.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission246.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 98,

XGBoost R² Score: 0.9940
LightGBM R² Score: 0.9947

Model Performance Metrics:
               Metric     Score
0           R-squared  0.983469
1    Out-of-Bag Score  0.986838
2   Mean CV R-squared  0.866963
3  Ensemble R-squared  0.960388

Submission file saved to Submission246.csv


In [4]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.310481
af_hi_f                             0.186626
pm_hi_f                             0.080385
Wind_Speed_x_Building_Density       0.069956
pm_t_f                              0.057956
building_density_ratio_squared      0.038004
log_building_density_ratio          0.034050
building_density_ratio              0.032901
building_density                    0.029995
building_density_LST_interaction    0.022832
temp_deviation_smooth               0.020528
am_hi_f                             0.019427
am_t_f                              0.017371
Nearest_AirTemp_C                   0.008376
log_LST                             0.007846
Temp_Anomaly                        0.007832
LST                                 0.006910
wind_direction_merge_               0.006339
SAVI_LST_sqrt_diff                  0.005913
temp_index                          0.005265
temp_2m_                            0.005219
temp_deviation      

In [5]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "Merged_UHI_HHI_Data.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission247.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE', 'P_OZONE','PR_OZONE',
     'PR_PM25', 'P_PM25', 'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI', 'F_HRI'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values
submission_df["PR_RENT"] = uhi_df.iloc[indices]["PR_RENT"].values
submission_df["P_RENT"] = uhi_df.iloc[indices]["P_RENT"].values
submission_df["OVERALL_RANK"] = uhi_df.iloc[indices]["OVERALL_RANK"].values
submission_df["OVERALL_SCORE"] = uhi_df.iloc[indices]["OVERALL_SCORE"].values
submission_df["P_OZONE"] = uhi_df.iloc[indices]["P_OZONE"].values
submission_df["PR_OZONE"] = uhi_df.iloc[indices]["PR_OZONE"].values
submission_df["PR_PM25"] = uhi_df.iloc[indices]["PR_PM25"].values
submission_df["P_PM25"] = uhi_df.iloc[indices]["P_PM25"].values
submission_df["NBE_SCORE"] = uhi_df.iloc[indices]["NBE_SCORE"].values
submission_df["NBE_RANK"] = uhi_df.iloc[indices]["NBE_RANK"].values
submission_df["POP"] = uhi_df.iloc[indices]["POP"].values
submission_df["PR_HRI"] = uhi_df.iloc[indices]["PR_HRI"].values
submission_df["F_HRI"] = uhi_df.iloc[indices]["F_HRI"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_t_f', 'am_hi_f', 
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE', 'P_OZONE','PR_OZONE',
     'PR_PM25', 'P_PM25', 'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI', 'F_HRI'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

  uhi_df = pd.read_csv(uhi_updated_path)
65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
55 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/skle

XGBoost R² Score: 0.9670
LightGBM R² Score: 0.9649

Model Performance Metrics:
               Metric     Score
0           R-squared  0.961652
1    Out-of-Bag Score  0.981334
2   Mean CV R-squared  0.345611
3  Ensemble R-squared  0.947127

Submission file saved to Submission247.csv


In [6]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
pm_t_f                              0.110948
Wind_Speed_x_Building_Density       0.103453
building_density_ratio              0.072428
building_density_ratio_squared      0.069889
building_density                    0.069213
log_building_density_ratio          0.068802
am_hi_f                             0.044326
temp_deviation_smooth               0.042932
building_density_LST_interaction    0.042255
POP                                 0.025591
wind_direction_merge_               0.022628
log_LST                             0.020190
LST                                 0.019944
P_RENT                              0.019219
NBE_RANK                            0.018366
SAVI_LST_sqrt_diff                  0.017668
relative_humidity_                  0.017350
OVERALL_SCORE                       0.017265
OVERALL_RANK                        0.016906
NBE_SCORE                           0.016640
PR_RENT                             0.015824
temp_2m_            

In [7]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "Merged_UHI_HHI_Data.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission248.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE', 'P_OZONE','PR_OZONE',
     'PR_PM25', 'P_PM25', 'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI', 'F_HRI'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values
submission_df["PR_RENT"] = uhi_df.iloc[indices]["PR_RENT"].values
submission_df["P_RENT"] = uhi_df.iloc[indices]["P_RENT"].values
submission_df["OVERALL_RANK"] = uhi_df.iloc[indices]["OVERALL_RANK"].values
submission_df["OVERALL_SCORE"] = uhi_df.iloc[indices]["OVERALL_SCORE"].values
submission_df["P_OZONE"] = uhi_df.iloc[indices]["P_OZONE"].values
submission_df["PR_OZONE"] = uhi_df.iloc[indices]["PR_OZONE"].values
submission_df["PR_PM25"] = uhi_df.iloc[indices]["PR_PM25"].values
submission_df["P_PM25"] = uhi_df.iloc[indices]["P_PM25"].values
submission_df["NBE_SCORE"] = uhi_df.iloc[indices]["NBE_SCORE"].values
submission_df["NBE_RANK"] = uhi_df.iloc[indices]["NBE_RANK"].values
submission_df["POP"] = uhi_df.iloc[indices]["POP"].values
submission_df["PR_HRI"] = uhi_df.iloc[indices]["PR_HRI"].values
submission_df["F_HRI"] = uhi_df.iloc[indices]["F_HRI"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE', 'P_OZONE','PR_OZONE',
     'PR_PM25', 'P_PM25', 'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI', 'F_HRI'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

  uhi_df = pd.read_csv(uhi_updated_path)
65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/skle

XGBoost R² Score: 0.9939
LightGBM R² Score: 0.9937

Model Performance Metrics:
               Metric     Score
0           R-squared  0.982893
1    Out-of-Bag Score  0.986704
2   Mean CV R-squared  0.840600
3  Ensemble R-squared  0.958379

Submission file saved to Submission248.csv


In [8]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.258617
af_hi_f                             0.195860
pm_hi_f                             0.067207
Wind_Speed_x_Building_Density       0.061415
pm_t_f                              0.053525
building_density_ratio_squared      0.036664
building_density_ratio              0.036294
log_building_density_ratio          0.035734
building_density                    0.029044
building_density_LST_interaction    0.022264
temp_deviation_smooth               0.021050
am_t_f                              0.016276
am_hi_f                             0.016197
POP                                 0.016118
NBE_SCORE                           0.012608
P_RENT                              0.010595
wind_direction_merge_               0.008778
NBE_RANK                            0.008716
OVERALL_SCORE                       0.007622
OVERALL_RANK                        0.007077
PR_RENT                             0.006995
Nearest_AirTemp_C   

In [9]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "Merged_UHI_HHI_Data.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission249.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE',
     'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values
submission_df["PR_RENT"] = uhi_df.iloc[indices]["PR_RENT"].values
submission_df["P_RENT"] = uhi_df.iloc[indices]["P_RENT"].values
submission_df["OVERALL_RANK"] = uhi_df.iloc[indices]["OVERALL_RANK"].values
submission_df["OVERALL_SCORE"] = uhi_df.iloc[indices]["OVERALL_SCORE"].values
submission_df["P_OZONE"] = uhi_df.iloc[indices]["P_OZONE"].values
submission_df["PR_OZONE"] = uhi_df.iloc[indices]["PR_OZONE"].values
submission_df["PR_PM25"] = uhi_df.iloc[indices]["PR_PM25"].values
submission_df["P_PM25"] = uhi_df.iloc[indices]["P_PM25"].values
submission_df["NBE_SCORE"] = uhi_df.iloc[indices]["NBE_SCORE"].values
submission_df["NBE_RANK"] = uhi_df.iloc[indices]["NBE_RANK"].values
submission_df["POP"] = uhi_df.iloc[indices]["POP"].values
submission_df["PR_HRI"] = uhi_df.iloc[indices]["PR_HRI"].values
submission_df["F_HRI"] = uhi_df.iloc[indices]["F_HRI"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT', 'OVERALL_RANK', 'OVERALL_SCORE',
     'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

  uhi_df = pd.read_csv(uhi_updated_path)
65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/skle

XGBoost R² Score: 0.9932
LightGBM R² Score: 0.9946

Model Performance Metrics:
               Metric     Score
0           R-squared  0.983026
1    Out-of-Bag Score  0.986885
2   Mean CV R-squared  0.862830
3  Ensemble R-squared  0.959652

Submission file saved to Submission249.csv


In [10]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.300913
af_hi_f                             0.185218
pm_hi_f                             0.067789
Wind_Speed_x_Building_Density       0.062703
pm_t_f                              0.055280
building_density_ratio_squared      0.037056
building_density_ratio              0.031122
building_density                    0.030138
log_building_density_ratio          0.029116
building_density_LST_interaction    0.018041
temp_deviation_smooth               0.017531
am_hi_f                             0.015837
POP                                 0.015567
am_t_f                              0.013882
NBE_SCORE                           0.011133
NBE_RANK                            0.010898
P_RENT                              0.008541
PR_RENT                             0.008278
wind_direction_merge_               0.008065
OVERALL_SCORE                       0.007118
OVERALL_RANK                        0.006964
relative_humidity_  

In [11]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "Merged_UHI_HHI_Data.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission250.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT',
     'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values
submission_df["PR_RENT"] = uhi_df.iloc[indices]["PR_RENT"].values
submission_df["P_RENT"] = uhi_df.iloc[indices]["P_RENT"].values
submission_df["OVERALL_RANK"] = uhi_df.iloc[indices]["OVERALL_RANK"].values
submission_df["OVERALL_SCORE"] = uhi_df.iloc[indices]["OVERALL_SCORE"].values
submission_df["P_OZONE"] = uhi_df.iloc[indices]["P_OZONE"].values
submission_df["PR_OZONE"] = uhi_df.iloc[indices]["PR_OZONE"].values
submission_df["PR_PM25"] = uhi_df.iloc[indices]["PR_PM25"].values
submission_df["P_PM25"] = uhi_df.iloc[indices]["P_PM25"].values
submission_df["NBE_SCORE"] = uhi_df.iloc[indices]["NBE_SCORE"].values
submission_df["NBE_RANK"] = uhi_df.iloc[indices]["NBE_RANK"].values
submission_df["POP"] = uhi_df.iloc[indices]["POP"].values
submission_df["PR_HRI"] = uhi_df.iloc[indices]["PR_HRI"].values
submission_df["F_HRI"] = uhi_df.iloc[indices]["F_HRI"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]


# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'nearest_building_area',
     'nearest_building_perimeter',
     'building_density',
     'temp_2m_',
     'relative_humidity_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_building_perimeter',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'PR_RENT', 'P_RENT',
     'NBE_SCORE', 'NBE_RANK', 'POP', 'PR_HRI'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

  uhi_df = pd.read_csv(uhi_updated_path)
65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
55 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/skle

XGBoost R² Score: 0.9912
LightGBM R² Score: 0.9942

Model Performance Metrics:
               Metric     Score
0           R-squared  0.968005
1    Out-of-Bag Score  0.986922
2   Mean CV R-squared  0.846916
3  Ensemble R-squared  0.953136

Submission file saved to Submission250.csv


In [12]:
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

feature_names = X.columns
rf_importances = pd.Series(importances_rf, index=feature_names).sort_values(ascending=False)
et_importances = pd.Series(importances_et, index=feature_names).sort_values(ascending=False)

print("Random Forest Feature Importances:")
print(rf_importances)

print("\nExtra Trees Feature Importances:")
print(et_importances)

Random Forest Feature Importances:
af_t_f                              0.261223
af_hi_f                             0.185089
pm_hi_f                             0.073968
Wind_Speed_x_Building_Density       0.062903
pm_t_f                              0.061207
building_density_ratio_squared      0.036833
building_density_ratio              0.032626
building_density                    0.031207
log_building_density_ratio          0.030670
building_density_LST_interaction    0.022622
temp_deviation_smooth               0.021095
am_hi_f                             0.017681
POP                                 0.017144
am_t_f                              0.014726
NBE_RANK                            0.012450
NBE_SCORE                           0.011845
P_RENT                              0.011800
PR_RENT                             0.011275
wind_direction_merge_               0.008713
Temp_Anomaly                        0.006721
Nearest_AirTemp_C                   0.006578
LST                 

In [2]:
import pandas as pd
import numpy as np
import joblib
from scipy.spatial import cKDTree
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.metrics import r2_score

# -------------------------
# Load the Updated Dataset (Excluding Latitude & Longitude as Features)
# -------------------------
uhi_updated_path = "Merged_UHI_HHI_HVI_GreenRoof_SVI_UHII_Data.csv"
submission_path = "Submission_template.csv"
submission_updated_path = "Submission269.csv"

uhi_df = pd.read_csv(uhi_updated_path)

# Fix column names (remove special characters)
uhi_df.columns = (
    uhi_df.columns.str.replace(r"\[.*?\]", "", regex=True)  # Remove content in brackets
    .str.replace(" ", "_")  # Replace spaces with underscores
)

# -------------------------
# Feature Engineering: Adding Interactions & Transformations
# -------------------------
uhi_df["building_density_ratio"] = uhi_df["building_density"] / (uhi_df["nearest_building_area"] + 1)
uhi_df["log_building_perimeter"] = np.log1p(uhi_df["nearest_building_perimeter"])
uhi_df["log_LST"] = np.log1p(uhi_df["LST"])  # log(LST + 1) to avoid log(0)
uhi_df["log_building_density_ratio"] = np.log1p(uhi_df["building_density_ratio"])
uhi_df["building_density_LST_interaction"] = uhi_df["building_density"] * uhi_df["LST"]
uhi_df["building_density_ratio_squared"] = uhi_df["building_density_ratio"] ** 2
uhi_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(uhi_df["SAVI"] - uhi_df["LST"]))
uhi_df["Wind_Speed_x_Building_Density"] = uhi_df["avg_wind_speed_merge_"] * uhi_df["building_density"]

uhi_df["temp_range"] = uhi_df["af_hi_f"] - uhi_df["af_t_f"]  
uhi_df["am_pm_temp_diff"] = uhi_df["pm_t_f"] - uhi_df["am_t_f"]  
uhi_df["hi_temp_diff"] = uhi_df["af_hi_f"] - uhi_df["am_hi_f"] 
uhi_df["weighted_temp"] = (0.6 * uhi_df["af_t_f"]) + (0.4 * uhi_df["pm_t_f"])
uhi_df["temp_rate_change"] = (uhi_df["af_t_f"] - uhi_df["am_t_f"]) / 12  

# -------------------------
# Feature Selection (Excludes Latitude & Longitude)
# -------------------------
X = uhi_df[
    ['LST',
     'building_density',
     'temp_2m_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'temp_range', 'am_pm_temp_diff', 'hi_temp_diff', 'weighted_temp', 'temp_rate_change'
     ]
]
y = uhi_df["UHI_Index"]

# -------------------------
# Train-Test Split
# -------------------------
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.001, random_state=42)

# -------------------------
# Hyperparameter Tuning with RandomizedSearchCV
# -------------------------
param_dist = {
    "n_estimators": [100, 200, 500, 1000],
    "max_depth": [None, 10, 20, 30],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 5],
    "max_features": ["auto", "sqrt", "log2"],
    "bootstrap": [True]
}

rf = RandomForestRegressor(random_state=42, oob_score=True)
random_search = RandomizedSearchCV(
    estimator=rf, param_distributions=param_dist, cv=5, n_iter=50, 
    scoring="r2", n_jobs=-1, random_state=42
)
random_search.fit(X_train, y_train)

# -------------------------
# Use the Best Model
# -------------------------
best_rf = random_search.best_estimator_

# -------------------------
# Model Evaluation
# -------------------------
y_pred = best_rf.predict(X_test)
r2 = r2_score(y_test, y_pred)
oob_score = best_rf.oob_score_

# Cross-Validation Scores
cv_scores = cross_val_score(best_rf, X, y, cv=5, scoring="r2")
mean_cv_score = cv_scores.mean()

# -------------------------
# Ensemble Learning (Extra Trees)
# -------------------------
extra_trees = ExtraTreesRegressor(n_estimators=500, random_state=42)
extra_trees.fit(X_train, y_train)

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42)
xgb_model.fit(X_train, y_train)
y_pred_xgb = xgb_model.predict(X_test)
r2_xgb = r2_score(y_test, y_pred_xgb)
print(f"XGBoost R² Score: {r2_xgb:.4f}")

import lightgbm as lgb

lgb_model = lgb.LGBMRegressor(n_estimators=500, learning_rate=0.1, max_depth=6, random_state=42, verbose=-1)
lgb_model.fit(X_train, y_train)
y_pred_lgb = lgb_model.predict(X_test)
r2_lgb = r2_score(y_test, y_pred_lgb)
print(f"LightGBM R² Score: {r2_lgb:.4f}")

# Combine Predictions (Averaging Random Forest & Extra Trees)
y_pred_ensemble_test = (
    0.2 * best_rf.predict(X_test) +
    0.6 * extra_trees.predict(X_test) +
    0.1 * xgb_model.predict(X_test) +
    0.1 * lgb_model.predict(X_test)
)

r2_ensemble = r2_score(y_test, y_pred_ensemble_test)

# -------------------------
# Display Model Performance
# -------------------------
model_performance = pd.DataFrame({
    "Metric": ["R-squared", "Out-of-Bag Score", "Mean CV R-squared", "Ensemble R-squared"],
    "Score": [r2, oob_score, mean_cv_score, r2_ensemble]
})
print("\nModel Performance Metrics:")
print(model_performance)

# -------------------------
# Update Submission File with Predictions (Excluding Lat/Lon as Features)
# -------------------------
submission_df = pd.read_csv(submission_path)

# Extract coordinates using the correct column names
uhi_coords = uhi_df[['longitude', 'latitude']].values  # UHI dataset uses lowercase
submission_coords = submission_df[['Longitude', 'Latitude']].values  # Submission dataset uses uppercase

# Build a KDTree using UHI dataset
uhi_tree = cKDTree(uhi_coords)

# Query the KDTree for nearest neighbors
_, indices = uhi_tree.query(submission_coords, k=1)  # k=1 ensures the nearest point is found

# Assign nearest features from UHI dataset to submission file
submission_df["NDVI"] = uhi_df.iloc[indices]["NDVI"].values
submission_df["EVI"] = uhi_df.iloc[indices]["EVI"].values
submission_df["GNDVI"] = uhi_df.iloc[indices]["GNDVI"].values
submission_df["SAVI"] = uhi_df.iloc[indices]["SAVI"].values
submission_df["NDBI"] = uhi_df.iloc[indices]["NDBI"].values
submission_df["MNDWI"] = uhi_df.iloc[indices]["MNDWI"].values
submission_df["NDWI"] = uhi_df.iloc[indices]["NDWI"].values
submission_df["LSWI"] = uhi_df.iloc[indices]["LSWI"].values
submission_df["BI"] = uhi_df.iloc[indices]["BI"].values
submission_df["Albedo"] = uhi_df.iloc[indices]["Albedo"].values
submission_df["IBI"] = uhi_df.iloc[indices]["IBI"].values
submission_df["LST"] = uhi_df.iloc[indices]["LST"].values
submission_df["nearest_building_area"] = uhi_df.iloc[indices]["nearest_building_area"].values
submission_df["nearest_building_perimeter"] = uhi_df.iloc[indices]["nearest_building_perimeter"].values
submission_df["building_density"] = uhi_df.iloc[indices]["building_density"].values
submission_df["elevation_"] = uhi_df.iloc[indices]["elevation_"].values
submission_df["temp_2m_"] = uhi_df.iloc[indices]["temp_2m_"].values
submission_df["relative_humidity_"] = uhi_df.iloc[indices]["relative_humidity_"].values
submission_df["avg_wind_speed_merge_"] = uhi_df.iloc[indices]["avg_wind_speed_merge_"].values
submission_df["max_wind_speed_merge_"] = uhi_df.iloc[indices]["max_wind_speed_merge_"].values
submission_df["wind_speed_stddev_merge_"] = uhi_df.iloc[indices]["wind_speed_stddev_merge_"].values
submission_df["wind_direction_merge_"] = uhi_df.iloc[indices]["wind_direction_merge_"].values
submission_df["wind_direction_stddev_merge_"] = uhi_df.iloc[indices]["wind_direction_stddev_merge_"].values
submission_df["solar_insolation_"] = uhi_df.iloc[indices]["solar_insolation_"].values
submission_df["mean_temp"] = uhi_df.iloc[indices]["mean_temp"].values
submission_df["temp_deviation"] = uhi_df.iloc[indices]["temp_deviation"].values
submission_df["temp_deviation_smooth"] = uhi_df.iloc[indices]["temp_deviation_smooth"].values
submission_df["Nearest_AirTemp_C"] = uhi_df.iloc[indices]["Nearest_AirTemp_C"].values
submission_df["Temp_Anomaly"] = uhi_df.iloc[indices]["Temp_Anomaly"].values
submission_df["pm_t_f"] = uhi_df.iloc[indices]["pm_t_f"].values
submission_df["am_t_f"] = uhi_df.iloc[indices]["am_t_f"].values
submission_df["af_t_f"] = uhi_df.iloc[indices]["af_t_f"].values
submission_df["pm_hi_f"] = uhi_df.iloc[indices]["pm_hi_f"].values
submission_df["am_hi_f"] = uhi_df.iloc[indices]["am_hi_f"].values
submission_df["af_hi_f"] = uhi_df.iloc[indices]["af_hi_f"].values
submission_df["bldgarea"] = uhi_df.iloc[indices]["bldgarea"].values
submission_df["numfloors"] = uhi_df.iloc[indices]["numfloors"].values
submission_df["unitsres"] = uhi_df.iloc[indices]["unitsres"].values
submission_df["unitstotal"] = uhi_df.iloc[indices]["unitstotal"].values
submission_df["bldgfront"] = uhi_df.iloc[indices]["bldgfront"].values
submission_df["bldgdepth"] = uhi_df.iloc[indices]["bldgdepth"].values
submission_df["lotarea"] = uhi_df.iloc[indices]["lotarea"].values
submission_df["residfar"] = uhi_df.iloc[indices]["residfar"].values
submission_df["commfar"] = uhi_df.iloc[indices]["commfar"].values
submission_df["facilfar"] = uhi_df.iloc[indices]["facilfar"].values
submission_df["garagearea"] = uhi_df.iloc[indices]["garagearea"].values
submission_df["strgearea"] = uhi_df.iloc[indices]["strgearea"].values
submission_df["factryarea"] = uhi_df.iloc[indices]["factryarea"].values
submission_df["assessland"] = uhi_df.iloc[indices]["assessland"].values
submission_df["yearbuilt"] = uhi_df.iloc[indices]["yearbuilt"].values
submission_df["yearalter1"] = uhi_df.iloc[indices]["yearalter1"].values
submission_df["yearalter2"] = uhi_df.iloc[indices]["yearalter2"].values
submission_df["temp_index"] = uhi_df.iloc[indices]["temp_index"].values
submission_df["PR_RENT"] = uhi_df.iloc[indices]["PR_RENT"].values
submission_df["P_RENT"] = uhi_df.iloc[indices]["P_RENT"].values
submission_df["OVERALL_RANK"] = uhi_df.iloc[indices]["OVERALL_RANK"].values
submission_df["OVERALL_SCORE"] = uhi_df.iloc[indices]["OVERALL_SCORE"].values
submission_df["P_OZONE"] = uhi_df.iloc[indices]["P_OZONE"].values
submission_df["PR_OZONE"] = uhi_df.iloc[indices]["PR_OZONE"].values
submission_df["PR_PM25"] = uhi_df.iloc[indices]["PR_PM25"].values
submission_df["P_PM25"] = uhi_df.iloc[indices]["P_PM25"].values
submission_df["NBE_SCORE"] = uhi_df.iloc[indices]["NBE_SCORE"].values
submission_df["NBE_RANK"] = uhi_df.iloc[indices]["NBE_RANK"].values
submission_df["POP"] = uhi_df.iloc[indices]["POP"].values
submission_df["PR_HRI"] = uhi_df.iloc[indices]["PR_HRI"].values
submission_df["F_HRI"] = uhi_df.iloc[indices]["F_HRI"].values
submission_df["HVI"] = uhi_df.iloc[indices]["HVI"].values
submission_df["gr_area"] = uhi_df.iloc[indices]["gr_area"].values
submission_df["bldg_area"] = uhi_df.iloc[indices]["bldg_area"].values
submission_df["prop_gr"] = uhi_df.iloc[indices]["prop_gr"].values
submission_df["heightroof"] = uhi_df.iloc[indices]["heightroof"].values
submission_df["groundelev"] = uhi_df.iloc[indices]["groundelev"].values
submission_df["NBAI"] = uhi_df.iloc[indices]["NBAI"].values
submission_df["UHII_Value"] = uhi_df.iloc[indices]["UHII_Value"].values


# Feature Engineering for Submission Data
submission_df["building_density_ratio"] = submission_df["building_density"] / (submission_df["nearest_building_area"] + 1)
submission_df["log_building_perimeter"] = np.log1p(submission_df["nearest_building_perimeter"])
submission_df["log_LST"] = np.log1p(submission_df["LST"])  # log(LST + 1) to avoid log(0)
submission_df["log_building_density_ratio"] = np.log1p(submission_df["building_density_ratio"])
submission_df["building_density_LST_interaction"] = submission_df["building_density"] * submission_df["LST"]
submission_df["building_density_ratio_squared"] = submission_df["building_density_ratio"] ** 2
submission_df["SAVI_LST_sqrt_diff"] = np.sqrt(np.abs(submission_df["SAVI"] - submission_df["LST"]))
submission_df["Wind_Speed_x_Building_Density"] = submission_df["avg_wind_speed_merge_"] * submission_df["building_density"]
submission_df["temp_range"] = submission_df["af_hi_f"] - submission_df["af_t_f"]  
submission_df["am_pm_temp_diff"] = submission_df["pm_t_f"] - submission_df["am_t_f"]  
submission_df["hi_temp_diff"] = submission_df["af_hi_f"] - submission_df["am_hi_f"] 
submission_df["weighted_temp"] = (0.6 * submission_df["af_t_f"]) + (0.4 * submission_df["pm_t_f"])
submission_df["temp_rate_change"] = (submission_df["af_t_f"] - submission_df["am_t_f"]) / 12  

# Select Features for Prediction (Excluding Lat/Lon)
X_submission = submission_df[
    ['LST',
     'building_density',
     'temp_2m_',
     'wind_direction_merge_',
     'solar_insolation_',
     'building_density_ratio',
     'log_LST',
     'log_building_density_ratio',
     'building_density_LST_interaction',
     'building_density_ratio_squared',
     'SAVI_LST_sqrt_diff',
     'Wind_Speed_x_Building_Density', 
     'mean_temp', 'temp_deviation', 'temp_deviation_smooth',
     'Nearest_AirTemp_C', 'Temp_Anomaly',
     'pm_hi_f', 'am_hi_f', 'af_hi_f',
     'pm_t_f', 'am_t_f', 'af_t_f',
     'temp_index',
     'temp_range', 'am_pm_temp_diff', 'hi_temp_diff', 'weighted_temp', 'temp_rate_change'
     ]
]

# Predict UHI Index for Submission File
submission_df["UHI Index"] = (
    0.2*best_rf.predict(X_submission) + 0.6*extra_trees.predict(X_submission) + 0.1*xgb_model.predict(X_submission) 
    + 0.1*lgb_model.predict(X_submission)
)

# Save the Updated Submission File
submission_df[['Longitude', 'Latitude', 'UHI Index']].to_csv(submission_updated_path, index=False)
print(f"\nSubmission file saved to {submission_updated_path}")

# Extract feature importances
importances_rf = best_rf.feature_importances_
importances_et = extra_trees.feature_importances_

# Create a DataFrame to store feature importances
feature_names = X.columns
feature_importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Random Forest Importance': importances_rf,
    'Extra Trees Importance': importances_et
})

# Sort by Random Forest importance
feature_importance_df = feature_importance_df.sort_values(by="Random Forest Importance", ascending=False)

# Save feature importances to CSV
feature_importance_df.to_csv("feature_importances269.csv", index=False)
print("\nFeature importances saved to feature_importances.csv")

  uhi_df = pd.read_csv(uhi_updated_path)
65 fits failed out of a total of 250.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
50 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/Users/helenhsu/opt/anaconda3/lib/python3.9/site-packages/skle

XGBoost R² Score: 0.9921
LightGBM R² Score: 0.9949

Model Performance Metrics:
               Metric     Score
0           R-squared  0.988833
1    Out-of-Bag Score  0.987113
2   Mean CV R-squared  0.926308
3  Ensemble R-squared  0.960562

Submission file saved to Submission269.csv

Feature importances saved to feature_importances.csv
