# **Sentinel-2 Pre-Processing**

ML-based Local Climate Zone (LCZ) classification requires all input datasets to be homogenized in terms of exten, spatial resolution and projection. This notebook prepares the Sentinel-2 data downloaded from the 01_Data_Aquisition notebook for ML model training. Here are the steps:

1. Project Setup
2. Merge tiles from each band
3. Clip merged tiles to extent of study area
4. Resample to 30 m Resolution, as reccomended by Absaraori et al. 2024


### **1. Project Setup**

#### 1.1 Import Libraries

In [3]:
%load_ext autoreload
%autoreload 

import sys
import os

# Add the module's parent directory to sys.path
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)


## Import local LCZ Classification libraries
from lcz_classification.config import *
from lcz_classification.util import merge_rasters, resample_da, clip_raster, tiles_from_bbox
from lcz_classification.dataset import fetch_metadata
## Import required libraries
import rioxarray as rio
import pandas as pd
import geopandas as gpd
import xarray as xr


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
study_area = fetch_metadata('STUDY_AREA')
bounds=study_area.total_bounds
UTM_CRS=study_area.estimate_utm_crs()

tiles=tiles_from_bbox(bounds, tile_dims=(4,4))


### **2. Merge Tiles from Each Band**
    
This section reads band data from each scene downloaded in the sentinel-2 data directory. A single tile is then created for each band using the merge_rasters() method. The band tiles are clipped to the bounding box of the study area created in section 1.2.

#### 2.1. Prepare DataFrame of Available Scenes

In [3]:


sent2_dict=dict() # Create an empty dictionary

# Retrieve Sentinel-2 scene names from the SENT2_DIR directory
s2_tiles = [f"{S2_RAW}/{scene}" for scene in os.listdir(S2_RAW) if ".geojson" not in scene] # prepare file paths of DSM tiles

# Create a Pandas DataFrame of the available Sentinel-2 Scenes
s2_dfs=list()
for tile_path in s2_tiles:
    
    scene_df=pd.DataFrame(
        data=dict(
            band = [x.split(".")[0] for x in os.listdir(tile_path)],
            file_path=[f"{tile_path}/{x}" for x in os.listdir(tile_path)]

        )
    )
    tile_id=tile_path.split("/")[-1]
    # scene_id=tile_path.split("/")[-1]
    scene_df["tile_id"] = tile_id
    # scene_df["date"] = scene_id.split("_")[2]
    s2_dfs.append(scene_df)

# Create a single dataframe with pd.concat(), this results in a single data frame with required metadata to filter and read the desired tiles for the next steps.
s2_df=pd.concat(s2_dfs)


#### 2.2. Merge tiles from all scenes into single bands and clip

In [4]:

# Group scene DataFrame by band and iterate over each band.
sent2_grouped=s2_df.groupby("band") 
for band in sent2_grouped:
    raster_paths= band[-1].file_path.values # Get GeoTIFF file paths of all  tiles under this band
   
    out_path=f"{S2_MERGED}/{band[0]}.tif"  # Configure output file path for merged band raster

    if os.path.exists(out_path):
        print(f"Already exists: {out_path}")
    else:
        merge_rasters(raster_paths,out_path, None) # Merge all band tiles into a single raster, pass raster file paths as a list
        print(f"Exported merged raster for {band[0]}.tif")
    

Exported merged raster for B02.tif
Exported merged raster for B03.tif
Exported merged raster for B04.tif
Exported merged raster for B05.tif
Exported merged raster for B06.tif
Exported merged raster for B07.tif
Exported merged raster for B11.tif
Exported merged raster for B12.tif
Exported merged raster for B8A.tif


#### **3. Clip Band Tiles**

In [5]:

# band_tiles_fp = [f"{S2_MERGED}/{band}" for band in os.listdir(S2_MERGED)] # prepare file paths of merged band rasters

# # Iterate over each band raster
# for band_tile_fp in band_tiles_fp:

#     clipped_path=S2_CLIPPED + "/" + band_tile_fp.split("/")[-1] # Configure output path of clipped raster (per band)
#     if os.path.exists(clipped_path):
#         print(f"Already exists: {clipped_path}")
#     else:
#         # Clip raster using clip_raster() method
#         clip_raster(raster_path=band_tile_fp,
#                     gdf=study_area,
#                     bbox=None,
#                     out_path=clipped_path
#                     )
   


### **4. Resample to 30 m and stack to a single GeoTiFF**

#### 4.1 Resample Band Rasters to 30 m 

In [8]:
from lcz_classification.util import get_target_shape
from rasterio.enums import Resampling
# Read Band Tiles

band_tiles_fp = [f"{S2_MERGED}/{band}" for band in os.listdir(S2_MERGED)] # prepare file paths of band tiles
band_tiles=[rio.open_rasterio(band_tile_fp).sel(band=1) for band_tile_fp in band_tiles_fp] # Read all band tiles into a list of xarray DataArrays

# reproj=[band_tile.rio.reproject(dst_crs=UTM_CRS) for band_tile in band_tiles]

# Stack all resampled bands into a single dataset
s2=xr.concat(band_tiles, dim="band")
s2['band'] = [x.split("/")[-1][:3] for x in band_tiles_fp]# update band names
s2.attrs["bands"] = [x.split("/")[-1][:3] for x in band_tiles_fp]
s2 = s2.rio.reproject(dst_crs=UTM_CRS) # reproject to project CRS - local UTM zone derived from gpd.estimate_utm_crs()

target_shape = get_target_shape(s2.isel(band=0), CELL_RESOLUTION)
## Reproject to Local UTM Zone

s2_resampled = s2.rio.reproject(
        s2.rio.crs,
        shape=target_shape,
        resampling=Resampling.average,
    )


# Export Resampled Multiband Raster 
s2_resampled.rio.to_raster(S2_FP.replace(".tif","_1.tif")) # Write to GeoTIFF
print("Exported multiband sentinel-2 data")

Resampling input raster to 30 m resolution
Exported multiband sentinel-2 data
