# 0. Setup

## 0.1. Libraries

In [1]:
# General utilities
import os
import pandas as pd

# Image processing
from PIL import Image

# Custom library
from library.utilities import print_directory_structure
from library.preprocessing import process_labelled_dataset

## 0.2. Data paths and structure description

In [2]:
input_labelled_dir = '/media/pablo/Shared files/data/Satellite_burned_area_dataset'  # Folder containing the labelled dataset
input_inference_dir = os.path.join('..', 'catalunya_fire_imgs')

Let's check the structure of both directories.

In [5]:
print("=== Labelled Dataset Structure ===")
print_directory_structure(input_labelled_dir)

=== Labelled Dataset Structure ===
Satellite_burned_area_dataset/
  Satellite_burned_area_dataset_part1/
    EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector/
      EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector_mask.png
      EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector_mask.tiff
      sentinel1_2017-05-07.png
      sentinel1_2017-05-07.tiff
      sentinel1_2017-05-07_coverage.png
      sentinel1_2017-07-01.png
      sentinel1_2017-07-01.tiff
      sentinel1_2017-07-01_coverage.png
      sentinel2_2017-05-25.png
      sentinel2_2017-05-25.tiff
      sentinel2_2017-05-25_coverage.png
      sentinel2_2017-07-04.png
      sentinel2_2017-07-04.tiff
      sentinel2_2017-07-04_coverage.png
    EMSR207_02LOUSA_02GRADING_MAP_v2_vector/
      EMSR207_02LOUSA_02GRADING_MAP_v2_vector_mask.png
      EMSR207_02LOUSA_02GRADING_MAP_v2_vector_mask.tiff
      sentinel1_2017-05-07.png
      sentinel1_2017-05-07.tiff
      sentinel1_2017-05-07_coverage.png
      sentinel1_2017-05-08.png
      sent

- The whole dataset is divided into 5 parts for convenience (due to the size of the images).
- Each of the parts contains folders with images for each fire.
- Within each of the folders for each fire, there are the following images:
  - Pre-fire and post-fire images (approximately one month before and one month after the fire, respectively). They include Sentinel-1 GRD (IW mode with VV+VH polarization) and Sentinel-2 L2A images, both in `.png` and in `.tiff` (loseless format). The naming of the file is simply the product (`sentinel_x` followed by the date of acquisition of the image).
  - The coverage for each of the images, "a binary coverage mask, indicating invalid pixel values." In black, invalid pixel values, in white, valid ones.
  - Finally, the vector mask for the segmentation of the area affected by the wildfire and its severity, both in `.png` and in `.tiff` (loseless format). "Annotations released by Copernicus specify 5 different severity levels [0 = undamaged to 4 = fully destroyed], which were encoded in a greyscale image in range [0 − 255], from no damage to completely destroyed area."

> For our task, we are only interested in post-fire images in the loseless format `.tiff` (including the actual image, the coverage and the vector mask), as we want to train a model that assesses fire severity.

In [5]:
print("\n=== Catalunya Fire Images Structure ===")
print_directory_structure(input_inference_dir)


=== Catalunya Fire Images Structure ===
2015/
  images/
    sentinel2_2015-07-13_lon3-161_lat42-432.tiff
    sentinel2_2015-07-26_lon1-719_lat41-635.tiff
    sentinel2_2015-08-03_lon0-708_lat42-826.tiff
    sentinel2_2015-10-22_lon0-496_lat41-004.tiff
    sentinel2_2015-11-05_lon0-743_lat41-402.tiff
  metadata/
    sentinel2_2015-07-13_lon3-161_lat42-432.json
    sentinel2_2015-07-26_lon1-719_lat41-635.json
    sentinel2_2015-08-03_lon0-708_lat42-826.json
    sentinel2_2015-10-22_lon0-496_lat41-004.json
    sentinel2_2015-11-05_lon0-743_lat41-402.json
2016/
  images/
    sentinel2_2016-05-09_lon1-355_lat42-340.tiff
    sentinel2_2016-06-01_lon1-471_lat41-229.tiff
    sentinel2_2016-06-08_lon2-992_lat42-118.tiff
    sentinel2_2016-06-23_lon1-128_lat41-518.tiff
    sentinel2_2016-07-18_lon2-017_lat41-928.tiff
    sentinel2_2016-07-19_lon1-122_lat41-540.tiff
    sentinel2_2016-07-19_lon1-360_lat41-622.tiff
    sentinel2_2016-07-19_lon2-709_lat41-869.tiff
    sentinel2_2016-07-22_lon1-259

- One folder per fire of years.
- Each of the folder corresponding to a year has 2 folders: `images` and `metadata`.
- The `images` notebook contain the images corresponding to each retrieved wildfire in Catalunya, for Sentinel-1 GRD (IW mode and VV+VH polarization) and Sentinel-2 (L2A) data and in loseless format only (`.tiff`). 
  - All of these images correspond to the post-fire period, and retrieve the earliest satellite image (for Sentinel-2) with less than 10% cloud cover one month after the fire activation date. 
  - The minimum dimension of each of these images is of 512x512 pixels.
- The `metadata` notebook contains useful metadata associated to each image (e.g., cloud cover as provided by Copernicus, the date when the image was taken, etc.).
- The naming of the files consists on: the product (Sentinel-1 or Sentinel-2), the activation date of the fire, the longitude of the fire centroid (with 3 decimals, indicated after the dash) and the latitude of the fire centroid (same format as longitude).
- Note that there are no labels here (this dataset is just for inference).

## 0.3. Notebook description

In this notebook we preprocess the image data to prepare it for training/validation/test and for inference. To that end, we pre-process 2 different sources of data:
1) The labelled dataset from *Luca Colomba, Alessandro Farasin, Simone Monaco, Salvatore Greco, Paolo Garza, Daniele Apiletti, Elena Baralis, and Tania Cerquitelli. 2022. A Dataset for Burned Area Delineation and Severity Estimation from Satellite Imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM '22). Association for Computing Machinery, New York, NY, USA, 3893–3897. https://doi.org/10.1145/3511808.3557528*. The dataset can be downloaded from here: https://zenodo.org/records/6597139 (click on "Download all", which will approximately download 14GB worth of images and labels).
2) The inference dataset for the fires in Catalunya, obtained by our own means through the APIs of [Sentinel Hub](https://dataspace.copernicus.eu/analyse/apis/sentinel-hub) (see the notebooks for part 1). 

To that end, in this notebook we follow the the pre-processing steps outlined in the paper (for consistency):
  1) Since our research question focuses on assessing the severity of fires in Catalunya (inference), we will only be using the post-fire satellite imagery (like the authors do for their binary segmentation task).
  2) Generate 512x512 pixel tiles for the labelled train, validation, test and inference data. For images whose size is not a multiple of 512, overlap of tiles is allowed (in order to cover all of the area of the image). 
  3) Additionally, during the training phase, we will randomly shuffle the tiles and apply data augmentation techniques (see the corresponding training notebook).

> The final goal is to train a model that estimates the severity of a wildfire using post-wildfire (1 month after) imagery, and then to run inference with the satellite images from wildfires in Catalunya in the last decade. Severity estimation will be done at the pixel level.

Additionally, within the paper, the authors of the dataset provide additional metadata and a suggestion for the organization of the folds for training/validation (see the `folds` column): "*[d]ata were split into 7 folds according to geographical location, such that data belonging to the same fold maintains similar morphological features.*"

This will be taken into account for training-validation.

In [6]:
df = pd.read_csv(os.path.join(input_labelled_dir, 'satellite_data.csv'))

print("Number of rows (different fires) in the DataFrame:", len(df))
print("Columns in the DataFrame:", df.columns.tolist())

df.head()

Number of rows (different fires) in the DataFrame: 73
Columns in the DataFrame: ['folder', 'top_left_long', 'top_left_lat', 'bottom_right_long', 'bottom_right_lat', 'activation_date', 'suggested_pre_date_from', 'suggested_pre_date_to', 'suggested_post_date_from', 'suggested_post_date_to', 'download_period_start', 'download_period_end', 'height', 'width', 'pre_date', 'post_date', 'situation_as_of', 'longitude', 'latitude', 'fold']


Unnamed: 0,folder,top_left_long,top_left_lat,bottom_right_long,bottom_right_lat,activation_date,suggested_pre_date_from,suggested_pre_date_to,suggested_post_date_from,suggested_post_date_to,download_period_start,download_period_end,height,width,pre_date,post_date,situation_as_of,longitude,latitude,fold
0,EMSR214_05LELAVANDOU_02GRADING_MAP_v1_vector,6.104806,43.342188,6.452223,43.065393,24/07/2017,2017-06-13,2017-07-14,2017-07-24,2017-08-24,24/06/2017,24/08/2017,522,512,14/07/2017 10:30,18/08/2017 10:34,28/07/2017,6.278515,43.20379,purple
1,EMSR207_04AVELAR_02GRADING_MAP_v2_vector,-8.499468,40.026117,-8.307374,39.904662,17/06/2017,2017-05-07,2017-06-07,2017-06-17,2017-07-18,17/05/2017,17/07/2017,512,615,,,20/06/2017,-8.403421,39.965389,coral
2,EMSR207_08CERNACHEDOBONJARDIM_02GRADING_MAP_v2...,-8.307667,39.904662,-8.116087,39.783514,17/06/2017,2017-05-07,2017-06-07,2017-06-17,2017-07-18,17/05/2017,17/07/2017,512,613,,,20/06/2017,-8.211877,39.844088,coral
3,EMSR209_01MOGUER_02GRADING_MAP_v2_vector,-6.84764,37.291966,-6.702477,37.166775,24/06/2017,2017-05-14,2017-06-14,2017-06-24,2017-07-25,24/05/2017,24/07/2017,579,512,01/06/2017 11:12,01/07/2017 11:17,,-6.775058,37.229371,pink
4,EMSR298_02HAMMARSTRAND_02GRADING_MAP_v1_vector,16.361547,63.14044,16.447381,63.099674,16/07/2018,2018-06-05,2018-07-06,2018-07-16,2018-08-16,16/06/2018,16/08/2018,561,512,14/07/2018 10:30,27/07/2018 10:40,,16.404464,63.120057,grey


# 1. Pre-processing the labelled data

Structure of the new folder:

```
tiled_fire_dataset/
  FIRE_EVENT_NAME/
    images/
      sentinel1_YYYY-MM-DD_tile_i_j.tiff
      sentinel2_YYYY-MM-DD_tile_i_j.tiff
    coverages/
      sentinel1_YYYY-MM-DD_coverage_tile_i_j.tiff
      sentinel2_YYYY-MM-DD_coverage_tile_i_j.tiff
    masks/
      FIRE_EVENT_NAME_mask_tile_i_j.tiff
```

In [5]:
# First, let's get the minimum dimension of the image - this will be used to 
# determine the size of the tiles we will create
# First, let's get the minimum dimension of the image - this will be used to 
# determine the size of the tiles we will create
from library.preprocessing.labelled_dataset.find_fire_event_folders import find_fire_event_folders
import glob
from PIL import Image

# Find all individual fire event folders
all_fire_folders = find_fire_event_folders(input_labelled_dir)

min_dimension = float('inf')
min_dim_file = None

# Iterate through each folder to find the minimum image dimension
for folder in all_fire_folders:
    # Find all .tiff files in the current folder
    tiff_files = glob.glob(os.path.join(folder, "*.tiff"))
    
    for tiff_file in tiff_files:
        try:
            with Image.open(tiff_file) as img:
                width, height = img.size
                # Find the smaller dimension of the current image
                current_min = min(width, height)
                # If it's the smallest we've seen so far, update the overall minimum
                if current_min < min_dimension:
                    min_dimension = current_min
                    min_dim_file = tiff_file
        except Exception as e:
            print(f"Could not process file {tiff_file}: {e}")

if min_dimension != float('inf'):
    print(f"The minimum dimension found across all .tiff images is: {min_dimension} pixels.")
    print(f"File with minimum dimension: {min_dim_file}")
else:
    print("No .tiff images were found.")

More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples 

Searching for dataset parts in: /media/pablo/Shared files/data/Satellite_burned_area_dataset
Found part 1 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1
Found part 2 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part2
Found part 3 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part3
Found part 4 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part4
Found part 5 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5
Found 73 fire event folders
Example folder: /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1/EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector
Could not process file /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1/EMSR207_01MIRANDADOCORVO_02GRADI

More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13
More samples per pixel than can be decoded: 13


Could not process file /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5/EMSR221_01OGLIASTRO_02GRADING_MAP_v2_vector/sentinel2_2017-07-03.tiff: cannot identify image file '/media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5/EMSR221_01OGLIASTRO_02GRADING_MAP_v2_vector/sentinel2_2017-07-03.tiff'
Could not process file /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5/EMSR221_01OGLIASTRO_02GRADING_MAP_v2_vector/sentinel2_2017-07-06.tiff: cannot identify image file '/media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5/EMSR221_01OGLIASTRO_02GRADING_MAP_v2_vector/sentinel2_2017-07-06.tiff'
Could not process file /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5/EMSR221_01OGLIASTRO_02GRADING_MAP_v2_vector/sentinel2_2017-07-06_cloud_coverage.tiff: cannot identify imag

In [3]:
# First, let's get the minimum dimension of the image - this will be used to 
# determine the size of the tiles we will create
from library.preprocessing.labelled_dataset.find_fire_event_folders import find_fire_event_folders
import glob
import os
import rasterio

# Find all individual fire event folders
# Make sure input_labelled_dir is defined before this line
all_fire_folders = find_fire_event_folders(input_labelled_dir)

min_dimension = float('inf')
min_dim_file = None

# Iterate through each folder to find the minimum image dimension
for folder in all_fire_folders:
    # Find all .tiff files in the current folder
    tiff_files = glob.glob(os.path.join(folder, "*.tiff"))
    
    for tiff_file in tiff_files:
        try:
            with rasterio.open(tiff_file) as src:
                width, height = src.width, src.height
                # Find the smaller dimension of the current image
                current_min = min(width, height)
                # If it's the smallest we've seen so far, update the overall minimum
                if current_min < min_dimension:
                    min_dimension = current_min
                    min_dim_file = tiff_file
        except Exception as e:
            print(f"Could not process file {tiff_file}: {e}")

if min_dimension != float('inf'):
    print(f"The minimum dimension found across all .tiff images is: {min_dimension} pixels.")
    print(f"File with minimum dimension: {min_dim_file}")
else:
    print("No .tiff images were found.")

Searching for dataset parts in: /media/pablo/Shared files/data/Satellite_burned_area_dataset
Found part 1 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1
Found part 2 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part2
Found part 3 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part3
Found part 4 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part4
Found part 5 at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part5
Found 73 fire event folders
Example folder: /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1/EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector
The minimum dimension found across all .tiff images is: 261 pixels.
File with minimum dimension: /media/pablo/Shared files/data/Satellite_burned_area_da

  dataset = DatasetReader(path, driver=driver, sharing=sharing, **kwargs)


The minimum dimension across all labelled `.tiff` images is of 261 pixels. Therefore, we will create tiles for a lower dimension and a multiple of 2, 256 (half the 512x512 pixels explained in the paper).

In [4]:
# Set these paths according to your environment
# Use absolute paths to avoid confusion
output_directory = os.path.join('/media/pablo/Shared files/data/tiled_fire_dataset')  # Base output directory for saving tiled images

# Loop over the different dataset parts (5 parts overall)
for part in range(1, 6):
    base_dataset_path = os.path.join(input_labelled_dir, f'Satellite_burned_area_dataset_part{part}')
    
    # Create output directory if it doesn't exist
    os.makedirs(output_directory, exist_ok=True)
    
    # Run the processing function for each part
    process_labelled_dataset(base_dataset_path, output_directory, tile_size=256)

Starting to process dataset at /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1
Output will be saved to /media/pablo/Shared files/data/tiled_fire_dataset
Searching for fire folders in single part: /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1
Found 17 fire event folders
Example folder: /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1/EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector

Processing folder: /media/pablo/Shared files/data/Satellite_burned_area_dataset/Satellite_burned_area_dataset_part1/EMSR207_01MIRANDADOCORVO_02GRADING_MAP_v2_vector


AttributeError: 'function' object has no attribute 'glob'