# 0. Setup

## 0.1. Libraries

In [1]:
# General utilities
import os
import pandas as pd

# Image processing
from PIL import Image

## 0.2. Data paths

In [3]:
input_labelled_dir = '/media/pablo/Shared files/data/Satellite_burned_area_dataset'  # Adjust this path as needed
input_inference_dir = os.path.join('..', 'catalunya_fire_imgs')

## 0.3. Notebook description

In this notebook we preprocess the image data to prepare it for training/validation/test and for inference. To that end, we pre-process 2 different sources of data:
1) The labelled dataset from *Luca Colomba, Alessandro Farasin, Simone Monaco, Salvatore Greco, Paolo Garza, Daniele Apiletti, Elena Baralis, and Tania Cerquitelli. 2022. A Dataset for Burned Area Delineation and Severity Estimation from Satellite Imagery. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (CIKM '22). Association for Computing Machinery, New York, NY, USA, 3893–3897. https://doi.org/10.1145/3511808.3557528*. The dataset can be downloaded from here: https://zenodo.org/records/6597139 (click on "Download all", which will approximately download 14GB worth of images and labels).
2) The inference dataset for the fires in Catalunya, obtained by our own means through the APIs of [Sentinel Hub](https://dataspace.copernicus.eu/analyse/apis/sentinel-hub) (see the notebooks for part 1). 

To that end, in this notebook we follow the the pre-processing steps outlined in the paper (for consistency):
  1) Since our research question focuses on assessing the severity of fires in Catalunya (inference), we will only be using the post-fire satellite imagery (like the authors do for their binary segmentation task).
  2) Generate 512x512 pixel tiles for the labelled train, validation, test and inference data. For images whose size is not a multiple of 512, overlap of tiles is allowed (in order to cover all of the area of the image). 
  3) Additionally, during the training phase, we will randomly shuffle the tiles and apply data augmentation techniques (see the corresponding training notebook).

> The final goal is to train a model that estimates the severity of a wildfire using post-wildfire (1 month after) imagery, and then to run inference with the satellite images from wildfires in Catalunya in the last decade. Severity estimation will be done at the pixel level.

Additionally, within the paper, the authors of the dataset provide additional metadata and a suggestion for the organization of the folds for training/validation:

In [2]:
input_dir = '/media/pablo/Shared files/data/Satellite_burned_area_dataset'  # Adjust this path as needed

df = pd.read_csv(os.path.join(input_dir, 'satellite_data.csv'))

print("Number of rows (different fires) in the DataFrame:", len(df))
print("Columns in the DataFrame:", df.columns.tolist())

df.head()

Number of rows (different fires) in the DataFrame: 73
Columns in the DataFrame: ['folder', 'top_left_long', 'top_left_lat', 'bottom_right_long', 'bottom_right_lat', 'activation_date', 'suggested_pre_date_from', 'suggested_pre_date_to', 'suggested_post_date_from', 'suggested_post_date_to', 'download_period_start', 'download_period_end', 'height', 'width', 'pre_date', 'post_date', 'situation_as_of', 'longitude', 'latitude', 'fold']


Unnamed: 0,folder,top_left_long,top_left_lat,bottom_right_long,bottom_right_lat,activation_date,suggested_pre_date_from,suggested_pre_date_to,suggested_post_date_from,suggested_post_date_to,download_period_start,download_period_end,height,width,pre_date,post_date,situation_as_of,longitude,latitude,fold
0,EMSR214_05LELAVANDOU_02GRADING_MAP_v1_vector,6.104806,43.342188,6.452223,43.065393,24/07/2017,2017-06-13,2017-07-14,2017-07-24,2017-08-24,24/06/2017,24/08/2017,522,512,14/07/2017 10:30,18/08/2017 10:34,28/07/2017,6.278515,43.20379,purple
1,EMSR207_04AVELAR_02GRADING_MAP_v2_vector,-8.499468,40.026117,-8.307374,39.904662,17/06/2017,2017-05-07,2017-06-07,2017-06-17,2017-07-18,17/05/2017,17/07/2017,512,615,,,20/06/2017,-8.403421,39.965389,coral
2,EMSR207_08CERNACHEDOBONJARDIM_02GRADING_MAP_v2...,-8.307667,39.904662,-8.116087,39.783514,17/06/2017,2017-05-07,2017-06-07,2017-06-17,2017-07-18,17/05/2017,17/07/2017,512,613,,,20/06/2017,-8.211877,39.844088,coral
3,EMSR209_01MOGUER_02GRADING_MAP_v2_vector,-6.84764,37.291966,-6.702477,37.166775,24/06/2017,2017-05-14,2017-06-14,2017-06-24,2017-07-25,24/05/2017,24/07/2017,579,512,01/06/2017 11:12,01/07/2017 11:17,,-6.775058,37.229371,pink
4,EMSR298_02HAMMARSTRAND_02GRADING_MAP_v1_vector,16.361547,63.14044,16.447381,63.099674,16/07/2018,2018-06-05,2018-07-06,2018-07-16,2018-08-16,16/06/2018,16/08/2018,561,512,14/07/2018 10:30,27/07/2018 10:40,,16.404464,63.120057,grey


# 1. Pre-processing the labelled data

"Annotations released by Copernicus specify 5 different severity levels, which were encoded in a greyscale image in range [0 − 255], from no damage to completely destroyed area." Labels have 5 levels of severity (0 = undamaged to 4 = fully destroyed).

Such classes are highly imbalanced: unburnt areas are the 91.9% of the whole images. Focusing on burnt areas only, class 1 covers around the 12%, while class 2 to 4 coverages stay around 30% (1%, 2.25%, 2.35%, and 2.5% of the total, respectively for the 4 classes).