### Author: Md Fahim Hasan
### Work Email: mdfahim.hasan@bayer.com

# 1. Folder Structure

The `ARD_compilation` folder consists of scripts to compile the weather, soil, elevation, and satellite datasets into an `Analytical-Ready Dataset (ARD)` format.

There are five (05) scripts in the folder-
1. __ARD_utils.ipynb :__ Consists of functions to process and compile the datasets into ARD.
2. __weather_ARD_4km.ipynb :__ Consists of workflow to compile 4km weather datasets to a 4km ARD.
3. __soil_elevation_ARD_4km.ipynb :__ Consists of workflow to compile 4km soil + elevation datasets to a 4km ARD.
4. __weather_satellite_ARD_100m_Woodland.ipynb :__ Consists of workflow to compile 100m weather and satellite datasets to a 100m ARD.
5. __soil_elevation_ARD_100m_Woodland.ipynb :__ Consists of workflow to compile 100m soil + elevation datasets to a 100m ARD. 

----------------

# 2. Sequence for Script Running 

- The `ARD_utils` is a standalone script supporting all the ARD compilation scripts. There is no need to run this script separately. The ARD compilation scripts will import functions from this script and use them as needed. 
<br />

- The ARD compilation scripts have many common operations which invloves copying, resampling, masking, etc. I have commented out these similar operations for faster ARD compilations. The following sequence can be maintained for running the scripts for ARD compilation:
    - __4KM ARD__
        - weather_ARD_4km.ipynb
        - soil_elevation_ARD_4km.ipynb
    
  <br />  
    
    - __100m ARD__
        - weather_satellite_ARD_100m_Woodland.ipynb
        - soil_elevation_ARD_100m_Woodland.ipynb
    
    

---------------

# 3. 4km ARD `General` Framework

The following workflow shows the overall framework for creating 4km ARD. Here, ERA5 and TWC are weather datasets. 


__Note:__ This is a general framework and doesn't show some find-tuned steps, such as gap filling soil250 datasets, replacing ML/DL model interpolated weather datasets with TWC data after 2015

![image.png](attachment:image.png)



------------

# 4. 100m ARD `General` Framework

The following workflow shows the overall framework for creating 100m ARD. Here, ERA5 and TWC are weather datasets. 

__Note:__ This is a general framework and doesn't show some find-tuned steps, such as gap filling soil250 datasets, replacing ML/DL model interpolated weather datasets with TWC data after 2015

![image.png](attachment:image.png)

--------------

# 5. General Discussion

### Two (02) version of ARD for different resolutions
Two versions of ARD were compiled for different resolutions. 

__4km ARD:__ The 4km ARD consisted of weather datasets, soil250 datasets, elevation, and slope data. We made separate ARDs for time series (weather datasets) and static (soil, elevation, slope) datasets. 4km resolution is the target resolution because we trained the ML/DL models with TWC weather datasets of 4km resolution.

__100m ARD:__ The 4km ARD consisted of weather datasets, soil250 datasets, elevation, and slope data. In addition, it consists of `satellite land surface temperature (LST)` data. We made separate ARDs for time series (weather + satellite datasets) and static (soil, elevation, slope) datasets. 100m resolution is the target resoltuion because the satellite LST dataset is of 100m resolution.

### Region of Interest (ROI) 4km ARD
The 4km ARD was compiled for our whole ROI, a part of central valley, California. 

### Region of Interest (ROI) 100m ARD
The 100m ARD is only processed for the `Woodland` site as satellite Land Surface Temperature data is available for Woodland site only for the time being. With availability of datasets for other regions of the ROI, we can apply our framework for extending ARD for the whole ROI.

`The following figure shows a summary of the ARD versions for different resolutions and the datasets.`

![image.png](attachment:image.png)

### Replacing ML/DL model interpolated weather datasets with TWC data after 2015
We have TWC high resolution data records starting from 2015-06-30 (for precipitation from 2015-12-01) for weather variables `max temperature`, `min temperature`, `avg wind speed`, `avg relative humidity`, and `total precipitation`. We have modeled high resolution data from 2002 to 2023 using the ML and DL models. Despite that, from 2015 we will use TWC high resolution 4km data for `max temperature`, `min temperature`, `avg wind speed`, `avg relative humidity`, and `total precipitation`. Therefore, we replace ML/DL modeled weather datasets for the abovementioned variables with TWC datasets from 2015 before compling all datasets into the ARD.

### File Format of the Analytical-Ready Datasets (ARD)
The ARD are saved in the __`../../datasets/ARD`__ folder. All the ARDs are saved as dataframe in '.parquet' format. The `V1` tagged ARD are ARD with ML/DL model interpolated data, while the `V2` ARD are with ERA5 resampled precipitation data. All ARDs have latitude + longitude info to spatially locate each pixel of the ARD.