## RQ1 Data Preparation

[Add description]

Steps:
1. **Manually** download all datasets (5 forest definitions and Natura 2000) 
2. Filter Natura 2000 for areas in Germany
3. Mosaic data which comes in tiles (Hansen & JAXA)
4. Threshold & update (Hansen)
5. Reproject to the most common projection - EPSG 3035
6. Rasterise or Upsample (WITHOUT INTERPOLATION) to 5m 
7. Clip data to German Natura 2000 areas 

### Initial Setup

Used this for help with directory setup: 
https://www.freecodecamp.org/news/creating-a-directory-in-python-how-to-create-a-folder/

In [40]:
# Import packages
import geopandas as gpd
import os

# Create required directories if they don't already exist
# Note: these directories are ignored in git
path_list = ("./rawdata", "./processing", "./outputs")

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./rawdata already exists
Folder ./processing already exists
Folder ./outputs already exists


### Step 1: Manually download datasets

As several of the datasets require login credentials and are not available through an API, I decided to manually download all the required data. I have stored everything in a folder within the working directory called "rawdata" which is set to be ignored by git - this is because the files are too big to host on my GitHub repo.

**So for the first step: manually download all datasets using the notes below and save to the "rawdata" folder.** 

Note: For the forest definiton layers I have downloaded the 2018 datasets as this is the most recent data available across all datasets. 

**1. UMD (Hansen) / Global Forest Watch**
- Download date: 15 Jan 2025
- Downloaded from: https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/download.html
    - Download treecover2000, gain & lossyear for the 4 granules with top-left corner at: 60N, 0E; 60N, 10E; 50N, 0E and 50N, 10E 
- File: rawdata/Hansen_GFC-2023-v1.11.zip (contains 12 tifs - 4 each for cover, gain and loss)

**2. ESA Land Cover**
- Download date: 14 Jan 2025
- Downloaded from: https://cds.climate.copernicus.eu/datasets/satellite-land-cover?tab=download 
    - Login credentials required 
    - Only downloaded sub-region for Germany bbox (N:56, W:1, E:16, S:46)
- File: rawdata/C3S-LC-L4-LCCS-Map-300m-P1Y-2018-v2.1.1.area-subset.56.1.46.16.nc

**3. JAXA FNF** 
- Download date: 15 Jan 2025
- Downloaded from: https://www.eorc.jaxa.jp/ALOS/en/palsar_fnf/data/index.htm
    - Login credentials required (https://www.eorc.jaxa.jp/ALOS/en/palsar_fnf/registration.htm)
    - 73 tiles needed for Germany Natura areas (see other/jaxa_tile_list.txt)
- File: rawdata/jaxa_2018_fnf_ger.zip (contains 73 tifs)  

**4. CORINE Land Use** 
- Download date: 16 Jan 2025
- Downloaded from: https://land.copernicus.eu/en/products/corine-land-cover/clc2018#download
    - Login credentials required 
    - Opted for vector so that I can rasterise at a common resolution that makes sense with the other data
- File: rawdata/

**5. German Land Use**
- Download date: 14 Jan 2025
- Downloaded from: https://gdz.bkg.bund.de/index.php/default/corine-land-cover-5-ha-stand-2018-clc5-2018.html  
    - This will download all 5 classes which is required for producing the FAO map
- File: rawdata/clc5_classXxx.zip (contains: 5 shp & their components)

**6. Natura 2000 protected areas**
- Download date: 15 Jan 2025
- Downloaded from: https://www.eea.europa.eu/en/datahub/datahubitem-view/6fc8ad2d-195d-40f4-bdec-576e7d1268e4
    - Downloaded most recent date available (2022)
- File: rawdata/Natura2000_end2022_epsg3035.zip (contains: 1 shp & its components)

### Step 2: Filter Natura 2000 

Use the attributes of the Natura shapefile to filter the "MS" field (i.e. "Member States") to only include "DE" (i.e. Germany). 

I also save the results as a shapefile in the outputs folder as this maybe be useful for visualisations at the end. 

In [41]:
# Load the Natura 2000 shapefile as a geodataframe
# You can do this directly from the zipped file
natura_gdf = gpd.read_file("rawdata/Natura2000_end2022_epsg3035.zip")

#print(natura_gdf[1:20])

# Extract only the German areas
natura_de_gdf = natura_gdf.loc[natura_gdf["MS"] == "DE"]

# Check - there should be 5200 areas
natura_de_gdf.count()

# Save the file to outputs folder
natura_de_gdf.to_file('outputs/natura_de.shp')

  ogr_write(


### Step 3: Mosaic tiled data

JAXA and Hansen

### Step 4: Threshold & update for gain/loss 

Hansen data only

>60% cover - provides a good range with the other datasets (which are lower), and it's also the threshold used by the International Geosphere-Biosphere Programme (IGBP) definition

### Step 5: Reproject

to the most common projection - EPSG 3035

### Step 6: Rasterise or Upsample

WITHOUT INTERPOLATION

5m was selected as is the commonly divisible unit across all datasets; so all pixels can be approximately divided by 5, meaning there is as little transformation as possible. It also means that a lot of the detail of the shapefiles can be retained during rasterisation.

In [4]:
# for German LULC:
# convert CLC18 column to integer value and then:
# gdal_rasterize -l clc5_class3xx -a clc18_int -tr 50.0 50.0 -a_nodata 0.0 -ot Float32 -of GTiff C:/Users/ninam/Documents/UZH/04_Thesis/code/qgis_comparison/clc5_classXxx/clc5_class3xx.shp C:/Users/ninam/Documents/UZH/04_Thesis/code/qgis_comparison/clc5_class3xx_raster_test_50m.tif
# ran pretty quick in QGIS (less than 1 min)
# also not too bad for 25m

# gdal for python: https://gdal.org/en/stable/api/python_bindings.html

### Step 7: Clip to German Natura areas