## RQ1 Data Preparation

[Add description]

Steps:
1. **Manually** download all datasets (5 forest definitions and Natura 2000) 
2. Filter Natura 2000 for areas in Germany
3. Mosaic data which comes in tiles (Hansen & JAXA)
4. Threshold & update (Hansen)
5. Extract layer from netcdf (ESA)
6. Reproject to the most common projection - EPSG 3035
7. Rasterise or Upsample (WITHOUT INTERPOLATION) to 5m 
8. Clip data to German Natura 2000 areas 

### Initial Setup

Used this for help with directory setup: 
https://www.freecodecamp.org/news/creating-a-directory-in-python-how-to-create-a-folder/

In [1]:
# SETUP

# Import packages
import os
import warnings
import glob
import math

import geopandas as gpd
import rasterio
from rasterio.merge import merge
from rasterio.crs import CRS 
import xarray as xr 
import rioxarray as rio

from osgeo import gdal 

# Create required directories if they don't already exist
# Note: these directories are ignored in git
path_list = ("./rawdata", "./processing", "./outputs")

for path in path_list:
  if not os.path.exists(path):
    os.mkdir(path)
    print("Folder %s created!" % path)
  else:
    print("Folder %s already exists" % path)

Folder ./rawdata already exists
Folder ./processing already exists
Folder ./outputs already exists


### Step 1: Manually download datasets

As several of the datasets require login credentials and are not available through an API, I decided to manually download all the required data. I have stored everything in the "rawdata" folder. This folder is set to be ignored by git because the files are too big to push onto the GitHub repo.

**So for the first step: manually download all datasets using the notes below and save to the "rawdata" folder.** 

Note: For the forest definiton layers I have downloaded the 2018 datasets as this is the most recent data available across all datasets. 

**1. UMD (Hansen) / Global Forest Watch**
- Download from: https://storage.googleapis.com/earthenginepartners-hansen/GFC-2023-v1.11/download.html
    - Using the map interface, download the treecover2000, gain & lossyear layers for the 4 granules with top-left corner at: (60N, 0E), (60N, 10E), (50N, 0E) and (50N, 10E). 
    - The files will be used in combination with each other to generate a dataset that corresponds (roughly) to forest cover in 2018.
- My info:
    - Download date: 15 Jan 2025
    - File: rawdata/Hansen_GFC-2023-v1.11 (folder contains 12 tifs - 4 each for cover, gain and loss)

**2. ESA Land Cover**
- Download from: https://cds.climate.copernicus.eu/datasets/satellite-land-cover?tab=download 
    - Login credentials required (there is a prompt in the page to sign up/login).
    - Select 2018 map and v2.1.1.
    - Only download the sub-region for ~Germany bounding box (N:56, W:1, E:16, S:46).
- My Info:
    - Download date: 14 Jan 2025
    - File: rawdata/C3S-LC-L4-LCCS-Map-300m-P1Y-2018-v2.1.1.area-subset.56.1.46.16.nc

**3. JAXA FNF** 
- Download from: https://www.eorc.jaxa.jp/ALOS/en/palsar_fnf/data/index.htm
    - Login credentials required (To register, go here: https://www.eorc.jaxa.jp/ALOS/en/palsar_fnf/registration.htm).
    - Under the heading "PALSAR/PALSAR-2 mosaic and forest/non-forest (FNF) map", select the 2018 data.
    - Use the map interface to click through until you can download tiles. I opted to download four 5 x 5 tiles using the link above the map (for N55E005, N55E010, N50E005, N50E010), but also had to supplement with some individual tiles. I used QGIS to sort through the tiles and figure out which ones were needed. In total, 73 tiles are needed for the Germany Natura areas - a list of the required tile names is available in: other/jaxa_tile_list.txt
- My Info:
    - Download date: 15 Jan 2025
    - File: rawdata/jaxa_2018_fnf_ger (folder contains 73 tifs)  

**4. CORINE Land Use** 
- Download from: https://land.copernicus.eu/en/products/corine-land-cover/clc2018#download
    - Login credentials required (there is a prompt in the page to sign up/login).
    - Click on “Go to download by area”. Then select the CORINE land cover 2018 layer, use the area selection tool to click on Germany and then click on the download icon beside the layer name. 
    - From the cart, select the dataset and chose "vector" and "shapefile". I opted for vector so that I can rasterise at a common resolution that makes sense with the other data. Click the "Process Download Request" button.
    - NOTE: At this point the download request enters a queue which can take a long time. When it is ready to download, an email will be sent so you don't have to keep checking it.
- My Info:
    - Download date: 16 Jan 2025 (request date - ready for download on 18 Jan 2025)
    - File: U2018_CLC2018_V2020_20u1.zip  (contains: 1 shp & its components)

**5. German Land Use**
- Download from: https://gdz.bkg.bund.de/index.php/default/corine-land-cover-5-ha-stand-2018-clc5-2018.html  
    - Click on the “Direktdownload” tab, and then click on "Georeferenzierung: UTM32s, Format: Shape (ZIP, 1,24 GB)". This will download 5 shapefiles which represent the 5 main land cover classes (also used in CORINE) - individual features within these files have their more precise class as an attribute. Class 3 contains the classes related to forests, but other classes may be required for producing the FAO map (so all 5 are retained for now).
- My Info:
    - Download date: 14 Jan 2025
    - Files: rawdata/clc5_class1xx.zip, rawdata/clc5_class2xx.zip, rawdata/clc5_class3xx.zip, rawdata/clc5_class4xx.zip, rawdata/clc5_class5xx.zip (each contains: 1 shp & its components)

**6. Natura 2000 protected areas**
- Download from: https://www.eea.europa.eu/en/datahub/datahubitem-view/6fc8ad2d-195d-40f4-bdec-576e7d1268e4
    - Download the most recent date available (in my case: 2022 - direct link: https://sdi.eea.europa.eu/data/95e717d4-81dc-415d-a8f0-fecdf7e686b0).
- My Info:
    - Download date: 15 Jan 2025
    - File: rawdata/Natura2000_end2022_epsg3035.zip (contains: 1 shp & its components)

### Step 2: Filter Natura 2000 

Use the attributes of the Natura shapefile to filter the "MS" field (i.e. "Member States") to only include "DE" (i.e. Germany). 

I also save the results as a shapefile in the outputs folder as this maybe be useful for visualisations at the end. 

In [2]:
# FILTER NATURA2000

# Load the Natura 2000 shapefile as a geodataframe
# You can do this directly from the zipped file
natura_gdf = gpd.read_file("./rawdata/Natura2000_end2022_epsg3035.zip")

#print(natura_gdf[1:20])

# Extract only the German areas
natura_de_gdf = natura_gdf.loc[natura_gdf["MS"] == "DE"]

# Check - there should be 5200 areas
natura_de_gdf.count()

# Save the file to outputs folder (turned off warnings which is about a datetime column)
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    natura_de_gdf.to_file('./outputs/natura_DE.shp')

  _init_gdal_data()


### Step 3: Mosaic tiled data

JAXA and Hansen

Help with mosaicing using rasterio: https://automating-gis-processes.github.io/CSC18/lessons/L6/raster-mosaic.html

In [3]:
# MOSAIC TILES

# Store paths for tiles in list
jaxa_paths = glob.glob('./rawdata/jaxa_2018_fnf_ger/*.tif')
hansen_cover_paths = glob.glob('./rawdata/Hansen_GFC-2023-v1.11/Hansen_GFC-2023-v1.11_tree*.tif')
hansen_gain_paths = glob.glob('./rawdata/Hansen_GFC-2023-v1.11/Hansen_GFC-2023-v1.11_gain*.tif')
hansen_loss_paths = glob.glob('./rawdata/Hansen_GFC-2023-v1.11/Hansen_GFC-2023-v1.11_loss*.tif')

# Store the path for the output mosaics
jaxa_mosaic = "./processing/jaxa_FNF_4326_DE.tif"
hansen_cover_mosaic = "./processing/hansen_treecover2000_4326_DE.tif"
hansen_gain_mosaic = "./processing/hansen_gain_4326_DE.tif"
hansen_loss_mosaic = "./processing/hansen_lossyear_4326_DE.tif"

# Create a function which mosaics the tiles
def mosaic_rasters(input_paths, output_path):
    # Create an empty list to store the opened tiles
    tiles_to_mosaic = []
    # Iterate through the tile paths to open them and store the opened tiles a list
    for path in input_paths:
        tile = rasterio.open(path)
        tiles_to_mosaic.append(tile)
    # Create the mosaic and store the transform information
    mosaic, transform = merge(tiles_to_mosaic)
    # Copy the metadata for the mosaic from a tile
    mosaic_meta = tile.meta.copy()
    # Update the metadata with the new information for the mosaic
    mosaic_meta.update({"driver": "GTiff",
                        "height": mosaic.shape[1],
                        "width": mosaic.shape[2],
                        "transform": transform,
                        # crs is not included, as it is copied from the tiles
                        }
                        )
        # Write the mosaic with its metata to a tif file
    with rasterio.open(output_path, "w", **mosaic_meta, compress="LZW") as dest:
        dest.write(mosaic)

# Run the function to mosaic the tiles
# If mosaic already exists, make sure it's not open in QGIS :) you'll get permission error if so!
mosaic_rasters(jaxa_paths, jaxa_mosaic)
mosaic_rasters(hansen_cover_paths, hansen_cover_mosaic)
mosaic_rasters(hansen_gain_paths, hansen_gain_mosaic)
mosaic_rasters(hansen_loss_paths, hansen_loss_mosaic)

### Step 4: Threshold & update for gain/loss 

Hansen data only

>60% cover - provides a good range with the other datasets (which are lower), and it's also the threshold used by the International Geosphere-Biosphere Programme (IGBP) definition

reclassify 1-100 values so that 61-100 are forest, everything else is non-forest

reclassify loss so that 1-18 has a value of 1, everything else is 0

subtract/add loss and gain from treecover layer

### Step 5: Extract layer from netcdf

This is required for the ESA dataset only. The netcdf file includes several different layers of information - the one I want to use is called "lccs_class". The aim here is to simply extract the single layer and save it as a geotiff.

Help with saving netcdf layer as geotiff: https://help.marine.copernicus.eu/en/articles/5029956-how-to-convert-netcdf-to-geotiff 

In [8]:
# EXTRACT & CONVERT NETCDF DATA

# Open the ESA netcdf for conversion
esa_netcdf = xr.open_dataset("./rawdata/C3S-LC-L4-LCCS-Map-300m-P1Y-2018-v2.1.1.area-subset.56.1.46.16.nc", engine="netcdf4")
#esa_netcdf

# Extract the lccs_class variable
esa_lccs_class = esa_netcdf['lccs_class']

# Provide spatial axis & define the CRS
esa_lccs_class = esa_lccs_class.rio.set_spatial_dims(x_dim='lon', y_dim='lat')
esa_lccs_class.rio.crs
esa_lccs_class.rio.write_crs("EPSG:4326", inplace=True)

# Save the geotiff
esa_lccs_class.rio.to_raster("./processing/esa_lccs_class_4326_DE.tif")

### Step 6: Reproject

The data needed to be projected to work in units of meters for calculating area. Three datasets are not in a projected CRS (they are WGS 1984 / EPSG:4326). The most common projection is ETRS89-extended / LAEA Europe (EPSG: 3035) which is used by the Natura 2000 areas and the CORINE dataset. 

So for this step, all datasets which are not already in this projection, will be (re)projected to 3035.

Help with reprojecting using rioxarray: https://www.earthdatascience.org/courses/use-data-open-source-python/intro-raster-data-python/raster-data-processing/reproject-raster/

In [9]:
# REPROJECT RASTERS

# A quick function for reprojecting individual rasters to EPSG:3035
def reproject_raster_3035(input_path, output_path):
    # Open the raster (NOTE: need to use rio.open_rasterio() here!)
    input = rio.open_rasterio(input_path)
    # Run the reprojection
    output = input.rio.reproject("EPSG:3035")
    # Write the reprojected output raster
    output.rio.to_raster(output_path, compress = "LZW")

# Run this per file 
reproject_raster_3035("./processing/jaxa_FNF_4326_DE.tif", "./processing/jaxa_FNF_3035_DE.tif")
reproject_raster_3035("./processing/esa_lccs_class_4326_DE.tif", "./processing/esa_lccs_class_3035_DE.tif")

# TO DO: HANSEN?

In [10]:
# REPROJECT SHPS

# Store paths for shp zips in list
ger_lulc_paths = glob.glob('./rawdata/clc5_class*.zip')

# Create a function which reprojects the shp to 3035 (and saves to processing folder)
def reproj_shp_3035(input_paths):
    # Iterate through the shp paths 
    for path in input_paths:
        # Open the shp for each path (excludes extra cols as they cause problems & are not needed)
        shp = gpd.read_file(path, columns = ["CLC18"])
        # Reprojects to 3035
        shp_3035  = shp.to_crs("EPSG:3035")

        # For output file naming: extract the input file name (with extension)
        name_w_ext = os.path.split(path)[1] 
        # For output file naming: remove extension from input file name 
        name_wo_ext = os.path.splitext(name_w_ext)[0]
        # For output file naming: create the new name for reprojected shp
        new_name = name_wo_ext + "_3035_DE.shp"

        # Write the reprojected shp to the processing folder
        shp_3035.to_file('./processing/' + new_name)

# Run the function for the German LULC zipped shps
reproj_shp_3035(ger_lulc_paths)


In [11]:
# CLEAN UP

# Create a list of the data paths for deletion (files with "4326" in their name)
old_data =  glob.glob('./processing/*4326*')

# Create a function which deletes the input paths
def clean_up(input_paths):
    for path in input_paths:
        # Check that the paths exist
        if os.path.exists(path):  
           os.remove(path)
        else:
            print("Nothing to clean!") 

# Run the function to remove any data with "4326" in the file name
clean_up(old_data)

### Step 7: Rasterise or Upsample

In this step, I rasterise the vector files (German LULC - Class 3 only & CORINE) and upsample the already exisiting rasters to 5m. 5m was selected as is the commonly divisible unit across all datasets; so all pixels can be approximately divided by 5, meaning there is as little transformation as possible. It also means that a lot of the detail of the shapefiles can be retained during rasterisation. Importantly, the upsampling needs to happen WITHOUT INTERPOLATION so that no "new" information is created.

Help for rasterising vectors: https://py.geocompx.org/05-raster-vector & https://pygis.io/docs/e_raster_rasterize.html 

In [39]:
# RASTERISE VECTORS (WITH ATTRIBUTE VALUE)

# Testing
ger_lulc_3_test = gpd.read_file("./processing/clc5_class3xx_3035_DE.shp")

bounds = ger_lulc_3_test.total_bounds
res = 5
transform = rasterio.transform.from_origin(
    west=bounds[0], 
    north=bounds[3], 
    xsize=res, 
    ysize=res
)
#transform

rows = math.ceil((bounds[3] - bounds[1]) / res)
cols = math.ceil((bounds[2] - bounds[0]) / res)
shape = (rows, cols)
#shape

#ger_lulc_3_test_borders = ger_lulc_3_test.boundary
#ger_lulc_3_test_borders

geom = [shapes for shapes in ger_lulc_3_test.geometry]

ger_lulc_3_test['id'] = range(0,len(ger_lulc_3_test))

geom_value = ((geom,value) for geom, value in zip(ger_lulc_3_test.geometry, ger_lulc_3_test['id']))

rasterized = rasterio.features.rasterize(geom_value,
                                out_shape = shape,
                                transform = transform,
                                all_touched = True,
                                fill = -5,   # background value
                                )



MemoryError: Unable to allocate 165. GiB for an array with shape (173477, 127979) and data type int64

In [38]:
with rasterio.open(
        "./processing/rasterized_vector_test.tif", "w",
        driver = "GTiff",
        transform = transform,
        width = rasterized.shape[1], # or shape[0]?
        height = rasterized.shape[2], # or shape [1]?
        compress="LZW") as dst:
    dst.write(rasterized)


NameError: name 'rasterized' is not defined

In [4]:
# for German LULC:
# convert CLC18 column to integer value and then:

# gdal_rasterize -l clc5_class3xx -a clc18_int -tr 5.0 5.0 -a_nodata 0.0 -ot Float32 -of GTiff C:/Users/ninam/Documents/UZH/04_Thesis/code/qgis_comparison/clc5_classXxx/clc5_class3xx.shp C:/Users/ninam/Documents/UZH/04_Thesis/code/qgis_comparison/clc5_class3xx_raster_test_50m.tif

# something similar in rasterio?

### Step 8: Clip to German Natura areas