# Quality control of Geo-harmonizer datasets

The `eumap` library provides a set of functions to check the quality of all spatial datasets produced throughout the Geo-harmonizer project. 
These are the same functions used by the developers and are adapted for users to run quality checks not only on 
one entire raster layer (with no proper infrastructure it may be too computationally intensive) but also on subset of it.

These functions are contained in the module `qc` of the `eumap` package and can be used to check **accessibility**, **completeness** and
**consistency** of the raster layers. The main component of the `qc` module is the `Test` class (full documentation can be found [here](https://eumap.readthedocs.io/en/latest/_autosummary/eumap.qc.Test.html#eumap.qc.Test)

Let's import the module:

In [5]:
from eumap import qc

bounds = (
    4751935,
    2420238,
    4772117,
    2444223,
)

test = qc.Test(
    bounds=bounds,
    crs='EPSG:3035', # optional
    verbose=True,    # optional, defaults to False
)

test

ImportError: cannot import name 'qc' from 'eumap' (unknown location)

## Accessibility test

First we check if the datasets we are interested in are **accessible** (a simple check on the url that allows users to access or
download the files). We import the `Catalogue` object and search through our [GeoNetwork](https://data.opendatascience.eu) for the 
*potential natural vegetation* ("*pnv*") dataset. 

For more information on the `Catalogue` object, refer to the previous tutorial [7. Access to Geo-harmonizer datasets](https://eumap.readthedocs.io/en/latest/notebooks/07_catalogue.html)

In [2]:
from eumap.datasets import Catalogue

cat = Catalogue()

asset = cat.search('pnv')[0]

asset.meta

title:    PNV - Probability distribution for Abies alba
abstract: Overview:
Potential Natural Vegetation (PNV): potential probability of occurrence for the Silver fir from 2018 to 2020

Traceability (lineage):
    This is an original dataset produced with a machine learning framework which used a combination of point datasets and raster datasets as inputs. Point dataset is a harmonized collection of tree occurrence data, comprising observations from National Forest Inventories (EU-Forest), GBIF and LUCAS. The complete dataset is available on Zenodo. Raster datasets used as input are: monthly time series air and surface temperature and precipitation from a reprocessed version of the Copernicus ERA5 dataset; long term averages of bioclimatic variables from CHELSA; elevation, slope and other elevation-derived metrics and long term monthly averages snow probability. For a more comprehensive list refer to Bonannella et al. (2022) (in review, preprint available at: https://doi.org/10.21203/r

From the "*pnv*" catalogue we extract the url of the first raster layer of the dataset, the potential distribution map of silver fir for
the period 2018 - 2020: 

In [3]:
str(asset) # assets are just strings with metadata so we can use them as a url string

'https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_abies.alba_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif'

The raster url and the bounding box previously defined are the only information needed to run all the quality control checks. We can now run
the **accessibility** check using the method with the same name:

In [4]:
accessible = test.accessibility(asset)

accessible

Dataset accessible:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_abies.alba_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif


True

As we can see, the test results is `TRUE`, which means the file is available.

## Completeness test

The second test checks for **completeness** of the raster layer: every pixel of the region of interested selected in the raster layer is 
compared with the landmask used for all the layers produced in the Geo-harmonizer project. The main landmask (30m spatial resolution) is derived from [Pflugmacher et al., (2019)](https://doi.pangaea.de/10.1594/PANGAEA.896282). We use the `raster_land_coverage` method: the output of the method
is a number between `0` and `1`, representing the fraction of pixels of the raster layer tested being `nodata` across the landmask

In [5]:
coverage = test.raster_land_coverage(asset)

coverage

reader using 3 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_abies.alba_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif


1.0

By default, the landmask excludes all those pixels falling in permanent ice/snow and wetlands. If we are interested
in these specific areas, the method allows the user to include them during the quality control check:

In [6]:
coverage = test.raster_land_coverage(
    asset,
    include_ice=True, # include snow and ice in coverage check
)

coverage

reader using 5 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_abies.alba_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif


1.0

In [7]:
coverage = test.raster_land_coverage(
    asset,
    include_ice=True,
    include_wetlands=True, # include wetlands in coverage check
)

coverage

reader using 5 threads
Completeness 100.0% for dataset:
https://s3.eu-central-1.wasabisys.com/eumap/veg/veg_abies.alba_pnv.eml_p_30m_0..0cm_2018..2020_eumap_epsg3035_v0.2.tif


1.0