# PyEumap - Overlay Demonstration

In this tutorial, we will use the pyeumap package to overlay all the points of a vector layer (*geopackage file*) on several raster layers (*geotiff files*), using the **SpaceOverlay** and **SpaceTimeOverlay** classes to handle with timeless and temporal layers, respectively. 

In our dataset the elevation and slope, based on digital terrain model, are timeless and the landsat composites (7 spectral bands, 4 seasons and 3 percentiles) and night light (VIIRS Night Band) layers are temporal (from 2000 to 2020).

First, let's import the necessary modules

In [1]:
import os
from pathlib import Path
import sys

import pandas as pd
import geopandas as gpd

# Add the repository root in the path
# If the pyeumap isn't instaled you should do it   
sys.path.append('../../')

from pyeumap.overlay import SpaceOverlay, SpaceTimeOverlay

## Dataset 

Our dataset refers to 1 tile, located in Sweden, extracted from a tiling system created for European Union (7,042 tiles) by [GeoHarmonizer Project](https://opendatascience.eu).

In [2]:
from pyeumap import datasets

tile = datasets.TILES[0]
datasets.get_data(tile+'_rasters_gapfilled.zip')

data_root = datasets.DATA_ROOT_NAME
tile_dir = Path(os.getcwd()).joinpath(data_root,tile)

644.77 MB downloaded, unpacking...                    
Download complete.


For this tile we have a **geopackage** file containing the points

In [3]:
datasets.get_data(tile+'_landcover_samples.gpkg')

fn_points = Path(os.getcwd()).joinpath(data_root, tile, tile+'_landcover_samples.gpkg')

points = gpd.read_file(fn_points)
points

0.17 MB downloaded, unpacking...                    
Download complete.


Unnamed: 0,lucas,survey_date,confidence,lc_class,tile_id,geometry
0,True,2012-05-29,100,312,22497,POINT (4650000.166 4483999.711)
1,True,2012-05-16,100,312,22497,POINT (4650000.255 4471999.472)
2,False,2000-06-30,85,411,22497,POINT (4650097.582 4470351.405)
3,False,2012-06-30,85,411,22497,POINT (4650097.582 4470351.405)
4,False,2018-06-30,85,411,22497,POINT (4650097.582 4470351.405)
...,...,...,...,...,...,...
675,True,2018-07-23,50,124,22497,POINT (4666000.000 4490000.000)
676,True,2012-05-21,50,122,22497,POINT (4678000.201 4472000.124)
677,True,2012-05-21,50,124,22497,POINT (4678000.201 4472000.124)
678,True,2018-09-13,50,122,22497,POINT (4678000.000 4472000.000)


... some **timeless** raster layers 

In [4]:
dir_timeless_layers = os.path.join(tile_dir, 'timeless')
fn_timeless_layers = list(Path(dir_timeless_layers).glob('**/*.tif'))

print(f'Number of timeless layers: {len(fn_timeless_layers)}')

Number of timeless layers: 2


... and several **temporal** layers.

In [5]:
dir_temporal_layers = os.path.join(tile_dir)
fn_temporal_layers = list(Path(dir_temporal_layers).glob('????/*.tif'))

print(f'{len(fn_temporal_layers)} temporal layers from 2000 to 2020')

1743 temporal layers from 2000 to 2020


The association between the points and the temporal layers will occurs using the **survey_date** column

In [6]:
col_date = 'survey_date'

print('Number of samples per year:')
pd.to_datetime(points[col_date]).dt.year.value_counts()

Number of samples per year:


2012    190
2018    175
2006    152
2000    152
2015     10
2016      1
Name: survey_date, dtype: int64

... and the name of **temporal** directories.

In [7]:
dirs = list(Path(dir_temporal_layers).glob('????'))
dirs.sort()

print('Temporal directories:')
for dir in dirs:
    n_layers = len(list(Path(os.path.join(dir_temporal_layers,dir)).glob('*.tif')))
    print(f' - {dir.name}: {n_layers} layers')

Temporal directories:
 - 2000: 85 layers
 - 2001: 85 layers
 - 2002: 85 layers
 - 2003: 85 layers
 - 2004: 85 layers
 - 2005: 85 layers
 - 2006: 85 layers
 - 2007: 85 layers
 - 2008: 85 layers
 - 2009: 85 layers
 - 2010: 85 layers
 - 2011: 85 layers
 - 2012: 85 layers
 - 2013: 85 layers
 - 2014: 85 layers
 - 2015: 85 layers
 - 2016: 85 layers
 - 2017: 85 layers
 - 2018: 85 layers
 - 2019: 85 layers
 - 2020: 43 layers


## Space Overlay

The points should be overlayed on all timeless layers, regardless the date information stored in survey_date column. In this case, we will use the **SpaceOverlay** class passing the arguments:
- *fn_points*: the geopackage filepath
- *dir_timeless_layers*: the directory containing the timeless raster files

In [8]:
spc_overlay = SpaceOverlay(fn_points, dir_timeless_layers, verbose=False)
timeless_data = spc_overlay.run()

Now we have the elevation and slope information for each points:

In [9]:
timeless_data

Unnamed: 0,lucas,survey_date,confidence,lc_class,tile_id,geometry,overlay_id,dtm_elevation,dtm_slope
0,True,2012-05-29,100,312,22497,POINT (4650000.166 4483999.711),1,239.0,2.946278
1,True,2012-05-16,100,312,22497,POINT (4650000.255 4471999.472),2,391.0,5.559027
2,False,2000-06-30,85,411,22497,POINT (4650097.582 4470351.405),3,416.0,1.666667
3,False,2012-06-30,85,411,22497,POINT (4650097.582 4470351.405),4,416.0,1.666667
4,False,2018-06-30,85,411,22497,POINT (4650097.582 4470351.405),5,416.0,1.666667
...,...,...,...,...,...,...,...,...,...
675,True,2018-07-23,50,124,22497,POINT (4666000.000 4490000.000),676,215.0,3.726780
676,True,2012-05-21,50,122,22497,POINT (4678000.201 4472000.124),677,102.0,0.000000
677,True,2012-05-21,50,124,22497,POINT (4678000.201 4472000.124),678,102.0,0.000000
678,True,2018-09-13,50,122,22497,POINT (4678000.000 4472000.000),679,102.0,0.000000


## Space-Time Overlay

For the temporal layers, the points should be filtered by year and overlayed on the right raster files. The **SpaceTimeOverlay** class implements this approach using the parameter:
* *timeless_data*: The result of SpaceOverlay (GeoPandas DataFrame) 
* *col_date*: The column that contains the date information (2018-09-13)
* *dir_temporal_layers*: The directory where the temporal raster files are stored, organized by year

In [10]:
spc_time_Overlay = SpaceTimeOverlay(timeless_data, col_date, dir_temporal_layers, verbose=False)
overlayed_data = spc_time_Overlay.run()

Now we have the elevation, slope, landsat and the night light data for each points:

In [11]:
overlayed_data

Unnamed: 0,lucas,survey_date,confidence,lc_class,tile_id,geometry,overlay_id,dtm_elevation,dtm_slope,landsat_ard_fall_blue_p75,...,landsat_ard_winter_thermal_p25,landsat_ard_winter_swir1_p50,landsat_ard_winter_swir2_p25,landsat_ard_winter_swir2_p50,landsat_ard_winter_swir1_p25,landsat_ard_winter_swir2_p75,landsat_ard_winter_nir_p50,landsat_ard_winter_thermal_p50,landsat_ard_winter_thermal_p75,night_lights
0,True,2012-05-29,100,312,22497,POINT (4650000.166 4483999.711),1,239.0,2.946278,3.0,...,183.0,24.0,8.0,9.0,23.0,9.0,48.0,183.0,183.0,0.059711
1,True,2012-05-16,100,312,22497,POINT (4650000.255 4471999.472),2,391.0,5.559027,2.0,...,182.0,23.0,8.0,8.0,23.0,8.0,55.0,183.0,183.0,0.009072
3,False,2012-06-30,85,411,22497,POINT (4650097.582 4470351.405),3,416.0,1.666667,6.0,...,184.0,49.0,23.0,23.0,49.0,24.0,68.0,184.0,184.0,0.030172
14,False,2012-06-30,85,312,22497,POINT (4651001.339 4472046.880),4,357.0,8.700255,2.0,...,182.0,21.0,7.0,8.0,20.0,8.0,39.0,182.0,182.0,-0.012674
16,False,2012-06-30,85,312,22497,POINT (4651217.720 4488931.650),5,133.0,23.109041,3.0,...,183.0,24.0,8.0,8.0,23.0,9.0,47.0,183.0,183.0,0.200448
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
211,True,2015-10-29,100,312,22497,POINT (4668000.057 4493999.655),6,207.0,17.834112,4.0,...,183.0,20.0,8.0,8.0,20.0,8.0,44.0,183.0,183.0,0.859678
270,True,2015-10-29,100,312,22497,POINT (4670000.156 4491999.887),7,172.0,21.122986,2.0,...,183.0,19.0,6.0,6.0,18.0,7.0,46.0,183.0,183.0,0.925818
280,True,2015-07-27,100,511,22497,POINT (4662000.000 4478000.000),8,113.0,0.000000,3.0,...,182.0,14.0,4.0,4.0,12.0,5.0,19.0,182.0,183.0,0.900600
353,True,2015-10-29,100,311,22497,POINT (4676000.000 4488000.000),9,40.0,15.023130,3.0,...,182.0,9.0,2.0,3.0,8.0,4.0,14.0,182.0,183.0,2.580281


## Save to CSV and GeoPackage files

At last, we need to save the overlayed points to access it in other softwares (QGIS) and eumap tutorial:

In [12]:
csv_output = os.path.join(tile_dir, tile + '_landcover_samples_overlayed.csv.gz')

print(f"Saving {csv_output}")
overlayed_data.to_csv(csv_output, compression='gzip')

Saving /home/leandro/Code/eumap/demo/python/eumap_data/22497_sweden/22497_sweden_landcover_samples_overlayed.csv.gz


In [13]:
gpkg_output =  os.path.join(tile_dir, tile + '_landcover_samples_overlayed.gpkg')

print(f"Saving {gpkg_output}")
overlayed_data.to_file(gpkg_output,  driver="GPKG")

Saving /home/leandro/Code/eumap/demo/python/eumap_data/22497_sweden/22497_sweden_landcover_samples_overlayed.gpkg


## Overlay Benchmarks

Here we will show the performance of `pyeumap`'s overlay method against classic raster sampling methods using `rasterio`. First, let's time the overlay executions on the same dataset as in the tutorial above.

In [None]:
from pathlib import Path
import geopandas as gpd
import rasterio as rio
import numpy as np
import multiprocessing as mp

import warnings
warnings.filterwarnings('ignore')

from pyeumap.overlay import SpaceOverlay

max_workers = 8

points = gpd.read_file(fn_points)
print('Sample size:', points.index.size)

Serial sampling with `rasterio`:

In [None]:
def serial_sampling(points, layers_dir):
    sources = [
        rio.open(fn)
        for fn in sorted(layers_dir.glob('**/*.tif'))
    ]

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()
    for src in sources:
        layer_name = Path(src.name).stem
        results[layer_name] = np.stack(src.sample(coordinates)).ravel()

%timeit -n 1 -r 1 serial_sampling(points, tile_dir)

Parallel sampling with `rasterio`:

In [None]:
def sample_one_layer(args):
    fn, coordinates = args
    layer_name = fn.stem
    with rio.open(fn) as src:
        data = np.stack(src.sample(coordinates)).ravel()
    return layer_name, data

def parallel_sampling(points, layers_dir):
    files = sorted(layers_dir.glob('**/*.tif'))

    coordinates = np.c_[
        points.geometry.x,
        points.geometry.y,
    ]

    results = points.copy()

    arg_gen = (
        (fn, coordinates)
        for fn in files
    )

    with mp.Pool(max_workers) as pool:
        for layer_name, data in pool.map(
            sample_one_layer,
            arg_gen,
        ):
            results[layer_name] = data

%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)

Parallel sampling with `pyeumap.overlay.SpaceOverlay`:

In [None]:
def pyeumap_sampling(points, layers_dir):
    ovr = SpaceOverlay(
        points,
        layers_dir,
        max_workers=max_workers,
        verbose=False,
    )
    data = ovr.run()

%timeit -n 1 -r 1 pyeumap_sampling(points, tile_dir)

Sampling optimizations done in `pyeumap` generate some overhead which outweighs the speedup when used on smaller datasets. However, if we quadruple the sample size:

In [None]:
for i in range(2):
    points = points.append(points, ignore_index=True)

print('sample size:', points.index.size)

Parallel sampling with `rasterio`:

In [None]:
%timeit -n 1 -r 1 parallel_sampling(points, tile_dir)

Parallel sampling with `pyeumap`:

In [None]:
%timeit -n 1 -r 1 pyeumap_sampling(points, tile_dir)

As seen above, the optimized overlay's execution time has much more favorable scaling with dataset size than is the case with raw parallelization.