# Aggregation (spatial downsampling)

### Continuous aggregation

The core aggregation code is written in Cython, in raster_utilities.aggregation.spatial.core.continuous.pyx. 

A helper class raster_utilities.aggregation.spatial.SpatialAggregator is provided to manage calling the Cython code.

This notebook demonstrates using the helper class to aggregate a series of continuous-type raster files.

The code has been written to read input rasters of theoreticlly unlimited size, which are read in tiles to build up the output coarser / smaller grids; memory use is determined by the size of the output files (and the number of statistics requested, i.e. number of output files that are created). 

It has been used to aggregate a global 7-metre resolution grid to mastergrids 1k, reading from a .vrt file to avoid the need to ever generate the mosaiced high-resolution grid.

In [17]:
# The helper class
from  raster_utilities.aggregation.spatial.SpatialAggregator import *

In [2]:
# Enumerations to provide acceptable values for the aggregation parameters,
# avoid having to remember strings
from raster_utilities.aggregation.aggregation_values import *

In [3]:
import glob

### Run a continuous aggregation across a series of files in a folder

In [14]:
# The files to be aggregated should be provided as a list of filepaths. 
# (Just make a single-item list for one file)
inContFiles = glob.glob(r'H:\*\1km\Monthly\*.mean.*.tif')

# Also provide the output folder
outDir = r'E:\Data\Harry\Documents\dataprep\MODIS'

Specify the output nodata value (it doesn't have to be the same as the input, incoming NDV will be read from the files (better be set properly!)

In [5]:
ndvOut = -9999

Specify the aggregation statistics to create. This must be a list of items from the ContinuousAggregationStats enumeration, or their string representations.

In [6]:
# e.g.
stats = [ContinuousAggregationStats.MEAN, ContinuousAggregationStats.MAX,
         ContinuousAggregationStats.MIN, ContinuousAggregationStats.SD]
#stats = [ContinuousAggregationStats.MIN]
# or do do all of them use this convenience: 
#stats = ContinuousAggregationStats.ALL.value

Finally configure the aggregation. The final parameter for the SpatialAggregator constructor should be a dictionary that configures how the aggregation will run. 

* This should have a key that is a member of the AggregationTypes enumeration, i.e. AggregationTypes.RESOLUTION, AggregationTypes.FACTOR, or AggregationTypes.SIZE. This key determines the resolution of the output files in one of three ways.
* The value of this key should be as follows:
    * AggregationTypes.RESOLUTION: (Float value, or string "1km", "5km" or "10km")
    * AggregationTypes.FACTOR: Int value (e.g. 5 to go from 1k rasters to 5k rasters
    * AggregationTypes.SIZE: 2-tuple specifying the (height,width) of the output rasters

* A key "resolution_name" may be provided, which provides the name for the output resolution to be used as the fifth token of the 6-token output filenames (e.g. "5km")

* A key "mem_limit_gb" may be provided, to limit the memory use (if not, 30GB will be the default). Note that it's not very accurate so be conservative!


In [7]:
# e.g.
# Resolution can be a floating point number, or a string representing 
# one of the core mastergrid resolutions "1km", "5km", or "10km".
aggArgs = {AggregationTypes.RESOLUTION:"5km", "resolution_name":"5km"}

In [None]:
inContFiles

### Running - Single-process

Now just instantiate and run the aggregation:

In [15]:
agg = SpatialAggregator(inContFiles, outDir, ndvOut, stats, aggArgs)

In [None]:
agg.RunAggregation()

### Running - multiprocessing

Or use multiprocessing to do several files at once - the continuous aggregation algorithm is single-threaded so use multiprocessing instead to make gains. Pick a pool size that corresponds to the number of cores to run at once; keep an eye on disk utilisation as this will become the bottleneck and if it's pegged at 100% then that will end up slower so it'll be better to use fewer processes. (The compression algorithm is multithreaded when saving, but don't really need to worry about that)

In [None]:
from multiprocessing import Pool

def callAgg(f):
    try:
        agg = SpatialAggregator([f], outDir, ndvOut, stats, aggArgs)
        agg.RunAggregation()
    except KeyboardInterrupt, e:
        pass

# now we can just do this:
# p = Pool(10)
# p.map(callAgg, inContFiles)
# but it is impossible to interrupt if we need to! to allow that, need to do this:
    # https://bryceboe.com/2010/08/26/python-multiprocessing-and-keyboardinterrupt/

def runMulti():
    # choose an number not greater than the number of cores, but also that won't use more than 
    # the available memory and preferably substantially less so that OS-level write-caching can 
    # help prevent the disk becoming a bottleneck (ensure you are writing to a disk with write
    # caching enabled: it isn't by default on external drives)
    pool = Pool(7)
    p = pool.map_async(callAgg, inContFiles)
    try:
        r = p.get(0xFFFF)
    except KeyboardInterrupt:
        print ("parent received interrupt")
        return
    


In [None]:
# call it
runMulti()