RFC: file and dataset access in Rasterio tools and plugins #1300

sgillies · 2018-03-23T22:13:23Z

There's a discussion we're having in several places about high-level vs low-level ways of accessing paths and datasets in Rasterio tools and plugins. This is a request for comments on a pattern that I’ve found. We could use it, modify it, or come up with some better pattern or patterns. Let’s discuss.

Developers want path-based utilities

We all want power, right? We want to do more with less code. High-level file operations like https://docs.python.org/3/library/shutil.html#shutil.copy allow users to do bash-like scripting in Python, using files (or paths) as the primary things.

shutil in the standard library is a good model

There is precedent for this kind of thing in the Python standard library. See https://github.com/python/cpython/blob/master/Lib/shutil.py.

shutil.copyfile() is largely concerned with avoiding copying to the same file, determining whether files exist, and if symbolic links should be resolved. The actual copying is done by shutil.copyfileobj() which operates on opened Python file objects and doesn’t need to know that there is a filesystem at all.

High-level utilities or tools for Rasterio should work like this, too. They should mostly be concerned with the dataset files (or identifiers) and delegate to methods that operate on opened dataset objects.

Here’s a sketch.

def dataset_features(dsrc):
    """Yields GeoJSON features extracted from the dsrc dataset"""
    # For example.
    yield {"type": "Feature"}


def dataset_features_tool(src, dst):
    """Writes GeoJSON features extracted from src file to dst file

    This is a higher level abstraction than dataset_features().

    Parameters
    ----------
    src : str
        A raster dataset filename or identifier.
    dst : str
        An output filename.

    Returns
    -------
    None

    There's no need to return anything since the intent is to write to
    disk and we'll be careful to always raise an exception if that
    doesn't go exactly right.
    """
    # First, validate parameters.
    # Next, open the output file and the input dataset.
    with open(dst, 'wb') as fdst, rasterio.open(src) as dsrc:
        # Write out a sequence of GeoJSON texts.
        for feat in dataset_features(dsrc):
            fdst.write(json.dumps(feat))

But geospatial files aren’t the same as filesystem files

Operations like shutil.copyfile() are easy to describe and implement because they operate on a standardized operating system file – an array of bytes with associated metadata – where the content is irrelevant.

Geospatial datasets, and more specifically GDAL files, come in many flavors, aren’t always a single file, sometimes they are application protocols (like WFS “files”). Sometimes they aren’t even compatible with each other. What does it mean to convert a many-layered netcdf file to a JPEG? In our tools, it’s not going to be as simple as with shutil.copyfile().

Say we want to open the source dataset with some format-specific conditions (like number of threads for a GeoTIFF or block sizing for JP2) or specify a compact JSON encoding for the output. This requires us to be able to pass options through dataset_features_tool() into rasterio.open() and json.dumps().

I think it’s a good idea to have one pattern for doing this in Rasterio and its plugins.

Here’s a sketch of a pattern.

def dataset_features_tool(src, dst, src_opts=None, dst_opts=None):
    # Because a mutable default value doesn't work.
    src_opts = src_opts or {}
    dst_opts = dst_opts or {}

    with open(dst, 'wb') as fdst, rasterio.open(src, **src_opts) as dsrc:
        # Write out a sequence of GeoJSON texts.
        for feat in dataset_features(dsrc):
            fdst.write(json.dumps(feat, **dst_opts))

It’s not as transparent as the following would be

def dataset_features_tool(src, dst, src_num_threads=None, dst_separators=None, ...):

but I don’t think it’s feasible to list all the possible opening options as keyword arguments. There are dozens across all the different format drivers and the code would have to be updated every time a new driver was added.

Tool configuration environment

Say we want the tool to execute in the context of a set of GDAL configuration options like CPL_DEBUG=ON and GDAL_DISABLE_READDIR_ON_OPEN=TRUE. It’s possible to call the tool within a rasterio Env block as shown below.

with rasterio.Env(CPL_DEBUG=True, GDAL_DISABLE_READDIR_ON_OPEN=True):
    dataset_features_tool('example.tif', 'output.json')

But a higher-level abstraction that moves the Env into the tool may make it easier to use. We could add a third keyword argument, sketched below.

def dataset_features_tool(src, dst, src_opts=None, dst_opts=None, config=None):
    # Because a mutable default value doesn't work.
    src_opts = src_opts or {}
    dst_opts = dst_opts or {}
    config = config or {}

    with rasterio.Env(**config):
        with open(dst, 'wb') as fdst, rasterio.open(src, **src_opts) as dsrc:
            # Write out a sequence of GeoJSON texts.
            for feat in dataset_features(dsrc):
                fdst.write(json.dumps(feat, **dst_opts))

A tool class?

If we find this pattern useful, we could consider taking the next step and extract a callable class from it.

I admit I’m beginning to wave my hands a bit in this next sketch. It’s example of how the pattern could be turned into reusable and maintainable code, not a concrete proposal to write classes like this for Rasterio 1.0.

class JSONSequenceTool(object):
    """A tool which extracts data from a dataset and saves a JSON sequence to a file
    """

    def __init__(self, func):
        """Initialize tool

        Parameters
        ----------
        func : callable
            A function that takes a dataset and yields JSON serializable objects
        """
        self.func = func

    def __call__(self, src, dst, src_opts=None, dst_opts=None, config=None):
        src_opts = src_opts or {}
        dst_opts = dst_opts or {}
        config = config or {}

        with rasterio.Env(**config):
            with open(dst, 'wb') as fdst, rasterio.open(src, **src_opts) as dsrc:
                for obj in self.func(dsrc):
                    fdst.write(json.dumps(feat, **dst_opts))

dataset_features_tool = JSONSequenceTool(dataset_features)

Thanks in advance for your comments. I'm eager to hear them.

cc: especially @choldgraf @lwasser who brought up the issue of standardizing high-level utilities in #1273 and @eseglem, @jdmcbr, @perrygeo, @brendan-ward, @geowurster, @vincentsarago who have worked on tool-like functions in Rasterio.

The text was updated successfully, but these errors were encountered:

eseglem · 2018-03-24T05:13:23Z

I think everything under 'Tool configuration environment' looks like a nice option. Definitely worth having the Env in there if you are going that route. A higher level of abstraction like that would be very handy, and could provide an easy starting point for new analysis.

Not sure the callable classes would be necessary though. Have no argument against it, just not sure it provides anything more than the other option.

jdmcbr · 2018-03-28T05:32:16Z

@sgillies Thanks for putting this together! For background, I've never used any rio CLI tools except for rio mbtiles, and do a lot of work in ipython consoles that winds up wrapping rasterio calls so that at some point it takes an input file, or list of files, and generates an output file, or list of files.

I'm curious how narrow/broad the pattern sketched in dataset_features_tool is; in particular, the tool class suggests the possibility of a fairly specific type of pattern, and I'm having trouble imagining how that would work more generally. Might there be a (new) tools module with separate functions like merge and mask, wrapping the respective functions that operate on rasters? I ask about those, of course, because those two are the originally-rio-only-functions that I was interested in using outside of rio. My filesystem wrapper for merge takes a list of input files and an output filename, and the only real logic in the function is to take the outputs from rasterio.merge.merge to create the appropriate profile for the output raster (transform, height, width). My filesystem wrapper for mask takes a raster filename and a vector filename, and the (minimal) logic just gets the records from the shapefile, passes to rasterio.mask.mask along with the open raster, and uses the outputs to create the appropriate profile for the output raster (transform, height, width); this is essentially the doc in https://rasterio.readthedocs.io/en/latest/topics/masking-by-shapefile.html, with a couple of file opens at the start. Most of the logic in my merge/mask wrappers is the same. But given the rather different inputs and outputs to merging and masking, and the need to call distinct rasterio functions within them, I'm not seeing immediately how they both might inherit from a tool class.

I'm also curious how, if at all, another common tool in my personal toolbox (which with minor modification looks fairly similar to dataset_features_tool) might fit in with this proposal:

def apply_function_tool(src, dst, func, src_opts=None, dst_opts=None,
                        config=None, func_kwargs=None):
    src_opts = src_opts or {}
    dst_opts = dst_opts or {}
    config = config or {}
    func_kwargs = func_kwargs or {}

    with rasterio.Env(**config):
        with rasterio.open(src, **src_opts) as dsrc:
            src_data = dsrc.read()
            dst_data = func(src_data, **func_kwargs)
            # a fair fraction of my operations will change the band count and dtype,
            # but all remaining relevant attributes are unchanged
            dst_opts['count'] = dst_data[0]
            dst_opts['dtype'] = dst_data.dtype.name
            with rasterio.open(dst, 'wb', **dst_opts) as fdst: 
                fdst.write(dst_data)

sgillies · 2018-04-03T20:43:11Z

@jdmcbr since we seem to have some consensus that the pattern up until I introduce the tool class (which is an implementation detail) is okay, I think it's time to discuss whether tools should be gathered together in one module (which affects the API and should be dealt with before 1.0) and how open the project should be to accepting new tools. I'm usually inclined to minimize the number of tools. Our project is smaller than GDAL and won't be able to maintain 40+ tools like that project does.

Now, the apply_function_tool above: if generalized to multiple inputs, I could see wrapping the existing rio-calc CLI command around it. It's a good example of one that could be in the Rasterio core tools.

sgillies mentioned this issue Mar 23, 2018

user-friendly stack python function #1273

Open

vincentsarago pushed a commit to cogeotiff/rio-cogeo that referenced this issue Mar 27, 2018

commit to rasterio/rasterio#1300

01b9aea

sgillies mentioned this issue Apr 17, 2018

Tools module, class, and tests #1322

Merged

sgillies closed this as completed in #1322 Apr 18, 2018

sgillies mentioned this issue Nov 11, 2020

rio stack library function missing #2033

Open

sgillies added decision cli notebooks tool labels Jan 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: file and dataset access in Rasterio tools and plugins #1300

RFC: file and dataset access in Rasterio tools and plugins #1300

sgillies commented Mar 23, 2018 •

edited

eseglem commented Mar 24, 2018

jdmcbr commented Mar 28, 2018 •

edited

sgillies commented Apr 3, 2018

RFC: file and dataset access in Rasterio tools and plugins #1300

RFC: file and dataset access in Rasterio tools and plugins #1300

Comments

sgillies commented Mar 23, 2018 • edited

Developers want path-based utilities

shutil in the standard library is a good model

But geospatial files aren’t the same as filesystem files

Tool configuration environment

A tool class?

eseglem commented Mar 24, 2018

jdmcbr commented Mar 28, 2018 • edited

sgillies commented Apr 3, 2018

sgillies commented Mar 23, 2018 •

edited

jdmcbr commented Mar 28, 2018 •

edited