New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: file and dataset access in Rasterio tools and plugins #1300
Comments
I think everything under 'Tool configuration environment' looks like a nice option. Definitely worth having the Env in there if you are going that route. A higher level of abstraction like that would be very handy, and could provide an easy starting point for new analysis. Not sure the callable classes would be necessary though. Have no argument against it, just not sure it provides anything more than the other option. |
@sgillies Thanks for putting this together! For background, I've never used any I'm curious how narrow/broad the pattern sketched in I'm also curious how, if at all, another common tool in my personal toolbox (which with minor modification looks fairly similar to def apply_function_tool(src, dst, func, src_opts=None, dst_opts=None,
config=None, func_kwargs=None):
src_opts = src_opts or {}
dst_opts = dst_opts or {}
config = config or {}
func_kwargs = func_kwargs or {}
with rasterio.Env(**config):
with rasterio.open(src, **src_opts) as dsrc:
src_data = dsrc.read()
dst_data = func(src_data, **func_kwargs)
# a fair fraction of my operations will change the band count and dtype,
# but all remaining relevant attributes are unchanged
dst_opts['count'] = dst_data[0]
dst_opts['dtype'] = dst_data.dtype.name
with rasterio.open(dst, 'wb', **dst_opts) as fdst:
fdst.write(dst_data) |
@jdmcbr since we seem to have some consensus that the pattern up until I introduce the tool class (which is an implementation detail) is okay, I think it's time to discuss whether tools should be gathered together in one module (which affects the API and should be dealt with before 1.0) and how open the project should be to accepting new tools. I'm usually inclined to minimize the number of tools. Our project is smaller than GDAL and won't be able to maintain 40+ tools like that project does. Now, the |
There's a discussion we're having in several places about high-level vs low-level ways of accessing paths and datasets in Rasterio tools and plugins. This is a request for comments on a pattern that I’ve found. We could use it, modify it, or come up with some better pattern or patterns. Let’s discuss.
Developers want path-based utilities
We all want power, right? We want to do more with less code. High-level file operations like https://docs.python.org/3/library/shutil.html#shutil.copy allow users to do bash-like scripting in Python, using files (or paths) as the primary things.
shutil in the standard library is a good model
There is precedent for this kind of thing in the Python standard library. See https://github.com/python/cpython/blob/master/Lib/shutil.py.
shutil.copyfile()
is largely concerned with avoiding copying to the same file, determining whether files exist, and if symbolic links should be resolved. The actual copying is done byshutil.copyfileobj()
which operates on opened Python file objects and doesn’t need to know that there is a filesystem at all.High-level utilities or tools for Rasterio should work like this, too. They should mostly be concerned with the dataset files (or identifiers) and delegate to methods that operate on opened dataset objects.
Here’s a sketch.
But geospatial files aren’t the same as filesystem files
Operations like
shutil.copyfile()
are easy to describe and implement because they operate on a standardized operating system file – an array of bytes with associated metadata – where the content is irrelevant.Geospatial datasets, and more specifically GDAL files, come in many flavors, aren’t always a single file, sometimes they are application protocols (like WFS “files”). Sometimes they aren’t even compatible with each other. What does it mean to convert a many-layered netcdf file to a JPEG? In our tools, it’s not going to be as simple as with
shutil.copyfile()
.Say we want to open the source dataset with some format-specific conditions (like number of threads for a GeoTIFF or block sizing for JP2) or specify a compact JSON encoding for the output. This requires us to be able to pass options through
dataset_features_tool()
intorasterio.open()
andjson.dumps()
.I think it’s a good idea to have one pattern for doing this in Rasterio and its plugins.
Here’s a sketch of a pattern.
It’s not as transparent as the following would be
but I don’t think it’s feasible to list all the possible opening options as keyword arguments. There are dozens across all the different format drivers and the code would have to be updated every time a new driver was added.
Tool configuration environment
Say we want the tool to execute in the context of a set of GDAL configuration options like
CPL_DEBUG=ON
andGDAL_DISABLE_READDIR_ON_OPEN=TRUE
. It’s possible to call the tool within a rasterioEnv
block as shown below.But a higher-level abstraction that moves the
Env
into the tool may make it easier to use. We could add a third keyword argument, sketched below.A tool class?
If we find this pattern useful, we could consider taking the next step and extract a callable class from it.
I admit I’m beginning to wave my hands a bit in this next sketch. It’s example of how the pattern could be turned into reusable and maintainable code, not a concrete proposal to write classes like this for Rasterio 1.0.
Thanks in advance for your comments. I'm eager to hear them.
cc: especially @choldgraf @lwasser who brought up the issue of standardizing high-level utilities in #1273 and @eseglem, @jdmcbr, @perrygeo, @brendan-ward, @geowurster, @vincentsarago who have worked on tool-like functions in Rasterio.
The text was updated successfully, but these errors were encountered: