-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RasterDataset.from_files #1427
Comments
This makes curation of datasets more flexible. We may find inspiration in how Lightning Flash implements these constructors like |
Another motivation for this is for reading files within archives (e.g. zip). Instead of implementing logic for listing and matching files also within archives, the user can construct these themselves, e.g.:
which rasterio can read directly without unzipping. |
So, my suggestion flips the logic a little as I made If this makes sense I could try to write a common/analogous logic for VectorDataset. What do you think? class RasterDataset(GeoDataset):
def __init__(
self,
filepaths: list[str],
...,
transforms: = None,
) -> None:
super().__init__(transforms)
self.bands = bands or self.all_bands
self.cache = cache
# Populate the dataset index
i = 0
filename_regex = re.compile(self.filename_regex, re.VERBOSE)
for filepath in filepaths:
match = re.match(filename_regex, os.path.basename(filepath))
if match is not None:
# rest of logic from existing __init__
...
@classmethod
def from_root(cls, root: str = "data", *args, **kwargs):
pathname = os.path.join(root, "**", cls.filename_glob)
filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
cls(filepaths, *args, **kwargs)
@classmethod
def from_root_vsi(cls, root: str, *args, **kwargs):
filepaths = listdir_vsi_recursive(root)
cls(filepaths, *args, **kwargs) |
I would prefer to default to a root path, as that should be much more common |
I see. Classes that inherit from RasterDataset would need to be modified by moving root-logic from So we keep I also considered singledispatch to override init based on the first argument type, but we may in the future need multiple constructors for the same type. from functools import singledispatchmethod
class RasterDataset(GeoDataset):
@singledispatchmethod
def __init__(self, input: list[str], ...):
# Logic from existing __init__ with root-listing factored out
@__init__.register(str)
def _from_root_directory(self, input: str = "data", ...):
pathname = os.path.join(input, "**", self.filename_glob)
filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
return self.__init__(filepaths, ...)
# NB! This will not work as str is already overridden
@__init__.register(str)
def _from_root_vsi(self, input: str, ...):
filepaths = listdir_vsi_recursive(root)
return self.__init__(filepaths, ...)
# Usage
d1 = RasterDataset(input='my_root')
d2 = RasterDataset(input=['my_file1.tif', 'my_file2.tif']) |
Ah, I see the issue now. This is more annoying than I thought. Maybe we should just accept both a string and a list of strings. If the input is one or more files, we use them as is. If the input is one or more roots, we recursively search them. Then we don't need any classmethods. Thoughts? |
It feels a little dirty, but I agree that it is the easiest. Edit: In retrospect I think it is ok. |
The backwards incompatible tag in #1442 made me shudder for a second so I wanted to come back to the design briefly just to make sure the conclusions we came to here were correct. I also read through https://realpython.com/python-multiple-constructors/ which provided a great tutorial of the various options available to us: Simulating Multiple Constructors in Your ClassesUsing Optional Argument Values in
|
Thank you for an excellent summary. I too am happy with our choice as it is both simple and flexible. Backwards compatibility may of course be an issue if users do make use of kwarg |
Summary
We should add a
from_files
classmethod constructor toRasterDataset
andVectorDataset
.Rationale
Our current implementation allows the user to specify a root directory, which the class will recursively search for geospatial files. However, there are a few issues with this:
A secondary class constructor would alleviate these issues and allow users to specify a list of files any way they want to.
Implementation
The idea would be to add
@classmethod
constructors toRasterDataset
andVectorDataset
. These methods will largely share the same code as__init__
so we should move those to a common helper function. The only difference will be replacing the glob with a user-supplied list of files.Alternatives
We could optionally allow
root
to be a list and assume that this is a list of files, not a list of directories to search. We could even support both if we first check if the contents of the list are directories or files.Additional information
@calebrob6 @adriantre
The text was updated successfully, but these errors were encountered: