Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RasterDataset.from_files #1427

Closed
adamjstewart opened this issue Jun 19, 2023 · 9 comments · Fixed by #1442
Closed

RasterDataset.from_files #1427

adamjstewart opened this issue Jun 19, 2023 · 9 comments · Fixed by #1442
Labels
datasets Geospatial or benchmark datasets
Milestone

Comments

@adamjstewart
Copy link
Collaborator

Summary

We should add a from_files classmethod constructor to RasterDataset and VectorDataset.

Rationale

Our current implementation allows the user to specify a root directory, which the class will recursively search for geospatial files. However, there are a few issues with this:

  • It doesn't work well for remote files (Add kwarg vsi to RasterDataset to support GDAL VSI #1399)
  • It doesn't work well for drivers that split each scene into multiple auxiliary files
  • It doesn't allow users to select a subset of data unless it is separated geographically or in separate directories

A secondary class constructor would alleviate these issues and allow users to specify a list of files any way they want to.

Implementation

The idea would be to add @classmethod constructors to RasterDataset and VectorDataset. These methods will largely share the same code as __init__ so we should move those to a common helper function. The only difference will be replacing the glob with a user-supplied list of files.

Alternatives

We could optionally allow root to be a list and assume that this is a list of files, not a list of directories to search. We could even support both if we first check if the contents of the list are directories or files.

Additional information

@calebrob6 @adriantre

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Jun 19, 2023
@adamjstewart adamjstewart added this to the 0.5.0 milestone Jun 19, 2023
@adriantre
Copy link
Contributor

This makes curation of datasets more flexible. We may find inspiration in how Lightning Flash implements these constructors like from_folders and from_files

https://lightning-flash.readthedocs.io/en/stable/api/generated/flash.image.detection.data.ObjectDetectionData.html?highlight=From_folders#flash.image.detection.data.ObjectDetectionData.from_folders

@adriantre
Copy link
Contributor

adriantre commented Jun 20, 2023

Another motivation for this is for reading files within archives (e.g. zip). Instead of implementing logic for listing and matching files also within archives, the user can construct these themselves, e.g.:

zip:///path/to/file.zip!/folder/file.tif'

which rasterio can read directly without unzipping.

@adriantre
Copy link
Contributor

adriantre commented Jun 21, 2023

So, my suggestion flips the logic a little as I made from_files the default in __init__. Then from_root just become a special case of how to find these files. And similarly for vsi or other special cases.

If this makes sense I could try to write a common/analogous logic for VectorDataset. filename_glob could now also be passed as input arg to from_root instead of being class argument.

What do you think?

class RasterDataset(GeoDataset):
    def __init__(
            self,
            filepaths: list[str],
            ...,
            transforms: = None,
    ) -> None:
        super().__init__(transforms)

        self.bands = bands or self.all_bands
        self.cache = cache

        # Populate the dataset index
        i = 0
        filename_regex = re.compile(self.filename_regex, re.VERBOSE)
        for filepath in filepaths:
            match = re.match(filename_regex, os.path.basename(filepath))
            if match is not None:
                # rest of logic from existing __init__
                ...

    @classmethod
    def from_root(cls, root: str = "data", *args, **kwargs):
        pathname = os.path.join(root, "**", cls.filename_glob)
        filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        cls(filepaths, *args, **kwargs)

    @classmethod
    def from_root_vsi(cls, root: str, *args, **kwargs):
        filepaths = listdir_vsi_recursive(root)
        cls(filepaths, *args, **kwargs)

@adamjstewart
Copy link
Collaborator Author

I would prefer to default to a root path, as that should be much more common

@adriantre
Copy link
Contributor

adriantre commented Jun 21, 2023

I see. Classes that inherit from RasterDataset would need to be modified by moving root-logic from __init__ to their own override of from_root, and usage (initialisation) must be changed.

So we keep init to expect arg root. from_files will then call init, and thus need to pass root and filepaths. So should we then allow root to be both str and list[str]?

I also considered singledispatch to override init based on the first argument type, but we may in the future need multiple constructors for the same type.

from functools import singledispatchmethod

class RasterDataset(GeoDataset):
    @singledispatchmethod
    def __init__(self, input: list[str], ...):
        # Logic from existing __init__ with root-listing factored out

    @__init__.register(str)
    def _from_root_directory(self, input: str = "data", ...):
        pathname = os.path.join(input, "**", self.filename_glob)
        filepaths = [filepath for filepath in glob.iglob(pathname, recursive=True)]
        return self.__init__(filepaths, ...)

    # NB! This will not work as str is already overridden
    @__init__.register(str)
    def _from_root_vsi(self, input: str, ...):
        filepaths = listdir_vsi_recursive(root)
        return self.__init__(filepaths, ...)

# Usage
d1 = RasterDataset(input='my_root')
d2 = RasterDataset(input=['my_file1.tif', 'my_file2.tif'])

@adamjstewart
Copy link
Collaborator Author

Ah, I see the issue now. This is more annoying than I thought. Maybe we should just accept both a string and a list of strings. If the input is one or more files, we use them as is. If the input is one or more roots, we recursively search them. Then we don't need any classmethods. Thoughts?

@adriantre
Copy link
Contributor

adriantre commented Jun 22, 2023

It feels a little dirty, but I agree that it is the easiest.

Edit: In retrospect I think it is ok.

@adamjstewart
Copy link
Collaborator Author

The backwards incompatible tag in #1442 made me shudder for a second so I wanted to come back to the design briefly just to make sure the conclusions we came to here were correct. I also read through https://realpython.com/python-multiple-constructors/ which provided a great tutorial of the various options available to us:

Simulating Multiple Constructors in Your Classes

Using Optional Argument Values in .__init__()

Why did we never consider this? Couldn't we keep root: str and add filepaths: list[str] to __init__? The main advantage of this is backwards compatibility, but I don't care much for backwards compatibility at this point in development, so we should only do this if all other options are considered equally (hint, they aren't).

Checking Argument Types in .__init__()

This is the solution we went with in #1442. In hindsight, I still think this is still the best decision because it allows for the greatest flexibility. For example, the following options are only possible with this option (or at least much easier):

  • list of dirs: ['dir1', 'dir2', ...]
  • single path: 'file1'
  • combination: ['dir1', 'file1', 'dir2', 'file2']

Providing Multiple Constructors With @classmethod in Python

This is the initial idea I had in mind when I opened this issue. We basically decided that this would work fine if __init__ accepted a list of files and classmethod from_directories called it, but the reverse wouldn't work. Since passing a root is (probably?) more common than passing a list of files, this is probably the worst option.

Providing Multiple Constructors With @singledispatchmethod

I've never used this before but it looks really interesting. Above you considered this option but discounted it because

we may in the future need multiple constructors for the same type.

I would say this actually isn't that big of a deal. The other options have the exact same limitation, and can be solved by adding classmethods for special cases: from_vsi, from_s3, etc. But this doesn't support the same options as type checking, so it's equal to optional argument values in my ranking.

Alternatives

Subclasses for each source

We could use RasterDirectoryDataset, RasterFileDataset, RasterVSIDataset, etc. subclasses. However, we would need subclasses for everything: Landsat9DirectoryDataset, Landsat9FileDataset, Landsat9VSIDatast, etc. This isn't sustainable, making this the worst possible option.

TorchData pipelines for each source

My original plan was to add support for a data pipeline that could either start from directory, file, or S3 bucket. However, TorchData is no longer under active development and may die off entirely: pytorch/data#1196.

Conclusion

In conclusion, I think we made the right decision (unless someone finds fault with my above statements or has better alternative ideas). This was primarily an exercise designed to give me something to come back to when asking myself "why did we decide to do it this way again?". I'll now go back to reviewing #1442.

@adriantre
Copy link
Contributor

Thank you for an excellent summary. I too am happy with our choice as it is both simple and flexible. Backwards compatibility may of course be an issue if users do make use of kwarg root instead of passing first argument. This could be solved by supporting a kwarg root which maps to paths in the init, and yield a deprecation warning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants