Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow RasterDataset to accept list of files #1442

Merged

Conversation

adriantre
Copy link
Contributor

Fix #1427

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jun 22, 2023
@adriantre
Copy link
Contributor Author

adriantre commented Jun 22, 2023

Here is the inital proposal. If root is a string, add it to a list and handle it. But we should allow root to be e.g. a pathlib.Path too.

Are there any collection of types that are appropriate for checking either "path-like" or "not list-like"?

Note that we cannot check os.path.isdir or os.path.isfile as these expect the path to exist in the file system, which is not the case for e.g. cloud bucket blobs (vsi).

Also, mypy does not like that self.root now can be str or list because children may use this expecting a str (actual root).

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For directory vs. file, what if we use os.path.exists to determine whether or not the path is on the local filesystem. If it's local, use os.path.isdir or os.path.isfile. If it isn't, assume it's a path.

Alternative is to only support str directories and list files. I would also be satisfied with that solution if you want to keep it simple.

torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@adamjstewart adamjstewart added this to the 0.5.0 milestone Jun 22, 2023
@github-actions github-actions bot added the testing Continuous integration testing label Jun 23, 2023
@adamjstewart
Copy link
Collaborator

I think this is sufficient for a proof of concept. Wondering what @calebrob6 thinks as he's the one who's been asking for this feature for the longest.

tests/datasets/test_geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@adriantre
Copy link
Contributor Author

@adamjstewart I am moving this discussion over here.

If we remove os.path.exists it will add non-existing directories to the list. In my opinion, since we are only implementing support for local files at this point, we should let the user know if they are providing non-existing paths. The current implementation does this by raising, and at the same time make it easy for the users to handle non-local files by giving them the possibility to override handle_nonlocal_path.

We could overcome this by checking both if os.path.exists() and os.path.isdir():, but then users that want to handle non-local files must override the whole list_files method instead of only the relevant part of the code.

@calebrob6
Copy link
Member

Hey all -- first, sorry it has taken me so long to get to this!

I just tested this and it behaves exactly as I would expect, so that's awesome.

My only concern is the situation in which someone overrides handle_nonlocal_path(dir_or_file) and then that file is matched to the filename_regex. We are using os.path.basename(filepath) to get the filename, then match that against filename_regex:

for filepath in self.list_files():
    match = re.match(filename_regex, os.path.basename(filepath))
    ...

This might produce unexpected behavior (e.g. if I make a RemoteRasterDataset that takes a list of URLs to Azure blob containers that have SAS tokens appended to the end of them, then a string passed to re.match(...) would be something like "m_4512227_se_10_060_20200719.tif?st=2023-07-18T18%.....").

This seems minor and something that we could handle if it comes up.

@calebrob6 calebrob6 force-pushed the feature/add_support_for_list_of_files branch from 1b7ed03 to 93738fb Compare July 19, 2023 19:51
calebrob6
calebrob6 previously approved these changes Jul 19, 2023
@adamjstewart
Copy link
Collaborator

I'll try to take a closer look at this later this week. I definitely want to get this into the 0.5.0 release in August.

@adamjstewart adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Sep 7, 2023
@adamjstewart adamjstewart changed the title Make RasterDataset accept list of files Allow RasterDataset to accept list of files Sep 7, 2023
@adamjstewart adamjstewart changed the title Allow RasterDataset to accept list of files Allow GeoDataset to accept list of files Sep 27, 2023
Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for taking so long to review this, I graduated and moved to Germany so I've been very preoccupied.

I still really want to sneak this into 0.5.0, which I still want to finish by tomorrow. I know this is very last minute, so if you need me to I'm happy to take over at any point to finish up this PR. Depending on which time zone you're in this may work very well.

Last request: can we do the same thing for VectorDataset? The code will be basically identical, and we only have 2 of them, so shouldn't take long.

tests/conf/l7irish.yaml Outdated Show resolved Hide resolved
tests/datasets/test_agb_live_woody_density.py Outdated Show resolved Hide resolved
tests/datasets/test_eudem.py Outdated Show resolved Hide resolved
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/agb_live_woody_density.py Outdated Show resolved Hide resolved
torchgeo/datasets/landcoverai.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@adriantre
Copy link
Contributor Author

adriantre commented Sep 28, 2023

What is not yet fixed is:

  • Decide to merge this and implement for VectorDataset in new PR?
  • How to specify directory to download to, as self.paths may be a list.
  • Where to unpack zip-files when we dont have a root? Current behaviour is in the same folder as zip-file.
  • How to find zip-files if we don't want to allow to override in filename_glob
  • (If above is resolved) Rename list_files to files and make @property + propagate to usage
  • Resolve red tests
    (seems like paths are passed to some non-RasterDatasets. I think tests/conf where ChesapeakeCVPRDataModule is used is the place to start.)
  • Add tests and codecov

@adamjstewart if you want to you can take over now. I hope I did not leave too big of a mess 😅

@adamjstewart
Copy link
Collaborator

How to find zip-files if we don't want to allow to override in filename_glob

Same way we used to, with a call to glob.glob

@adamjstewart
Copy link
Collaborator

Too tired to finish this tonight, will work on this tomorrow.

@adriantre adriantre force-pushed the feature/add_support_for_list_of_files branch from 7141d0d to 628801a Compare September 29, 2023 08:17
@adamjstewart adamjstewart changed the title Allow GeoDataset to accept list of files Allow RasterDataset to accept list of files Sep 29, 2023
@@ -80,14 +80,17 @@ def __init__(
cache: if True, cache file handle to speed up repeated sampling
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamjstewart can you take a look at the docstring for `download? just above this line?

download: if True, download dataset and store it in the root directory

It may be hard to know what that root is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the majority of users will still use a single string in paths, and the docs do say that paths is one or more root directories, so hopefully they can figure it out. If users report confusion we can make this more clear or add a better error message in a future patch release.

Copy link
Collaborator

@adamjstewart adamjstewart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to merge now!

At the moment, we don't support passing a list if you also want to download/extract a dataset. I think this limitation makes sense, as people won't have an easy way to get a list of files if they don't already have them extracted anyway. But if someone asks for this feature, we can think about how to do it in a future patch release.

Thanks for a great first contribution, I think this is worth highlighting in the release notes as it unlocks a lot of potential for remote files or file filtering.

@adriantre
Copy link
Contributor Author

adriantre commented Sep 29, 2023

Looks good to me! Good job! Hopefully not my last contrib! :)

I'll let you update the branch and push the button 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Changes that are not backwards compatible datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RasterDataset.from_files
3 participants