Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check if path is vsi #1612

Merged

Conversation

adriantre
Copy link
Contributor

Fix #1605

In 0.5 we included property files in GeoDatasets. files tries to list all files matching the filename_glob. It also tries to support virtual file systems. In doing so, it accidentally accepts non-directories as valid files.

The proposed solution will do a basic check of the path prefix to see if it looks like a vsi. Paths not matching these criteria will be ignored (to match behaviour pre version 0.5). Could raise FileNotFound instead. Then _verify-methods of Datasets could try-except to decide if files need downloading.

Current limitations:

  • Does not check if the vsi file exists, or is a 'directory'
  • Does not list files in vsi directories.

@adamjstewart
Copy link
Collaborator

Why not check for all files containing ://? Do os.path.isdir or glob.glob work for file://?

@adamjstewart adamjstewart added this to the 0.5.1 milestone Oct 3, 2023
@adriantre
Copy link
Contributor Author

adriantre commented Oct 3, 2023

I forgot to push the isfile in elif os.path.isfile(path) or path_is_vsi(path):

Do os.path.isdir or glob.glob work for file://?

I don't think it does. But it is a valid vsi scheme.

Why not check for all files containing ://?

Should be sufficient, but then we only support the apache/rasterio syntax az://bucket/blob.tif. This is equivalent to /vsiaz/bucket/blob.tif which is the gdal syntacx. Both syntax is supported by rasterio. In fact, rasterio converts it to the latter before calling the underlying gdal methods.

Edit: Also I think it is good practice to verify the vsi syntax against the ones supported by rasterio. Then we can yield meaningful error messages if there are typos. But I guess rasterio would complain that the file is not found. The SCHEME dict that I check against may also become outdated if rasterio does not keep it updated. Is that a risk with protected modules, like this one rasterio._paths?

@adamjstewart
Copy link
Collaborator

Is that a risk with protected modules, like this one rasterio._paths?

Yes, I usually try to avoid using these. I wonder if there's a different way we can access that info...

@adriantre
Copy link
Contributor Author

Could create a feature request in rasterio for non-private methods/classes for doing this verification?

@adriantre
Copy link
Contributor Author

adriantre commented Oct 3, 2023

This could be useful, but it will be deprecated and will be removed in rasterio 1.4. Cannot find any information pointing to why or what will replace it. I'll ask in the rasterio discussion board.

from rasterio.path import ParsedPath
p = ParsedPath.from_uri("zip://*.tif::files.zip")
p.scheme  # 'zip'
p.is_local  # True
p.is_remote  # False

I see that rasterio are working on api-changes for opening vsi. Pasting here for reference. I do not yet know if this means will need to change anything when upgrading rasterio.

docs:
https://rasterio.readthedocs.io/en/latest/topics/vsi.html#python-file-openers
usage in tests:
https://github.com/rasterio/rasterio/blob/d966440c06f3324aca1fa761d490cc780a9f619c/tests/test_pyopener.py#L20

@adriantre
Copy link
Contributor Author

FYI, I started this 'Q&A' over at rasterio
rasterio/rasterio#2930

@adriantre
Copy link
Contributor Author

Rasterio don't want to expose the path-validation, and say this API is prone to change (as we expected). They suggest copying in the relevant parts of this validation, as I do in this PR.

What do we think. Are we on the right track here? I do think it would be practical to give the users the possibility to implement their own logic for files that are not visible to os and glob packages, somethink like: else: handle_non_local_path.

@calebrob6
Copy link
Member

I'm fine with keeping the check on our side as the changes should be "someone wants a new VSI in GeoDataset" vs. an update to rasterio breaks us.

@adamjstewart
Copy link
Collaborator

My concern is that if the list of supported schemes changes according to rasterio version, it will be annoying to validate in torchgeo. What's the downside of treating all paths with a : in them as remote files and letting rasterio issue it's own error msg if something doesn't work?

@adriantre adriantre force-pushed the bugfix/fix_invalid_path_in_paths branch from 4b5d067 to 2b9f13b Compare October 16, 2023 08:31
@adriantre
Copy link
Contributor Author

adriantre commented Oct 16, 2023

I agree. If the user inputs invalid paths, then this line will complain.
https://github.com/microsoft/torchgeo/blob/main/torchgeo/datasets/geo.py#L434

files property will then:

  • for local files: validate existence and recursively list files matching glob
  • for vsi: return without further ado
    (requires users to perform listing and validation themselves, and provide as paths)

Should we somehow inform users about this difference?

This is outside the scope of this bug-fix and is something we can fix when the use of vsi shows/grows. In the mean time users can implement the listing of files themselves e.g. by overriding property files and add this logic.

When rasterio settles on how they treat vsi then we can take a new look at the possibilities it may bring.

@github-actions github-actions bot added the testing Continuous integration testing label Oct 16, 2023
torchgeo/datasets/utils.py Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Outdated Show resolved Hide resolved
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/geo.py Outdated Show resolved Hide resolved
@adriantre adriantre force-pushed the bugfix/fix_invalid_path_in_paths branch from 4bf52dd to 28899cb Compare October 23, 2023 16:33
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
tests/datasets/test_geo.py Outdated Show resolved Hide resolved
torchgeo/datasets/utils.py Show resolved Hide resolved
Wait with pathlib syntax

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@adamjstewart adamjstewart enabled auto-merge (squash) October 31, 2023 14:02
@adamjstewart adamjstewart merged commit 32a7307 into microsoft:main Oct 31, 2023
21 checks passed
@adamjstewart
Copy link
Collaborator

When I try to download AbovegroundLiveWoodyBiomassDensity, it complains that my directory doesn't exist and will be ignored. This is because we check self.files in _verify. Wonder if there's a way we could ignore that warning...

nilsleh pushed a commit that referenced this pull request Nov 6, 2023
* Add check if path is vsi

* Add url to reference for apache vsi syntax

* Add missing check to if

* Copy rasterio SCHEMES definition into torchgeo

* Check all schemes, not only last

* Simplify method path_is_vsi

* Add tests

* Remove print

* Update test names

* Add missing comma in list

* Update torchgeo/datasets/utils.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Update torchgeo/datasets/utils.py

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

* Use pytest tmp_path for test

* Warn if some of input paths are invalid

* Update docstring for mocked class

* Handle tests failing due to UserWarning

* Remove unnecessary filterwarning

* Test CustomGeoDataset instead of MockRasterDataset

* Merge two similar tests

* str instead of as_posix

Wait with pathlib syntax

Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>

---------

Co-authored-by: Adrian Tofting <adrian@vake.ai>
Co-authored-by: Adam J. Stewart <ajstewart426@gmail.com>
@adriantre
Copy link
Contributor Author

Right. Would it make sense to create a common verify-method and ignore the warning there? This way the ignore is only "required" one place in the code.

@adamjstewart
Copy link
Collaborator

I would love a common _verify method but haven't had the time/energy to do it. This relates to #99.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Download support broken for some GeoDatasets if directory does not exist
3 participants