Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download support broken for some GeoDatasets if directory does not exist #1605

Closed
adamjstewart opened this issue Sep 30, 2023 · 8 comments · Fixed by #1612
Closed

Download support broken for some GeoDatasets if directory does not exist #1605

adamjstewart opened this issue Sep 30, 2023 · 8 comments · Fixed by #1612
Labels
datasets Geospatial or benchmark datasets
Milestone

Comments

@adamjstewart
Copy link
Collaborator

Description

@adriantre this is a bug we (mostly I) introduced in #1442/#1597. The issue is that if os.path.isdir(file) is False, we automatically add the file to self.files, and then self.files evaluates to True in _verify. I think the safest solution is to not use self.files in _verify, but you might be able to think of a better solution.

I first noticed this during 8ec2e93 but didn't realize the impact of it until fe546bf. I still want to get this release out so I figured I would give us something to do in 0.5.1. I don't know how many other datasets are impacted.

Current workaround is to create the directory before downloading. Not hard, just annoying.

Steps to reproduce

> python3 
>>> from torchgeo.datasets import ChesapeakeDE
>>> ChesapeakeDE('fake/path', download=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/Adam/torchgeo/torchgeo/datasets/chesapeake.py", line 138, in __init__
    super().__init__(paths, crs, res, transforms=transforms, cache=cache)
  File "/Users/Adam/torchgeo/torchgeo/datasets/geo.py", line 434, in __init__
    raise FileNotFoundError(msg)
FileNotFoundError: No ChesapeakeDE data was found in `paths='fake/path''`

Version

0.5.0

@adamjstewart adamjstewart added the datasets Geospatial or benchmark datasets label Sep 30, 2023
@adamjstewart adamjstewart added this to the 0.5.1 milestone Sep 30, 2023
@adriantre
Copy link
Contributor

adriantre commented Oct 2, 2023

Before 0.5.0, each (individual) method _verify returned successfully if the file / all files exist locally. This check could be moved into self.files. I propose that self.files returns the set of files that exists locally; if none is found returns the empty set. I believe this is the same behaviour as this line before 0.5.0.

# GeoDataset
@property
def files(self) -> set[str]:
    ...  # unchanged code removed

    files: set[str] = set()
    for path in paths:
        if os.path.exists(path):
            if os.path.isfile(path):
                files.add(path)
            else:  # meaning os.path.isdir(path) == True, since exists is checked
                pathname = os.path.join(path, "**", self.filename_glob)
                files |= set(glob.iglob(pathname, recursive=True))
        else:
            files |= self.handle_nonlocal_path(path)

    return files

@staticmethod
def handle_nonlocal_path(path: str) -> set:
    return set()

With this setup the user must make sure the files are available (as before 0.5.0). They can download files, unpack zips etc. Or they can implement their own logic by overriding handle_nonlocal_path to, for instance, point to a file within a zip-archive, point to a file in azure blob etc.

Proposed docstring:

A set of all available files in the dataset.

For all paths, finds local files that match filename_glob.
Recursively iterates any directory in paths.
Paths that do not match filename_glob in the local file system is by
default ignored. This behaviour may be overridden in
method handle_nonlocal_path.

Running ChesapeakeDE('fake/path', download=True) works with the above changes. I have not yet tested anything else.

I have a branch reflecting these changes. I'll wait with the PR until we have discussed the approach.

@adamjstewart
Copy link
Collaborator Author

I propose that self.files returns the set of files that exists locally; if none is found returns the empty set.

I don't like this because it makes it impossible to use remote/zip files without the user overriding handle_nonlocal_path. I'm under the impression that these things "just work" at the moment. As long as GDAL knows how to read remote/zip files, we don't need to confirm that they exist.

I think you already looked into this and found that it wasn't possible to confirm whether a file was remote/zip vs. local? Like the path may not always contain vsi or something like that. But will the path always contain a colon? What if we check both os.path.isfile and os.path.isdir for paths without a colon and trust the user for paths with a colon?

I agree that files should not return anything for an obviously local files/dirs that do not exist. Honestly, I would go further and turn it into an error message.

P.S. Would love to have better support for glob chars. Something like paths="*.tif" should work, but I don't think it does at the moment. So maybe keep that in mind during refactoring.

@adriantre
Copy link
Contributor

I did not realise until now the the if dir else add-logic was motivated by supporting vsi out of the box 😅 that is indeed a good feature.

There are two ways of specifying vsi-path, one without any commas. But I have looked at this before, something similar may be feasible.

I'll be back.

@calebrob6
Copy link
Member

FYI I was able to use v0.5 to do COG reading with RasterDataset -- https://gist.github.com/calebrob6/df765911d86df6c648e99060222b1e0b

@adriantre
Copy link
Contributor

adriantre commented Oct 3, 2023

Drafted up a proposition in #1612 .

@adriantre
Copy link
Contributor

P.S. Would love to have better support for glob chars. Something like paths="*.tif" should work, but I don't think it does at the moment. So maybe keep that in mind during refactoring.

Would this replace the filename_glob?

(btw, looks like something similar are coming to vsi (?) )

@adamjstewart
Copy link
Collaborator Author

Nope, it wouldn't replace it. We still need the ability to set a default filename glob for subclasses of RasterDataset. But this would allow people to override it without creating a subclass. This is a feature people frequently request: #1524

@adriantre
Copy link
Contributor

Yes, I meant override, not replace 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants