Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python file-like VSI Plugin #2141

Merged
merged 24 commits into from Nov 13, 2021
Merged

Conversation

djhoese
Copy link
Contributor

@djhoese djhoese commented Apr 2, 2021

The Goal

See corteva/rioxarray#246 for details on the origins of this feature.

The goal is to make it as efficient as possible for rasterio to open a Python file-like object. The biggest benefit of a feature like this comes when a library like fsspec is combined with rasterio. fsspec provides a unified interface for accessing files including remote file systems like S3 or GCS and has the ability to cache files locally. It can provide these files as python file-like objects. An example of using fsspec and providing its file to rasterio looks like:

import rasterio
import fsspec

with fsspec.open('https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/S/DV/2020/4/S2B_13SDV_20200428_0_L2A/TCI.tif') as fobj:
    with rasterio.open(fobj) as src:
        print(src.profile)

The Current Issue

Currently, rasterio uses a MemoryFile, a GDAL supported feature that reads all the necessary data into memory and accesses the data from there instead of disk. This is great, except that it requires reading all of the data from the file before we need it. This means a lot of time (especially for remote resources) and a lot of memory. With the above example and using the rasterio master branch it takes about ~21s on my laptop to do the above.

The Solution - Attempt 2

The current version of this pull request uses GDAL's plugin interfaces to add all necessary hooks (callbacks) for GDAL to communicate with rasterio and the file-like objects that you provide to it. This comes as two parts: the virtual filesystem and the individual file handling. Right now the filesystem is all kept in a single python dictionary (virtual filename -> file wrapper object). The file handling is fairly basic and doesn't need to be too complicated as the basic FILE* operations map to file-like objects (ex read, seek, tell, etc).

I haven't done anything for writing or for multi-range reads.

TODO

  • Switch to GDAL Plugin system. I didn't know about this when I started and it could mean moving all the .cpp stuff into Cython which would be great for maintainability. This will be "Attempt 2".
  • Figure out multi-threading behavior and multi-process behavior. This is mostly related to how the file system treats a file when rasterio is using it. Also, determines if/when we need locking. This includes the GIL. For example, if GDAL is making its own threads (outside of python), then we may need to explicitly create the thread-local GIL.
  • Add tests
  • Add documentation
  • Add docstrings
  • Answer below questions

Questions

  1. Will this replace all rasterio handling of non-string inputs (objects)? Probably not, but in that case when is MemoryFile used instead? Do users get a kwarg to choose? Answer: It will replace reading of file-like objects.
  2. Should this be limited to fsspec objects or any file-like object (handling any missing methods gracefully)? Limiting to fsspec objects may let us assume some things. Answer: All file-like objects.
  3. Are users allowed to provide a filename for the open file object to help GDAL know how to refer to it? If so, what happens if a user provides two different objects with the same name? Answer: Yes, but only to match the existing interface of MemoryFile. Conflicts have an undefined behavior.
  4. Can we support multi-range reads in a semi-standard way? Answer: No. Ignore it for now.

@vincentsarago
Copy link
Member

FYI I've worked on a similar thing last year #1972

@djhoese
Copy link
Contributor Author

djhoese commented Apr 2, 2021

@vincentsarago That's very interesting. Why was work stopped on that?

I thought I had to do the .cpp stuff to subclass the VSIVirtualHandle. As far as I could tell there isn't a reliable way to do this all inside Cython. However, the newer plugin interfaces in GDAL may make this irrelevant.

@vincentsarago
Copy link
Member

@djhoese you can see the discussion over https://rasterio.groups.io/g/dev/topic/add_reader_for_direct_file/75989193
basically we agreed that we didn't really need to add a new abstraction layer.

That's said I still believe that it would be nice to have a direct access to the underlying file.

I'm not a C or Cython expert either 🤷‍♂️

@djhoese
Copy link
Contributor Author

djhoese commented Apr 12, 2021

Ok, these latest commits have all the logic moved to a Cython module and use the GDAL Plugin API. This is so much cleaner and should remain clean once it is done. That may just be my opinion of C++ though. A couple things to note for historical purposes:

  1. I had to define each callback function with with gil or else I got a core dump when accessing the python objects. If I added with gil: inside the function (as a context manager), Cython would complain that it will already have the GIL in this function so that doesn't make sense.
  2. The pUserData that I'm passing around and the way I'm creating the callbacks struct is not scalable. I'm currently passing the wrapper class itself as the filesystem's "data", but the file system should not be tied to the life cycle of a single file wrapper (in my opinion). This may require some global or shared state (maybe a python dictionary) that I use to represent the file system. Without doing this, we wouldn't be able to support reading more than one file at a time.
  3. I found out that Cython is smart enough to do <char*>some_python_bytes_obj but not <void*>some_python_bytes_obj. I ended up doing <void*><char*>some_python_bytes_obj.

I've now deleted my original .cpp and .h files. This should clean up this PR a lot. Now to work on the file system stuff.

@sgillies
Copy link
Member

@djhoese I love this idea. I'm going to schedule some time to give this a good read and see about getting some serious consultation from the VSI system maintainer.

@djhoese
Copy link
Contributor Author

djhoese commented Apr 21, 2021

@sgillies It looks like travis is failing at compilation. Probably because this PR involves pulling in some of GDAL's C++ specific interfaces? Or are these failures happening for all of rasterio?

I was hoping to get the tests passing and then add a nice big docstring to the top of the module explaining how things work. Then I'll probably continue to clean up the code and we can talk about specifics of how this could/should work and missing features.

Comment on lines 29 to 32
cdef extern from "cpl_vsi_virtual.h":
cdef cppclass VSIFileManager:
@staticmethod
void* GetHandler(const char*)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note this is the only reason this has to be compiled as C++ from my basic tests. Without this GetHandler function though, I'd have to re-install the plugin every time the PyVSIFileBase class was created. I use GetHandler to check if the plugin is already installed. Looking at GDAL's source it looks like it creates new objects and copies the struct every time this install is done. Not a huge deal, but not the best thing either.

If we can make the plugin installation happen at import time then maybe this wouldn't matter.

rasterio/_pyvsi.pyx Outdated Show resolved Hide resolved
@sgillies
Copy link
Member

sgillies commented Apr 22, 2021

It lives! Congratulations, @djhoese.

Screenshot from 2021-04-22 17-07-01

@djhoese
Copy link
Contributor Author

djhoese commented Apr 23, 2021

@sgillies That's great! And not all the tests are failing 🎉

My original post has a ton of TODOs and questions that we'll need to consider. To fix the tests I realized that __len__ for my new class wasn't able to get the size of io.BufferedReader object returned by open in the tests (no __len__, no .size, and I'm not doing anything smarter). This isn't a problem for the old method as it just read the entire string as input. There are some smarter estimates I could make, but wanted to bring it up for the future.

@martindurant
Copy link

How are things going here?

@djhoese
Copy link
Contributor Author

djhoese commented May 25, 2021

@martindurant I'm mostly waiting on guidance and decisions from people who know rasterio better than I do. As it sits, it works. But how well do we want it to work? I have my questions in my original PR description that I was hoping people could guide me on. I'm not sure how "all in" we want to go with this implementation. I mean, is this going to be the "go to" interface?

If we can answer some of these questions and the ones above then I can continue working on tests and documentation.

@martindurant
Copy link

Since no one else has commented for a while, I would say that this is the Right Way to go and should proceed. If it works, that's good enough, since nothing like this is possible otherwise. Of course, I don't have insight into how the internals of rasterio/gdal actually work, so I can't review the implementation itself.

@sgillies
Copy link
Member

sgillies commented Jun 5, 2021

@djhoese I have comments and attempts at answers for some of your questions. And an apology for being slow to reply. I'm stretched a little thin at work lately.

  1. Will this replace all rasterio handling of non-string inputs (objects)? Probably not, but in that case when is MemoryFile used instead? Do users get a kwarg to choose?

Yes, I think we could swap PythonVSIFile for MemoryFile in rasterio.open(). What users want is to be able to access a dataset within a Python file object. They don't necessarily want the entire file read into memory. And if they do, they can explicitly use MemoryFile.

Tangentially, GDAL has a feature that I don't see any support for (yet) in fsspec: a raster file and a VRT XML which references it together as siblings in the same zipfile. When I tried (as in https://github.com/mapbox/rasterio/blob/master/tests/test_memoryfile.py#L236)

    >>> with fsspec.open("zip://*.vrt::tests/data/white-gemini-iv.zip") as fobj, rasterio.open(fobj) as dst:
    ...     print(dst.profile)

I see

rasterio._err.CPLE_OpenFailedError: /vsipythonfilelike/bfaa8d65-b599-4dae-ba65-6a581d6524b8/389225main_sw_1965_1024.jpg: Success

because GDAL's VRT driver doesn't find a file at /vsipythonfilelike/bfaa8d65-b599-4dae-ba65-6a581d6524b8/389225main_sw_1965_1024.jpg.

That doesn't impact this PR because zipfiles are not supported as input to rasterio.open.

  1. Should this be limited to fsspec objects or any file-like object (handling any missing methods gracefully)? Limiting to fsspec objects may let us assume some things.

Let's do any file-like object if we can.

  1. Are users allowed to provide a filename for the open file object to help GDAL know how to refer to it? If so, what happens if a user provides two different objects with the same name?

I'm not sure that what happens when names conflict is well defined. That's why I've been using uuids. I'm not opposed to names if they help, but our design principle is that rasterio users should be able to make programs using Python objects only and shouldn't have to think about the names of things within GDAL's virtual filesystems.

  1. Can we support multi-range reads in a semi-standard way?

Is that a question about whether fsspec can support multi-range reads?

@djhoese
Copy link
Contributor Author

djhoese commented Jun 7, 2021

Yes, I think we could swap PythonVSIFile for MemoryFile in rasterio.open()

I think this means that we'll (I'll) need to consider adding write support then, right?

a raster file and a VRT XML which references it together as siblings in the same zipfile

This would be difficult. I think this would require file system methods on the file-like object which I don't (think) is the goal of fsspec's interface...I suppose you could pass an fsspec file system object (@martindurant thoughts?)

I don't see it being impossible for this interface to support both, but obviously this PR becomes much simpler if we stick to just file-like objects instead of any python object.

I'm not opposed to names if they help, but our design principle is that rasterio users should be able to make programs using Python objects only and shouldn't have to think about the names of things within GDAL's virtual filesystems.

I asked because there was/is some minimal support for it with MemoryFile in rasterio where you can tell GDAL how it should refer to this "thing".

Is that a question about whether fsspec can support multi-range reads?

Yes, I suppose this is a question for @martindurant. It seems GDAL allow for VSI plugins to include multi-range reads, but I'm not sure if fsspec is following some "standard" method names for doing this and if so whether or not I should use them (how public are the interfaces). I'm not sure how GDAL decides whether or not multi-range can be used and I'll have to figure that out. I mean, does it say "the multi-range methods are NULL pointers so I won't use them" or does it have some return code for these methods that tells it of the support.

@martindurant
Copy link

a raster file and a VRT XML which references it together as siblings in the same zipfile.

I suppose open could make a available multiple files on given paths for GDAL to consume, but yes it gets a little complex. fsspec is happy to have multiple files open at one, but I don't think you'd want to go down the route where fsspec provides a handler for the GDAL directory structure and browsing, some sore of mount or FUSE thing.

file system methods on the file-like object which I don't (think) is the goal of fsspec's interface...

Files o maintain a reference to their parent file system instance, but no, I don't think this model is used anywhere.

Let's do any file-like object if we can.

Agree with this, but I don't see what writing is necessary. fsspec's files try to match the standard python interface (io.IOBase).

multi-range gets

I don't actually understand what is being talked about here. This is HTTP fetches with Range:bytes=x1-y1,x2-y2? This is not directly supported.
You can pass arbitrary header keys when using cat_file, but fsspec will generate bytes range headers when reading via the file-like interface. The file-like interface is not async/concurrent (because there is internal state), but cat_file is, so you could either construct your own header, or issue multiple cat_file requests for the same file and different start/end values, and be sure they are processed concurrently.

fsspec also doesn't join multiple cat_file requests that happen to have similar ranges, where fewer requests might be better. This has been considered in the context of ReferenceFileSystem, but nothing has been done yet, and getting the heuristics right would be tricky (unless the ranges form a contiguous block).

By default, an fsspec file provides a range buffering options, of which "bytes" (fetch blocks, fill holes, readahead), "readahead" (read a bit more than requested, good for streaming pattern) and "first" (keep first block around, good for header metadata) are probably of interest here.

@djhoese
Copy link
Contributor Author

djhoese commented Jun 7, 2021

Agree with this, but I don't see what writing is necessary.

I brought up writing because rasterio's open can be used to read or write files. So if we are completely replacing MemoryFile with PythonVSIFile in open then it would need to support reading and writing. But looking at the code again, it seems we don't have to do both. We could always implement it in the future too.

https://github.com/mapbox/rasterio/blob/69305c72b58b15a96330d371ad90ef31c209e981/rasterio/__init__.py#L192-L199

I don't actually understand what is being talked about here. This is HTTP fetches with Range:bytes=x1-y1,x2-y2? This is not directly supported.

I'm wondering if fsspec does or expects to be able to do those types of multi-range HTTP fetches from its File classes. When I was looking through fsspec's source I found:

https://github.com/intake/filesystem_spec/blob/61ec18fb0a77d3a524a6b2d72340af3748c5bf2a/fsspec/spec.py#L1445-L1447

but realize now that is just for a single range which is where a lot of this confusion probably started. If fsspec's objects had a way to request multiple ranges at once, then I could implement the callback for:

https://gdal.org/api/cpl.html#_CPPv441VSIFilesystemPluginReadMultiRangeCallback

Which would go here in this PR:

https://github.com/djhoese/rasterio/blob/ab340b3f2e4727eadc0acb1096f3274788b87972/rasterio/_pyvsi.pyx#L213-L220

@sgillies
Copy link
Member

sgillies commented Oct 15, 2021

@djhoese I'm comfortable including this in next week's 1.3a2 release as a provisional feature. I'd like to improve the name of the class, "Python" being superfluous, and "VSI" having an uncertain meaning, even to Even 😄 I can't think of anything better at the moment, though.

What do you think? Any reservations? Caveats?

@djhoese
Copy link
Contributor Author

djhoese commented Oct 15, 2021

@sgillies I'm fine with changing the name and agree that VSI and Python are maybe not the most descriptive. Ideas for the module name (currently _pyvsi.pyx) and the class name (currently PythonVSIFile)?

The "Python" was supposed to refer to the python file-like object versus file-like object from some other library or language. 🤷 Doing something like "OpenFile" seems too generic.

Edit: FileLike?

@martindurant
Copy link

Sounds like FileLikeObject :)

@tbonfort
Copy link

One thing that continues to appeal to me about the plugin is that it seems to offer more visibility into network errors.

Under the hood, you are still using a c++ VSIVirtualHandle that is wrapping your C plugin functions, so using the plugin infrastructure won't bring you more or less visibility compared to implementing your own VSIVirtualHandle. c.f. https://github.com/OSGeo/gdal/blob/713c4a7f7e631ea3b88efd5ed293baeb2fc08f36/gdal/port/cpl_vsil_plugin.cpp#L121 . And implementing your own VSIVirtualHandle gives you more flexibility, e.g. when adding buffering in OSGeo/gdal#2901 we had to wait for a new gdal release before being able to use the functionality, whereas now we can release new godal versions independently.

For example, HTTP 50x server errors are practically unavoidable when reading from COGs on a large scale. But when they occur inside rasterio's use of e.g., GDAL's warper, we never see the original errors, they are inevitably hidden by a generic "read block error".

We handle this case by making our vsi handlers return a textual error that we can then emit as a CPLError: https://github.com/airbusgeo/godal/blob/689926741dbb6871dae23c4ea5220895791dfbdf/godal.cpp#L1096 . In order for that error to bubble up to the original caller instead of just being printed out to stderr, we install a thread-local error handler each time we enter gdal code: https://github.com/airbusgeo/godal/blob/689926741dbb6871dae23c4ea5220895791dfbdf/godal.cpp#L68

@djhoese
Copy link
Contributor Author

djhoese commented Oct 16, 2021

@sgillies You have the final say. What will it be? FileLikeObject (_filelike.pyx)? Something else?

@sgillies
Copy link
Member

sgillies commented Nov 10, 2021

I'm finally through enough of the transition to my new day job to get back to this. Sorry for the delay!

First of all, I need to explain some undocumented parts of rasterio's design. And then I will propose a name for the new class.

Rasterio's open function takes a string or path-like object (or Python file, of course, which we're going to be changing) and returns a dataset object. This reflects GDAL's API. We also have the concept of an object from which you can get a dataset object by calling an open method. We have an implicit "Openable" interface. The MemoryFile class implements this. We came to this by analogy to pathlib.Path. I think of MemoryFile as being both a kind of "path", a path that only has meaning to GDAL, and an openable object. It's the latter that's important to users of the API, but I feel like "path" is still a useful term for developers. GDAL/rasterio can open "paths", whether the path is based on the VSI system or not.

So, I'd like to propose FilePath for the name of a class that adapts a Python file or file-like object to a "Path" interface. What do you think @djhoese? I went to the brink of overthinking this, but I think the analogy with pathlib.Path is helpful, and we already have some Path classes in rasterio (for internal use) with a related purpose.

Next question: what would you think about eliminating the pass-through methods of the new class? We can always call those methods on the original file-like object, yes? In the MemoryFile case, they are needed to get data out, but the new class doesn't have to be file-like itself to be useful.

@djhoese
Copy link
Contributor Author

djhoese commented Nov 11, 2021

Regarding FilePath: Hey, it isn't my project. And based on the context you've provided I can see how some of our other suggestions would be confusing. I think FilePath would be less than great in any other project, but this is GDAL we're talking about. @martindurant based on @sgillies additional information do any other ideas come to your mind?

what would you think about eliminating the pass-through methods of the new class? We can always call those methods on the original file-like object, yes?

Do you specifically mean the seek, tell, and read methods? Yeah it should be fine to remove those. I think I was just matching MemoryFile assuming it was also matching some "standard" interface for rasterio. I don't think close method can be removed unless we want to remove all possible file-like objects. I think I'm OK with assuming these classes are not necessarily user facing though too. You kind of mentioned it earlier, but I kind of see this PR as a "first draft" and we see how it breaks when users actually start using it.

I've been out sick most of this week, but I'll try to get to this some time before next week.

@sgillies
Copy link
Member

Design notes from above have been added to https://github.com/rasterio/rasterio/blob/master/CONTRIBUTING.rst#path-objects.

@martindurant
Copy link

do any other ideas come to your mind?

For the name? I don't really mind, indeed this i pretty much an internal detail from users' point of view.

@djhoese
Copy link
Contributor Author

djhoese commented Nov 13, 2021

Ok I think I've done all the renaming. @sgillies let me know what you think. I also removed those methods I mentioned.

@sgillies
Copy link
Member

@djhoese the tests pass with a couple minor changes. I'm going to merge this and then will commit those changes. Then I'm going to get the CI up and running again.

@sgillies sgillies merged commit 80a67d6 into rasterio:master Nov 13, 2021
sgillies added a commit that referenced this pull request Nov 13, 2021
@vincentsarago
Copy link
Member

🥳 woot woot, all this looks great.

Quick question, If the reading/fetching is deferred to fsspec, could this mean there is a way to allow async reading? I guess neither rasterio and gdal VSI are ready for this but I'm trying to figure if using fsspec async API would be the easiest way 🤷‍♂️

@djhoese djhoese deleted the feature-pyvsi branch November 15, 2021 12:52
@martindurant
Copy link

For the very specific case that you wish to read many small sections of a large file, you could do what is done in the new fsspec.paquet module: async-fetch all of the binary sections concurrently (you need to know these ranges) and then construct a file-lie object which has a special cache containing these chunks. I don't know if this is a valid use case here.

@jhamman
Copy link

jhamman commented Nov 15, 2021

Huge! Thanks all for getting this merged, especially @djhoese!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants