New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python file-like VSI Plugin #2141
Conversation
FYI I've worked on a similar thing last year #1972 |
@vincentsarago That's very interesting. Why was work stopped on that? I thought I had to do the |
@djhoese you can see the discussion over https://rasterio.groups.io/g/dev/topic/add_reader_for_direct_file/75989193 That's said I still believe that it would be nice to have a direct access to the underlying file. I'm not a C or Cython expert either 🤷♂️ |
Ok, these latest commits have all the logic moved to a Cython module and use the GDAL Plugin API. This is so much cleaner and should remain clean once it is done. That may just be my opinion of C++ though. A couple things to note for historical purposes:
I've now deleted my original .cpp and .h files. This should clean up this PR a lot. Now to work on the file system stuff. |
@djhoese I love this idea. I'm going to schedule some time to give this a good read and see about getting some serious consultation from the VSI system maintainer. |
@sgillies It looks like travis is failing at compilation. Probably because this PR involves pulling in some of GDAL's C++ specific interfaces? Or are these failures happening for all of rasterio? I was hoping to get the tests passing and then add a nice big docstring to the top of the module explaining how things work. Then I'll probably continue to clean up the code and we can talk about specifics of how this could/should work and missing features. |
rasterio/_pyvsi.pyx
Outdated
cdef extern from "cpl_vsi_virtual.h": | ||
cdef cppclass VSIFileManager: | ||
@staticmethod | ||
void* GetHandler(const char*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is the only reason this has to be compiled as C++ from my basic tests. Without this GetHandler
function though, I'd have to re-install the plugin every time the PyVSIFileBase
class was created. I use GetHandler
to check if the plugin is already installed. Looking at GDAL's source it looks like it creates new objects and copies the struct every time this install is done. Not a huge deal, but not the best thing either.
If we can make the plugin installation happen at import time then maybe this wouldn't matter.
It lives! Congratulations, @djhoese. |
@sgillies That's great! And not all the tests are failing 🎉 My original post has a ton of TODOs and questions that we'll need to consider. To fix the tests I realized that |
How are things going here? |
@martindurant I'm mostly waiting on guidance and decisions from people who know rasterio better than I do. As it sits, it works. But how well do we want it to work? I have my questions in my original PR description that I was hoping people could guide me on. I'm not sure how "all in" we want to go with this implementation. I mean, is this going to be the "go to" interface? If we can answer some of these questions and the ones above then I can continue working on tests and documentation. |
Since no one else has commented for a while, I would say that this is the Right Way to go and should proceed. If it works, that's good enough, since nothing like this is possible otherwise. Of course, I don't have insight into how the internals of rasterio/gdal actually work, so I can't review the implementation itself. |
@djhoese I have comments and attempts at answers for some of your questions. And an apology for being slow to reply. I'm stretched a little thin at work lately.
Yes, I think we could swap PythonVSIFile for MemoryFile in rasterio.open(). What users want is to be able to access a dataset within a Python file object. They don't necessarily want the entire file read into memory. And if they do, they can explicitly use MemoryFile. Tangentially, GDAL has a feature that I don't see any support for (yet) in fsspec: a raster file and a VRT XML which references it together as siblings in the same zipfile. When I tried (as in https://github.com/mapbox/rasterio/blob/master/tests/test_memoryfile.py#L236) >>> with fsspec.open("zip://*.vrt::tests/data/white-gemini-iv.zip") as fobj, rasterio.open(fobj) as dst:
... print(dst.profile) I see
because GDAL's VRT driver doesn't find a file at /vsipythonfilelike/bfaa8d65-b599-4dae-ba65-6a581d6524b8/389225main_sw_1965_1024.jpg. That doesn't impact this PR because zipfiles are not supported as input to rasterio.open.
Let's do any file-like object if we can.
I'm not sure that what happens when names conflict is well defined. That's why I've been using uuids. I'm not opposed to names if they help, but our design principle is that rasterio users should be able to make programs using Python objects only and shouldn't have to think about the names of things within GDAL's virtual filesystems.
Is that a question about whether fsspec can support multi-range reads? |
I think this means that we'll (I'll) need to consider adding write support then, right?
This would be difficult. I think this would require file system methods on the file-like object which I don't (think) is the goal of fsspec's interface...I suppose you could pass an fsspec file system object (@martindurant thoughts?) I don't see it being impossible for this interface to support both, but obviously this PR becomes much simpler if we stick to just file-like objects instead of any python object.
I asked because there was/is some minimal support for it with MemoryFile in rasterio where you can tell GDAL how it should refer to this "thing".
Yes, I suppose this is a question for @martindurant. It seems GDAL allow for VSI plugins to include multi-range reads, but I'm not sure if fsspec is following some "standard" method names for doing this and if so whether or not I should use them (how public are the interfaces). I'm not sure how GDAL decides whether or not multi-range can be used and I'll have to figure that out. I mean, does it say "the multi-range methods are NULL pointers so I won't use them" or does it have some return code for these methods that tells it of the support. |
I suppose
Files o maintain a reference to their parent file system instance, but no, I don't think this model is used anywhere.
Agree with this, but I don't see what writing is necessary. fsspec's files try to match the standard python interface (
I don't actually understand what is being talked about here. This is HTTP fetches with fsspec also doesn't join multiple cat_file requests that happen to have similar ranges, where fewer requests might be better. This has been considered in the context of ReferenceFileSystem, but nothing has been done yet, and getting the heuristics right would be tricky (unless the ranges form a contiguous block). By default, an fsspec file provides a range buffering options, of which "bytes" (fetch blocks, fill holes, readahead), "readahead" (read a bit more than requested, good for streaming pattern) and "first" (keep first block around, good for header metadata) are probably of interest here. |
I brought up writing because rasterio's
I'm wondering if fsspec does or expects to be able to do those types of multi-range HTTP fetches from its File classes. When I was looking through fsspec's source I found: but realize now that is just for a single range which is where a lot of this confusion probably started. If fsspec's objects had a way to request multiple ranges at once, then I could implement the callback for: https://gdal.org/api/cpl.html#_CPPv441VSIFilesystemPluginReadMultiRangeCallback Which would go here in this PR: |
@djhoese I'm comfortable including this in next week's 1.3a2 release as a provisional feature. I'd like to improve the name of the class, "Python" being superfluous, and "VSI" having an uncertain meaning, even to Even 😄 I can't think of anything better at the moment, though. What do you think? Any reservations? Caveats? |
@sgillies I'm fine with changing the name and agree that VSI and Python are maybe not the most descriptive. Ideas for the module name (currently The "Python" was supposed to refer to the python file-like object versus file-like object from some other library or language. 🤷 Doing something like "OpenFile" seems too generic. Edit: |
Sounds like |
Under the hood, you are still using a c++ VSIVirtualHandle that is wrapping your C plugin functions, so using the plugin infrastructure won't bring you more or less visibility compared to implementing your own VSIVirtualHandle. c.f. https://github.com/OSGeo/gdal/blob/713c4a7f7e631ea3b88efd5ed293baeb2fc08f36/gdal/port/cpl_vsil_plugin.cpp#L121 . And implementing your own VSIVirtualHandle gives you more flexibility, e.g. when adding buffering in OSGeo/gdal#2901 we had to wait for a new gdal release before being able to use the functionality, whereas now we can release new godal versions independently.
We handle this case by making our vsi handlers return a textual error that we can then emit as a CPLError: https://github.com/airbusgeo/godal/blob/689926741dbb6871dae23c4ea5220895791dfbdf/godal.cpp#L1096 . In order for that error to bubble up to the original caller instead of just being printed out to stderr, we install a thread-local error handler each time we enter gdal code: https://github.com/airbusgeo/godal/blob/689926741dbb6871dae23c4ea5220895791dfbdf/godal.cpp#L68 |
@sgillies You have the final say. What will it be? |
I'm finally through enough of the transition to my new day job to get back to this. Sorry for the delay! First of all, I need to explain some undocumented parts of rasterio's design. And then I will propose a name for the new class. Rasterio's open function takes a string or path-like object (or Python file, of course, which we're going to be changing) and returns a dataset object. This reflects GDAL's API. We also have the concept of an object from which you can get a dataset object by calling an open method. We have an implicit "Openable" interface. The MemoryFile class implements this. We came to this by analogy to pathlib.Path. I think of MemoryFile as being both a kind of "path", a path that only has meaning to GDAL, and an openable object. It's the latter that's important to users of the API, but I feel like "path" is still a useful term for developers. GDAL/rasterio can open "paths", whether the path is based on the VSI system or not. So, I'd like to propose Next question: what would you think about eliminating the pass-through methods of the new class? We can always call those methods on the original file-like object, yes? In the MemoryFile case, they are needed to get data out, but the new class doesn't have to be file-like itself to be useful. |
Regarding
Do you specifically mean the I've been out sick most of this week, but I'll try to get to this some time before next week. |
Design notes from above have been added to https://github.com/rasterio/rasterio/blob/master/CONTRIBUTING.rst#path-objects. |
For the name? I don't really mind, indeed this i pretty much an internal detail from users' point of view. |
Ok I think I've done all the renaming. @sgillies let me know what you think. I also removed those methods I mentioned. |
@djhoese the tests pass with a couple minor changes. I'm going to merge this and then will commit those changes. Then I'm going to get the CI up and running again. |
🥳 woot woot, all this looks great. Quick question, If the |
For the very specific case that you wish to read many small sections of a large file, you could do what is done in the new fsspec.paquet module: async-fetch all of the binary sections concurrently (you need to know these ranges) and then construct a file-lie object which has a special cache containing these chunks. I don't know if this is a valid use case here. |
Huge! Thanks all for getting this merged, especially @djhoese! |
The Goal
See corteva/rioxarray#246 for details on the origins of this feature.
The goal is to make it as efficient as possible for rasterio to open a Python file-like object. The biggest benefit of a feature like this comes when a library like
fsspec
is combined withrasterio
.fsspec
provides a unified interface for accessing files including remote file systems like S3 or GCS and has the ability to cache files locally. It can provide these files as python file-like objects. An example of usingfsspec
and providing its file torasterio
looks like:The Current Issue
Currently,
rasterio
uses a MemoryFile, a GDAL supported feature that reads all the necessary data into memory and accesses the data from there instead of disk. This is great, except that it requires reading all of the data from the file before we need it. This means a lot of time (especially for remote resources) and a lot of memory. With the above example and using the rasterio master branch it takes about ~21s on my laptop to do the above.The Solution - Attempt 2
The current version of this pull request uses GDAL's plugin interfaces to add all necessary hooks (callbacks) for GDAL to communicate with rasterio and the file-like objects that you provide to it. This comes as two parts: the virtual filesystem and the individual file handling. Right now the filesystem is all kept in a single python dictionary (virtual filename -> file wrapper object). The file handling is fairly basic and doesn't need to be too complicated as the basic FILE* operations map to file-like objects (ex read, seek, tell, etc).
I haven't done anything for writing or for multi-range reads.
TODO
.cpp
stuff into Cython which would be great for maintainability. This will be "Attempt 2".Questions