Filesystem abstraction library #39

mosra · 2017-12-22T21:22:29Z

A "one line" marketing pitch (everything subject to change):

PluginManager::Manager<Fs::Filesystem> manager;

std::unique_ptr<Fs::Filesystem> fs = manager.loadAndInstantiate("AnyFilesystem");

fs->openPrefix("http://your.domain/assets/");            // web 
fs->openPrefix("bundle://assets/");                      // android
fs->openPrefix("assets.zip");                            // local filesystem
fs->openPrefix("://assets.zip");                         // compiled-in resource
fs->openPrefix("http://your.domain/assets/assets.zip");  // or combined

// map the file, if supported
Containers::Array<char, Fs::MapDeleter> data = fs->map("file.jpg");

// or just copy-read
Containers::Array<char> data = fs->read("file.jpg");

Further info: https://gist.github.com/mosra/d64d4388d6a3bef80c6226ea6b479d6d

Things to do, based on the comments below and further research on my side:

Further ideas:

A wrapper over win32 resources (EnumResources, FindResource, LoadResource, LockResource + GetResourceSize), exposing them as a filesystem
A wrapper over Utility::Resource, exposing it as a filesystem
Hugepages support for faster memory reservation when memory-mapping -- https://chrisgreendevelopmentblog.wordpress.com/2021/10/14/fun-with-large-memory-allocations-well-not-really-fun-at-all/
Anything to take from https://github.com/cginternals/cppfs ? A bit of a kitchen sink library that includes also tree diffing or SHA1 calculation that doesn't really have a place in a pure FS abstraction lib...
Decompression in-place, overwriting the compressed data https://twitter.com/ZoidCTF/status/1590137517072056320

coveralls · 2017-12-22T22:08:23Z

Coverage remained the same at 96.167% when pulling 9f34998 on filesystem into e94609d on master.

iboB · 2018-12-04T15:37:59Z

Some suggestions (taking the associated gist in mind)

Consider custom uris (not necessarily platform specific ones) like "assets://" or "gen://". This allows for fancy features like adding proceduraly generated content to the asset pipeline
Add the capability for filesystem notifications. Not necessarily implemented, but implementable. This allows custom file watches and asset hotswapping if one chooses to implement it.
I agree that having the option for blocking i/o is a nice feature, but since it's such a hassle to implement with emscripten why not make the... well the synchronicity of the filesystem an optional feature. Thus the non-preloaded assets file system layer will have no option for blocking (synchronous) i/o. If you want blocking in the browser, preload everything.
Speaking of layers. It could be that the intention is for plugins to work like that, but if not, do consider having layers. This basically means that the Filesystem class has potentially multiple layers underneath. When you try to fetch an asset if searches in all layers in order or priority (could be just the inverse order in which they were added). This allows for seamless multiple options when dealing with assets. Here are some use cases:
- On the browser having some preloaded files and some downloaded. Browser cache can also make this a good choice.
- On Android having some files bundled in the apk and some downloaded separately (for example when the maximum size of apk is exceeded or as a dlc)
- Overriding assets: either for debugging purposes or for released software expansions
- Having layers allows the user to make the decision about how to split assets later and with minimal effort.

mosra · 2018-12-04T17:00:07Z

@iboB thanks for the valuable feedback!

Custom URIs: yes, this is quite integral to the whole system, based on the "scheme" it will delegeate to a concrete implementation. It doesn't make sense to restrict these just to "real schemes", so having custom schemes is absolutely possible. Also things like zip:// -- if you ever worked with KIO (KDE, Linux), this is a very nice thing to have.
Filesystem notifications: yup, definitely! 👍 (For physical filesystems, there's Utility::FileWatcher already, though the implementation is quite rudimentary at the moment.)
What I definitely do not want with sync/async I/O is hardcoding threading directly in the plugins themselves, since that'll impose restrictions on how the APIs can and can't be used. I'm thinking that every filesystem plugin implementation should be as direct as possible and so:
- for APIs that are already async (emscripten, HTTP fetch APIs), expose them as async (poll for results once a frame, e.g., or some callback/future/promise things)
- for APIs that are synchronous (reading of in-memory data, reading files from disk), expose them as synchronous as well (and so you have to explicitly offload them to some worker thread if you don't want them to block), but make it possible to use the sync APIs the "fake-async" way as well (so you can write just one code for both the sync and async case)
- maybe have some "layer" / "adaptor plugin" that converts the sync plugins to async (by creating a worker thread and executing them there)
Layers: didn't think about it in this way yet, but I think this could be handled better by a separate dedicated functionality as the whole fallback thing is quite orthogonal to filesystem access, in my opinion. There's the ResourceManager class that handles fallbacks, asset overriding etc. and works on any type of resource, not just files. What I meant by "layers" above and in the gist is chaining the filesystems together -- e.g. get a file from a ZIP archive downloaded from HTTP in a single (potentially async) call.

hsdk123 · 2020-01-22T03:25:38Z

Looking forward to this!

hsdk123 · 2020-01-23T01:57:02Z

I see the thread started in 2017 and hasn't had updates since 2018, is this still being considered? The marketing pitch looks great.

LB-- · 2020-01-23T08:05:31Z

Just saw this, and wanted to comment that a common mistake I see with filesystem abstractions (in my opinion) is using a single string to represent a file path. Google Drive is very much a filesystem, but it allows nearly all characters in filenames including forward slashes and backslashes, and has no concept of single-string-path. It also allows multiple identically-named items to exist in the same place. Nearly all filesystem APIs are incapable of handling this without doing weird workarounds, because they forget to abstract the concept of a path to a filesystem object at all. It's more than just changing slash directions. Sure, traditional filepaths can be the input to a convenience function that spits out the real path abstraction, but otherwise I feel that single-string paths do more harm than good since they often need processing anyway.

I am also very much not convinced that web addresses / URLs should be involved here - you can't iterate over the content of a directory if it's a web URL because that's not how the web works. "Opening" a URL for "reading" may have side effects. It's an entirely different beast. Instead I think it'd make more sense to have a virtual filesystem that knows the directory layout already and can just perform HTTP requests on the fly under the hood while giving the appearance of an ordinary filesystem, specifically for the case of a web app.

IMO the colon and everything before it should be an intrinsic property of the filesystem, and shouldn't be exposed to the code that uses that filesystem. You shouldn't have one API that can accept both C:\... and ftp://... as input - maybe it does access both under the hood for whatever reason, but on the surface the point of abstracting filesystems is to help an app read its resources and write its user data. I assume maybe you want a centralized way to plug in any path into some API and have it delegate to the correct filesystem abstraction object under the hood but that just seems like playing with fire to me. It would allow completely disparate parts of the program to potentially interfere with each other, effectively just being a fancy global variable. And why not allow for encoding into the type system the type of filesystem that is expected?

Consider games - they'd need one read-only filesystem with the root as their install directory, and a read-write filesystem with the root as the place the player or OS has designated save data should go (e.g. the Saved Games folder on Windows, FOLDERID_SavedGames). Giving either of these filesystems a path like D:\Program Files (x86)\Steam\steamapps\common\My Game\ or C:\Users\mosra\Saved Games\My Game\ would be nonsensical, even if that is exactly the paths they use under the hood. Paths like save:/save_data.dat are needlessly complex when you already know you're working with the save filesystem, you could just say save_data.dat and use the save filesystem directly. Maybe late in development someone decides they want to allow local-co-op and save file sharing via Steam cloud with both users signed in, but they suddenly have to go through and change everywhere they had typed save:/ to something else, instead of just adapting the code to associate a filesystem with each player. Level loading code could just get a filesystem view of a subdirectory of the parent filesystem, scoped to its own little realm, and then multiple levels could be loaded at once without potentially interfering with each other. Using colon prefixes doesn't really make sense in this respect.

Maybe something like this would work - a path as an array of path fragments, and each fragment must have either a name or unique ID (or both), and it can only be used with a filesystem directly, not with a centralized API:

using corrade::filesystem_literals::_pf; //path fragment
auto level_path = u8"level1"_pf/u8"geometry"/u8"main"; //slash operator like std::filesystem
Containers::Array<char> data = embedded_filesystem->read(level_path);

auto case_insensitive_path = u8"WiNdOwS iS wEiRd.TxT"_pf;
auto normalized_path = windows_filesystem->normalize(case_insensitive_path);
assert(normalized_path[0].name() == u8"Windows is Weird.txt");

auto gdrive_path = u8"directory of files w/o unique names \\o/"/PathFragment(u8"dummy filename, the ID takes precedence", unique_id);
std::u8string actual_filename = gdrive_filesystem->normalize(gdrive_path).crbegin()->name();

In the case of Google Drive I imagine that a single path fragment with an ID is enough to find the file anywhere and get its full path via normalization. A similar thing is possible in Windows, with GetFileInformationByHandle and GetFinalPathNameByHandleW. This can be useful for locating where computer-illiterate players may have moved save files to, by storing the ID of the save folder somewhere else like the registry. You can also open files by their ID, though I have no idea if it's any faster than by path. Seems like similar things are possible in Linux too. So, having unique IDs as part of paths isn't just useful with the Google Drive case. A filesystem, when given a fully normalized path that has both names and IDs in each fragment, could resolve file-not-found issues by finding the closest parent directory by ID and then going down it by name. If the IDs aren't relevant to the filesystem, no harm done. And, with the terse syntax, you can completely ignore that the API even supports IDs in the first place. In cases where the underlying filesystem doesn't support or use IDs, it's no problem there either.

...This took more time to write than I realized, and may be overkill. Maybe simpler is better in most cases. I'd just like to vote for a little less simplicity for the sake of a lot more flexibility. I don't want a repeat of std::filesystem, and I don't want filesystem-aware paths. But I only know how I would use a filesystem API in Corrade, not how you expect to use it.

mosra · 2020-01-23T10:34:17Z

@LB-- thanks a lot for stopping by, you've made a bunch of great points 👍

I was not aware of Google Drive allowing / in filenames (why, Google, why?). I only did very cursory research of what is Google Drive doing there and I have to admit I don't fully understand the semantics -- one file can be in multiple directories (well, which is the case of Unix handlinks too) and there isn't any real (browseable) concept of a directory tree? So the only real way to access a file is to know its ID beforehand?

Regarding the "single path string" -- good point. Instead of overloading operator/ (which I'm not a fan of, at all), I think all APIs that accept a single string could accept a "list of path components", e.g.:

fs->open("path/to/a/file.dat");
fs->open({"path", "to", "a", "file.dat"});

This will solve the problem of / in filenames (in the second case, any occurence of / will be treated as being a part of the filename, if allowed by the backend), while still providing the single-string shorthand for convenience (where one can assume that / is indeed separating directories). Also I think the second way could make more sense in cases where a path is generated programatically, to avoid costly joining in user code.

Allowing the use of IDs (GUID, inode ID...) for opening files is also a very good idea -- that nicely sidesteps all string processing and could be extended further (for example, referencing files by their SHA, like Git does). A caveat is that, as far as I can tell, both inode ID and GUID might get reassigned to a different file, so it's not failproof.

I am also very much not convinced that web addresses / URLs should be involved here

The main use case of the scheme prefixes is for convenient opening of arbitrary URLs by the user, similar to the KIO framework in KDE. For example, you could use magnum-player to open a file from an URL, a file from inside a ZIP file etc:

magnum-player https://a.path.to/file.glb
magnum-player fish://192.168.1.105/home/mosra/models/chair.blend
magnum-player zip://backup.zip/cube.obj

For Saved Games locations, this is not really where it could be used, except maybe if you'd want to give the users an ability to override the location (and then allow them to store the saves on a remote location, e.g.).

I'll update the PR description with new TODO items for the above.

@hsdk123 yes, if it wouldn't be, it would be closed already ;) I did some design/research work related to on-the-fly decompression back in August and have a bunch of new things locally, nothing pushed here tho.

LB-- · 2020-01-23T19:24:06Z

and there isn't any real (browseable) concept of a directory tree? So the only real way to access a file is to know its ID beforehand?

From what I can tell the API is more search-oriented (makes sense given the company in question), you search for files and you get a list of all the parent folders each file has. Also, folders are just files with a special MIME type, and root is a special ID for the root folder. Since you can search for files with a query based on their parents, it's easy enough to just recursively search starting from root and then each subsequent folder found. You get back the name and ID of each entry in the results, enough to make a semblance of a filesystem out of.

I think all APIs that accept a single string could accept a "list of path components"

That was what I was thinking originally too, I just went with the operator overload in my example to mirror std::filesystem even though it is a bit weird. Either way works, as long as paths can be composed of fragments instead of splitting on slashes.

The main use case of the scheme prefixes is for convenient opening of arbitrary URLs by the user, similar to the KIO framework in KDE.

After sleeping on this I realized I had a few misconceptions, since I was worried about conflicts with OS-registered protocol handlers. Then I realized, duh, https://... in Corrade wouldn't call out to the user's default web browser and download the data through it. So, what you're suggesting is having an option for a centralized API that takes paths and returns file contents, that filesystems can be optionally registered into but are otherwise not necessarily related to? That makes more sense to me, and although I personally wouldn't use it I can certainly see it being useful.

mosra · 2020-02-23T15:55:52Z

Google developed a Linux FS that works akin to the custom pagefault callbacks linked above: https://www.osnews.com/story/131383/googles-new-incremental-file-system-may-let-you-play-big-android-games-before-theyre-fully-downloaded/

Not sure yet if I like it or not.

[wip] New filesystem abstraction library.

9f34998

mosra self-assigned this Dec 22, 2017

mosra added this to TODO in Project management via automation Dec 22, 2017

mosra moved this from TODO to In progress in Project management Jan 9, 2018

mosra added this to In Progress in Filesystem Jul 3, 2019

mosra mentioned this pull request Sep 25, 2019

[esp/bindings_js] add back support for semantic information facebookresearch/habitat-sim#245

Merged

11 tasks

mosra mentioned this pull request Feb 25, 2020

dox: Fix vcpkg install everything command mosra/magnum#368

Closed

mosra marked this pull request as draft November 2, 2020 07:19

mosra changed the title ~~[WIP] Filesystem abstraction library~~ Filesystem abstraction library Nov 2, 2020

mosra mentioned this pull request Oct 15, 2021

KTX2 + Basis Universal mosra/magnum-plugins#110

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filesystem abstraction library #39

Filesystem abstraction library #39

mosra commented Dec 22, 2017 •

edited

coveralls commented Dec 22, 2017

iboB commented Dec 4, 2018

mosra commented Dec 4, 2018

hsdk123 commented Jan 22, 2020

hsdk123 commented Jan 23, 2020

LB-- commented Jan 23, 2020

mosra commented Jan 23, 2020

LB-- commented Jan 23, 2020

mosra commented Feb 23, 2020

Filesystem abstraction library #39

Are you sure you want to change the base?

Filesystem abstraction library #39

Conversation

mosra commented Dec 22, 2017 • edited

coveralls commented Dec 22, 2017

iboB commented Dec 4, 2018

mosra commented Dec 4, 2018

hsdk123 commented Jan 22, 2020

hsdk123 commented Jan 23, 2020

LB-- commented Jan 23, 2020

mosra commented Jan 23, 2020

LB-- commented Jan 23, 2020

mosra commented Feb 23, 2020

mosra commented Dec 22, 2017 •

edited