Reconsider the filesystem status querying design #18

ncoghlan · 2018-06-01T12:14:54Z

(Getting some thoughts on this topic out of my head and into written form, as part of handing over the lead maintainer and API designer role to @palaviv)

One of the annoying aspects of walkdir's symlink loop detection is that it needs to call os.path.islink on each directory. Prior to the introduction of os.scandir(), this made the number of stat calls during a walk equal to the number of files yielded (as the underlying walk decided whether it was a directory or not) plus twice the number of directories (once in the underlying walk, once in walkdir's symlink loop detection).

Since Python 3.5 though, that hasn't been the case: os.scandir() means the underlying walk doesn't need to make any stat calls at all, since the information it needs is right there on the os.DirEntry objects. Those objects also typically contain the symlink-or-not info that the symlink loop detection needs, but the entries aren't made visible to the rest of the filter stack, so each filter that needs filesystem status information needs to make its own call to check the status.

Designing a custom walk that exposes the underlying os.DirEntry objects just to save a stat call in the symlink loop detector probably isn't worth the hassle, but it would be nice to be able to filter on other stat_result attributes like the mode, size, uid, gid, access time, modification time, and creation time without every such filter needing to make its own stat call.

The challenge is finding a way to store a transient stat cache such that:

it can be used even with immutable dir_entry objects (like the tuples produced by os.walk() and os.fwalk())
stale entries don't accumulate over time
stale entries don't persist for extended periods after the relevant iterators terminate
multiple iterators can be running in multiple threads without interfering with each other
multiple iterators can be interleaved within a single thread without interfering with each other

I initially thought keying by id(walk_iter) might work, but the problem there is that each layer of filtering gets a different value for that (since they see the next filter layer down, not the underlying walk).

Keying by id(dir_entry) seems more promising (since the filters are expected to pass that through unmodified since 0.4), but the challenge there is deciding when to clear the previously cached entries.

The final possibility that comes to mind is to instead use a naive time based cache, where rather than aiming for "Only one stat call per path per walk iteration", we instead aim for "No more than one stat call per path per N milliseconds". To enable that, you'd calculate the currently active cache ID as int(time.monotonic() * (1000 / N)). Then whenever the cache ID changed, you'd discard all previously cached values and start making fresh stat calls.

However, a problem with only using time, and not considering id(dir_entry) is that once you "escape" from the filter pipeline, then the loop body may mutate the filesystem state in a way that it expects the next iteration to detect. This means caching by dir_entry may still make the most sense, and time only comes into play as a way of limiting the amount of no longer needed stale data that gets kept around, without limiting the number of concurrently active filter pipelines.

P.S. It's conceivable this is a topic I've been intermittently thinking about since first creating walkdir, and I never actually did anything about it because nothing leapt out at me as "Yes, that's the right way to handle multiple stat-aware filter stages without large numbers of redundant stat calls" ;)

The text was updated successfully, but these errors were encountered:

palaviv · 2018-06-02T08:33:10Z

I think that the big problem with your suggestion is the cache. What will happen if something has changed in the filesystem between walks?

ncoghlan · 2018-06-02T15:28:43Z

Same thing that happens with any query-before-use filtering: the criterion may no longer be true by the time you act on it. Even using "find" in a pipeline is vulnerable to that problem though, since the filter condition may no longer be true by the time the next stage in the pipeline does something with the emitted path. That said, that concern is also why I think "id(dir_entry)" has a part to play in any cache management: it should make it possible to ensure that any potential cache invalidation problems remain along the same lines as those faced by "find" pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconsider the filesystem status querying design #18

Reconsider the filesystem status querying design #18

ncoghlan commented Jun 1, 2018

palaviv commented Jun 2, 2018

ncoghlan commented Jun 2, 2018 via email

Reconsider the filesystem status querying design #18

Reconsider the filesystem status querying design #18

Comments

ncoghlan commented Jun 1, 2018

palaviv commented Jun 2, 2018

ncoghlan commented Jun 2, 2018 via email