Skip to content
This repository has been archived by the owner on Jun 19, 2024. It is now read-only.

Reconsider the filesystem status querying design #18

Open
ncoghlan opened this issue Jun 1, 2018 · 2 comments
Open

Reconsider the filesystem status querying design #18

ncoghlan opened this issue Jun 1, 2018 · 2 comments

Comments

@ncoghlan
Copy link
Owner

ncoghlan commented Jun 1, 2018

(Getting some thoughts on this topic out of my head and into written form, as part of handing over the lead maintainer and API designer role to @palaviv)

One of the annoying aspects of walkdir's symlink loop detection is that it needs to call os.path.islink on each directory. Prior to the introduction of os.scandir(), this made the number of stat calls during a walk equal to the number of files yielded (as the underlying walk decided whether it was a directory or not) plus twice the number of directories (once in the underlying walk, once in walkdir's symlink loop detection).

Since Python 3.5 though, that hasn't been the case: os.scandir() means the underlying walk doesn't need to make any stat calls at all, since the information it needs is right there on the os.DirEntry objects. Those objects also typically contain the symlink-or-not info that the symlink loop detection needs, but the entries aren't made visible to the rest of the filter stack, so each filter that needs filesystem status information needs to make its own call to check the status.

Designing a custom walk that exposes the underlying os.DirEntry objects just to save a stat call in the symlink loop detector probably isn't worth the hassle, but it would be nice to be able to filter on other stat_result attributes like the mode, size, uid, gid, access time, modification time, and creation time without every such filter needing to make its own stat call.

The challenge is finding a way to store a transient stat cache such that:

  • it can be used even with immutable dir_entry objects (like the tuples produced by os.walk() and os.fwalk())
  • stale entries don't accumulate over time
  • stale entries don't persist for extended periods after the relevant iterators terminate
  • multiple iterators can be running in multiple threads without interfering with each other
  • multiple iterators can be interleaved within a single thread without interfering with each other

I initially thought keying by id(walk_iter) might work, but the problem there is that each layer of filtering gets a different value for that (since they see the next filter layer down, not the underlying walk).

Keying by id(dir_entry) seems more promising (since the filters are expected to pass that through unmodified since 0.4), but the challenge there is deciding when to clear the previously cached entries.

The final possibility that comes to mind is to instead use a naive time based cache, where rather than aiming for "Only one stat call per path per walk iteration", we instead aim for "No more than one stat call per path per N milliseconds". To enable that, you'd calculate the currently active cache ID as int(time.monotonic() * (1000 / N)). Then whenever the cache ID changed, you'd discard all previously cached values and start making fresh stat calls.

However, a problem with only using time, and not considering id(dir_entry) is that once you "escape" from the filter pipeline, then the loop body may mutate the filesystem state in a way that it expects the next iteration to detect. This means caching by dir_entry may still make the most sense, and time only comes into play as a way of limiting the amount of no longer needed stale data that gets kept around, without limiting the number of concurrently active filter pipelines.

P.S. It's conceivable this is a topic I've been intermittently thinking about since first creating walkdir, and I never actually did anything about it because nothing leapt out at me as "Yes, that's the right way to handle multiple stat-aware filter stages without large numbers of redundant stat calls" ;)

@palaviv
Copy link
Collaborator

palaviv commented Jun 2, 2018

I think that the big problem with your suggestion is the cache. What will happen if something has changed in the filesystem between walks?

@ncoghlan
Copy link
Owner Author

ncoghlan commented Jun 2, 2018 via email

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants