New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RecursiveDirectoryIterator::hasChildren
is slow
#11573
Comments
If I understand correctly, the problem is calling hasChildren() multiple times back to back causing a performance problem because of duplicate work? Note that caching the flags of directory/files in a directory may be undesired in some cases, e.g. when the filesystem structure can change in between. |
Yes. Currently In linux, https://man7.org/linux/man-pages/man3/readdir.3.html
In Windows, https://github.com/php/php-src/blob/php-8.2.7/win32/readdir.c#L73 so in both linux/Windows, we should use the already fetched file entry info and never invoke expensive stat if we know the file entry is a regular file. |
I implemented something that caches d_type. See: nielsdos#28 FWIW here's some benchmark results in nr of instructions:
Results may vary across benchmarks and systems ofc, but this is what I got on Linux. |
@mvorisek The tests pass for my branch in nielsdos#28. Could you try it out and see how much it improves your workload? |
Does instruction count also reflect that less IO should speed it up? Or will instruction count only reflect time spent on CPU (so no IO included?) |
The instruction count only includes the time spent in CPU, no IO. |
On my Windows machine I do not have C compiler to test, but your PR seems to do exactly what I thought and what linux man advises. Thank you for looking into this topic so quickly ❤️ |
Hmm OK. I can do a quick test tonight on a Windows VM I have set up on my desktop for PHP bugfixing, hopefully it shows good results. |
Also
on rewind, no action should be done, as long as no next was called. Based on https://www.php.net/manual/en/iterator.rewind.php the rewind is called before every foreach start. |
I guess in that case we can check if the current index is 0, if true then we don't need to rewind. This is a little bit iffy though because changes to the filesystem between an open and rewind (for example, if used outside a foreach loop), will get ignored for the first entry. |
I just checked the performance gain on my Windows VM too. I get about a 8-9x performance win (note: on a HDD because my VM is on a HDD) for a simple iteration of
I checked this. Preventing the double read made no measurable impact on performance for me. Since people might rely on the reset behaviour and keep these instances around, I won't be changing the behaviour of that. I'll clean up my changes and make a PR now. |
could the github action upload the compiled artifact so people could re-use the already compiled binaries for local testing? |
It shouldn't be too hard to compile it yourself, but I understand it can be cumbersome on Windows, especially if libraries are involved. |
Description
I belive the currently php-src impl. checks if any iterated item has children on each
RecursiveDirectoryIterator::hasChildren
call. This is slow.I propose to reuse current item file entry file flag (
d_type = DT_REG
on linux,dwFileAttributes
on Windows). If a current item is a file, it cannot have children and it will optimize typical usecase - directory with many regular files.repro iterator code:
Measured on Windows, maybe on linux it is faster.
The text was updated successfully, but these errors were encountered: