Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecursiveDirectoryIterator::hasChildren is slow #11573

Closed
mvorisek opened this issue Jul 1, 2023 · 13 comments
Closed

RecursiveDirectoryIterator::hasChildren is slow #11573

mvorisek opened this issue Jul 1, 2023 · 13 comments

Comments

@mvorisek
Copy link
Contributor

mvorisek commented Jul 1, 2023

Description

image

I belive the currently php-src impl. checks if any iterated item has children on each RecursiveDirectoryIterator::hasChildren call. This is slow.

I propose to reuse current item file entry file flag (d_type = DT_REG on linux, dwFileAttributes on Windows). If a current item is a file, it cannot have children and it will optimize typical usecase - directory with many regular files.

repro iterator code:

$iter = new RecursiveDirectoryIterator(
    $path,
    \RecursiveDirectoryIterator::SKIP_DOTS | \RecursiveDirectoryIterator::FOLLOW_SYMLINKS
);

Measured on Windows, maybe on linux it is faster.

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

If I understand correctly, the problem is calling hasChildren() multiple times back to back causing a performance problem because of duplicate work?

Note that caching the flags of directory/files in a directory may be undesired in some cases, e.g. when the filesystem structure can change in between.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jul 2, 2023

Yes. Currently RecursiveDirectoryIterator::hasChildren tries to stat each file entry in https://github.com/php/php-src/blob/php-8.2.7/ext/spl/spl_directory.c#L1501 but the stat is not needed as long as the currentl entry is a regular file. Regular file is guaranteed to not have any children.

In linux, https://man7.org/linux/man-pages/man3/readdir.3.html d_type holds the type info and it is explicitly documented there:

This field contains a value indicating the file type, making it possible to avoid the expense of calling lstat(2) if further actions depend on the type of the file.

In Windows, https://github.com/php/php-src/blob/php-8.2.7/win32/readdir.c#L73 dp->fileinfo holds the type info - https://learn.microsoft.com/en-us/windows/win32/api/minwinbase/ns-minwinbase-win32_find_dataa - dwFileAttributes.

so in both linux/Windows, we should use the already fetched file entry info and never invoke expensive stat if we know the file entry is a regular file.

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

I implemented something that caches d_type. See: nielsdos#28
It does help noticeably, even for a normal foreach loop over a RecursiveDirectoryIterator, and this case doesn't even call that much into hasChildren.

FWIW here's some benchmark results in nr of instructions:

start: 124,985,330
after removal of is_recursive: 124,947,575
after propagating d_type: 110,506,487
after reducing d_name size: 102,133,320

Results may vary across benchmarks and systems ofc, but this is what I got on Linux.
I'm currently running that WIP PR through CI on my fork, because I don't have access to a Windows machine. So I may need to do some tweaks based on the CI results. I'll report back when I know more.

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

@mvorisek The tests pass for my branch in nielsdos#28. Could you try it out and see how much it improves your workload?

@staabm
Copy link
Contributor

staabm commented Jul 2, 2023

FWIW here's some benchmark results in nr of instructions:

Does instruction count also reflect that less IO should speed it up?

Or will instruction count only reflect time spent on CPU (so no IO included?)

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

FWIW here's some benchmark results in nr of instructions:

Does instruction count also reflect that less IO should speed it up?

Or will instruction count only reflect time spent on CPU (so no IO included?)

The instruction count only includes the time spent in CPU, no IO.
Even without measuring the improvement in IO, we see a ~17% instruction decrease in my testcase. If for Michael's testcase the stat calls comprise a large part of the time spent, then this should give a big speedup. If not, and there's something else dominating the time spent, then there likely won't be a big difference. My expectations are that the PR will make a big difference, especially given that Windows IO has to go through our win32 layer and the fact that the Windows APIs for IO are fairly slow.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jul 2, 2023

Could you try it out and see how much it improves your workload?

On my Windows machine I do not have C compiler to test, but your PR seems to do exactly what I thought and what linux man advises.

Thank you for looking into this topic so quickly ❤️

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

Hmm OK. I can do a quick test tonight on a Windows VM I have set up on my desktop for PHP bugfixing, hopefully it shows good results.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jul 2, 2023

Also spl_filesystem_dir_read seems to be called twice - on open and rewind:

on rewind, no action should be done, as long as no next was called. Based on https://www.php.net/manual/en/iterator.rewind.php the rewind is called before every foreach start.

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

I guess in that case we can check if the current index is 0, if true then we don't need to rewind. This is a little bit iffy though because changes to the filesystem between an open and rewind (for example, if used outside a foreach loop), will get ignored for the first entry.
EDIT: it probably also won't make a big difference in performance, but I can check

@nielsdos
Copy link
Member

nielsdos commented Jul 2, 2023

I just checked the performance gain on my Windows VM too. I get about a 8-9x performance win (note: on a HDD because my VM is on a HDD) for a simple iteration of new RecursiveIteratorIterator(new RecursiveDirectoryIterator('ext')); in php-src. And about a similar gain on a smaller subfolder.
On Linux I get (with SSD) a performance gain of around 17%. I think the big win for Windows is because it's a HDD and because Windows IO is generally slower than Linux IO.

Also spl_filesystem_dir_read seems to be called twice - on open and rewind:

I checked this. Preventing the double read made no measurable impact on performance for me. Since people might rely on the reset behaviour and keep these instances around, I won't be changing the behaviour of that.

I'll clean up my changes and make a PR now.

@staabm
Copy link
Contributor

staabm commented Jul 3, 2023

On my Windows machine I do not have C compiler to test, but your PR seems to do exactly what I thought and what linux man advises.

could the github action upload the compiled artifact so people could re-use the already compiled binaries for local testing?

@nielsdos
Copy link
Member

nielsdos commented Jul 3, 2023

could the github action upload the compiled artifact so people could re-use the already compiled binaries for local testing?

It shouldn't be too hard to compile it yourself, but I understand it can be cumbersome on Windows, especially if libraries are involved.
I'm not sure what the consequences of exposing and storing those artifacts would be though. You'd better ping someone who maintains the CI setup.

@nielsdos nielsdos closed this as completed Jul 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants