-
Notifications
You must be signed in to change notification settings - Fork 7.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GH-11573: RecursiveDirectoryIterator::hasChildren is slow #11577
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SPL change makes sense, the second commit does too.
Not sure about the third because I can't seem to find anything in the POSIX spec about that field?
It's not in POSIX, hence the ifdef. In practice it's available on Linux and BSD. |
Right, well I think this is OK, but streams are still somewhat nebulous to me. |
The failing CI test looks related. It looks like I missed some other places where a php_stream_dirent is created, so I'll check tonight for the missed places and correct them. |
Okay that is fixed now. Now Windows CI fails are unrelated: they are because current master fails too. |
@@ -1487,6 +1485,11 @@ PHP_METHOD(RecursiveDirectoryIterator, hasChildren) | |||
if (spl_filesystem_is_invalid_or_dot(intern->u.dir.entry.d_name)) { | |||
RETURN_FALSE; | |||
} else { | |||
if (intern->u.dir.entry.d_type == DT_DIR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the https://github.com/php/php-src/blob/php-8.2.7/ext/spl/spl_directory.c#L1247 code it seems functions like DirectoryIterator::isDir()
currently involve stat
syscall too.
There are used a lot in Symfony like in https://github.com/symfony/symfony/blob/v6.3.1/src/Symfony/Component/Finder/Iterator/ExcludeDirectoryFilterIterator.php#L62
Can DirectoryIterator::isDir()
, DirectoryIterator::isFile()
and maybe also isLink
/getType
be optimized the same way? Ie. if it is known the file entry is file or directory, return cached result early.
UPDATE:
DirectoryIterator::isDir()
is cached using file status cache.
So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type
in iterator.
Then DirectoryIterator::isDir()
will be no-syscall and the cache will be clearable via clearstatcache()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type in iterator.
The stat cache was already used prior to this PR. The problem is that there were stat calls in between for other files/folders, such that the cache entry couldn't be reused and the stat call had to happen again. So this comment does not make sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean to pupulate the PHP stat cache here:
https://github.com/php/php-src/blob/05626ba83c/main/streams/plain_wrapper.c#L1035-L1039
based on the readdir
result.
Then the stat
in RecursiveDirectoryIterator::hasChildren
should be resolved thru the PHP stat cache.
Currently I belive this is not done, as the file type result from readdir
was never honored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The directory entries don't give the same information as stat does. As such, the stat cache cannot be filled in from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How feasible is partial PHP stat cache?
It will help https://github.com/symfony/symfony/blob/v6.2.12/src/Symfony/Component/Finder/Iterator/ExcludeDirectoryFilterIterator.php#L62.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did benchmark on Windows 10 and NVMe SSD on php-src (21640 files, 449 directories).
/wo ->files()
:
- 1.3 secs
- 24 secs /w open_basedir! (due slow realpath)
- 0.15 secs (/w this PR, I emulated the hasChildren optimization by a precomputed list of directories)
now /w ->files()
:
- 1.3 secs
- 24 secs
- 1.3 secs (/w this PR)
- 24 secs (/w this PR and open_basedir)
In "now /w ->files()
" each iterated path is tested twice, once for is_dir for hasChildren, once for is_file(). The reason the times are the same is because the 2nd stat is cached. This is why I propose to transfer the knowledge from read_dir d_type to cache, even if the cache will be for last iterated entry will help these file discovery scenarios (time reduced from 1.3 (or 23 /w open_basedir) secs to 0.15 secs). Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.
Also: 24 secs /w open_basedir! (due slow realpath)
WOW, I guess we need to look into that ...
Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.
But then we spend time managing a cache for a very specific use-case. I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in spl_filesystem_object_create_type()
. This is called for example if ->current()
is called on a FileSystemIterator
. Since Symfony constructs the file info manually instead from within the iterator, this gave a performance degradation of 5% because we paid the cost of the caching without getting any benefit. This demonstrates that even for lightweight caching, we must be careful with adding these kinds of optimizations: they can actually result in a slowdown.
This is why I propose (...)
Talk is cheap, show me the code ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.
I am confident here, this PR avoids IO syscalls for hasChildren and that is what I did...
Also: 24 secs /w open_basedir! (due slow realpath) WOW, I guess we need to look into that ...
That is known issue with slow realpath on Windows - https://bugs.php.net/bug.php?id=53263 However, any improvement can imply big performance boost, as the realpath is really slow, there may be unneeded allocation, unoptimal syscalls etc..
Talk is cheap, show me the code ;)
This PR is perfect. When no is_dir/is_file is called during the iteration, for regular dirs/files, the travesing is much faster than before. But when is_dir/is_file is called (or OOP equivalents like SplFileInfo::isDir()), the performance is as before, because originally the hasChildren called is_dir internally. And it was cached, thus any additional is_dir was fast, as it was cached thru stat cache.
So I propose, to address the basic file discovery usecase, to transfer the knowledge from the readdir d_type to the single entry file stat cache as well. As the knowledge has not the full data as stat has, the cache will be valid for is_file and is_dir only.
This will then make code like, even without OOP iterator, very fast:
$p = '/path/to/files';
$handle = opendir($p);
$dirs = [];
while (false !== ($entry = readdir($handle))) {
if (is_dir($p . '/' . $entry)) { // should be resolved thru stat cache from last readdir d_type result (when regular dir/file)
$dirs[] = $entry;
}
}
closedir($handle);
As partial stat file cache entry might be not expected by other C extensions, I think it is feasible to implement it as another single entry per thread cache. And in is_file/is_dir simply compare if the cache entry path is the same, if yes, use the cached entry. This cache should be cleared like the current stat cache is, ie. if file is moved/deleted or using clearstatcache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR is ok in the current state. Further improvements can be done later of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in
spl_filesystem_object_create_type()
. This is called for example if->current()
is called on aFileSystemIterator
. Since Symfony constructs the file info manually instead from within the iterator
@nielsdos I agree the transfering into SplFileInfo
might be too edge usecase. But can the iterator fastpath optimization be used for isFile
/isDir
on the same iterator object as the hasChildren
does? This will allow the fastpath knowledge to be accessible to userland. It should have close to zero performance penalty.
This only takes up space and time.
On POSIX systems, we can use the maximum file length instead of maximum path length. This saves space and time for clearing the memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems reasonable to me.
This PR does three things:
is_recursive
See #11573 (comment) for results.