Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GH-11573: RecursiveDirectoryIterator::hasChildren is slow #11577

Closed
wants to merge 4 commits into from

Conversation

nielsdos
Copy link
Member

@nielsdos nielsdos commented Jul 2, 2023

This PR does three things:

  • Gets rid of an unused field is_recursive
  • Reduces the size of directory entries when NAME_MAX is defined. Previously, we used MAXPATHLEN, which is the total path length, i.e. path + file name. Directory entries only contain a file name, so we can use a smaller size. By doing so we spend less time allocating and clearing memory.
  • We can avoid extra stat() calls by caching the d_type field from the directory entry. The last commit adds a field d_type to _php_stream_dirent that will be filled in from the dirent. It also does the necessary translation on Windows.

See #11573 (comment) for results.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SPL change makes sense, the second commit does too.

Not sure about the third because I can't seem to find anything in the POSIX spec about that field?

@nielsdos
Copy link
Member Author

nielsdos commented Jul 3, 2023

Not sure about the third because I can't seem to find anything in the POSIX spec about that field?

It's not in POSIX, hence the ifdef. In practice it's available on Linux and BSD.

@Girgias
Copy link
Member

Girgias commented Jul 3, 2023

Right, well I think this is OK, but streams are still somewhat nebulous to me.

@nielsdos
Copy link
Member Author

nielsdos commented Jul 3, 2023

The failing CI test looks related. It looks like I missed some other places where a php_stream_dirent is created, so I'll check tonight for the missed places and correct them.

@nielsdos
Copy link
Member Author

nielsdos commented Jul 3, 2023

Okay that is fixed now. Now Windows CI fails are unrelated: they are because current master fails too.

@@ -1487,6 +1485,11 @@ PHP_METHOD(RecursiveDirectoryIterator, hasChildren)
if (spl_filesystem_is_invalid_or_dot(intern->u.dir.entry.d_name)) {
RETURN_FALSE;
} else {
if (intern->u.dir.entry.d_type == DT_DIR) {
Copy link
Contributor

@mvorisek mvorisek Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the https://github.com/php/php-src/blob/php-8.2.7/ext/spl/spl_directory.c#L1247 code it seems functions like DirectoryIterator::isDir() currently involve stat syscall too.

There are used a lot in Symfony like in https://github.com/symfony/symfony/blob/v6.3.1/src/Symfony/Component/Finder/Iterator/ExcludeDirectoryFilterIterator.php#L62

Can DirectoryIterator::isDir(), DirectoryIterator::isFile() and maybe also isLink/getType be optimized the same way? Ie. if it is known the file entry is file or directory, return cached result early.

UPDATE:

DirectoryIterator::isDir() is cached using file status cache.

So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type in iterator.

Then DirectoryIterator::isDir() will be no-syscall and the cache will be clearable via clearstatcache().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this PR should be reworked to populate file status cache (if file/directory type is known) instead of storing d_type in iterator.

The stat cache was already used prior to this PR. The problem is that there were stat calls in between for other files/folders, such that the cache entry couldn't be reused and the stat call had to happen again. So this comment does not make sense to me.

Copy link
Contributor

@mvorisek mvorisek Jul 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to pupulate the PHP stat cache here:

https://github.com/php/php-src/blob/05626ba83c/main/streams/plain_wrapper.c#L1035-L1039

based on the readdir result.

Then the stat in RecursiveDirectoryIterator::hasChildren should be resolved thru the PHP stat cache.

Currently I belive this is not done, as the file type result from readdir was never honored.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The directory entries don't give the same information as stat does. As such, the stat cache cannot be filled in from there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@mvorisek mvorisek Jul 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did benchmark on Windows 10 and NVMe SSD on php-src (21640 files, 449 directories).

/wo ->files():

  • 1.3 secs
  • 24 secs /w open_basedir! (due slow realpath)
  • 0.15 secs (/w this PR, I emulated the hasChildren optimization by a precomputed list of directories)

now /w ->files():

  • 1.3 secs
  • 24 secs
  • 1.3 secs (/w this PR)
  • 24 secs (/w this PR and open_basedir)

In "now /w ->files()" each iterated path is tested twice, once for is_dir for hasChildren, once for is_file(). The reason the times are the same is because the 2nd stat is cached. This is why I propose to transfer the knowledge from read_dir d_type to cache, even if the cache will be for last iterated entry will help these file discovery scenarios (time reduced from 1.3 (or 23 /w open_basedir) secs to 0.15 secs). Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.
Also: 24 secs /w open_basedir! (due slow realpath) WOW, I guess we need to look into that ...

Ideally cache the d_type of the last/current dir_read entry in some hashmap, so cache will be present for possibly concurrent iterators.

But then we spend time managing a cache for a very specific use-case. I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in spl_filesystem_object_create_type(). This is called for example if ->current() is called on a FileSystemIterator. Since Symfony constructs the file info manually instead from within the iterator, this gave a performance degradation of 5% because we paid the cost of the caching without getting any benefit. This demonstrates that even for lightweight caching, we must be careful with adding these kinds of optimizations: they can actually result in a slowdown.

This is why I propose (...)

Talk is cheap, show me the code ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first thought is that the benchmarking is at least a little flawed because you emulated the PR instead of applying it.

I am confident here, this PR avoids IO syscalls for hasChildren and that is what I did...

Also: 24 secs /w open_basedir! (due slow realpath) WOW, I guess we need to look into that ...

That is known issue with slow realpath on Windows - https://bugs.php.net/bug.php?id=53263 However, any improvement can imply big performance boost, as the realpath is really slow, there may be unneeded allocation, unoptimal syscalls etc..

Talk is cheap, show me the code ;)

This PR is perfect. When no is_dir/is_file is called during the iteration, for regular dirs/files, the travesing is much faster than before. But when is_dir/is_file is called (or OOP equivalents like SplFileInfo::isDir()), the performance is as before, because originally the hasChildren called is_dir internally. And it was cached, thus any additional is_dir was fast, as it was cached thru stat cache.

So I propose, to address the basic file discovery usecase, to transfer the knowledge from the readdir d_type to the single entry file stat cache as well. As the knowledge has not the full data as stat has, the cache will be valid for is_file and is_dir only.

This will then make code like, even without OOP iterator, very fast:

$p = '/path/to/files';
$handle = opendir($p);
$dirs = [];
while (false !== ($entry = readdir($handle))) {
    if (is_dir($p . '/' . $entry)) { // should be resolved thru stat cache from last readdir d_type result (when regular dir/file)
        $dirs[] = $entry;
    }
}
closedir($handle);

As partial stat file cache entry might be not expected by other C extensions, I think it is feasible to implement it as another single entry per thread cache. And in is_file/is_dir simply compare if the cache entry path is the same, if yes, use the cached entry. This cache should be cleared like the current stat cache is, ie. if file is moved/deleted or using clearstatcache.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR is ok in the current state. Further improvements can be done later of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just experimented with transferring the knowledge of the d_type to SplFileInfo objects in spl_filesystem_object_create_type(). This is called for example if ->current() is called on a FileSystemIterator. Since Symfony constructs the file info manually instead from within the iterator

@nielsdos I agree the transfering into SplFileInfo might be too edge usecase. But can the iterator fastpath optimization be used for isFile/isDir on the same iterator object as the hasChildren does? This will allow the fastpath knowledge to be accessible to userland. It should have close to zero performance penalty.

This only takes up space and time.
On POSIX systems, we can use the maximum file length instead of maximum
path length. This saves space and time for clearing the memory.
Copy link
Member

@bukka bukka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems reasonable to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RecursiveDirectoryIterator::hasChildren is slow
4 participants