Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving listDirectoriesInDirectory by using std::fs #7974

Merged
merged 5 commits into from
May 23, 2023

Conversation

marcosd4h
Copy link
Contributor

This relates to #6679 and Fleet PR #8243

There was a 10x performance increase when using this logic (see osqueryi_custom.exe results below)

PS C:\tools\osquery> Measure-Command { .\osqueryi_5.8.2.exe "select COUNT(*) from python_packages;" | Out-Default }
+----------+
| COUNT(*) |
+----------+
| 228      |
+----------+
TotalSeconds      : 19.1131021
TotalMilliseconds : 19113.1021



PS C:\tools\osquery> Measure-Command { .\osqueryi_custom.exe "select COUNT(*) from python_packages;" | Out-Default }
+----------+
| COUNT(*) |
+----------+
| 228      |
+----------+
TotalSeconds      : 1.907992
TotalMilliseconds : 1907.992

@marcosd4h marcosd4h requested review from a team as code owners March 29, 2023 13:36
@marcosd4h marcosd4h changed the title Improving listDirectoriesInDirectory by using STD fs::recursive_directory_iterator Improving listDirectoriesInDirectory by using std::fs Mar 29, 2023
Copy link
Member

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally sure I understand -- is this mostly changing how the allocations work or is there something more nuanced?

if (!listDirectoriesInDirectory(site, directories, true).ok()) {
return;

if (isPlatform(PlatformType::TYPE_WINDOWS)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a change that we should apply on all platforms? Is there something windows specific about it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the listDirectoriesinDirectory helper takes a long time to complete on Windows. This is a field issue reported by Fleet customers - see here. I decided to play safe and just use the new implementation on Windows only.

@@ -84,12 +84,43 @@ void genPackage(const std::string& path, Row& r, Logger& logger) {
}
}

Status stdListDirectoriesInDirectory(const fs::path& path,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful enough we should put in filesystem.cpp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to release this new logic only and then wait a couple of releases to see how it behaves. I can benchmark it on different platforms before deciding on using it on filesystems.cpp

@marcosd4h
Copy link
Contributor Author

I'm not totally sure I understand -- is this mostly changing how the allocations work or is there something more nuanced?

Thanks @directionless for looking into this! The overall goal is to fix a field issue found by fleet customers using python_packages tables on Windows environment - see here.

@directionless
Copy link
Member

My gut sense is that we should see if we need this everywhere, and not just here. But I can be persuaded to just approve it here.

@alessandrogario or @Smjert either of you have an opinion?

@directionless directionless added this to the 5.9.0 milestone Mar 30, 2023
@Smjert
Copy link
Member

Smjert commented Apr 6, 2023

My gut sense is that we should see if we need this everywhere, and not just here. But I can be persuaded to just approve it here.

@alessandrogario or @Smjert either of you have an opinion?

I think that as long as the logic has to just return a list of directories without the need to parse any regex from SQL, it should definitely not use the glob function, which on Windows particularly is hand rolled and quite slow apparently.
Since it seems to me that listDirectoriesInDirectory did not have to support regex, I would swap the internal representation with the faster one.

The other thing is to make sure that the behavior is the same (with tests), particularly around symlinks/junctions.

…tem helpers. Adding testcases to cover different directories listing scenarios
@marcosd4h
Copy link
Contributor Author

@Smjert I've replaced the platform's listDirectoriesInDirectory implementation with the new logic and added testcases to check different directory listing scenarios

directionless
directionless previously approved these changes Apr 25, 2023
@directionless
Copy link
Member

directionless commented Apr 25, 2023

Potential behavioral changes:

  • probably lists dotfiles now
  • may chance symlink traversal

@mike-myers-tob mike-myers-tob changed the title Improving listDirectoriesInDirectory by using std::fs Improving listDirectoriesInDirectory by using std::fs May 9, 2023
@marcosd4h
Copy link
Contributor Author

Potential behavioral changes:

  • probably lists dotfiles now
  • may chance symlink traversal

I have just run a quick test on this and found the following:

  • The old implementation of listDirectoriesInDirectory, which uses a glob-based approach, does not follow directories inside symlink directories.
  • The new implementation of listDirectoriesInDirectory, which uses std::fs, lists directories inside symlink directories, which ends up producing more data (this is better as more directories get listed).

I also found that the two implementations behave in the same way when it comes to listing dot-directories

@directionless
Copy link
Member

As discussed in office hours, the symlink listing is a little unclear. Is there a risk of looping? Like if A is a junction to A, what happens?

…s to test this scenario and to compare the previous logic with the current one
@marcosd4h
Copy link
Contributor Author

Is there a risk of looping? Like if A is a junction to A, what happens?
This scenario turned out to be problematic because boost::filesystem::is_directory() couldn't handle it properly. I have added code and tests to fix this issue and ensure that it works as expected.

I have also included a test case to compare the previous logic with the new one and ensure that both list the same number of regular and junctions directories.

@marcosd4h
Copy link
Contributor Author

@Smjert @directionless This PR is something we can consider for inclusion in the upcoming release

@marcosd4h marcosd4h merged commit 9f09b7c into osquery:master May 23, 2023
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants