-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Index: Delay RAW file format check to improve performance #2683
Conversation
Unfortunately I can't sign this and I need to run this through my companys legal team first. I might just file this as a bug and hope somebody else is able to move the 4 lines up by 4 lines... :) |
indexed. The performance improvement measured is several magnitudes. (See photoprism#2683 fore more information) Specifically the following improvements are made: - Move the IsRaw() check after checking if the file is already indexed. IsRaw() causes the file to be opened and the first 261 bytes to be read to determine file type. This is obviously very expensive. - godirwalk can inform us if the file currently processed is a symlink or not (which is gathered without extra stat syscalls). Use this information to skip resolving the symlink to the absolute path (which is necessary to get the stat info of the image file instead of the symlink to it).
b99e34b
to
846ea10
Compare
Updates this PR with some extra scan performance improvements. I got rid of all unnecessary syscalls and managed to get a 100x speedup scanning my library for new files. Also still trying to get the CLA figured out... |
Thank you very much! The preview build now includes the delayed RAW file format check (assuming this provides most of the performance gain), but not the "fileNameResolved" optimization. I first want to make sure this has no unintended side effects and is well understandable/maintainable (might change naming of variables or functions...). |
I'm very grateful you found this, because we noticed something was wrong but couldn't figure out what was causing the performance degradation. Overall, it was still pretty fast even with 100k files and I eventually wasn't sure if it was just imagination. |
846ea10
to
4eb65b2
Compare
I updated the PR and signed the CLA. |
…mlink. godirwalk can inform us if the file currently processed is a symlink or not (which is gathered without extra stat syscalls).Using this information to skip resolving the symlink to the absolute path (which is necessary to get the stat info of the image file instead of the symlink to it) saves on a lot of syscalls. Resolve causes a Stat syscall for each level in the path, which is very expensive and slows down scanning.
4eb65b2
to
f4a8d6f
Compare
Great news! I had expected that this would not be possible due to Google's AGPL policy: https://opensource.google/documentation/reference/using/agpl-policy Since we had hoped for some sort of collaboration with Google, PhotoPrism was initially licensed under Apache 2.0, which is much more permissive. However, no one seemed interested, although I talked to quite a few personal acquaintances at Google. That made it easy for our community to convince us to use AGPL instead. When I have a little time, I'll move on with merging your additional symlink handling improvements! 👍 |
It's complicated... Because I am doing this as a personal project (no affiliation to the company in any way) I can put my contributions under whatever license I desire. (If a Googler is contributing to an open source project as part of their work the
Cool, thanks! |
I'm going to merge this now! Really helpful. We should have users test well after integration in case there are unexpected side effects we didn't think of. The next stable release might take a bit longer anyway, as I just pushed more than 10,000 lines of code with completely reworked session management and ACLs. |
Signed-off-by: Michael Mayer <michael@photoprism.app>
Scanning is unnecesserily slow because the code does extra processing for files that are unchanged and skipped later.
With a bit of reshuffeling I was able to achieve a 100x scan speedup (from about 900s down to 9s) for my library of 33000 photos (see specs below) if I trigger a rescan without any changes to any of the files.
Discovered using strace and noticing how during index each file is opened and exactly 261 ones bytes are read and many stat calls for the same file:
After the changes each file is only stat'ed once (which we need to get the modification time for the isIndexed check):
"Benchmarked" using around 33000 images stored on a HDD backed ZFS pool on a Ryzen 3900X, 64G RAM, SQLite/Cache/sidecar/... on a SSD backed ZFS pool
Acceptance Criteria: