Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index: Delay RAW file format check to improve performance #2683

Merged
merged 1 commit into from
Sep 28, 2022

Conversation

ReneHollander
Copy link
Contributor

@ReneHollander ReneHollander commented Sep 5, 2022

Scanning is unnecesserily slow because the code does extra processing for files that are unchanged and skipped later.

With a bit of reshuffeling I was able to achieve a 100x scan speedup (from about 900s down to 9s) for my library of 33000 photos (see specs below) if I trigger a rescan without any changes to any of the files.

Discovered using strace and noticing how during index each file is opened and exactly 261 ones bytes are read and many stat calls for the same file:

[pid 3359850] newfstatat(AT_FDCWD, "/photoprism", {st_mode=S_IFDIR|0755, st_size=5, ...}, AT_SYMLINK_NOFOLLOW) = 0                                             
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals", {st_mode=S_IFDIR|0755, st_size=8, ...}, AT_SYMLINK_NOFOLLOW) = 0                                   
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX", {st_mode=S_IFDIR|0755, st_size=3045, ...}, AT_SYMLINK_NOFOLLOW) = 0                        
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX", {st_mode=S_IFDIR|0755, st_size=10, ...}, AT_SYMLINK_NOFOLLOW) = 0                  
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/XXX.jpg", {st_mode=S_IFREG|0644, st_size=156377, ...}, AT_SYMLINK_NOFOLLOW) = 0                                                   
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/XXX.jpg", {st_mode=S_IFREG|0644, st_size=156377, ...}, 0) = 0                                                                                                                                                     
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism", {st_mode=S_IFDIR|0755, st_size=5, ...}, AT_SYMLINK_NOFOLLOW) = 0          
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals", {st_mode=S_IFDIR|0755, st_size=8, ...}, AT_SYMLINK_NOFOLLOW) = 0                                   
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX", {st_mode=S_IFDIR|0755, st_size=3045, ...}, AT_SYMLINK_NOFOLLOW) = 0             
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX", {st_mode=S_IFDIR|0755, st_size=10, ...}, AT_SYMLINK_NOFOLLOW) = 0                  
[pid 3359850] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/XXX.jpg", {st_mode=S_IFREG|0644, st_size=156377, ...}, AT_SYMLINK_NOFOLLOW) = 0                                                                                                                                   
[pid 3359850] openat(AT_FDCWD, "/photoprism/originals/XXX/XXX/XXX.jpg", O_RDONLY|O_CLOEXEC) = 10                                                                         
[pid 3359850] read(10, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\341\vjhttp://n"..., 261) = 261                               
[pid 3359850] close(10 <unfinished ...>                                                                                              

After the changes each file is only stat'ed once (which we need to get the modification time for the isIndexed check):

[pid 1410222] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/1.jpg", {st_mode=S_IFREG|0644, st_size=1706545, ...}, 0) = 0
[pid 1410222] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/2.jpg", {st_mode=S_IFREG|0644, st_size=971489, ...}, 0) = 0
[pid 1410222] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/3.jpg", {st_mode=S_IFREG|0644, st_size=1722545, ...}, 0) = 0
[pid 1410222] newfstatat(AT_FDCWD, "/photoprism/originals/XXX/XXX/4.jpg", {st_mode=S_IFREG|0644, st_size=980666, ...}, 0) = 0

"Benchmarked" using around 33000 images stored on a HDD backed ZFS pool on a Ryzen 3900X, 64G RAM, SQLite/Cache/sidecar/... on a SSD backed ZFS pool

  • Before: around 850s for a scan of originals (not complete rescan)
  • After: around 9s for a scan of originals (not complete rescan)

Acceptance Criteria:

  • Features and enhancements are fully implemented so that they can be released at any time without additional work
  • Automated unit and/or acceptance tests have been added to ensure the changes work as expected and to reduce repetitive manual work
  • User interface changes are fully responsive and have been tested on all major browsers and various devices
  • Database-related changes are compatible with SQLite and MariaDB
  • Translations have been / will be updated (specify if needed)
  • Documentation has been / will be updated (specify if needed)
  • Contributor License Agreement (CLA) has been signed

@CLAassistant
Copy link

CLAassistant commented Sep 5, 2022

CLA assistant check
All committers have signed the CLA.

@ReneHollander
Copy link
Contributor Author

Contributor License Agreement (CLA) has been signed

Unfortunately I can't sign this and I need to run this through my companys legal team first. I might just file this as a bug and hope somebody else is able to move the 4 lines up by 4 lines... :)

ReneHollander added a commit to ReneHollander/photoprism that referenced this pull request Sep 6, 2022
indexed.

The performance improvement measured is several magnitudes. (See photoprism#2683
fore more information)

Specifically the following improvements are made:
- Move the IsRaw() check after checking if the file is already indexed.
  IsRaw() causes the file to be opened and the first 261 bytes to be
read to determine file type. This is obviously very expensive.
- godirwalk can inform us if the file currently processed is a symlink
  or not (which is gathered without extra stat syscalls). Use this
information to skip resolving the symlink to the absolute path (which is
necessary to get the stat info of the image file instead of the symlink
to it).
@ReneHollander
Copy link
Contributor Author

Updates this PR with some extra scan performance improvements. I got rid of all unnecessary syscalls and managed to get a 100x speedup scanning my library for new files.

Also still trying to get the CLA figured out...

@lastzero
Copy link
Member

lastzero commented Sep 15, 2022

Thank you very much! The preview build now includes the delayed RAW file format check (assuming this provides most of the performance gain), but not the "fileNameResolved" optimization. I first want to make sure this has no unintended side effects and is well understandable/maintainable (might change naming of variables or functions...).

@lastzero lastzero added please-test Ready for acceptance test enhancement Refactoring, improvement or maintenance task performance Performance Optimization labels Sep 15, 2022
@lastzero lastzero changed the title Move the skip check above the IsRaw check during index. Index: Delay RAW file format check to improve performance Sep 15, 2022
@lastzero
Copy link
Member

I'm very grateful you found this, because we noticed something was wrong but couldn't figure out what was causing the performance degradation. Overall, it was still pretty fast even with 100k files and I eventually wasn't sure if it was just imagination.

@ReneHollander
Copy link
Contributor Author

The preview build now includes the delayed RAW file format check (assuming this provides most of the performance gain)
This made indexing about 10x quicker. Thanks for including it already!

but not the "fileNameResolved" optimization. I first want to make sure this has no unintended side effects and is well understandable/maintainable (might change naming of variables or functions...).
This optimization is another 10x speedup if put on top of the IsRaw optimization.
Not my proudest work to get this working, but it does the job. If you have better ideas on how to implement this in a nicer way I am all ears :)

I updated the PR and signed the CLA.

…mlink.

godirwalk can inform us if the file currently processed is a symlink or not (which is gathered without extra stat syscalls).Using this information to skip resolving the symlink to the absolute path (which is necessary to get the stat info of the image file instead of the symlink to it) saves on a lot of syscalls. Resolve causes a Stat syscall for each level in the path, which is very expensive and slows down scanning.
@lastzero
Copy link
Member

I updated the PR and signed the CLA.

Great news! I had expected that this would not be possible due to Google's AGPL policy: https://opensource.google/documentation/reference/using/agpl-policy

Since we had hoped for some sort of collaboration with Google, PhotoPrism was initially licensed under Apache 2.0, which is much more permissive. However, no one seemed interested, although I talked to quite a few personal acquaintances at Google. That made it easy for our community to convince us to use AGPL instead.

When I have a little time, I'll move on with merging your additional symlink handling improvements! 👍

@lastzero lastzero added the in-progress Somebody is working on this label Sep 17, 2022
@lastzero lastzero self-assigned this Sep 17, 2022
@ReneHollander
Copy link
Contributor Author

Great news! I had expected that this would not be possible due to Google's AGPL policy: https://opensource.google/documentation/reference/using/agpl-policy

It's complicated... Because I am doing this as a personal project (no affiliation to the company in any way) I can put my contributions under whatever license I desire. (If a Googler is contributing to an open source project as part of their work the @google.com would be used as commit author email, which I am not doing here)

When I have a little time, I'll move on with merging your additional symlink handling improvements! +1
Thanks!

Cool, thanks!
FWIW: I did only a basic manual test of creating a symlink to a file and verifying the modify timestamps match (and not the modify timestamp of the symlink being indexed). The recursive stat syscalls to resolve the symlink also show up in strace for the symlinked file.

@lastzero
Copy link
Member

I'm going to merge this now! Really helpful. We should have users test well after integration in case there are unexpected side effects we didn't think of. The next stable release might take a bit longer anyway, as I just pushed more than 10,000 lines of code with completely reworked session management and ACLs.

@lastzero lastzero merged commit add115c into photoprism:develop Sep 28, 2022
@lastzero lastzero added merged Changes should be tested again after they have been integrated and removed in-progress Somebody is working on this labels Sep 28, 2022
lastzero added a commit that referenced this pull request Sep 28, 2022
Signed-off-by: Michael Mayer <michael@photoprism.app>
@lastzero lastzero added released Available in the stable release and removed please-test Ready for acceptance test labels Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Refactoring, improvement or maintenance task merged Changes should be tested again after they have been integrated performance Performance Optimization released Available in the stable release
Projects
Status: Release 🌈
Development

Successfully merging this pull request may close these issues.

3 participants