LibraryScanner: Improve hashing of directory contents #2497

uklotzde · 2020-02-11T21:22:54Z

Use a cryptographic SHA256 hash that can be calculated incrementally instead of concatenating all file path strings (requiring many intermediate dynamic memory allocations) and finally hashing the resulting string with qHash()
Use 64-bit integers (by ~~mangling all~~truncating SHA256 bytes) instead of 32-bit integers from qHash()
The hash code 0 is reserved and considered invalid. Tests verify the correct behavior.
We don't need a database migration. The stored hash values will be replaced with the next library rescan.

Remark The new cache_key_t typedef (a primitive type) and utility functions could be reused for caching artwork images. The aoide branch already contains the code for hashing images. Unfortunately this migration will require to discard and recalculate the hash of all artwork images. But we need to do it at some point if we want to get rid of the crippled 16-bit integer hashes that cause many hash collisions and wrong artwork display.

src/library/scanner/libraryscanner.cpp

src/util/cache.cpp

uklotzde · 2020-02-12T23:32:50Z

macOS SCons build failures are unrelated. We need to get rid of SCons asap!

Holzhaus · 2020-02-13T02:12:44Z

#2498 should fix the SCons/OSX errors btw.

Holzhaus

Code looks good, tests pass, library rescan didn't cause any issues. LGTM.

daschuer · 2020-02-16T18:08:33Z

src/library/scanner/recursivescandirectorytask.cpp

@@ -54,7 +56,7 @@ void RecursiveScanDirectoryTask::run() {
        if (currentFileInfo.isFile()) {
            const QString& fileName = currentFileInfo.fileName();
            if (supportedExtensionsRegex.indexIn(fileName) != -1) {
-                newHashStr.append(currentFile);
+                hasher.addData(currentFile.toUtf8());


Did you made a performance check for toUtf8() compared to use the QString bytes?
I can Imagine that the later is faster.

Raw bytes of QString are not platform-independent when considering endianess, but UTF-8 is. Does this matter?

There is no function to obtain the raw bytes of a QString. Using toLatin1() may lose characters and toLocal8Bit() is platform dependent while the database is portable. Please clarify which option you consider here.

It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion.

I don't care about endianes, because this would be only apply if the DB is on an external HD, accessing the same OS with different CPU architectures.
Thinking of this a independent solution is not bad.

I don't think that the additional allocations really matter here and won't invest hours to figure out how to do a realistic performance test to prove this. The optimized solution would also produce non-portable data, in contrast to all other data in the database.

It was just the idea to check if something like
hasher.addData(reinterprete_cast<char*>(currentFile.utf16()))
is faster or not.
This way there is the double amount to hash but no utf8 conversion

Double the data should not be an issue. sha256 runtime not increase 1:1 with the amount of input data. I'm on my phone, but you should see the same effect on desktop pcs:

$ openssl speed sha256 The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes 16384 bytes sha256 180714.92k 532643.80k 1088833.54k 1466255.70k 1633716.91k 1647209.13k

The actual number of bytes that we feed into QCryptographicHash should not be an issue. But QString might need to allocate a temporary QByteArray for the UTF-8 representation and the conversion itself also takes some time.

Yes, utf16() would be faster, but it is non-portable. If someone is able to proof that it is much faster and worth the drawback, please go ahead ;) I will not do it.

Just out of curious I did the test. Here are the results:

10 x 30 characters 6040 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar)); 8683 nsec hasher.addData(lorem.toUtf8()); 12836 nsec hasher.addData(lorem.toUtf8()); "ööööö.." 10 x 888 characters 149882 nsec hasher.addData(reinterpret_cast<const char*>(lorem.unicode()), lorem.size() * sizeof(QChar)); 82302 nsec hasher.addData(lorem.toUtf8()); 257479 nsec hasher.addData(lorem.toUtf8()); "ööööö.."

Conclusion:
As expected there is only performance gain for ASCII characters. Except in non Latin charter sets two byte utf8 characters are a minority.
For a typical file name of 30 charters, the unicode() version is way faster. If we append all strings first, the utf8 version catches up at one point.

Those numbers also need to be put in relation to the file system access. I guess that the differences on the overall performance are negligible. That's what I had in mind with "realistic performance test".

Yes of cause. I was just interested if we have a low hanging fruit here.

daschuer · 2020-02-18T22:29:16Z

LGTM, Thank you.

uklotzde added 2 commits February 11, 2020 22:03

Add utility functions and typedefs for cache keys

3171c9c

Use a cryptographic hash function for directory hashes

686a229

Holzhaus requested changes Feb 12, 2020

View reviewed changes

src/library/scanner/libraryscanner.cpp Outdated Show resolved Hide resolved

src/util/cache.cpp Outdated Show resolved Hide resolved

uklotzde added 4 commits February 12, 2020 09:06

Add reasoning for cache key calculation

2ebaede

Use prepare/bind/exec for cleanup SQL query

44ee5df

Truncate message digest into cache key

160618c

Use static_assert from C++11

5c6996d

Merge branch 'master' into libraryscannerhash

866704d

Holzhaus assigned uklotzde Feb 15, 2020

Holzhaus added the library label Feb 15, 2020

Holzhaus approved these changes Feb 15, 2020

View reviewed changes

Holzhaus requested a review from daschuer February 15, 2020 01:28

uklotzde added this to the 2.3.0 milestone Feb 15, 2020

daschuer reviewed Feb 16, 2020

View reviewed changes

daschuer merged commit 030e792 into mixxxdj:master Feb 18, 2020

uklotzde deleted the libraryscannerhash branch February 21, 2020 10:33

Holzhaus added this to Done in 2.3 release Mar 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LibraryScanner: Improve hashing of directory contents #2497

LibraryScanner: Improve hashing of directory contents #2497

uklotzde commented Feb 11, 2020 •

edited

uklotzde commented Feb 12, 2020

Holzhaus commented Feb 13, 2020

Holzhaus left a comment

daschuer Feb 16, 2020

uklotzde Feb 16, 2020

uklotzde Feb 16, 2020 •

edited

daschuer Feb 16, 2020

uklotzde Feb 17, 2020 •

edited

Holzhaus Feb 18, 2020 •

edited

uklotzde Feb 18, 2020

daschuer Feb 18, 2020

uklotzde Feb 18, 2020

daschuer Feb 18, 2020

daschuer commented Feb 18, 2020

LibraryScanner: Improve hashing of directory contents #2497

LibraryScanner: Improve hashing of directory contents #2497

Conversation

uklotzde commented Feb 11, 2020 • edited

uklotzde commented Feb 12, 2020

Holzhaus commented Feb 13, 2020

Holzhaus left a comment

Choose a reason for hiding this comment

daschuer Feb 16, 2020

Choose a reason for hiding this comment

uklotzde Feb 16, 2020

Choose a reason for hiding this comment

uklotzde Feb 16, 2020 • edited

Choose a reason for hiding this comment

daschuer Feb 16, 2020

Choose a reason for hiding this comment

uklotzde Feb 17, 2020 • edited

Choose a reason for hiding this comment

Holzhaus Feb 18, 2020 • edited

Choose a reason for hiding this comment

uklotzde Feb 18, 2020

Choose a reason for hiding this comment

daschuer Feb 18, 2020

Choose a reason for hiding this comment

uklotzde Feb 18, 2020

Choose a reason for hiding this comment

daschuer Feb 18, 2020

Choose a reason for hiding this comment

daschuer commented Feb 18, 2020

uklotzde commented Feb 11, 2020 •

edited

uklotzde Feb 16, 2020 •

edited

uklotzde Feb 17, 2020 •

edited

Holzhaus Feb 18, 2020 •

edited