Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take file mtime's into account when calculating freshness hashes during scanning. #6187

Open
mixxxbot opened this issue Aug 22, 2022 · 14 comments

Comments

@mixxxbot
Copy link
Collaborator

Reported by: hile
Date: 2011-12-17T11:44:50Z
Status: Triaged
Importance: Low
Launchpad Issue: lp905669
Tags: easy, weekend
Attachments: 905669-3.patch, [Update the status of the directory and the tracks inside a transaction](https://bugs.launchpad.net/bugs/905669/+attachment/2920472/+files/Update the status of the directory and the tracks inside a transaction)


I noticed library rescan takes quite long time even for files which have not been modified.

Did not check the code, but I checked the database schema and couldn't find any place where the file modification timestamp would be stored. I would suggest modifying the schema and default file scanner to store it to database, and only checking files which have more recent st_mtime on linux/osx (don't remember how this works in windows!).

Just remember the common gotcha that modifying a file in a directory on unix does NOT update directory mtime unless filename is changed, so you need to check file mtimes not album directory mtimes ...

@mixxxbot
Copy link
Collaborator Author

Commented by: daschuer
Date: 2012-01-04T06:13:04Z


This could be achieved like mentioned in bug #⁠898487, together with bug #⁠893242.
"Add configuration path command-line option" + "(Relative) Library Path"

@mixxxbot
Copy link
Collaborator Author

Commented by: daschuer
Date: 2012-01-04T06:21:23Z


Please ignore #⁠1, this is a commend for bug #⁠905669. Sorry!

@mixxxbot
Copy link
Collaborator Author

Commented by: rryan
Date: 2012-01-07T08:15:06Z


Cool -- I thought we did use mtimes but it turns out we don't! This is some low-hanging fruit. All that has to be done is to update the folder hash calculation in src/library/libraryscanner.cpp in the recursiveScan() method. We test whether to rescan a directory by concatenating all the filenames of the files in it and hashing it and comparing it against a previous hash. If you just append the mtime to the filename, that would be good enough to cause the directory to be rescanned.

@mixxxbot
Copy link
Collaborator Author

Commented by: bencoder
Date: 2012-03-20T08:15:53Z
Attachments: 905669-3.patch


I believe the original bug was intended to store the mtime with the file info in the database (probably in the track_locations table), so that once we've determined there's been a change in a directory using the hashing method, we can more quickly rescan that directory, by not loading and checking the id3 of each track in that directory if the mtime is the same.

The fix suggested by Ryan in #⁠3 is for a slightly different issue, which is that we don't check the mtime of the files when deciding to do a directory rescan, so if a file changes(for example, changing the id3 tag) but the filename itself doesn't, then the directory won't be scanned.

I have attached a patch which fixes this issue, but doesn't fix the original issue of slow rescan times. Can someone confirm whether or not we want to fix the original issue of slow rescan times?

@mixxxbot
Copy link
Collaborator Author

Commented by: bencoder
Date: 2012-03-20T09:05:53Z


I will also work on the original issue to speed up library rescan times.

@mixxxbot
Copy link
Collaborator Author

Commented by: bencoder
Date: 2012-03-20T11:36:35Z


The process is roughly:

scan directories recursively:
for each filename in directory:
newHashStr += filename
if hash(newHashStr) != hash_in_database(dir):
for each filename in directory:
if exists_in_db(filename):
mark as verified
else:
add file to tracksToAdd list

add tracksToAdd list to the database

simply adding the mtime to the directory hash doesn't cause the files to be updated, since if the file exists in the database at all, it's considered to be the same and isn't updated. So the patch I created in #⁠4 won't do anything in this case. This will likely require a bit of refactoring to change. Given that the tracks shouldn't actually update, I'm not sure what is causing the slowdown.

@mixxxbot
Copy link
Collaborator Author

Commented by: hile
Date: 2012-03-20T13:14:47Z


My python tool 'musa-db' does it something like this, and it's quite fast:

db_files = 'SELECT path,mime from songs'
for each directory in library:
dirfiles = listdir()
for each file in dirfiles:
if file not in db_files:
append(file)
else if file.mtime != db_song.mtime:
db_reload_tags(file)
db_dir_files = filter db_files where directory == directory
for db_file in db_dir_files:
if db_file not in dirfiles:
db_mark_file_removed(db_file)

I keep no hashes, haven't seen any use for those (I might add SHA for each file to DB though).

Here is timings for my normal run against 80000 songs m4a database with this process, when I haven't done it for a while:

time musa-db --cleanup --update
real 1m21.403s
user 0m32.343s
sys 0m16.223s

And second run after inodes are in fs cache (no changes to files):

time musa-db --cleanup --update
real 0m45.790s
user 0m29.514s
sys 0m12.966s

It still takes considerable time, but it is looking at about 7000 folders after all, so I'm OK with the numbers.

On 20 Mar 2012, at 14:36, Ben Clark wrote:

The process is roughly:

scan directories recursively:
for each filename in directory:
newHashStr += filename
if hash(newHashStr) != hash_in_database(dir):
for each filename in directory:
if exists_in_db(filename):
mark as verified
else:
add file to tracksToAdd list

add tracksToAdd list to the database

simply adding the mtime to the directory hash doesn't cause the files to
be updated, since if the file exists in the database at all, it's
considered to be the same and isn't updated. So the patch I created in
#⁠4 won't do anything in this case. This will likely require a bit of
refactoring to change. Given that the tracks shouldn't actually update,
I'm not sure what is causing the slowdown.

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/905669

Title:
Take file mtime's into account when calculating freshness hashes
during scanning.

To manage notifications about this bug go to:
https://bugs.launchpad.net/mixxx/+bug/905669/+subscriptions

@mixxxbot
Copy link
Collaborator Author

Commented by: daschuer
Date: 2012-03-20T18:35:19Z


Hi Ben,

If you like, you can test my uncommitted patches from Bug #⁠239883 and Bug #⁠801700.
I would be happy if you could work on top of them.

Thank you.

@mixxxbot
Copy link
Collaborator Author

Commented by: maxime-bochon
Date: 2012-03-22T12:39:19Z


It would be fine as a user to be able to choose the accuracy of the rescan algorithm. This way we could control the scanning speed depending on our needs.

I don't have a thorough idea of how the algorithm work, but I suppose some comparisons are done on the file name, size and signature. So the Library pane in the Preference dialog could provide options to tweak which level of information is used for the comparisons during rescan.

@mixxxbot
Copy link
Collaborator Author

Commented by: asantoni
Date: 2012-03-22T13:17:31Z


Hi guys,

Like Ben Clark said, modified files in directories don't get their metadata
reloaded upon rescan. This is by design. The reason the hashing only looks
at filenames is to detect three cases:

  1. New files added to a directory
  2. Files deleted from a directory (or moved)
  3. Files renamed in a directory

Again, we specifically did not want to detect modifications to files in
those directories because then you have to deal with metadata conflicts.
For example, if the user adds BPM to a song in the Mixxx sqlite db, and
then they add a different BPM externally to the ID3, we can't assume which
one the user actually wants as the BPM in Mixxx. Mixxx now has a "reload
metadata from disk" option (I think?) which is a reasonable way to solve
this for now.

Thanks,
Albert

On Tue, Mar 20, 2012 at 7:36 AM, Ben Clark wrote:

The process is roughly:

scan directories recursively:
for each filename in directory:
newHashStr += filename
if hash(newHashStr) != hash_in_database(dir):
for each filename in directory:
if exists_in_db(filename):
mark as verified
else:
add file to tracksToAdd list

add tracksToAdd list to the database

simply adding the mtime to the directory hash doesn't cause the files to
be updated, since if the file exists in the database at all, it's
considered to be the same and isn't updated. So the patch I created in
#⁠4 won't do anything in this case. This will likely require a bit of
refactoring to change. Given that the tracks shouldn't actually update,
I'm not sure what is causing the slowdown.

--
You received this bug notification because you are a member of Mixxx
Development Team, which is subscribed to Mixxx.
https://bugs.launchpad.net/bugs/905669

Title:
Take file mtime's into account when calculating freshness hashes
during scanning.

To manage notifications about this bug go to:
https://bugs.launchpad.net/mixxx/+bug/905669/+subscriptions

--
Albert Santoni
Developer, Mixxx
http://www.mixxx.org
http://www.oscillicious.com

@mixxxbot
Copy link
Collaborator Author

Commented by: bencoder
Date: 2012-03-23T07:16:59Z
Attachments: [Update the status of the directory and the tracks inside a transaction](https://bugs.launchpad.net/mixxx/+bug/905669/+attachment/2920472/+files/Update the status of the directory and the tracks inside a transaction)


Yes that is what I had determined, that taking the modified time into account was not necessary and in fact could only slow things down. However, since the original bug was that the library rescan is slower than it should be I've been investigating where that can be improved.

The primary delay in rescanning when there have been no changes is where we are marking as verified the existing directory and tracks in these two calls:

m_libraryHashDao.updateDirectoryStatus(dirPath, false, true);
m_trackDao.markTracksInDirectoryAsVerified(dirPath);

on my machine these calls take about 100ms each. I believe this is primarily due to the disk writes, and not due to Indexes or preparing the queries (I investigated both pre-preparing the queries instead of preparing them on demand and adding an index on track_locations.directory, both of which had very little effect [LibraryHashes.directory_path is a PK so is already indexed])

A very simple method to significantly improve the speed is to wrap these two calls in a transaction, that causes us to do just a single write instead of two, speeding up my own rescan by more than a third from 95s to 60s. I've attached the small patch for this.

@mixxxbot
Copy link
Collaborator Author

Commented by: bencoder
Date: 2012-03-28T09:34:19Z


Since the investigation and patch I made have strayed quite far from the original suggestion, and we are not fixing the original issue to take file modification times into account, I'm marking this as incomplete and have opened a new Bug #⁠966963 to track other ways of improving the rescan speed.

@mixxxbot
Copy link
Collaborator Author

Commented by: rryan
Date: 2012-03-28T14:13:36Z


Thanks Ben -- incomplete means "waiting for more information from submitter" though and the Launchpad janitor will auto-delete it after a month or two of inactivity. Let's leave it as Triaged.

@mixxxbot
Copy link
Collaborator Author

Commented by: vedu
Date: 2013-05-05T09:04:10Z


So then nothing has to be done here?

@mixxxbot mixxxbot transferred this issue from another repository Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant