New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RClone should support a cache of checksums for local files to avoid redundant work #949

Open
jediry opened this Issue Dec 13, 2016 · 16 comments

Comments

Projects
None yet
5 participants
@jediry
Copy link

jediry commented Dec 13, 2016

I think a nice feature would be the ability to generate a cache of checksums for local files, which could be used to accelerate subsequent calls to rclone. This would make the following sequence of operations much faster, by avoiding redundant checksum computations on the local side:

rclone check --checksum cloud:/path /local/path
[user thinks about why the files might be different, and decides what to do]
rclone copy some files
[repeat as necessary]
rclone check --checksum cloud:/path /local/path
[see where we stand now]

I'm thinking that this might be accomplished with a single command-line argument..something like --local-file-checksum-cache=path/to/database/file. If the file is missing, it's created, and in either case is written out when rclone terminates. The checksum DB would cache the file's size and mod-time along with the checksum, and when the checksum is needed for a local file, the cached one is used if the size and mod-time match.

Note that the cache file would potentially need to cache both MD5 and SHA1 checksums, and there should also be a way to disable reading the cache (e.g., if I believe my filesystem may have been damaged, and want to re-download pristine versions of any corrupted files, then I don't want to trust the local checksum cache).

See also related issues #725, #626 and #157.

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 13, 2016

This is a great idea, thank you for bringing this up.

there should also be a way to disable reading the cache

I would think that it would be sufficient for the user to refrain from using the command you suggested ("local-file-checksum-cache"). If the user refrains from using that command, then rclone will behave as it does currently. That is, it will recalculate all checksums, which is the same thing as "disable reading the cache," right?

then I don't want to trust the local checksum cache

In any situation where the user feels this way, then aside from not using the command, they are also free to trash the checksum file. Then use the command again, and the local checksum cache will be recreated.

Your basic design for how this works makes perfect sense, and it's probably all that's needed. Unless there's something I misunderstand about this part of your comment.

@jediry

This comment has been minimized.

Copy link
Author

jediry commented Dec 13, 2016

Yes, you're correct...so long as the cache file is opt-in, no extra option is needed to disable it. I only mention this scenario in case the person who implements this chooses to instead make this an on-by-default, opt-out thing. The point is, there are cases where one doesn't want the cache.

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 13, 2016

I only mention this scenario in case the person who implements this chooses to instead make this an on-by-default, opt-out thing.

OK, thanks for explaining, I see what you mean. But even in that case, any user who is worried about a corrupted cache file is always free to trash it, and that would always force rclone to recreate it, right? Then again, if for some reason a user worries that the local file is always subject to corruption for some reason, then remembering to delete it every time would be a major nuisance.

Anyway, I like your original suggestion of making this feature happen only when explicitly invoked. In my opinion, it should not be on-by-default. People upgrading to a new rclone version should not suddenly see a new cache file they might not expect or need.

@jediry

This comment has been minimized.

Copy link
Author

jediry commented Dec 13, 2016

I agree that my original opt-in design seems best, as it avoids surprising users.

@ncw

This comment has been minimized.

Copy link
Owner

ncw commented Dec 14, 2016

I'd propose implementing it like this.

Use a simple Key/Value database - bolt looks like the leading contender here. Using an actual DB rather than a text file will mean we don't have to load the entire database into memory and we can update it as we go along.

Use the absolute value of the paths as a key - this will mean that you can use the cache for any local file accesses, not just ones from a specific directory.

As the values we need to store

  • modification time
  • sha1sum
  • md5sum

I'd serialize these into JSON probably.

We'd use this DB

  • when reading a Hash we'd look in the DB first
  • when we've calculated a hash we'd stuff it in the DB

Potential issues

  • garbage collection - if you move a lot of files about there will be files in the DB which don't have a corresponding file - I haven't thought of a mechanism for removing these. It might not be a big problem though.
  • concurrent use. - bolt (and most key value databases) only allow one writer at a time to the db so you'd only be able to use one rclone processes at once.
  • is bolt available on all the platforms rclone builds for

@ncw ncw added the enhancement label Dec 14, 2016

@ncw ncw added this to the v1.36 milestone Dec 14, 2016

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 14, 2016

Hi Nick, thank you for the work you are putting into this.

garbage collection

A suggestion: at least for now, just let the user worry about this. As a user, I'm well-aware of the existence of this file, because you won't create it unless I explicitly ask you to do so. It's easy enough for the user to pay attention to where it is, and how big/old/stale it's getting. Periodically, I should trash that file and let a new one be built, but the optimum schedule for that is very dependent on the specifics of my situation. For example, I might know my data well enough to be sure that it's rare for files to be moved or deleted.

Later on, if eventually there are many people using this feature, then you will be in a position to notice the presence (or absence) of many people reporting problems that are a result of garbage accumulating in the file. Then enhancements can be made, and the nature of the problems that are reported (if any) will be a helpful framework for designing those enhancements. Like you said, it might not be a big problem, ever, so no point worrying about it right now.

For now, how about you just put a statement like this in the documentation for this new command: "If you have many files that are moved or deleted, this checksum file will eventually contain obsolete information, so it could be a good idea to periodically throw this file away. Then you can let rclone build a new one which will contain only current information."

concurrent use … you'd only be able to use one rclone processes at once

rclone currently allows me to create multiple log files with unique names. Then I can run multiple jobs concurrently, and each log file will reflect only the specific job which owns that file. Could checksum files work the same way? Then as a user it's up to me to decide how many of these files should exist, and what they should be called, and where they should live.

The documentation could say something like this: "You can run multiple rclone instances concurrently, but if you are using checksum files, then each instance must have its own."

When a job starts, if I ask you to use a checksum file, you check and see if the file I specified is currently in use. If it's in use, you stop the job with an error. A helpful message would be something like this: "The checksum file you specified is currently in use."

is bolt available on all the platforms rclone builds for

My understanding is that "Bolt currently works on Windows, Mac OS X, and Linux." (https://godoc.org/github.com/boltdb/bolt) That should cover most rclone users, right? If you are on some other platform, then maybe you just have to live without this feature, at least for now?

@jediry

This comment has been minimized.

Copy link
Author

jediry commented Dec 14, 2016

Yes, thanks for taking this on. Excited to see the improvement!

garbage collection

I agree that the user-visibility of this file makes it less of a problem. Still, I could imagine a user with many terabytes of data being unhappy with the "just delete your cache" solution.

If it does become a problem, two possible solutions:

  1. Compact-on-shutdown: when terminating rclone after a copy/sync/check, scan the DB for any entries for which you didn't look up (or generate) a checksum (for any that you did, you know these are still valid), and check for the existence and mod-time match of the corresponding file. Pro: very automatic, and should be quite cheap when (a) the DB is already well-compacted and (b) most/all the files in it were already examined by the preceding "rclone copy/sync/check" that just finished. Con: potentially significant overhead during shutdown with a large DB, if your "rclone copy/sync/check" operation only examined a small fraction of the whole collection. One way to mitigate this would be to restrict the pruning based on the directory prefix (e.g., if I "rclone" from /mnt/disk1/Media/Movies, then you'd only prune DB entries contained under "/mnt/disk1/Media/Movies"); I believe this minimizes the number of files that need to be examined during compaction.
  2. Manual compaction: add "rclone compactdb" (or similar) command, so that the user can compact the cache file following any major directory restructuring. Pro: no overhead in normal usage, can be limited to only running when the user wants it. Con: very manual.

concurrent use

So long as rclone detects this situation and aborts (rather than continuing and corrupting the DB), I think this is fine. I agree with Monroe-74 that this won't be a huge problem in practice...if I'm routinely processing different directories in parallel, I'll have separate cache DBs for those tasks. The only users likely to be significantly hampered by this are people trying to sync the same local directory into multiple cloud locations...and they can work around this by doing an "rclone check" to bring the DB (mostly) up-to-date, and then making copies of the DB for each "rclone sync" instance they plan to spawn.

sha1sum, md5sum

Is it fairly cheap to simultaneously compute the SHA1 and MD5 (vs. computing just one)? If it's a significant extra expense to compute both, you might consider making them "optional", and only compute them on-demand. In my case, I'm using OneDrive exclusively, so computing the MD5 sum doesn't matter. I imagine other people's usage is similar. But if it's cheap to do both simultaneously, then probably best to do that, to avoid reading the same file twice to generate two different checksums.

Use the absolute value of the paths as a key

Depending on what you mean by "absolute value", I probably agree. Specifically, in addition to being a non-relative path, I think the path needs to be canonicalized to remove this-directory (./) and parent-directory (../) sequences, superfluous '/' sequences (e.g., /mnt///path -> /mnt/path), and also should resolve symbolic links.

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 14, 2016

Hi jediry, I just want to chime in and say that everything you said makes a lot of sense to me. One minor comment:

The only users likely to be significantly hampered by this are people trying to sync the same local directory into multiple cloud locations

If I were doing backups to multiple cloud locations, I think I would probably do them on some kind of rolling or alternating schedule, instead of making them simultaneous. I assume a rolling schedule would be a common practice, because then I have more options if I want to rollback to a previous state of my data? So I'm inclined to guess that the exact case you described would be rare. Anyway, you suggested a good way for the user to handle that situation.

@jediry

This comment has been minimized.

Copy link
Author

jediry commented Dec 14, 2016

If I were doing backups to multiple cloud locations, I think I would probably do them on some kind of rolling or alternating schedule, instead of making them simultaneous.

Yeah, I would do that too. The difficulty is that it's hard to know when a backup task will finish, so unless you're certain that the duration of one backup will always be "short" compared to the gap between launching them, then you have to account for the possibility that the previous backup could still be running when the next scheduled "start" rolls around. Of course, it's possible to code around this, but it's more complicated.

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 14, 2016

Good point, I see what you mean. I was assuming no overlap, but you're right that this is not a safe assumption.

The new feature we're discussing might radically improve the speed of certain backups, so that might minimize the overlap issue you described. But it will still happen in some situations, so it's good that this problem is getting some thought ahead of time, and I like your suggestions.

@ncw

This comment has been minimized.

Copy link
Owner

ncw commented Dec 16, 2016

@jediry @monroe-74 thanks for thinking about this.

I think the garbage collection would be solved for the time being with some extra docs. If necessary rclone could grow an extra command to help deal with it. The compact on shutdown is a reasonable idea, but even scanning a huge directory tree can take some time. We'll have to see how big the DBs become, but pessimistically 1k per file. So for 1M files the DB becomes 1GB. So if there are 100k stale files that is not going to be much of a problem.

As for concurrent access, rclone could wait until the db was available rather than stop with an error. It could print a message every 60s while it was waiting. This would mean that backups would proceed in an orderly fashion. boltdb takes a lot of care to not allow concurrent access so you wont get corruption.

I could solve the concurrent access to the db problem like this if necessary

  • the file you pass in as the cache rclone only reads from - multiple readers are allowed if read only
  • rclone writes to a temporary db
  • when rclone has finished the sync it
    • waits for any other users of the main db to finish
    • opens the main db r/w
    • updates any changes

Using multiple DB's is a good idea. You could fill it up from one run then copy it and use it in another.

Is it fairly cheap to simultaneously compute the SHA1 and MD5

The main cost is IO so it is pretty much free computing the SHA1 if you are already computing the MD5. That is what rclone does at the moment anyway.

Specifically, in addition to being a non-relative path, I think the path needs to be canonicalized to remove this-directory (./) and parent-directory (../) sequences, superfluous '/' sequences (e.g., /mnt///path -> /mnt/path), and also should resolve symbolic links.

Absolutely (pun intended ;-)

@ncw

This comment has been minimized.

Copy link
Owner

ncw commented Dec 16, 2016

and also should resolve symbolic links.

rclone ignores symbolic links at the moment - that needs a little bit of extra thought...

@monroe-74

This comment has been minimized.

Copy link

monroe-74 commented Dec 16, 2016

I think the garbage collection would be solved for the time being with some extra docs.

Sounds good to me.

As for concurrent access, rclone could wait until the db was available rather than stop with an error.

Good point. I suggested you just stop, but waiting instead would be much friendlier. In my situation it wouldn't matter, but I could picture another user being annoyed if you stop instead of waiting.

I could solve the concurrent access to the db problem like this if necessary

Deferring writes this way is a clever solution. However, speaking for myself, this is not needed, and I imagine this could be a noticeable chunk of extra work for you. So I suppose this could be a later enhancement, if many users adopt the new feature and express a need for this enhancement.

rclone ignores symbolic links at the moment

Speaking again strictly for myself, you ignoring symbolic links does not cause any trouble for me.

@jediry

This comment has been minimized.

Copy link
Author

jediry commented Dec 16, 2016

Ignoring symlinks doesn't bother me either. My comment was based on the incorrect assumption that rclone traverses links.

@lickdragon

This comment has been minimized.

Copy link

lickdragon commented Jan 13, 2017

pinging this issue to become a participant and get updates on conversation (is there a better way?)

@ncw ncw modified the milestones: v1.36, v1.37 Feb 14, 2017

@ncw ncw modified the milestones: v1.37, v1.38 Jul 19, 2017

@robbat2

This comment has been minimized.

Copy link

robbat2 commented Sep 17, 2017

Please also capture the inode number & size, to use for verification.

For a given filename, if ANY of mtime, size, (device_number, inode) are different, don't use it from the cache.

@ncw ncw removed this from the v1.38 milestone Sep 30, 2017

@ncw ncw added this to the v1.39 milestone Sep 30, 2017

@ncw ncw modified the milestones: v1.39, v1.40 Jan 11, 2018

@ncw ncw modified the milestones: v1.40, v1.41 Mar 19, 2018

@ncw ncw modified the milestones: v1.41, Soon Apr 21, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment