Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid calculating checksum if size differs (performance optimization opportunity) #49

Closed
tajacobsen opened this issue Jan 20, 2015 · 8 comments
Milestone

Comments

@tajacobsen
Copy link

Seems function "checksumDiffers" in types.go calculates the checksum of a local file even if it differs in size from the remote file. This seem unnecessary and increases resolving time if many files have been changed.

Edit: Enhancement

@odeke-em
Copy link
Owner

checksumDiffers just checks if the DifferMd5Checksum bit is set in a mask. However you do have a point that the precomputation of differences if the size is not the same, then naturally the checksum is different. Before the code to skip this check was in but it was a little messy and got replaced, just need to revert it. I'll jump on this when I get back home in about 9 hours and ping you up. Thank you for the suggestion.

@odeke-em odeke-em added this to the Release 6 milestone Jan 21, 2015
@odeke-em
Copy link
Owner

Thank you @tkjacobsen for bringing this up. It is modest of you to call this a minor enhancement, it is actually a major one if you consider that for the common case, mostly when sizes are equal the files are mostly the same. Please take a look at c281591

@tajacobsen
Copy link
Author

Well the file size difference check itself is minor - the ignore-checksum is not :).

Comment to the change:
The point about setting the checksum mask if size differs seems alright.

However, when setting the ignoreChecksum flag introduces one (possibly unintended) side effect: If a change is just a modTime difference, the DifferentMd5Checksum mask will not be set causing a re-download of a potentially identical file. Not sure what right solution would be. Potential solution might be to do a checksum only if date differes (when the flag is set).

@odeke-em
Copy link
Owner

I see. However, ModTime changes on their own should cause just the receipient's time to be modified, no need for re download/upload of a file. Also if DifferMd5Checksum is not set, this cuts out directly the need to even think about downloading the file.

@tajacobsen
Copy link
Author

Good point. This would only cause erroneous updates if a file changes but keeps same size. In this case modTime will be updated, but if checksum is not calculated, file will not be updated.

My point is (I think) that if modTime has changed (even if file size is the same) there is an increased likelihood that the file content has changed as well (see use case below). (If neither modTime nor size changed, there is a little chance that content - and thus checksum - changed -- unless someone actively updated the file and went back to set modTime to original value).

I would therefore suggest one of the following:

  1. calculate checksum even if -ignore-checksum is set if modTime is set
  2. create another flag -lazy-checksum that does above

Use case:

  • I have a google spreadsheet that fetches stock prices when opened. All data in the sheet is static except the price. As the price is stored in a "double" (or equivalent) the file size will not change upon updating the file - only the modTime (and checksum).

@odeke-em
Copy link
Owner

Good point. This would only cause erroneous updates if a file changes but keeps same size. In this case modTime will be updated, but if checksum is not calculated, file will not be updated.

In deed, however the point of -ignore-checksum is that a user is explicitly making a risk for example as highlighted in the use case that prompted that flag where computing the checksum might be quite expensive take a look at issue #47 .

My point is (I think) that if modTime has changed (even if file size is the same) there is an increased likelihood that the file content has changed as well (see use case below). (If neither modTime nor size changed, there is a little chance that content - and thus checksum - changed -- unless someone actively updated the file and went back to set modTime to original value).

I can give you two cases for which this is may not be true:

  1. A user runs drive touch [paths ...].
  2. A user while going through their drive accidentally renames a file, then reverts the name because it might clash with another file in the path. I see a whole lot of folks doing this for photos.
I would therefore suggest one of the following:
1) calculate checksum even if -ignore-checksum is set if modTime is set
2) create another flag -lazy-checksum that does above

Hmm, see suggestion 1 is a bit misleading because -ignore-checksum is meant to handle the case where the user does not need to perform full checksum verification, if you want checksum verification don't use -ignore-checksum. Suggestion 2 introduces a bit of ambiguity in the flag name itself.

Use case:

I have a google spreadsheet that fetches stock prices when opened. All data in the sheet is static except the price. As the price is stored in a "double" (or equivalent) the file size will not change upon updating the file - only the modTime (and checksum).

Unfortunately this use case cannot work for drive because Google Docs + Spreadsheets cannot be downloaded raw but only exported, actually even when you pull the file it will just touch a file locally for the sake of consistency later on when you are uploading.

@tajacobsen
Copy link
Author

yea, probably not worth it to introduce this complexity and ambiguity for such a corner case.

Anyway, thanks a bunch for the original patch! It works really well on on pull requests containing large files (esp on a low powered device as my Raspberry Pi):

[18:01] user@alarmpi Large Files $ time drive pull && time drive pull -ignore-checksum
Resolving...
md5Checksum: `[snip].zip` (407.53MB)
might take time to checksum.
md5Checksum: `[snip].zip` (462.36MB)
might take time to checksum.
Everything is up-to-date.

real    4m2.485s
user    1m39.450s
sys     1m55.420s

Resolving...
Everything is up-to-date.

real    0m4.350s
user    0m1.150s
sys     0m0.090s

I'll make sure to learn GO and send a working patch next time :)

@odeke-em
Copy link
Owner

Good to know that this improves things. Thank you for the great suggestion [ keep 'em honest ;) ] as well as for the thought experiment. For sure, your suggestions, feedback and ideas are always welcome :)

@odeke-em odeke-em changed the title Avoid calculating checksum if size differs (minor performance optimization opportunity) Avoid calculating checksum if size differs (performance optimization opportunity) Apr 26, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants