Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checksumming by default is too aggressive #117

Closed
l3iggs opened this issue Mar 22, 2015 · 24 comments
Closed

checksumming by default is too aggressive #117

l3iggs opened this issue Mar 22, 2015 · 24 comments

Comments

@l3iggs
Copy link

l3iggs commented Mar 22, 2015

It seems that currently whenever a push or pull action encounters local files having the same name as those in the cloud, a checksum (MD5) is calculated over the local copy and compared to that stored for the file on Google's server to decide if the files are actually identical. This is certainly the safest thing to do, but it's my opinion that the default mode of operation of calculating an MD5 over every single identical file every time the user calls a push or pull is unacceptable performance wise.

I'd like to suggest that drive takes a similar approach to how rsync works and turn checksum comparison off by default. Correct me if I'm wrong, but I think rsync looks at file names, file sizes and modified times (by default) to decide if two files are identical which is probably enough for 99.999% of the time and is extremely fast. Only when the user explicitly asks for a checksum comparison (by using the -c or --checksum option) does rsync compare files by calculating checksums.

@odeke-em
Copy link
Owner

@l3iggs please take a look at the screen shot:
screen shot 2015-03-22 at 11 09 39 am

You could actually try testing whether checksum-ing is turned on for every file. As far as I know this only happens when the size differs only when the name is the same which is what are you suggesting.

@l3iggs
Copy link
Author

l3iggs commented Mar 22, 2015

If you intend the checksum to only be calculated when some special option is set by the user, there is a bug.

I'm suggesting in this issue that ignoreChecksum should be true by default (which is not the behavior I'm seeing with v0.1.5). What's its default value?

@odeke-em
Copy link
Owner

What do you mean by?

If you intend the checksum to only be calculated when some special option is set by the user, there is a bug.

By default my view is for pro checksum-ing, because most users do not even invest in finding out the internals of how differences are detected. With the number of issues reported for even the basic, 'duplicate files present but I didn't read the manual' I would rather build for the common case and turn it off only when the user wants to turn it off.

@kcwu
Copy link

kcwu commented Mar 22, 2015

IIUC, the comment is incorrect

                // Only compute the checksum if the size differs
                if sizeDiffers(difference) || md5Checksum(src) != md5Checksum(dest) {

it actually compute the checksum when the size equals.

@odeke-em
Copy link
Owner

It is actually a legacy comment ( which is why am not a fan of commenting) but that isn't the point, the code speaks for itself as requested by @l3iggs, checksum only computed if size is the same which should be the case.

@kcwu
Copy link

kcwu commented Mar 22, 2015

I think the question is, if mod time and size are both not changed, should we do checksum? rsync and many sync tool by default trust last modified time and treat those files identical.

@odeke-em
Copy link
Owner

I am up for getting it in for the common case. However, I want to be sure that integrity of files is something users can count on drive for by default and turn off on their own. In fact originally, checksum-ing originally was not in: size and last modified date changes toggled modifications.

P.S: Can I ping y'all back in about 4 hours? -- I barely slept.

@l3iggs
Copy link
Author

l3iggs commented Mar 22, 2015

Actually, I'm advocating here that a checksum should never be calculated unless the user explicitly asks for it with a command line switch. Comparing files with mod time, file name and file size is enough to catch 99.999% of all changes. It's incredibly fast and it's how tools like rsync are designed (and it's hard to argue against the robustness and user-friendliness of rsync). Try and think of a scenario in real life when mod time, file size and file name wouldn't catch a legitimate change. Pretty much the only scenario I can think of involves an evil genius.

The bug I'm seeing in my day-to-day use of v0.1.5 is that I'm staring at MD5s being calculated over all of my many GBs when I know my local and server files are identical.

Here is a very simple test illustrating the undesirable behavior:

$ dd if=/dev/urandom of=random.bin bs=10M count=200
200+0 records in
200+0 records out
2097152000 bytes (2.1 GB) copied, 246.178 s, 8.5 MB/s
$ drive push random.bin
Resolving...
+ /random.bin
Addition count 1 src: 1.95GB
Proceed with the changes? [Y/n]: y
1 / 1 [===========] 100.00 % 1m15s
$ drive push random.bin # now I'll push the same exact file again
Resolving...
md5Checksum: `random.bin` (1.95GB)
might take time to checksum.
Everything is up-to-date.

(lol, notice how it takes me ~3.3 times longer to create a random 2GB file than it does to upload it!)

@l3iggs
Copy link
Author

l3iggs commented Mar 22, 2015

By default my view is for pro checksum-ing, because most users do not even invest in finding out the internals of how differences are detected. With the number of issues reported for even the basic, 'duplicate files present but I didn't read the manual' I would rather build for the common case and turn it off only when the user wants to turn it off.

@odeke-em OK, well consider this issue a vote in opposition to liberal checksumming and feel free to close it if you like the current design. I just wanted to raise it because I thought checksums were being calculated when they shouldn't be. I'm genuinely intrigued to learn more about users who have found that the mod time/file size/file name trio is not enough to flag their files as being different.

@odeke-em
Copy link
Owner

I am back up.
Now thank you for the explanation, see I think the voice of more than one user as well as a good advocacy is enough for me to reconsider my thoughts. rsync and other tools have been around for way longer than drive and since as you say they cover the common case for the user, my case then fades away to a rare case. The folks who are covered by my advocacy are those who have spreadsheets laying around say of stock market values. However now thinking about this, there is no reason to cover the strict cases. In fact @l3iggs request for drive stat including the checksum cover the case when they need to verify that the checksum is the same.
Thus you now have my vote. Checksum-ing is going to be come optional so trusting size and last modified date.

@kcwu
Copy link

kcwu commented Mar 23, 2015

BTW, do you know what's precision of google drive api's last modified time?

@odeke-em
Copy link
Owner

To the millisecond.

@l3iggs
Copy link
Author

l3iggs commented Mar 23, 2015

Can you ask the API what the current time is?

@odeke-em
Copy link
Owner

Off the top of my head, not directly, at least when using the Google Drive API. Reason (that I can think of) being that the data might be sharded and stored in various locations. However after touching a file, you get back the time that the server registered that touch in UTC. So not really, but yes you can get the registered time after touching a file. I could be wrong.

@l3iggs
Copy link
Author

l3iggs commented Mar 23, 2015

The reason I ask is that I think it would be wise to add a check that the local system time is set correctly whenever a push or pull occurs; and probably abort the push/pull if the local system time is wrong unless the user adds some --i-know-what-im-doing flag.

@odeke-em
Copy link
Owner

Why would the local system time matter? The only time I could think of such a need is for example if you are doing AWS S3 multipart uploads and you need to synchronize your NTP clock time. For Google Drive, your UTC time offset gets resolved and matched up with the one on the Google Drive server.

@l3iggs
Copy link
Author

l3iggs commented Mar 23, 2015

When you do the mod time comparison between what Google has and what you have locally, the local file's mod time comes from the system time, no?

@odeke-em
Copy link
Owner

Well the modTime is dependant on results returned by the file system so the inode's modTime not directly from the current time.

@l3iggs
Copy link
Author

l3iggs commented Mar 23, 2015

Right, but if drive can detect that the local system time is misconfigured, that's evidence that all the local files have an incorrect mod time, in which case drive should probably not attempt to sync these files because it's too dangerous. Does that make sense?

@odeke-em
Copy link
Owner

The farthest that I could imagine getting involved with clock checks is if the times are in the future, but I think that is up to the user. What if the machine they got the files off say a Docker instance has a clock skew? -- when they move this file to another FS what happens? What if someone tries to upload files right after daylight savings (before clocks have synched, happened to me this month), for files that they created previously. Ruling out that every file has an incorrect mod time seems too broad for me right now, at least because I do not fully understand the problem nor see why a file system dependent detail is something that should make updating just the modTime dangerous. Also if just the modTime is different, only the modTime gets updated not the entire file.

@kcwu
Copy link

kcwu commented Mar 23, 2015

+1. I feel clock offset is not an issue since we only care timestamp equals or not (which has enough precision). We don't care the clock is in the future or in the past.

@l3iggs
Copy link
Author

l3iggs commented Mar 23, 2015

Cool. This discussion is off topic here anyway. I'll keep thinking about the problems that could arise from a misconfigured system time and open a new issue if I feel like it's needed.

odeke-em pushed a commit that referenced this issue Mar 27, 2015
This is because checksum verification by default was deemed to be
too vigorous and unncessary by default. This discussion stems from
issue #117.
odeke-em pushed a commit that referenced this issue Mar 27, 2015
This is because checksum verification by default was deemed to be
too vigorous and unncessary by default. This discussion stems from
issue #117.
@odeke-em
Copy link
Owner

Please take a look at PR #128

odeke-em pushed a commit that referenced this issue Mar 27, 2015
This is because checksum verification by default was deemed to be
too vigorous and unncessary by default. This discussion stems from
issue #117.
@odeke-em
Copy link
Owner

odeke-em commented May 2, 2015

Addressed by #128. Please re-open if persists.

@odeke-em odeke-em closed this as completed May 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants