Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large files uploaded via crypt remote to B2 are missing SHA1SUMs #1767

Closed
tjanez opened this issue Oct 22, 2017 · 28 comments
Closed

Large files uploaded via crypt remote to B2 are missing SHA1SUMs #1767

tjanez opened this issue Oct 22, 2017 · 28 comments

Comments

@tjanez
Copy link

tjanez commented Oct 22, 2017

What is your rclone version (eg output from rclone -V)

rclone v1.38
- os/arch: linux/amd64
- go version: go1.9

Which OS you are using and how many bits (eg Windows 7, 64 bit)

Fedora 26 (Workstation Edition), 64 bit

Which cloud storage system are you using? (eg Google Drive)

B2 Backblaze

The command you were trying to run (eg rclone copy /tmp remote:tmp)

An example of a large file uploaded via crypt remote to B2 that misses SHA1SUM:

$ rclone mkdir b2:crypted-missing-sha1sum-test
$ rclone config
Configure new crypted remote named crypted-missing-sha1sum-test:

Remote config
--------------------
[crypted-missing-sha1sum-test]
remote = b2:crypted-missing-sha1sum-test
filename_encryption = off
password = *** ENCRYPTED ***
password2 = *** ENCRYPTED ***
--------------------

$ ls -lh ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso 
-rw-rw-r--. 1 tadej tadej 1.5G Jul  6 00:34 /home/tadej/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso
$ rclone copy -v ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso crypted-missing-sha1sum-test:
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Modify window is 1ms
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Waiting for checks to finish
2017/10/22 15:06:20 INFO  : Encrypted drive 'crypted-missing-sha1sum-test:': Waiting for transfers to finish
2017/10/22 15:07:20 INFO  : 
Transferred:   383.938 MBytes (6.041 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            0
Elapsed time:      1m3.5s
Transferring:
 *     Fedora-Workstation-Live-x86_64-26-1.5.iso: 25% done, 8.578 MBytes/s, ETA: 2m9s

[... output trimmed... ]

2017/10/22 15:23:43 INFO  : Fedora-Workstation-Live-x86_64-26-1.5.iso: Copied (new)
2017/10/22 15:23:43 INFO  : 
Transferred:   1.456 GBytes (1.424 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            1
Elapsed time:    17m26.8s
$ rclone sha1sum b2:crypted-missing-sha1sum-test
                                          Fedora-Workstation-Live-x86_64-26-1.5.iso.bin
$ rclone sha1sum crypted-missing-sha1sum-test:
                                          Fedora-Workstation-Live-x86_64-26-1.5.iso

An example of the same large file uploaded directly to B2 remote that has SHA1SUM:

$ rclone mkdir b2:missing-sha1sum-test
$ ls -lh ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso 
-rw-rw-r--. 1 tadej tadej 1.5G Jul  6 00:34 /home/tadej/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso
$ rclone copy -v ~/Dist/Fedora-Workstation-Live-x86_64-26-1.5.iso b2:missing-sha1sum-test/
2017/10/22 13:37:22 INFO  : B2 bucket missing-sha1sum-test: Modify window is 1ms
2017/10/22 13:37:23 INFO  : B2 bucket missing-sha1sum-test: Waiting for checks to finish
2017/10/22 13:37:23 INFO  : B2 bucket missing-sha1sum-test: Waiting for transfers to finish
2017/10/22 13:37:37 INFO  : 
Transferred:   7.346 MBytes (467.081 kBytes/s)
Errors:                 0
Checks:                 0
Transferred:            0
Elapsed time:       16.1s
Transferring:
 *     Fedora-Workstation-Live-x86_64-26-1.5.iso:  0% done, 1.299 MBytes/s, ETA: 19m2s

[... output trimmed... ]

2017/10/22 13:54:56 INFO  : Fedora-Workstation-Live-x86_64-26-1.5.iso: Copied (new)
2017/10/22 13:54:56 INFO  : 
Transferred:   1.456 GBytes (1.413 MBytes/s)
Errors:                 0
Checks:                 0
Transferred:            1
Elapsed time:    17m35.3s

$ rclone sha1sum b2:missing-sha1sum-test
65880b7e61f995df0009ecf556d736552526d2e0  Fedora-Workstation-Live-x86_64-26-1.5.iso
@ncw
Copy link
Member

ncw commented Oct 23, 2017

I'm not sure this is possible with the B2 API.

To store an SHA1 on a large file requires us to store it in the file metadata. We need to know this when we create the upload session before we've read any data.

However we can't know the SHA1 until the end of the session until we've actually read all the file bytes, and unfortunately I don't think b2 has an API for changing the metadata.

Is that correct @breunigs or have I missed something?

@tjanez
Copy link
Author

tjanez commented Oct 23, 2017

To store an SHA1 on a large file requires us to store it in the file metadata. We need to know this when we create the upload session before we've read any data.

However we can't know the SHA1 until the end of the session until we've actually read all the file bytes, and unfortunately I don't think b2 has an API for changing the metadata.

@ncw, thanks for the explanation.

Can you elaborate a bit more on how does the crypt remote come into play?
To be more precise, how is SHA1 computed when a large file is uploaded directly to B2 remote and how is it computed when a large file is uploaded via crypt remote (to B2 remote)?

@ncw
Copy link
Member

ncw commented Oct 24, 2017

For a large file upload to b2 we

  1. query the source object to see if it has an SHA1
  2. if it does then put it in the metadata for the create upload session
  3. upload the file in chunks
  4. finalise the upload session

Now the trouble comes in step 1. For a local object, querying the SHA1 causes the file to be read and the SHA1 to be computed. This means we read the file again in step 3 which is unfortunate but not the end of the world.

However when uploading a crypted file, the object does not know its SHA1. To find it crypt would have to read the entire file, encrypt it and SHA1 it. The encryption would then be repeated in step 3. The encryption would have to use the same nonce (which will drive any cryptographers nuts!), and the crypt module would have to persist objects (which it doesn't at the moment).

So to fix this we would need to do one of

  • spool large files to disk
  • encrypt large files twice (and complicate the internals of crypt even more)
  • get b2 to add a update the metadata API call (would be useful for set mod time also)
  • get b2 to make the large file upload finalize take the total SHA1 for the metadata

I'll email my contact at b2 about the last two options, see what he says!

@tjanez
Copy link
Author

tjanez commented Oct 24, 2017

For a large file upload to b2 we

  1. query the source object to see if it has an SHA1
  2. if it does then put it in the metadata for the create upload session
  3. upload the file in chunks
  4. finalise the upload session

Now the trouble comes in step 1. For a local object, querying the SHA1 causes the file to be read and the SHA1 to be computed. This means we read the file again in step 3 which is unfortunate but not the end of the world.

However when uploading a crypted file, the object does not know its SHA1. To find it crypt would have to read the entire file, encrypt it and SHA1 it. The encryption would then be repeated in step 3. The encryption would have to use the same nonce (which will drive any cryptographers nuts!), and the crypt module would have to persist objects (which it doesn't at the moment).

@ncw, thanks for this thorough explanation. I understand the difficulty of the problem now.

So to fix this we would need to do one of

  • spool large files to disk
  • encrypt large files twice (and complicate the internals of crypt even more)
  • get b2 to add a update the metadata API call (would be useful for set mod time also)
  • get b2 to make the large file upload finalize take the total SHA1 for the metadata

I'll email my contact at b2 about the last two options, see what he says!

Thanks, that's much appreciated!

Indeed, the last options would make it easy to support this on rclone's side (and probably other tools using B2's API).

@breunigs
Copy link
Collaborator

Sorry for my late answer, but yes, I have the same understanding as you do @ncw. That being said, we do calculate the sha1's for each part being uploaded, which offers at least some safeguard against corruption. This should also happen through crypt, even if it's not visible from their APIs (I believe).

@ncw
Copy link
Member

ncw commented Nov 1, 2017

@breunigs thanks for the confirmation

I'll email my contact at b2 about the last two options, see what he says!

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

@tjanez
Copy link
Author

tjanez commented Nov 1, 2017

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

@ncw, thanks for pushing this.

I concur, it's great to hear Backblaze is very helpful and willing to discuss extending their API to support such a use case. Fingers crossed!

@ya-isakov
Copy link

@ncw, I've read from https://rclone.org/b2/ that "When using B2 with crypt files are encrypted into a temporary location and streamed from there. This is required to calculate the encrypted file’s checksum before beginning the upload."
So, why do you need to crypt two times, to calculate SHA1? Why file cannot be fully encrypted first in /tmp, SHA1 sum calculated in the process, then file split to chunks and uploaded?

@ncw
Copy link
Member

ncw commented Dec 12, 2017

@ya-isakov those docs are for 1.39 - the docs for for the latest beta don't include that, they say this instead

Sources which don't support SHA1, in particular `crypt` will upload
large files without SHA1 checksums.  This may be fixed in the future
(see [#1767](https://github.com/ncw/rclone/issues/1767)).

@tjanez
Copy link
Author

tjanez commented Jul 15, 2018

I'll email my contact at b2 about the last two options, see what he says!

He said that they are aware of the gap in the protocol and he is going to discuss it with the engineers, so cross fingers they will implement a fix. I've always found Backblaze very helpful in this regard.

If I'm reading B2's Large Files docs correctly, they still don't support setting SHA1 checksum after uploading all the parts but only at the call to b2_start_large_file:

If the caller knows the SHA1 of the entire large file being uploaded, Backblaze recommends specifying the SHA1 in the fileInfo during the call to b2_start_large_file. Inside the fileInfo specify one of the keys as large_file_sha1 and for the value use a 40 byte hex string representing the SHA1.

@ncw, could you ping your B2 contact to see if there has been any internal progress on the issue?

@ncw
Copy link
Member

ncw commented Jul 15, 2018

I asked him a few months ago and he said it was in-progress I think. Unfortunately it isn't as easy as you might think as the whole of b2 is built on immutable objects...

@tjanez
Copy link
Author

tjanez commented Jul 15, 2018

I asked him a few months ago and he said it was in-progress I think.

Great, nice to hear it's in-progress.

Unfortunately it isn't as easy as you might think as the whole of b2 is built on immutable objects...

Yes, I can imagine...

@tjanez
Copy link
Author

tjanez commented May 29, 2019

@ncw, following #3210, I saw that you've implemented SetModTime using server side copy:

This creates a new version using a server side copy while updating the metadata.

Could we also leverage server side copy to set the SHA1 checksum after uploading the whole large file via the crypt remote?

As far as I understand it, we could do something like:

  • perform upload as we do it now
  • then create a server-side copy by:
    • specifying the correct SHA in the call to b2_start_large_file
    • repeating b2_copy_part
    • finishing with b2_finish_large_file

@ncw
Copy link
Member

ncw commented Jun 6, 2019

@ncw, following #3210, I saw that you've implemented SetModTime using server side copy:

This creates a new version using a server side copy while updating the metadata.

Could we also leverage server side copy to set the SHA1 checksum after uploading the whole large file via the crypt remote?

Yes this would be possible.

As far as I understand it, we could do something like:

  • perform upload as we do it now

  • then create a server-side copy by:

    • specifying the correct SHA in the call to b2_start_large_file
    • repeating b2_copy_part
    • finishing with b2_finish_large_file

Yes, though I'd just use b2_copy_file as it doesn't have any size limits.

I think there may be one disadvantage to doing this. Doing a server side copy creates another version which doubles the storage.

I suppose rclone could delete the source version once the copy is complete... Maybe rclone should be doing this in the SetModTime call too?

@ya-isakov
Copy link

ya-isakov commented Jun 6, 2019

Maybe it would be better to use b2_copy_part, as it will return sha1 sum. Without range option, it should probably return checksum of the whole file. And I'm voting for delete the old version.

@tjanez
Copy link
Author

tjanez commented Jun 7, 2019

Yes this would be possible.

This is great news.

Yes, though I'd just use b2_copy_file as it doesn't have any size limits.

I couldn't find any size limits mentioned in b2_copy_part's documentation, but I haven't actually used this myself.

I think there may be one disadvantage to doing this. Doing a server side copy creates another version which doubles the storage.
I suppose rclone could delete the source version once the copy is complete...

Yes, I agree with @ya-isakov, I think the source version should be deleted afterwards.

Maybe rclone should be doing this in the SetModTime call too?

I haven't looked at the source code but if it only uses B2's copy API to set a different modification time, then it would make sense to delete the original afterwards.

@scruloose
Copy link

What is the largest file size I can upload to B2 via crypt without encountering this issue?

@ncw
Copy link
Member

ncw commented Nov 9, 2020

What is the largest file size I can upload to B2 via crypt without encountering this issue?

Up to the limit set by --b2-upload-cutoff you can set this up to 5G I think.

@domyd
Copy link
Contributor

domyd commented Jan 28, 2021

I'm not quite clear on the practical impact of this issue. If I understand correctly, uploads are still checksummed and guaranteed to be correct, but corruption to the files at rest on B2, and when downloading, wouldn't be detected? But in that case they wouldn't decrypt (properly), so the user - and presumably crypt too - would notice anyway, no?

@ncw
Copy link
Member

ncw commented Jan 28, 2021

I'm not quite clear on the practical impact of this issue. If I understand correctly, uploads are still checksummed and guaranteed to be correct,

Correct

but corruption to the files at rest on B2, and when downloading, wouldn't be detected?

Also correct

But in that case they wouldn't decrypt (properly), so the user - and presumably crypt too - would notice anyway, no?

Yes crypt will notice any corruption in files.

So the actual effect is probably quite small.

@ivandeex
Copy link
Member

ivandeex commented Feb 9, 2021

@tjanez Does the problem reproduce with rclone 1.54?

@ivandeex
Copy link
Member

Sorry for noise!

@ya-isakov
Copy link

@domyd This server-stored checksum are great to check against local system. You could use it to find, if any of your local files are damaged or changed, using cryptcheck.
@ncw Am I right that sync command is not supporting comparing files by checksums, if they're on crypt remote?

@ncw
Copy link
Member

ncw commented Mar 23, 2021

@ya-isakov

Am I right that sync command is not supporting comparing files by checksums, if they're on crypt remote?

You need to use rclone cryptcheck for that.

@Snackhole
Copy link

@ncw Quick question, if the SHA1 is missing on any large file uploads to B2, what would be the expected output on rclone cryptcheck for those large files? I've got 30GB+ files that are showing as identical on local and on the crypted remote. What has it checked to determine that if not the SHA1 on the remote (which shouldn't exist according to the discussion in this issue)?

@ncw
Copy link
Member

ncw commented Oct 10, 2022

Try rclone sha1sum on the underlying files to see whether they have sha1sums or not.

@Snackhole
Copy link

Snackhole commented Oct 10, 2022

@ncw

Try rclone sha1sum on the underlying files to see whether they have sha1sums or not.

I deleted my previous response because I figured out I was confused about what you were actually asking me to do. To confirm, you wanted me to run rclone sha1sum on the encrypted files as found in B2, right? On the underlying remote that the crypt remote wraps?

If so, I checked with both large and small files. In the case of small files, the SHA1 is reported as found on B2's file browser. In the case of large files, the SHA1 is also reported as found on B2's file browser, but there are two values listed in their interface: One labeled SHA1 with a value of "none" and one labeled large_file_sha1 which is the value reported by rclone sha1sum.

So, does this mean that this issue is actually solved? Is the large_file_sha1 value present on the encrypted B2 file accurate, and used by cryptcheck to validate the file's integrity? I was under the impression from reading this issue and the docs that Rclone can't actually add a SHA1 to a large file uploaded to a crypted B2 remote, but it seems like that may not be true?


Edit: I have downloaded a large encrypted file from my B2 remote. When I use rclone sha1sum on the remote file, it reports exactly the same value as when used on the local download of the encrypted file, which is found on the large_file_sha1 field on the remote. It looks like this issue is just straight-up solved, unless I'm sorely mistaken. Does this also mean that cryptcheck is functioning exactly as intended with crypted B2 remotes, and I can trust it when it says the files are identical?


Edit 2: It occurs to me that I updated Rclone between uploading these files to the crypted B2 and running cryptcheck, since I wanted the use of the --combined flag with cryptcheck. That might be relevant? This comment suggests there was an older version of Rclone that actually did upload SHA1 values for large files to crypted B2 remotes. However, I think it's very unlikely I was using that older version, since I only started using Rclone at all in 2020, three years after you said this was outdated information; it was installed with curl https://rclone.org/install.sh | sudo bash, so presumably not a version from 2017.


Edit 3: Out of curiosity, I tracked down a large file that was uploaded after I updated Rclone to 1.59.2. It's about 30GB, and it does have large_file_sha1 defined, so updating the version between uploading and running cryptcheck does not seem to have been the cause of any of what I've observed.

@ncw
Copy link
Member

ncw commented Oct 11, 2022

Looking at the code, I think this issue was fixed in 1648c1a which was first released in v1.52.0.

This allows, for the local disk only, large file uploads to b2 wrapped in crypt to have the large_file_sha1 set.

So I think this issue is fixed now and I will close it.

@ncw ncw closed this as completed Oct 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants