Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EOF issue with zfsbackup-go #59

Open
lhoward opened this issue Jan 26, 2019 · 24 comments
Open

EOF issue with zfsbackup-go #59

lhoward opened this issue Jan 26, 2019 · 24 comments

Comments

@lhoward
Copy link

lhoward commented Jan 26, 2019

Let me preface this by saying that I am not a Go programmer, nor an expert on B2; just a humble punter trying to get zfsbackup-go working on my Solaris machine. I don't know enough to know whether this issue is in zfsbackup-go's usage of Blazer, or in Blazer itself.

cf someone1/zfsbackup-go#17

ListObjects() in Writer.getLargeFile() could return EOF, which was causing zfsbackup-go to hang starting a new upload. Treating EOF identically to len(objs) < 1 (i.e. restarting with w.Resume = false) seemed to fix it, but I'm not sure if this is the correct fix.

diff --git a/b2/writer.go b/b2/writer.go
index 209cfce..bedf626 100644
--- a/b2/writer.go
+++ b/b2/writer.go
@@ -311,7 +311,12 @@ func (w *Writer) getLargeFile() (beLargeFileInterface, error) {
                cur := &Cursor{name: w.name}
                objs, _, err := w.o.b.ListObjects(w.ctx, 1, cur)
                if err != nil {
-                       return nil, err
+                       if err == io.EOF {
+                               w.Resume = false
+                               return w.getLargeFile()
+                       } else {
+                               return nil, err
+                       }
                }
                if len(objs) < 1 || objs[0].name != w.name {
                        w.Resume = false
@kurin
Copy link
Owner

kurin commented Jan 26, 2019

This feels like an issue in Blazer but let me see if I can get a test case failing. This code is possibly more complicated than it ought to be 🙄

@kurin
Copy link
Owner

kurin commented Jan 26, 2019

I suspect you've surfaced a bug where if Resume is true but there's nothing to resume it just blows up.

@kurin kurin closed this as completed in bfa7981 Jan 26, 2019
@kurin
Copy link
Owner

kurin commented Jan 26, 2019

Thanks for catching this! Try with the latest commit, you should be good.

@someone1
Copy link

Apologies for the tardy chiming in - I added that bit since retries after a network failure would result in incomplete open large file uploads. When I made the change I thought blazer would know to start a new file if there was nothing to resume (may have just read the code wrong at the time!)

Anyway, thanks for all the debugging @lhoward and for the quick fix @kurin !

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

This fix unfortunately gets it stuck in an endless loop with

2019/01/26 15:48:19 b2: Opening volume tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Entering tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Uploading tank/Git|20190126.zstream.gz.pgp.vol1
2019/01/26 15:48:19 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 15:48:19 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

@kurin kurin reopened this Jan 26, 2019
@kurin
Copy link
Owner

kurin commented Jan 26, 2019

Does this happen with any invocation? I have a zfs box around here somewhere I can try to test this on, but if you can reproduce it with the environment variable B2_LOG_LEVEL=2 you'll get a lot of debug lines that you could add to the bug (be sure to purge of secret data).

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

I wasn't exactly sure what data to consider secret, so for now I'll just paste in a summary – happy to send you the complete details privately if you mail me (lukeh at padl.com).

  • b2_list_unfinished_large_files (returns a fileId, nextFileId == null)
  • b2_list_parts (returns a bunch)
  • b2_get_upload_part_url (returns upload URLs for a bunch of fileIDs)

The key appears to be this ("chunks don't match"):

2019/01/26 18:21:44 error writing tank/Git|20190126.zstream.gz.pgp.vol1: resumable upload was requested, but chunks don't match
2019/01/26 18:21:44 << b2_get_upload_part_url (14) 200  {Cache-Control: max-age=0, no-cache, no-store; Content-Type: application/json;charset=utf-8; Content-Length: 376; Date: Sat, 26 Jan 2019 07:31:33 GMT} ({
  "authorizationToken": "[redacted]",
  "fileId": "some-fileid-redacted",
  "uploadUrl": "https://pod-000-1118-16.backblaze.com/b2api/v1/b2_upload_part/some-fileid-redacted/0055"
}
)
2019/01/26 18:21:44 >> b2_finish_large_file uri: https://api002.backblazeb2.com/b2api/v1/b2_finish_large_file headers: {X-Blazer-Request-Id: 17;User-Agent: blazer/0.5.1} args: ({"fileId":"some-fileid-redacted","partSha1Array":["67a935b099a061116bed27ba1ea8403fc3fe50e4","71dc3437d1dfa17cf36e1da1d9373fc875bd37ce","b5c1c32096584dce8837ee26c0d371d6a69b6bd5","826363c88715be8e4aa2190c91e0c0bfed4b7941","a36bdad6cc27bfa4e26a69a3501ad409a3e999f2","780c49397e0f2d64d7012eeeffe293308a2376c1","a6a79531a551b1ca2f0e9c0287da78d2fd456ed6","cb8ae6c70cd4f7b7e7593aa6c011c966a6224c69","fb3ddd9a9e2d32bb8c676025f43e3c8133a9fc5e","53ce7c6a8f6da7a9d6fdefdac4e95066b38a8cea"]})
2019/01/26 18:21:44 b2 backend: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled
2019/01/26 18:21:44 b2: Error while uploading volume tank/Git|20190126.zstream.gz.pgp.vol1 - context canceled

@kurin
Copy link
Owner

kurin commented Jan 26, 2019

Oooh okay so what's happening is that to resume the large file, the checksums of the already-uploaded chunks have to match the checksums of the corresponding chunks of the current file; if not you're really uploading a different file.

The reason your checksums don't match, I'm pretty sure, is because the snapshot you're uploading is encrypted, and the encryption is of course resulting in a different data stream each time.

This library is designed so that clients who receive errors will by default simply retry forever; the only permanent failure should be if the process itself dies. @someone1 do you have examples of the kinds of failures that led you to set w.Resume = true?

I need to think about whether there's a better way to handle non-matching resumes. I think it might be a good idea to add an option, when there's a conflict, to remove the partially uploaded file and start from scratch.

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

So setting resume to false should fix it? There is a resume command line option to zfsbackup BTW but I’m not setting that. And yes I am encrypting.

@lhoward
Copy link
Author

lhoward commented Jan 26, 2019

Is this resuming within or across a zfsbackup session, I wonder? In the latter case it seems like it should be disabled if encryption is on (even if it’s off, unless the volumes are cached locally possibly an unencrypted backup may differ?) just speculating.

@kurin
Copy link
Owner

kurin commented Jan 26, 2019

I'm not familiar with zfsbackup-go but it sounds to me like it's starting the upload from the beginning with w.Resume = true, which works when the stream is the same, but fails when the stream is encrypted, because the same plaintext becomes different cyphertext.

I'm wondering what's happening to cause zfsbackup-go to restart from the beginning, however, because that shouldn't be necessary. If the backup is interrupted (e.g. if the network drops), the client should just hang until service is restored.

@lhoward
Copy link
Author

lhoward commented Jan 27, 2019

If in zfsbackup-go's B2Backend.Upload(), I set w.Resume to false, the backup completes. @someone1 are there any risks in making this change?

@someone1
Copy link

@kurin - I think it was a mistake to add Resume=true as the default option in zfsbackup-go precisely for the reason you described. I had a user complain that they saw multiple versions of files in their B2 buckets where some were not accessible and assumed this was due to incomplete large file uploads. I don't actually see a way in blazer to cancel a large file upload and my goal was to just resume or start a new upload.

I think it would make sense to provide the following in the blazer:

  • If a large file upload fails for whatever reason and an option such as AbortOnFailure=true, then blazer should automatically cancel the large file upload.
  • Add an option as you described to clear out any partially uploaded files when starting an upload (this should obviously be mutually exclusive to the Resume option).

I'd love to have my code better tested against B2 - the bonfire package you have seems interesting in this regard - can it be used to setup a local B2 emulator to write tests against? If so, mind sharing a quick example how?

@kurin
Copy link
Owner

kurin commented Jan 29, 2019

I don't believe B2 has a way to automatically cancel large files.

The only permanent failure for the writer should be a cancelled context. Any other failure ought to be recoverable, and so wouldn't benefit from an abort-on-failure option, and when the context is cancelled we can't use it to clean up after ourselves.

It's possible to list unfinished files (with b2.ListUnfinished() option) and then delete them. Thinking about it, I don't think I want to add an option that lets callers set Resume but also removes the existing large file on chunk mismatch. When a caller requests Resume, I think it's better for it to fail permanently than to silently remove data.

Bonfire is an experimental reimplementation of the B2 server API that I haven't had time to work on in a while.

@someone1
Copy link

B2 does not automatically cancel large files - but it does have a b2_cancel_large_file call which is not exposed via blazer. The Writer interface already handles chunking a file, initiating a large file upload, and completing it for you, can it be updated to also cancel the upload if any failures occur (or expose a call that does)? In addition to a canceled context, wouldn't a permanent network failure or internal server errors from backblaze eventually fail any sort of retry/backoff attempts blazer makes or does it operate in such a way that it will NEVER return an error? I know GCS/S3 Go SDKs will eventually fail outside of a canceled context. The automatic cleanup would be great for situations where the power is pulled and large files are left but that's a backblaze feature, not blazer (S3 uses lifecycle rules for this).

The AWS SDK for Go's s3manager will start a multipart upload and cancel it should it exhaust retries (network failures, cancelled context, etc.) This is helpful as a high-level API where the details surrounding running and managing a multipart upload are hidden from the developer. I thought it'd make sense as an option to the blazer Writer.

The options I was suggesting are (naming aside):

  • CancelOnError - this would simply call b2_cancel_large_file if the Writer errors out
  • ClearUnfinished - this would check if an unfinished upload exists for this object before uploading (similar to Resume), and if so, cancel/delete it before uploading. This option is mutually exclusive to Resume (you can't have both)

If you think this is better left outside the Writer interface, I understand - I like the feature of CancelOnError like the S3 SDK provides but can work around it. The second option was suggested since backblaze doesn't have a way to automatically clear unfinished uploads after a time period. Is the proper way to cancel an unfinished large upload to b2 without knowledge of the large file session id to list the unfinished object(s) and call Delete on it?

@kurin
Copy link
Owner

kurin commented Jan 30, 2019

The B2 integration checklist specifies that some errors (including network errors) should be handled with exponentially backed-off retries. So in that sense blazer will retry forever even given network failures until a context is cancelled or the process exits.

It turns out some blazer clients would actually prefer just to receive and handle errors themselves, and I've kicked around some ideas for how to implement that (#55) but nothing's been done. It probably makes sense to add the ability to say "if NO progress has been made in , then kill the upload".

At the very least I should expose the b2_cancel_large_file request. I can't think of a way to gracefully handle the dead-context situation; it might be best to make an unexported cancel that doesn't rely on a context at all.

@lhoward
Copy link
Author

lhoward commented Jan 30, 2019

In the interim, backups with w.Resume = false are working for me. Thank you for the detail analysis above, I'm certainly learning something!

@someone1
Copy link

I agree there should be some way to call b2_cancel_large_file - looking forward to an implementation!

As for error handling/retires - given the b2 integration checklist, I think the current behavior is to spec and therefore fine (assuming it errors out for any non-408 400 error codes, seems like everything else should be retried). I can cancel requests using the Context I pass in.

One thing might be worth adjusting though: Instead of using time.Sleep here, to be more context sensitive it should probably be:

select {
case <-w.ctx.Done():
  err = w.ctx.Err()
case <-time.After(sleep):
   // Remainder of code  from 169-182
}

@kurin
Copy link
Owner

kurin commented Feb 1, 2019

Yeah that's probably a good idea.

@kurin
Copy link
Owner

kurin commented Feb 3, 2019

So I've added a thing for this:

https://godoc.org/github.com/kurin/blazer/b2#WithCancelOnError

I'm not super happy about the interface though. The idea with ctxf is that you probably want a new context if the old one is cancelled, and probably with a short deadline on it so that your writer doesn't hang forever even after you cancel the writer's context. Play around with this and see how it fits your needs.

@kurin kurin closed this as completed Mar 27, 2020
@someone1
Copy link

Sorry to come back to this issue so late - finally updating zfsbackup-go and while updating my backend code all around I noticed how the google cloud storage handled all sort of situations using a cancelled context and wonder if it would be a good fit here: https://pkg.go.dev/cloud.google.com/go/storage?tab=doc#Writer.CloseWithError (read the deprecation notice)

Basically, instead of using an option or anything of the sort, if the context.Context passed to NewWriter is cancelled - then the upload would cancel. In the case of b2 it would call b2_cancel_large_file if it was a large file upload, but even if it wasn't uploading a large file and was partially through uploading a small file - it would cancel it so the object is never created.

I'm using the new option you added, cancelling the context manually on error, and just opting not to call Close if I encounter any issues. If this isn't how I should be using the new options/client. please let me know!

@kurin kurin reopened this Apr 19, 2020
@kurin
Copy link
Owner

kurin commented Apr 19, 2020

Automatically calling b2_cancel_large_file is fraught if, for example, the user doesn't want the large file cancelled, which is a reasonable thing to expect.

You should always call Close irrespective of whether you've cancelled the context or otherwise encountered an error.

@someone1
Copy link

Hmmm - the problem I'm experiencing is if I start an upload and there's an error while writing it to b2, calling Close will save the partially written file - my preference is for the file to not be written in this situation.

For the GCS client, I just cancel the context.Context I passed into NewWriter and the object is never written to the bucket. Internally, the library is listening for the context to be cancelled and aborts the upload, cleaning up as required.

My test is somewhat contrived, but I basically try to io.Copy into the Writer from a source that will always error on Read - instead of calling w.Close() in the error handling code as recommended - I just return, cancelling the context.Context which seems to prevent the file from being written to b2. However, I can't tell if blazer is expecting this type of interaction and will abort the write as required if Close is never called.

Maybe there's an edge case here since io.Copy will never have started the upload so when I call Close, blazer calls init and creates an empty file in b2?

@kurin
Copy link
Owner

kurin commented May 1, 2020

I'd need to think about whether we can infer from a cancelled context whether the user does or does not want to clean up the underlying file. On the one hand it would suck to get partway through a huge file just to have a ctrl-c or something undo all your work, but on the other hand as you say we don't want to leave partials lying around.

Today, if you cancel the context and call Close(), the file should be left in an unwritten state; it can't complete simpleWriteFile() or finishLargeFile() with a cancelled context. So you will have a partially written large file, if you had completed any chunks, but you shouldn't have a "completed" large or small file. Are you seeing this behavior instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants