New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix wharf's resumable upload client behavior + collect error traces in case of service degradation #71

Closed
quyse opened this Issue Aug 17, 2016 · 91 comments

Comments

Projects
None yet
9 participants
@quyse
Member

quyse commented Aug 17, 2016

I'm getting the following error while trying to push new version of a file.

$ butler push linux quyse/flaw:model-editor-linux
> For channel `model-editor-linux`: last build is 3183, downloading its signature

> Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
*errors.errorString itch.io API error: uploaded file does not exist in storage (build/4513/13440-signature-default: 404)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)

What does it possibly mean? I've recently upgraded docker container which is pushing builds during CI, but I don't think I've changed anything about butler.

@quyse

This comment has been minimized.

Member

quyse commented Aug 17, 2016

Hmm, looks like it's appearing randomly. I tried again and was able to push build successfully, from normal system and then from the same docker container (typing command manually). Then I tried it one more time and got the same error. Now with -v argument for longer logs, file has been already pushed before, so it must be trivial near-zero-sized patch:

[gitlab-runner@5565939828f6 flaw]$ butler -v push linux quyse/flaw:model-editor-linux
> For channel `model-editor-linux`: last build is 4515, downloading its signature
Created signature build file: {ID:13452 UploadURL:https://storage.googleapis.com/itchio/build/4518/13452-signature-default?upload_id=AEnB2Up52INajDXc6Su6gI1kzyfh0uEjsU_KvWdsteFu2xnTdgkG_DISVfEcF2H6ZupLtOd5rLFJrW2_7PVAr05a3ROIkUKoIw UploadParams:map[]}
Created patch build file: {ID:13453 UploadURL:https://storage.googleapis.com/itchio/build/4518/13453-patch-default?upload_id=AEnB2Up8mty9AnguponC-0fiULr8tPoCVZGUbSBwpBoVSOsjkuhcrdBE8OlAhIXxJztZHaQEsxxG_FMDmyJb_5PvJ10ZTDFnpQ UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!

> Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
|                   |   0.00%                                                  [upload 1] kicking off sender
[upload 1] sender blocking receive
[upload 0] kicking off sender
[upload 0] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 0] flushing buffered writer, 95 written
[upload 1] closing pipe writer
[upload 1] closed pipe writer
[upload 0] closing pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 0] scanner done
[upload 1] scanner done
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
[upload 1] uploading chunk of 3044 bytes
[upload 1] uploading 0-3043, last? true
>>>>>>>>>>> woo, got a done
|####################| 100.00%       0 - almost there                          [upload 1] 3.0 kB uploaded, at 200 OK
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
|####################| 100.00%       0 - almost there                          [upload 0] 95 B uploaded, at 308 
[upload 0] sender done
[upload 0] done sent!
>>>>>>>>>>> woo, got a done
*errors.errorString itch.io API error: uploaded file does not exist in storage (build/4518/13453-patch-default: 404)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)
@leafo

This comment has been minimized.

Member

leafo commented Aug 17, 2016

This means that the actual upload (coming from your computer) was never completed, but butler thinks it did and tried to continue to the next step. Looking at the our storage bucket I only see a successful upload for the signature file. The patch file is stuck in "uploading" state.

Anything related to your internet by any chance that might cause it to interrupt the upload?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 17, 2016

I have no theory about where the error comes from atm — if it was a logic error in butler, it sounds like it would've popped up before (over 4500 builds pushed for many different games so far).

Maybe some kind of google cloud storage problem? They happen and transparency on them isn't always ideal. I'd be happy if it was a butler bug because it means we could solve the issue once and for all, but for now, I don't know what would actually cause that.

@quyse

This comment has been minimized.

Member

quyse commented Aug 18, 2016

Anything related to your internet by any chance that might cause it to interrupt the upload?

Well, I don't know, my internet connection is usually very stable due to my use of VPN with DigitalOcean machine as output node (so my crappy Russian ISP doesn't interfere with traffic), and I don't see anything suspicious in working of other networked programs i.e. browser, package manager. It's still happening consistently both in docker container and on the host system.

Gitlab pipeline link: https://gitlab.com/quyse/flaw/pipelines/3971019 - see deploys named itch, not a single retry succeeded. I think before yesterday I haven't seen a single butler failure in months.

I just got different error on host machine (and just found the same error yesterday in docker too):

$ ~/.config/itch/bin/butler -v push linux quyse/flaw:model-editor-linux
• For channel `model-editor-linux`: last build is 4537, downloading its signature
Created signature build file: {ID:13514 UploadURL:https://storage.googleapis.com/itchio/build/4539/13514-signature-default?upload_id=AEnB2UorCYSC-Etm7BGT9QBx1w4y8-aE-wYql6YNR6n2Y8QUNLeFkxMXOKKGbBtcqxWuOBcQ0NLoxzlI8SemEP3n9jdcBa6OoA UploadParams:map[]}
Created patch build file: {ID:13515 UploadURL:https://storage.googleapis.com/itchio/build/4539/13515-patch-default?upload_id=AEnB2Uq9HzQDROTkWjfGMaf7wpCIXmoHv5OjO02QtK8CJkRj1GmcGaegwueHvnyX6DEqmDbY1dzrOUoNsefzsiJcmRH7Y-MZMg UploadParams:map[]}
Launching patch & signature channels
Waiting for source container
Got sourceContainer!

• Pushing 8.0 MB (1 files, 0 dirs, 0 symlinks)
Building diff context
▐                   ▌   0.00%         - network idle, 8.0 MB left              [upload 0] kicking off sender
[upload 1] kicking off sender
[upload 0] sender blocking receive
[upload 1] sender blocking receive
[upload 1] flushing buffered writer, 3044 written
[upload 1] closing pipe writer
[upload 0] flushing buffered writer, 95 written
[upload 0] closing pipe writer
[upload 0] closed pipe writer
[upload 0] everything closed! uploadedbytes = 0, totalbytes = 95
>>>>>>>>>>> woo, got a done
[upload 1] closed pipe writer
[upload 1] everything closed! uploadedbytes = 0, totalbytes = 3044
>>>>>>>>>>> woo, got a done
[upload 1] scanner done
[upload 1] uploading chunk of 3044 bytes
[upload 0] scanner done
[upload 1] uploading 0-3043, last? true
[upload 0] uploading chunk of 95 bytes
[upload 0] uploading 0-94, last? true
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00%       0 - almost there                          [upload 1] 3.0 kB uploaded, at 308 
[upload 1] sender done
[upload 1] done sent!
>>>>>>>>>>> woo, got a done
▐▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▌ 100.00%       0 - almost there                          [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said 
[upload 0] send error, bailing out
[upload 0] got sub error: HTTP 503 while uploading, bailing
*errors.errorString HTTP 503 while uploading                                    
/home/amos/builds/630f4286/1/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:283 (0x6571cd)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x490431)

503 error cannot be explained as a connectivity problem, isn't it?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 18, 2016

Ah, 503 is definitely a response from the server, and it's something the docs tell us how to deal with: https://cloud.google.com/storage/docs/xml-api/resumable-upload#step_4wzxhzdk17query_title_for_the_upload_status

So I can indeed fix that (and it is indeed the GCS being unstable afaict? some reverse proxy realizing that the backend isn't available probably)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 18, 2016

Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071

@quyse

This comment has been minimized.

Member

quyse commented Aug 18, 2016

Hmm interesting, this one is failing because it can't chmod: https://gitlab.com/quyse/flaw/builds/3176071

Yep, that's my fault, as I said, it's freshly rebuilt docker container, and I put butler_creds in there manually as a root, and butler is running under normal user, I fixed that :)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 18, 2016

Right, but I still think butler should only warn about not being able to chmod, not straight up fail :) (if it has read access, that is)

@fasterthanlime fasterthanlime changed the title from "API error: uploaded file does not exist in storage" to Gracefully handle HTTP 503 from GCS Aug 18, 2016

@mattiascibien

This comment has been minimized.

mattiascibien commented Aug 18, 2016

Had the same 503 error over and over now... Seems pretty sistematic.

@HoratioZEDU

This comment has been minimized.

HoratioZEDU commented Aug 19, 2016

This is so far completely blocking me from pushing almost any builds. I've previously been getting 503 errors that interrupt the upload after a period of "network idle" for pushing larger files, and I get the described error above for smaller files that actually reach 100%. I hope this is fixed soon! The -v log I got for the former case is here: https://gist.github.com/HoratioZEDU/7a53c3637ddc0c9d9b4624c7ecbbdfea

@mattiascibien

This comment has been minimized.

mattiascibien commented Aug 19, 2016

@HoratioZEDU it is the same that happened to me...

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

Looking at code right now, we do have built-in retry logic for stuff like socket hangups, spurious dns failures, etc, but not for that particular case. I'll add the list of status codes listed in the GCS code to the retry logic.

fasterthanlime added a commit to itchio/wharf that referenced this issue Aug 19, 2016

@VinditCo

This comment has been minimized.

VinditCo commented Aug 19, 2016

Hello, also getting this 503 issue - can't push anything. Any suggestions?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

I've written a fix, currently upgrading our build machines to go 1.7 for macOS sierra compat, smaller binaries, and new optimizations. New version will be up in a bit.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

A new bleeding-edge version of butler is now available - can you check that the problem is resolved?

See diff here: itchio/wharf@dfdeafe

@quyse

This comment has been minimized.

Member

quyse commented Aug 19, 2016

Seems not helping. I just upgraded to latest butler:

$ butler --version
head, built on Aug 19 2016 @ 14:40:20

And retried the same CI build two times, and both times got the same uploaded file does not exist in storage error as in my first post.

@VinditCo

This comment has been minimized.

VinditCo commented Aug 19, 2016

version : head, built on Aug 19 2016 @ 15:40:26

From verbose mode :

`• Pushing 4.2 GB (803 files, 304 dirs, 0 symlinks)
▐▓ ▌ 1.63% 2h22m11s - network idle, 4.0 GB left
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x41fc0b9]

goroutine 70 [running]:
panic(0x446ea20, 0xc420012090)
/usr/local/Cellar/go/1.7/libexec/src/runtime/panic.go:500 +0x1a1
github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_netError).Error(0xc42083a9f0, 0x0, 0x0)
/Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:146 +0x29
github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func2(0xc4230a0000, 0x1000000, 0x1000000, 0x0, 0x0, 0x0)
/Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:224 +0x157
github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks.func3(0xc420466000, 0xc420266a80, 0xc42042e0c0, 0xc42042e120, 0xc420314060, 0xc4203d21e0, 0xc4203d2120)
/Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:294 +0x5d7
created by github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader.(_ResumableUpload).uploadChunks
/Users/amos/builds/94518b86/0/itchio/butler/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:305 +0x4c4`

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

@VinditCo fixing that now.

@quyse I'm still not sure why that particular one happens. The -v log you posted earlier doesn't show any sign of an upload failing :(

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

(Sorry about the slow fixes, I'm fighting against seemingly random CI builder failures..)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

@VinditCo can you try again with the latest bleeding-edge? (butler update --head should do the trick)

edit: actually the bleeding-edge deploy job got skipped in the CI, let me try that again...

@VinditCo

This comment has been minimized.

VinditCo commented Aug 19, 2016

@fasterthanlime sure thing, pulling and pushing now. Also having a dev hell day, so feel your pain ;)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

@VinditCo see edit above - latest 'head' doesn't have the fix for the last error you reported, 0.15.5 does, pushing a new head now 📦

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

Alright, latest head deployed be3391a

@VinditCo

This comment has been minimized.

VinditCo commented Aug 19, 2016

Got latest version, seems to stick at % and not increase even when it resumes

`upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 4h5m40s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said

network error: HTTP 503 while uploading
Sleeping 1 seconds then retrying
[upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 6h32m19s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said

network error: HTTP 503 while uploading
Sleeping 4 seconds then retrying
[upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 9h3m51s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said

network error: HTTP 503 while uploading
Sleeping 9 seconds then retrying
[upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 12h36m29s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said

network error: HTTP 503 while uploading
Sleeping 16 seconds then retrying
[upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 16h4m39s - network idle, 4.1 GB left [upload 0] uh oh, got HTTP 503 Service Unavailable
[upload 0] server said

network error: HTTP 503 while uploading
Sleeping 25 seconds then retrying
[upload 0] uploading chunk of 16777216 bytes
[upload 0] uploading 16777216-33554431, last? false
▐▓ ▌ 0.81% 19h11m20s - network idle, 4.1 GB left ^[[A
`

@HoratioZEDU

This comment has been minimized.

HoratioZEDU commented Aug 19, 2016

@VinditCo Can confirm that this appears to be the case.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 19, 2016

I'm running some tests on my side, am definitely noticing things that weren't there before:

  • Storage responds super slowly even on really small uploads
  • There doesn't seem to be a pattern for 503 errors - Range header is correctly set, it does retry, I just had an upload where it got 503 once, retried, and then it worked.

Looking at http headers, still investigating..

edit: I'm timing the requests - I get 200 or 308 responses in about 7s-12s, whereas 503 only come after a whopping 1m34

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

@IBwWG unfortunately network errors seem to have come back full swing. on the plus side, this makes the client-side fixes that much easier to test ⛄️

edit: just as I'm posting that, I'm not getting any errors on my test uploads. conditions seems to be changing every hour.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

looks like defensive programming is paying off, just saw a 308 response without a Range header:

[upload 1] uploading 0-16778, last? true
[upload 1] server replied in 22.533891661s, with status 308 
[upload 1] [Reply header] Alternate-Protocol: [443:quic]
[upload 1] [Reply header] Alt-Svc: [quic=":443"; ma=2592000; v="35,34,33,32,31,30"]
[upload 1] [Reply header] X-Guploader-Uploadid: [AEnB2UpBkgmmDbB0MF4C7MF6TPCaMqIy9Rv3-HYfGS9XvBmmtatQL2p5LNKiOeMwv4X28tou-OjQ0D8gaeYtWGbq51rpdlW6FA]
[upload 1] [Reply header] Content-Length: [0]
[upload 1] [Reply header] Date: [Mon, 29 Aug 2016 10:43:16 GMT]
[upload 1] [Reply header] Server: [UploadServer]
[upload 1] [Reply header] Content-Type: [text/html; charset=UTF-8]
[upload 1] sender: send error, bailing out
[upload 1] got sub error: invalid range header, expected key=val, got , bailing
*errors.errorString invalid range header, expected key=val, got                 
/home/amos/go/src/github.com/itchio/butler/vendor/github.com/itchio/wharf/uploader/resumable.go:323 (0x655c3d)
        (*ResumableUpload).uploadChunks.func2: err := sendBytes(sendBuf, isLast)
/usr/local/go/src/runtime/asm_amd64.s:1998 (0x48e101)
        goexit: BYTE    $0x90   // NOP

I'm just assuming this means the Range was null (ie. 0-0) and nothing was committed. Moving along

@IBwWG

This comment has been minimized.

IBwWG commented Aug 29, 2016

OK, just after all that, my first 32-bit went through errorless, and my second after 1 retry over a 308. Guess it helps to get the latest HEAD butler again (don't settle for these underbutlers, I tell ya.)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Testing new code, not seeing any 5xx errors any more, but still seeing quite a few 308 with unexpected ranges - which the new code handles beautifully. Wish there was a way to trigger other error conditions, to make sure all the codepaths work as intended.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Trying to push about a gig's worth of data on my slow connection: still getting 308 response indicating that some blocks didn't get sent properly - but interestingly, it seems to always either fully commit, or fully fail. For those tests, I've brought down maxChunkGroup to 4 (resulting in 4*256KB=1MB PUT requests), as opposed to the default value of 64 in the client everyone's been using so far.

I'll try to run that longer test again with the 64 value (16MB put requests).

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Results with maxChunkGroup = 64 - only getting 308 now. I'm guessing the 503 we saw before were because the previous code expected a 308 to mean "the request you just sent got committed properly", then sent the next block, which was an invalid range according to GCS.

now that the client processes 308 responses correctly, it still looks like it either completely succeeds (64 blocks stored = 16MB), or completely fails (64 blocks thrown away). That's quite a bit of data to re-upload, but the docs recommend using a larger multiple of 256KB to alleviate the cost of RTT.

Any ideas, @Capstan ? Here's an example session: https://gist.github.com/fasterthanlime/7b2089bd9ec79235b557b806a25e8010 (includes uploader ID)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

The good news: I just managed to push 1.2GB using the latest bleeding-edge butler (head)

The bad news: a good number of requests still keep failing, although now with 308, since butler isn't getting confused about ranges anymore. It does succeed eventually, although slower than usual.

Here's a complete session log for @Capstan & anyone else at Google following this: https://gist.github.com/anonymous/cd1ed7f0af461d0c9fca4494e25e89b1 - it contains 26 failures and 36 successes. No partial commits were seen, and only HTTP 308 or HTTP 200 responses. This was uploaded from France on an ADSL2+ connection, to the same US bucket as usual, with 054e64f

The current version of the client code can be found here: https://github.com/itchio/wharf/blob/2beaf74bc214e512514731d66c4c06567ebb2e11/uploader/resumable.go (and a few other files in the same folder). That code hasn't been released to a stable version of butler yet, since it needs more testing first.

@Capstan

This comment has been minimized.

Capstan commented Aug 29, 2016

There are several oddities:

  • The resumable session creation landed in us-west. All the subsequent chunk PUTs landed in eu-west. This can create performance issues, since every PUT has to load & commit the session information trans-atlantic. Are you creating the resumable session in some other data center and handing the URL to a client? If not, it might be strange ingress into Google or an unlucky load-balancing issue.
  • The recorded data about the session still shows a number of requests after the initial session creation where you are supplying a Content-Length: 0 with a content-range: bytes 0-16777215/* and getting back the current status that the server has zero bytes. Your first request at 11:38am request in the log specifically says Content-Length: 0.
    • I don't understand why we don't return a Range header, but my suspicion is that it's a boundary case since Range: 0-0 would mean we had a byte, when we do not. I've asked an engineer closer to that code for more detail, but if so, the code may have to treat an omitted Range as having no stored bytes.
    • I don't understand why you're calculating a content-range with some bytes, but your body is content-length: 0. It looks like you're trying to send content, but disagree about its size internally. Is it possible that the bytes.Reader.Read is returning 0 under the call to net.http.NewRequest? Perhaps the calculation should not be on the buffer you get passed, but on req.Header.Get("Content-Length")? I'm not sure when that's guaranteed to be set though, so it might involve some experimentation.
@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

The resumable session creation landed in us-west. All the subsequent chunk PUTs landed in eu-west. This can create performance issues, since every PUT has to load & commit the session information trans-atlantic. Are you creating the resumable session in some other data center and handing the URL to a client? If not, it might be strange ingress into Google or an unlucky load-balancing issue.

Yes, this is by design (this is what I meant by "uploading to a US bucket from Europe"). Ideally, in the future, we'd have our system let users upload to a nearby bucket and then transfer it to another bucket in our backend - but for now we're doing the simplest thing that could possibly work.

I don't understand why we don't return a Range header, but my suspicion is that it's a boundary case since Range: 0-0 would mean we had a byte, when we do not. I've asked an engineer closer to that code for more detail, but if so, the code may have to treat an omitted Range as having no stored bytes.

That makes sense (I also figured that Range: 0-0 would be problematic) - the latest version of the code handles the lack of Range header as "no stored bytes".

I don't understand why you're calculating a content-range with some bytes, but your body is content-length: 0. It looks like you're trying to send content, but disagree about its size internally. Is it possible that the bytes.Reader.Read is returning 0 under the call to net.http.NewRequest? Perhaps the calculation should not be on the buffer you get passed, but on req.Header.Get("Content-Length")? I'm not sure when that's guaranteed to be set though, so it might involve some experimentation.

I don't understand either! (and request times consistently being 2m30 to send 16MB suggest that there is, in fact, data being sent across the network - that it arrives correctly is another question though!).

My suspicion was that some proxy along the way (from eu-west to us-west) might be dropping the payload, and the Content-Length: 0 header would be coming from there. However, I'll also try running an upload through a local debug proxy and keep an eye on the headers my debug proxy actually sees.

Or - as I'm re-reading your comment, do you mean that there's actually data in the HTTP request, and it's just the Content-Length header that's set incorrectly?

Here's the current code sending the PUT request: https://github.com/itchio/wharf/blob/master/uploader/resumable.go#L169 - the data flow goes like this:

  • http.Request gets passed a counting.Reader
  • the counting.Reader is reading from a bytes.NewReader()
  • the bytes reader is reading from a fixed bytes buffer that doesn't get touched throughout the upload

The io.Reader interface doesn't have any function to query the size, so if the standard golang http implementation wanted to set Content-Length appropriately, I suppose it would have to read the entire reader into another buffer, just to know its length.

@Capstan are you seeing the Content-Length: 0 issue in the session with upload ID AEnB2Uou6kKzhnSoosVeLnBFbabyEQ9ABCtwDvVsnrM_GmNM5ocXZryOxWdvpsj0H4UsM2fCOW8g__Q1cKdEKs52M3VLwa0_ZR3K-UewRcPTc--Mop1aBIM ? I'm not sure which timezone the 11:38AM time you gave is in, but I was testing code in the middle of a refactoring earlier today - that session is the only complete one I have with the latest code.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Looking through Fiddler logs right now, butler's PUT request don't seem to set the Content-Length header at all. I'm trying to change the code so that it's set explicitly now.

@Capstan

This comment has been minimized.

Capstan commented Aug 29, 2016

Yes, this is by design (this is what I meant by "uploading to a US bucket from Europe"). Ideally, in the future, we'd have our system let users upload to a nearby bucket and then transfer it to another bucket in our backend - but for now we're doing the simplest thing that could possibly work.

Currently, the upload session and the bucket/object metadata/data storage are in separate persistent stores, because the upload session needs to be close to the uploader, not the final destination of the bytes. If you're creating a US upload for upload from the EU, then I still suggest you make the resumable session request from the EU to reduce latency.

I don't understand either! (and request times consistently being 2m30 to send 16MB suggest that there is, in fact, data being sent across the network - that it arrives correctly is another question though!).

My suspicion was that some proxy along the way (from eu-west to us-west) might be dropping the payload, and the Content-Length: 0 header would be coming from there. However, I'll also try running an upload through a local debug proxy and keep an eye on the headers my debug proxy actually sees.

Or - as I'm re-reading your comment, do you mean that there's actually data in the HTTP request, and it's just the Content-Length header that's set incorrectly?

I don't have data for that. If you send Content-Length: 0, which we do record and see, I'm not sure we ever expect any further data and certainly don't record it, at least at my service's level.

The io.Reader interface doesn't have any function to query the size, so if the standard golang http implementation wanted to set Content-Length appropriately, I suppose it would have to read the entire reader into another buffer, just to know its length.

That's my suspicion as well, but I didn't have time to look. In any case, it looks like the request thinks it's transmitting 0 bytes, so there's disagreement.

@Capstan are you seeing the Content-Length: 0 issue in the session with upload ID AEnB2Uou6kKzhnSoosVeLnBFbabyEQ9ABCtwDvVsnrM_GmNM5ocXZryOxWdvpsj0H4UsM2fCOW8g__Q1cKdEKs52M3VLwa0_ZR3K-UewRcPTc--Mop1aBIM ? I'm not sure which timezone the 11:38AM time you gave is in, but I was testing code in the middle of a refactoring earlier today - that session is the only complete one I have with the latest code.

Yes. The timezone is in GMT, which is what was the timezone of the gist – this is from that first request. (4:38AM PT). From the comments, it looks like it's in the chunk-sending code, but it's sending Content-Length: 0 and thus the request is being treated like a query-status.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Well, I feel like I'm taking crazy pills :(

As I suspected, butler wasn't ever setting Content-Length so far - golang's http.Request uses Chunked transfer encoding unless the ContentLength field of http.Request is set explicitly.

I'm now setting it explicitly, which makes http.Request actually send a Content-Length header with the correct number, and now some requests are failing with Content-Length=(a multiple of 256KB) but Body length=0 - except I'm logging the reads, so I know they're happening.

I'm currently trying to simplify the code as much as possible to isolate the problem (remove custom transport, remove counting reader, etc.). I'll update as soon as I find out more.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Found one more piece of the puzzle, the .Do method I use to send PUT requests has its own 503 handling logic, which blindly retries (and reads from the same reader, hence the 0 body length).

My only consolation is that, well, we're still getting 503, so I'm not completely crazy and it's not just client-side errors afaict.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

@Capstan little status update:

Here's what used to happen:

  • GCS was responding with 503 often, but the codepath used to do PUT requests was retrying that blindly - with the same reader, which explains the Content-Length: 0 requests you saw
  • Me setting Content-Length explicitly helped me find out what was happening, and I fxed it
  • The following screenshot shows what we have now (the same network issue we've had for 12 days now, unrelated to how correct butler's error handling was)

gcs-errors

  • request 10 is the first try: it's sending 4 blocks (1MB). it gets a 503 back.
  • request 12 is a "query status" request (new logic implemented today) - it gets a 308 with a blank "Range" header (no bytes were committed)
  • request 13 is the second try: it's sending the same 4 blocks - it gets a 503 back
  • requests 15, 17, 19, 21, and 23 are all "query status" - 15 through 21 get a 503 back, 23 finally gets a 308, which has a blank "Range" header again
  • request 24 is the third try: still sending the same 4 blocks
  • request 26 and 28 are "query status", the second one gets a 308 with a blank "Range" header again
  • request 29 is the fourth try, which finally succeeds in sending the first 4 blocks.

The error rate here is much higher than I was seeing this morning (it's 8PM for now), but it's a good showcase of the kind of performance our users have seen. I'll let that upload session running, the upload ID is AEnB2Uq9QHXVdYmK_Td1nF9iqtG1QOZNXfHCUgT-2FFUvsbrbBSgo-cHlXyX_AQub6ntM5WAI8AFWoxXUg3BY63IKrM3H9Ci7w

Note: I have no doubt that this is a side-effect of "generate upload URL in US, then hand it over to EU clients". However, just two weeks ago (and for several months before that), we had never seen that many requests fail - which leads me to keep believing that, despite my trial and error with GCS error codes, there's a routing issue happening between the EU and US somewhere.

When it's solved, I'll be happy to look into generating upload URLs from the right continent - speaking of that, I'd be super happy if there was a way to do it that didn't imply either 1) giving full write access to the bucket to any user (ie. not being able to enforce quotas/limits on our own) or 2) having servers in every zone (so that they start the resumable upload session).

@Capstan

This comment has been minimized.

Capstan commented Aug 29, 2016

Found one more piece of the puzzle, the .Do method I use to send PUT requests has its own 503 handling logic, which blindly retries (and reads from the same reader, hence the 0 body length).

Do you have a pointer to this? This seems like a straight-up golang framework bug. Rereading from a reader will never work the way you want.

Note: I have no doubt that this is a side-effect of "generate upload URL in US, then hand it over to EU clients". However, just two weeks ago (and for several months before that), we had never seen that many requests fail - which leads me to keep believing that, despite my trial and error with GCS error codes, there's a routing issue happening between the EU and US somewhere.

This behavior should result in slow uploads, not failing uploads. I'm going to see if one of us can track down the cause of the 503s you're seeing. Do you / can you record the body of that error message to see if there's extra indication as to what is going on?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

Do you have a pointer to this? This seems like a straight-up golang framework bug. Rereading from a reader will never work the way you want.

That one is 100% my fault - golang http stack still as solid as ever thankfully :)

This behavior should result in slow uploads, not failing uploads. I'm going to see if one of us can track down the cause of the 503s you're seeing. Do you / can you record the body of that error message to see if there's extra indication as to what is going on?

Now that the error handling logic is (hopefully) correct on the client side, it does indeed result in (very) slow uploads, but that eventually complete (provided that it doesn't fail 10 times in a row). I haven't recorded all 503, but from a cursory look at my debug proxy, all I can see is "Service Unavailable". Here's a sample raw response:

HTTP/1.1 503 Service Unavailable
X-GUploader-UploadID: AEnB2Urat-wVR64GdMPEH-HL9giFKO4hfQsA499696qLv8N5EbCuT-nhGGVQk4NjnWABs4FPgZT8prEFVzbaD78DMthAui91p2gJwl0D3bJER5ZN-Muet3g
Content-Length: 19
Date: Mon, 29 Aug 2016 18:26:29 GMT
Server: UploadServer
Content-Type: text/html; charset=UTF-8
Alternate-Protocol: 443:quic
Alt-Svc: quic=":443"; ma=2592000; v="35,34,33,32,31,30"

Service Unavailable

I suspect this is the issue we were both looking for all along - butler's mishandling of GCS errors / blind retries on 503 (with empty body) or 308 (moving on to next block, whereas it was a query response that indicated that the previous block wasn't in fact committed) only exacerbated the problem - but it's still there.

@Capstan

This comment has been minimized.

Capstan commented Aug 29, 2016

It looks like I erred. The service can yield 503s when it takes longer than an internally-set deadline to read or write the upload session metadata across the pond. I don't know yet why it's happening more as of recently. When did you first start seeing the issue? Does it directly correlate to adding EU writers to the US upload sessions?

What proxy(ies) are you using for your normal upload flows (not counting fiddler)?

How, if at all, are you arranging to have upload sessions start in the US? Are there multiple machines involved? Or proxy servers?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

It looks like I erred. The service can yield 503s when it takes longer than an internally-set deadline to read or write the upload session metadata across the pond. I don't know yet why it's happening more as of recently.

Ah, that's promising!

When did you first start seeing the issue?

First report was this very github issue, see the timestamp of this comment (Aug 17).

Does it directly correlate to adding EU writers to the US upload sessions?

Yes - our users are mainly from the US and EU, and everyone who has complained about performance/increased error rate is uploading from the EU (France, Holland, Switzerland, England).

We've gotten no complaints from US users, and we have a set of backgrounds jobs that download and upload from/to GCS running on a US GCE instance, and they haven't had a single error either.

What proxy(ies) are you using for your normal upload flows (not counting fiddler)?
How, if at all, are you arranging to have upload sessions start in the US? Are there multiple machines involved? Or proxy servers?

No proxies on our side. The itch.io API server (in the US) creates a resumable upload session, sends it back as part of an API response, and then our client doesn't talk to our API server until after the upload is finished. (The hope was that GCS would scale better than whatever proxy we could come up with). Conditionally generating resumable upload session URLs allows us to enforce quotas/limits on our side.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

I'm running an experiment right now - setting storage.googleapis.com to its US IP in my /etc/hosts and seeing if I get any errors. I'm not getting any so far. (Obviously not a fix, but it does add credibility to the internally-set deadline theory).

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

News from the long-running upload (without the /etc/hosts trick) - there are periods of calm, where most PUTs succeed - then errors come back:

gcs-there-are-better-periods

(Still upload ID AEnB2UoaJTsJOX8hFB9DXgQ5oHIgm37aNQYT1wm-_8lEgCODlvUpBiMDKQNRpVDnuFLomVMyABCe58jZKqRpfO-ApvsmT2D9fQ)

@Capstan

This comment has been minimized.

Capstan commented Aug 29, 2016

@fasterthanlime That trick didn't work as intended. The upload session is still in us-west.

We're engaging the uploader team about internal timeouts.

While that's progressing, is it possible that you could have the US-based server generate a SignedUrl to create the upload session and have the EU-based client execute the signed URL and then continue from there? Or do your clients also have the necessary credentials, in which case they could just create the upload session themselves?

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

@fasterthanlime That trick didn't work as intended. The upload session is still in us-west.

Just to be clear: the "trick" session is AEnB2Ur_2VXtlSopRf-AZq_qJ5ocLfrlJiGzdc52OM12U67bU8fz3gWmW37sqzhgpf-6MoQubxw70OB6ALAfBEqBkNaC6_yIlw (I didn't post that ID before) - it's had no commit failures so far (no 503) - just one "connection reset by peer" which seems completely unrelated.

The no-trick session (AEnB2UoaJTsJOX8hFB9DXgQ5oHIgm37aNQYT1wm-_8lEgCODlvUpBiMDKQNRpVDnuFLomVMyABCe58jZKqRpfO-ApvsmT2D9fQ) is still getting regular 503 errors.

We're engaging the uploader team about internal timeouts.

Good to hear, thanks for that!

While that's progressing, is it possible that you could have the US-based server generate a SignedUrl to create the upload session and have the EU-based client execute the signed URL and then continue from there? Or do your clients also have the necessary credentials, in which case they could just create the upload session themselves?

I actually hadn't thought of using signed URLs that way (took me a while to get what you meant!). We'll try that and I'll get back to you with results.

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 29, 2016

@fasterthanlime That trick didn't work as intended. The upload session is still in us-west.

Just to be clear: the "trick" session is AEnB2Ur_2VXtlSopRf-AZq_qJ5ocLfrlJiGzdc52OM12U67bU8fz3gWmW37sqzhgpf-6MoQubxw70OB6ALAfBEqBkNaC6_yIlw (I didn't post that ID before) - it's had no commit failures so far (no 503) - just one "connection reset by peer" which seems completely unrelated.

(that upload just completed without a single 503)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 30, 2016

It's around 1PM Paris time, the error rate is the highest I've seen yet.

gcs-still-going

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 30, 2016

butler 0.17.0 is out and implements all known fixes, except for this one:

While that's progressing, is it possible that you could have the US-based server generate a SignedUrl to create the upload session and have the EU-based client execute the signed URL and then continue from there?

unfortunately didn't have time to tackle that today (set up monitoring of our various services via stackdriver instead).

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Aug 30, 2016

@Capstan I just added a "deferred_resumable" upload type to our client, which is your "signed URL" suggestion, and the results are just beautiful:

gcs-suggestion

We'll be deploying that and urging our users to upgrade asap. (Figuring out the original issue would be nice too!)

@fasterthanlime

This comment has been minimized.

Member

fasterthanlime commented Sep 2, 2016

butler 0.18.0 is out and uses @Capstan's suggested approach.

Everyone is encouraged to upgrade as soon as possible. I've posted a full post-mortem if anyone's curious!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment