Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File uploads time out after 5 minutes (Jetstream2) #232

Closed
2 tasks
eecavanna opened this issue Jul 4, 2024 · 6 comments
Closed
2 tasks

File uploads time out after 5 minutes (Jetstream2) #232

eecavanna opened this issue Jul 4, 2024 · 6 comments
Assignees
Labels
bug Something isn't working jetstream2 ✈️ Issue related to deploying NMDC EDGE to the Jetstream2 platform

Comments

@eecavanna
Copy link
Collaborator

History

In #218, @mflynn-lanl reported that users could not upload large files. The root cause was that Cloudflare — which, at the time, was configured in "Proxied" mode — was imposing its 500 MiB (Mebibyte) request size limit. Uploads of files smaller than 500 MiB would succeed, while uploads of files larger than 500 MiB would fail. To fix that, we reconfigured Cloudflare to be in "DNS Only" mode instead of "Proxied" mode. As a result, we were able to upload files larger than 500 MiB.

Problem

Although we can now upload files larger than 500 MiB; if the upload takes longer than 5 minutes, the upload fails and the client receives an HTTP 408 Request Timeout response.

image

image

Task 📋

  • Determine what is imposing the 5-minute limit (e.g. Caddy, Express)
  • Reconfigure/modify that thing so it no longer imposes that limit

Appendix

You can convert a number of Mebibytes into Bytes at: https://www.dr-lex.be/info-stuff/bytecalc.html

You can generate "dummy" files (of a given size in Bytes) by running this command:

$ head -c 523239424 /dev/urandom > my-fake-499-mebibyte.fastq.gz
$ head -c 524288000 /dev/urandom > my-fake-500-mebibyte.fastq.gz
$ head -c 525336576 /dev/urandom > my-fake-501-mebibyte.fastq.gz
@eecavanna eecavanna added bug Something isn't working jetstream2 ✈️ Issue related to deploying NMDC EDGE to the Jetstream2 platform labels Jul 4, 2024
@eecavanna eecavanna self-assigned this Jul 4, 2024
@eecavanna
Copy link
Collaborator Author

At the time the upload fails (i.e. approximately 5 minutes after starting the upload), I see these lines in the caddy container's log:

exouser-caddy-1  | {"level":"error","ts":1720044809.4940464,"logger":"http.handlers.reverse_proxy","msg":"reading from backend","error":"read tcp 192.168.0.2:60702->192.168.0.4:5000: read: connection reset by peer"}
exouser-caddy-1  | {"level":"error","ts":1720044809.4941583,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","upstream":"app:5000","duration":309.991915694,"request":{"remote_ip":"__REDACTED__","remote_port":"61441","client_ip":"__REDACTED__","proto":"HTTP/2.0","method":"POST","host":"edge-dev.microbiomedata.org","uri":"/auth-api/user/upload/add","headers":{"Priority":["u=1, i"],"Accept-Language":["en-US,en;q=0.9"],"X-Forwarded-Proto":["https"],"Cookie":[],"Accept":["application/json, text/plain, */*"],"Sec-Ch-Ua-Mobile":["?0"],"Content-Length":["1000000531"],"Cache-Control":["no-cache"],"Sec-Ch-Ua":["\"Brave\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\""],"Referer":["https://edge-dev.microbiomedata.org/user/files"],"Sec-Fetch-Mode":["cors"],"Sec-Fetch-Dest":["empty"],"Sec-Gpc":["1"],"Sec-Fetch-Site":["same-origin"],"Pragma":["no-cache"],"X-Forwarded-Host":["edge-dev.microbiomedata.org"],"Authorization":[],"Accept-Encoding":["gzip, deflate, br, zstd"],"X-Forwarded-For":["__REDACTED__"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"],"Origin":["https://edge-dev.microbiomedata.org"],"Content-Type":["multipart/form-data; boundary=----WebKitFormBoundaryzCs3N3kdNMcOfvdy"],"Sec-Ch-Ua-Platform":["\"macOS\""]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"edge-dev.microbiomedata.org"}},"error":"reading: read tcp 192.168.0.2:60702->192.168.0.4:5000: read: connection reset by peer"}
exouser-caddy-1  | {"level":"error","ts":1720044809.4942892,"logger":"http.log.access","msg":"handled request","request":{"remote_ip":"__REDACTED__","remote_port":"61441","client_ip":"__REDACTED__","proto":"HTTP/2.0","method":"POST","host":"edge-dev.microbiomedata.org","uri":"/auth-api/user/upload/add","headers":{"Content-Length":["1000000531"],"Authorization":[],"Referer":["https://edge-dev.microbiomedata.org/user/files"],"Accept-Language":["en-US,en;q=0.9"],"Sec-Fetch-Mode":["cors"],"Accept-Encoding":["gzip, deflate, br, zstd"],"Accept":["application/json, text/plain, */*"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36"],"Origin":["https://edge-dev.microbiomedata.org"],"Sec-Fetch-Dest":["empty"],"Cookie":[],"Sec-Fetch-Site":["same-origin"],"Pragma":["no-cache"],"Cache-Control":["no-cache"],"Sec-Ch-Ua":["\"Brave\";v=\"125\", \"Chromium\";v=\"125\", \"Not.A/Brand\";v=\"24\""],"Content-Type":["multipart/form-data; boundary=----WebKitFormBoundaryzCs3N3kdNMcOfvdy"],"Sec-Ch-Ua-Mobile":["?0"],"Sec-Ch-Ua-Platform":["\"macOS\""],"Sec-Gpc":["1"],"Priority":["u=1, i"]},"tls":{"resumed":true,"version":772,"cipher_suite":4865,"proto":"h2","server_name":"edge-dev.microbiomedata.org"}},"bytes_read":878797125,"user_id":"","duration":0,"size":0,"status":408,"resp_headers":{"Server":["Caddy"],"Alt-Svc":["h3=\":443\"; ma=2592000"]}}

Highlights (from those lines):

  • "msg": "aborting with incomplete response","upstream":"app:5000","duration":309.991915694
  • "Content-Length":["1000000531"]
  • "error":"reading: read tcp 192.168.0.2:60702->192.168.0.4:5000: read: connection reset by peer"
  • "bytes_read":878797125

The timestamp (1720044809.4940464) translates to:

  • Wednesday, July 3, 2024 10:13:29.494 PM (GMT) (UTC)
  • Wednesday, July 3, 2024 3:13:29.494 PM GMT-07:00 DST (PDT)

I also see this in the app container's log, but the timestamps seem to be 1 minute off from the ones above:

exouser-app-1    | 2024-07-03 22:14:29 debug: 	/auth-api/user/upload/add: {}
exouser-app-1    | 2024-07-03 22:14:29 error: 	[object Object]

@eecavanna
Copy link
Collaborator Author

I do see one timeout mentioned in the Caddy documentation, whose default duration is 5 minutes. It's the idle timeout. I don't know whether it affects us, since we're using Caddy in reverse-proxy mode; but here it is just in case.

  • idle is a duration value that sets the maximum time to wait for the next request when keep-alives are enabled. Defaults to 5 minutes to help avoid resource exhaustion.

@eecavanna
Copy link
Collaborator Author

I don't see any 5-minute (by default) timeouts in the documentation of the express-fileupload NPM package used by the React app, although this React app is using old versions of lots of packages, so the docs might not be accurate for the version of the package in use.

@ssarrafan
Copy link

@eecavanna @mflynn-lanl is this slated for this sprint? Can this be slotted for the next sprint?

@eecavanna
Copy link
Collaborator Author

eecavanna commented Jul 22, 2024

I think @mflynn-lanl and I will discuss this at the squad meeting on Tuesday (I missed last week's meeting) and then update this ticket based on that discussion (with one potential update being to move it to next sprint).

@eecavanna
Copy link
Collaborator Author

Closing in favor of #289, which mentions this symptom and links to this ticket. I think resolving #289 will eliminate this symptom (and the underlying issue) also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jetstream2 ✈️ Issue related to deploying NMDC EDGE to the Jetstream2 platform
Projects
Development

No branches or pull requests

2 participants