-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add some retry logic for pulling from remotes #878
Comments
This is likely to be a bug of some sort caused by libsoup; that error message only occurs in GLib when a socket timeout is set (and they aren't set by default). libsoup however does enable socket timeouts - looks like 60 seconds by default. Which is a pretty long time, but likely shorter than the operating system timeouts. The code path is going to be entirely different here for the libcurl backend. |
happened again last night for rawhide: task |
@dustymabe The task link there is self-referential 😄 |
sigh - I have to work on copy/paste skills - and sleep - https://koji.fedoraproject.org/koji/taskinfo?taskID=19781624 |
That one is a Do we want to retry in that scenario? And for how long? |
Do you think it would be reasonable to retry? I'd like to say yes, but don't want to seem unreasonable. Other than |
Another case where maybe something like this would help https://lists.projectatomic.io/projectatomic-archives/atomic-devel/2017-June/msg00030.html I know in this case it is probably more about anaconda than ostree, but it's possible that whatever we would do for this PR could prevent the user's reported problem from happening. |
hit this again - pulling to a local machine running fedora 26 AH |
Hit this again last night for F25AH compose: https://koji.fedoraproject.org/koji/taskinfo?taskID=20110591 |
I feel the pain here, but I still feel the correct fix here is in Fedora infra to make the webservers more reliable. I don't want to trivialize running infrastructure, but on the other hand we're just talking about serving static files. Also, ostree should have similar behavior to librepo(libdnf), and for that matter docker/skopeo. Otherwise we aren't solving the problem fully. And AFAICS, there's no such retries in librepo. |
Now as far as the Cisco issue, Anaconda does do retries for rpm installs, see e.g. https://github.com/rhinstaller/anaconda/blob/cff864a08fc78ae30bcb299cc83762b12da78c8b/pyanaconda/payload/__init__.py#L631 Also in RHEL7 with the yum payload: https://github.com/rhinstaller/anaconda/blob/2377e2ed95f4c0caccec578fa906ec22a94839b3/pyanaconda/packaging/yumpayload.py#L691 |
yep, just saying that if there is anything we can do we should do it. Assuming ostree becomes more popular and is going to be used in other places than atomic host, there are going to be crappy server infra that gets used in places. Do we want those issues opened here or handled by the software? If we can't do anything reasonable then let's close this issue. If there is something we can do then let's do it and save ourselves future headache. |
don't know if this is related or not, but since we are now on libcurl, I noticed this error once:
|
also just saw this:
|
I've also seen
Several times with that (large) ref pull in particular. |
I can reproduce this with at least two people. Can we do something to help debugging it? |
Setting a break point in timer_cb I get this trace:
Doesn't look too suspicious though |
What specific remotes and refs are people hitting? In #878 (comment) alex mentioned flathub; are other people hitting this with e.g.: |
Also libostree logs some information to the journal, try e.g.: |
I finally figured out how to reproduce this. First, download eos-select-bandwidth.py from https://gist.github.com/ramcq/fcfc6cc2d192d8f391fb9fb6e606509b, then run it like Then:
Output:
|
I discovered OSTREE_DEBUG_HTTP, so here is a new log with that: |
So, here is a completely random guess. Maybe we're starting to many parallel requests on the same http2 connection, and the server has some limit on how many are actually executed in parallel while the others get on some queue. Now, if the initial operations take > 30 secs, then the not-yet-handled requests time out because they have got no response yet. |
I did some more testing, and what seems to happen is that at some point a delta fetch is done and we start a new one, and then we don't get any data back on that in 30 sec (probably because we have a lot of other outstanding requests in the pipeline and the transfer rate is slow), which triggers these options that ostree sets:
If i remove these two then everything works fine. |
I wonder if we should just remove these two lines for now, including in the fedora packages. We can research the details later, but lets make it work first. |
I have a tool that checks to make sure refs are signed in our repos. I happened to see this this morning in the tool.
Not sure if anyone has seen the |
Got from some students this error too while installing GNOME Builder from flathub, not sure if related: |
Saw another
|
Ah yes, the low speed option conflicting with HTTP2 feels like a libcurl bug, but I'm totally in favor of just removing that for now. |
They don't play nicely currently with HTTP2 where we may have lots of requests queued. ostreedev#878 (comment) In practice anyways I think issues here are better solved on a higher level - e.g. apps today can use an overall timeout on pulls and if they exceed the limit set the cancellable.
PR in #1349 |
They don't play nicely currently with HTTP2 where we may have lots of requests queued. #878 (comment) In practice anyways I think issues here are better solved on a higher level - e.g. apps today can use an overall timeout on pulls and if they exceed the limit set the cancellable. Closes: #1349 Approved by: alexlarsson
They don't play nicely currently with HTTP2 where we may have lots of requests queued. #878 (comment) In practice anyways I think issues here are better solved on a higher level - e.g. apps today can use an overall timeout on pulls and if they exceed the limit set the cancellable. Closes: #1349 Approved by: alexlarsson
They don't play nicely currently with HTTP2 where we may have lots of requests queued. #878 (comment) In practice anyways I think issues here are better solved on a higher level - e.g. apps today can use an overall timeout on pulls and if they exceed the limit set the cancellable. Closes: #1349 Approved by: jlebon
A quick web search for "curl HTTP 2 framing error" turns up |
One thing we could probably do easily enough is add a remote config option to turn off http2 to allow people to work around this. |
Just for information, I have done further investigation and reported my findings on curl/curl#1618 (comment) (tl;dr: Curl keeps trying to read after the server asks it to cancel this connection and open a new one). |
Was experiencing this issue, or something similar. |
We already support disabling HTTP/2 since #1373 Yep agree we can close this for now; we can always reopen more targeted issues for further enhancements. Thanks a ton for implementing this! |
NOT FIXED. Current (new) Retry logic fails to consider congestion as cause of dropped packets. This especially happens with TLS handshake packets on certain routers and ISPs. Recommended solutions . Limit parallel connections to any single remote host to a sane value based on protocol. Furthermore, in the case of dropped connections, on the third retry to a host with parallel connections the local client should wait until a parallel connection finishes it's download before restarting, locking all new connections to the host at this point. At this point it should either hand-off that connection to the new download (Preferable), or close the connection and open a new one after a fixed grace period. As a final note, the best long term option is to redo the network code to be request based instead of connection based, opening a sane number of parallel connections, and feeding through requests from a cue, just like a modern web browser would when downloading the contents of a web page. This would be much simpler model, friendlier to multi-threading, and more adaptable in all aspects including network congestion and unreliable networks. |
Probably best to open a new issue with your suggestions rather than try to resurrect an issue from half a year ago, thanks. |
Thanks |
There has been a lot of intermittent network failures lately when pulling from our ostree remotes. While much of this is probably related to the backend fedora infra, i'd like to at least brainstorm how we could make our code more robust to poor networks and/or poor remote servers.
here are a few examples, but i've seen it myself quite a bit:
this screenshot from within this task
this log from within this task
The text was updated successfully, but these errors were encountered: