-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support retrying downloads if they transiently fail #1594
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! How about we split those first 6 commits into a prep PR to start? They're all pretty straight-forward.
src/libostree/ostree-repo-pull.c
Outdated
if (fetcher_queue_is_full (pull_data)) | ||
{ | ||
g_debug ("queuing fetch of static delta %s-%s part %u", | ||
fetch_data->from_revision ? fetch_data->from_revision : "empty", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: you can use the GNU C extension ?:
for this: https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html#Conditionals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I’ve also added b4ea90b at the beginning of the branch to convert other conditionals in the file too.
Done as #1599. I’ll rebase this one on that one shortly.
I’ll check that out tomorrow, thanks. Given the positive reaction to this so far, I’ll also work on queueing static delta superblocks, but I’ll do that as a separate PR so as not to block this one. |
433259a
to
754931a
Compare
src/libostree/ostree-repo-pull.c
Outdated
check_outstanding_requests_handle_error (pull_data, &error); | ||
scan_object_queue_data_free (scan_data); | ||
|
||
if (should_retry_request (error, scan_data->n_retries_remaining--)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to check for network errors here? scan_one_metadata_object
either enqueues a request or just scans a local object, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed. I blame over-enthusiastic grepping for calls to check_outstanding_requests_handle_error()
.
enqueue_one_object_request_s (pull_data, g_steal_pointer (&fetch_data)); | ||
else | ||
check_outstanding_requests_handle_error (pull_data, &local_error); | ||
|
||
if (free_fetch_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not new here, though I wonder why we need these bools instead of using the g_steal_pointer
pattern higher up. Anyway, we can look into that later!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure it applies in this particular instance, but there are other similar patterns when enqueuing other structs, where g_steal_pointer (&the_struct)
can’t be used because the_struct
needs to be referred to again lower down, before it’s eventually potentially freed.
In any case, the code could be restructured to keep a const
pointer to whatever it needs to refer to later, or something.
src/libostree/ostree-repo-pull.c
Outdated
g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_NOT_FOUND) || | ||
g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_UNREACHABLE) || | ||
g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_NOT_FOUND) || | ||
g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_TEMPORARY_FAILURE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I'm confused now how this actually works. I don't see anywhere in either the libsoup or the libcurl fetchers where we translate curl/soup specific transport errors into these. We need to expand handling in those functions to tease out these transient issues, right? E.g. in the libcurl fetcher, I would expect to see something like:
else if (curlres != CURLE_OK)
{
switch (curlres)
{
case CURLE_COULDNT_RESOLVE_HOST:
/* map to G_RESOLVER_ERROR_NOT_FOUND */
break;
case CURLE_COULDNT_CONNECT:
/* map to G_IO_ERROR_HOST_UNREACHABLE */
break;
...
}
}
or something similar...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the libsoup case, I guess I’m getting the GSocket
errors propagated back up through libsoup. I could definitely improve on_request_sent()
in ostree-fetcher-soup.c
to propagate some of the higher level errors a bit more successfully though.
I haven’t tested with the libcurl backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth trying to maintain a full mapping between the error domains, could just squash the ones we want to retry from curl to G_IO_ERROR_HOST_UNREACHABLE
or whatever.
I’ve updated my branch with some tests based on this, which caught a few more places where network retries were needed. While making these further changes I did wonder about whether I should be putting the retry support in at a lower level (e.g. in |
a4aa19d
to
9ae3f08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased this onto master!
src/libostree/ostree-repo-pull.c
Outdated
@@ -195,6 +204,7 @@ typedef struct { | |||
OstreeObjectType objtype; | |||
guint recursion_depth; /* NB: not used anymore, though might be nice to print */ | |||
OstreeCollectionRef *requested_ref; /* (nullable) */ | |||
guint n_retries_remaining; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't actually need this one, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we do, it’s used in queue_scan_one_metadata_object_c()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though it doesn't look like it's actually used for anything. I just compiled this branch with that line and the assignment in queue_scan_one_metadata_object_c()
removed. Did you mean to make use of it elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right, sorry. 🙁
Fixup pushed.
@@ -1072,11 +1072,24 @@ on_request_sent (GObject *object, | |||
GIOErrorEnum code; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that feels odd to me is that the way this is structured, we'll be retrying the whole set of mirrors if the last one on the list returns a transient error. So e.g. the first N-1 mirrors could return NOT_FOUND
and if the Nth mirror times out, we'll be retrying all N mirrors, right? Whereas I think what we want is to retry only the Nth mirror. It seems like fixing this properly would probably require indeed teaching each backend-specific fetcher about retries.
Anyway, I think it's definitely worth getting the current approach in and iterate on that since AFAIK mirrorlists are not widely used yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d definitely prefer to get the current approach in first. That said, I’m not sure whether I will have time afterwards to work on improvements to mirror handling (I’m working on this to fix a problem with our Endless downloads timing out, and mirrors don’t really impact on that).
However, I’m not entirely convinced about the situation you’re considering. Are we ever expecting the majority of mirrors to return NOT_FOUND
? Sounds like an unlikely corner case to me. Even so, OstreeFetcherSoup
maintains state about its mirrors, so I think the problem could be addressed by storing a bit more state there, rather than by rearchitecting things. Specifically, instead of storing a single mirrorlist_idx
pointing to the best mirror to try, store some flags for each mirror, and track whether each mirror has returned NOT_FOUND
, or returned a transient error, at all. Prioritise requesting from mirrors which haven’t. (This could also be implemented by storing a priority for each mirror, and demoting it on NOT_FOUND
or transient errors.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I wasn't suggesting you would fix those up. Just something that'd need to be done down the road. As I said, I'm not aware of any big users of mirrorlists yet, so it's not a pressing concern.
Are we ever expecting the majority of mirrors to return
NOT_FOUND
? Sounds like an unlikely corner case to me.
Heh, sure. The NOT_FOUND
was an example; anything "non-transient" would work. Anyway, I do agree with you that it could be done without lowering the retry logic down a level.
I don’t understand why |
Allow network requests to be re-queued if they failed with a transient error, such as a socket timeout. Retry each request up to a limit (default: 5), and only then fail the entire pull and propagate the error to the caller. Add a new ostree_repo_pull_with_options() option, n-network-retries, to control the number of retries (including setting it back to the old default of 0, if the caller wants). Currently, retries are not supported for FetchDeltaSuperData requests, as they are not queued. Once they are queued, adding support for retries should be trivial. A FIXME comment has been left for this. Signed-off-by: Philip Withnall <withnall@endlessm.com>
This allows the retry code in ostree-repo-pull.c to recover from (for example) timeouts at the libsoup layer in the stack, as well as from the GSocket layer in the stack. Signed-off-by: Philip Withnall <withnall@endlessm.com>
This is exactly like the --random-500s option, except that it will cause error 408 (request timeout) to be returned, rather than error 500 (internal server error). This will be used in a following commit to test pull behaviour when timeouts occur. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Extend test-pull-repeated.sh to test error 408 as well as error 500, to ensure that the new retry-on-network-timeout code in ostree-repo-pull.c correctly retries. Rather than the 200 iterations needed for the error 500 tests, only do 5 iterations. The pull code internally does 5 retries (by default), which means a full iteration count of 25. That seems to be sufficient to make the tests reliably pass, in my testing — we can always bump it up to 200 / 5 = 40 in future if needed (to put it in parity with the error 500 tests). Signed-off-by: Philip Withnall <withnall@endlessm.com>
I rebased the PR and squashed all the fixups since it makes individual commits harder to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking in good shape! Just a few minor comments.
g_debug ("Should retry request (remaining: %u retries), due to transient error: %s", | ||
n_retries_remaining, error->message); | ||
return TRUE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: while we're at it, maybe let's also add a g_debug
on the else
branch when we don't decide to retry to sanity check what kind of error it was?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s already covered by one of the g_debug()
s at the top of the function, no?
g_debug ("%s: error: %u:%u %s, n_retries_remaining: %u",
G_STRFUNC, error->domain, error->code, error->message,
n_retries_remaining);
src/libostree/ostree-repo-pull.c
Outdated
@@ -1286,9 +1310,14 @@ meta_fetch_on_complete (GObject *object, | |||
g_assert (pull_data->n_outstanding_metadata_fetches > 0); | |||
pull_data->n_outstanding_metadata_fetches--; | |||
pull_data->n_fetched_metadata++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a subtle accounting issue here. We should only be doing n_fetched_metadata++
if we actually fetched the metadata object successfully. The fact that we did this unconditionally before, even if we did have an error, didn't make a big difference since we would shortly afterwards error out and stop the pull operation. But now that we support retries, we need to make sure we're not incrementing this counter if we're retrying or we'd be counting it at least twice.
src/libostree/ostree-repo-pull.c
Outdated
@@ -1366,9 +1396,14 @@ static_deltapart_fetch_on_complete (GObject *object, | |||
g_assert (pull_data->n_outstanding_deltapart_fetches > 0); | |||
pull_data->n_outstanding_deltapart_fetches--; | |||
pull_data->n_fetched_deltaparts++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same applies here as well.
@@ -511,6 +524,13 @@ run (int argc, char **argv, GCancellable *cancellable, GError **error) | |||
goto out; | |||
} | |||
|
|||
if (!(opt_random_408s_percentage >= 0 && opt_random_408s_percentage <= 99)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: the way I like to write these range type checks usually is:
if (!(0 <= opt_random_408s_percentage && opt_random_408s_percentage <= 99))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh that's OK, I see you're just following the existing formatting of the 500s one.
Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com>
Ah, the CI test failures are when building with libcurl. I’ll see if I can do anything about that. |
Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers. The libsoup fetcher still handles a few more internal error codes than the libcurl one; this could be built on in future. Signed-off-by: Philip Withnall <withnall@endlessm.com>
7c65002
to
ba8a265
Compare
Fixed, I think, although some of the CI tests are hitting an internal error. |
src/libostree/ostree-fetcher-soup.c
Outdated
@@ -1092,10 +1086,11 @@ on_request_sent (GObject *object, | |||
code = G_IO_ERROR_BROKEN_PIPE; | |||
#else | |||
code = G_IO_ERROR_CONNECTION_CLOSED; | |||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to move this #endif
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh, no, the problems of squashing two diffs
bot, retest this please |
So here it only needed 2 iterations, which feels pretty low for a test with 40 iterations. But I think it checks out. We're fetching 12 objects, and the chance that one of those returns a 408 5 times in a row is... 4%? Hmm, maybe we should add a Anyway, we can easily fix up and tweak these things later. |
⚡ Test exempted: merge already tested. |
This allows the retry code in ostree-repo-pull.c to recover from (for example) timeouts at the libsoup layer in the stack, as well as from the GSocket layer in the stack. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon
This is exactly like the --random-500s option, except that it will cause error 408 (request timeout) to be returned, rather than error 500 (internal server error). This will be used in a following commit to test pull behaviour when timeouts occur. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon
Extend test-pull-repeated.sh to test error 408 as well as error 500, to ensure that the new retry-on-network-timeout code in ostree-repo-pull.c correctly retries. Rather than the 200 iterations needed for the error 500 tests, only do 5 iterations. The pull code internally does 5 retries (by default), which means a full iteration count of 25. That seems to be sufficient to make the tests reliably pass, in my testing — we can always bump it up to 200 / 5 = 40 in future if needed (to put it in parity with the error 500 tests). Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon
Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon
Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers. The libsoup fetcher still handles a few more internal error codes than the libcurl one; this could be built on in future. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon
Thanks! I’ve updated the follow-up #1600 to queue static delta superblock requests. |
PR ostreedev#1594 added logic to retry downloads when they fail due to network errors, or due to HTTP 408 Request Timeout. In practice, pulling from Flathub frequently fails with 503 Service Unavailable. This is a transient error, so it makes sense to retry when it's encountered. There is no good code in GIOErrorEnum for this so this patch takes the lazy route of mapping it to G_IO_ERROR_TIMED_OUT, which is already treated as a transient error in _ostree_fetcher_should_retry_request(). While we're here, also map 504 Gateway Timeout to G_IO_ERROR_TIMED_OUT (which does seem correct, if lossy).
Here’s a branch which adds support for retrying each network request in a pull operation if it fails with a transient error (like a socket timeout).
Static delta superblocks are currently not supported because they aren’t queued (that’s a separate issue). If this PR is accepted, I’ll put together another one to add queuing support for them, and remove the FIXME added in this one.
One issue which remains in my testing (using qdisc to increase latency and randomly drop packets) is that DNS queries fail a lot. That’s #894, and isn’t addressed in this PR. In real life, I would expect DNS queries to fail quite a lot less often than HTTP requests (since they’re tiny), but this does warrant further investigation.
I haven’t added tests to the branch because I have no idea how to write a unit test which reliably triggers this error handling path, short of writing a mock
GSocket
implementation and plumbing that in (which would result in a huge test harness). Suggestions welcome.#878