Support retrying downloads if they transiently fail #1594

pwithnall · 2018-05-22T15:24:44Z

Here’s a branch which adds support for retrying each network request in a pull operation if it fails with a transient error (like a socket timeout).

Static delta superblocks are currently not supported because they aren’t queued (that’s a separate issue). If this PR is accepted, I’ll put together another one to add queuing support for them, and remove the FIXME added in this one.

One issue which remains in my testing (using qdisc to increase latency and randomly drop packets) is that DNS queries fail a lot. That’s #894, and isn’t addressed in this PR. In real life, I would expect DNS queries to fail quite a lot less often than HTTP requests (since they’re tiny), but this does warrant further investigation.

I haven’t added tests to the branch because I have no idea how to write a unit test which reliably triggers this error handling path, short of writing a mock GSocket implementation and plumbing that in (which would result in a huge test harness). Suggestions welcome.

#878

cgwalters · 2018-05-23T20:12:45Z

ostree-trivial-httpd already has --random-500s; it might actually work to just remove the for ... seq loop there in the test-pull-repeated.sh? Though we might need slightly more sophisticated logic like --random-500s-between=X and use "number of objects" for X or something.

jlebon

Cool! How about we split those first 6 commits into a prep PR to start? They're all pretty straight-forward.

jlebon · 2018-05-24T13:34:14Z

src/libostree/ostree-repo-pull.c

+  if (fetcher_queue_is_full (pull_data))
+    {
+      g_debug ("queuing fetch of static delta %s-%s part %u",
+               fetch_data->from_revision ? fetch_data->from_revision : "empty",


Minor: you can use the GNU C extension ?: for this: https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html#Conditionals

Done. I’ve also added b4ea90b at the beginning of the branch to convert other conditionals in the file too.

pwithnall · 2018-05-25T00:05:21Z

Cool! How about we split those first 6 commits into a prep PR to start? They're all pretty straight-forward.

Done as #1599. I’ll rebase this one on that one shortly.

ostree-trivial-httpd already has --random-500s; …

I’ll check that out tomorrow, thanks. Given the positive reaction to this so far, I’ll also work on queueing static delta superblocks, but I’ll do that as a separate PR so as not to block this one.

jlebon · 2018-05-25T13:16:56Z

src/libostree/ostree-repo-pull.c

-  check_outstanding_requests_handle_error (pull_data, &error);
-  scan_object_queue_data_free (scan_data);
+
+  if (should_retry_request (error, scan_data->n_retries_remaining--))


Why do we need to check for network errors here? scan_one_metadata_object either enqueues a request or just scans a local object, right?

Hmm, indeed. I blame over-enthusiastic grepping for calls to check_outstanding_requests_handle_error().

jlebon · 2018-05-25T13:19:28Z

src/libostree/ostree-repo-pull.c

+    enqueue_one_object_request_s (pull_data, g_steal_pointer (&fetch_data));
+  else
+    check_outstanding_requests_handle_error (pull_data, &local_error);
+
  if (free_fetch_data)


Not new here, though I wonder why we need these bools instead of using the g_steal_pointer pattern higher up. Anyway, we can look into that later!

I’m not sure it applies in this particular instance, but there are other similar patterns when enqueuing other structs, where g_steal_pointer (&the_struct) can’t be used because the_struct needs to be referred to again lower down, before it’s eventually potentially freed.

In any case, the code could be restructured to keep a const pointer to whatever it needs to refer to later, or something.

jlebon · 2018-05-25T13:38:48Z

src/libostree/ostree-repo-pull.c

+      g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_NOT_FOUND) ||
+      g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_UNREACHABLE) ||
+      g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_NOT_FOUND) ||
+      g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_TEMPORARY_FAILURE))


Wait, I'm confused now how this actually works. I don't see anywhere in either the libsoup or the libcurl fetchers where we translate curl/soup specific transport errors into these. We need to expand handling in those functions to tease out these transient issues, right? E.g. in the libcurl fetcher, I would expect to see something like:

else if (curlres != CURLE_OK) { switch (curlres) { case CURLE_COULDNT_RESOLVE_HOST: /* map to G_RESOLVER_ERROR_NOT_FOUND */ break; case CURLE_COULDNT_CONNECT: /* map to G_IO_ERROR_HOST_UNREACHABLE */ break; ... } }

or something similar...

In the libsoup case, I guess I’m getting the GSocket errors propagated back up through libsoup. I could definitely improve on_request_sent() in ostree-fetcher-soup.c to propagate some of the higher level errors a bit more successfully though.

I haven’t tested with the libcurl backend.

I'm not sure it's worth trying to maintain a full mapping between the error domains, could just squash the ones we want to retry from curl to G_IO_ERROR_HOST_UNREACHABLE or whatever.

pwithnall · 2018-05-28T15:33:40Z

ostree-trivial-httpd already has --random-500s; …

I’ll check that out tomorrow, thanks.

I’ve updated my branch with some tests based on this, which caught a few more places where network retries were needed.

While making these further changes I did wonder about whether I should be putting the retry support in at a lower level (e.g. in ostree-fetcher-{soup,curl}.c). However, there is an advantage to having it implemented in ostree-fetcher.c: in future, the queue code could use a timeout from one host as a hint to lower the priority of that host and prioritise trying other mirrors/peers when downloading a particular object. I haven’t implemented that here (and have no immediate plans to), but we have that flexibility in future if needed.

jlebon

I rebased this onto master!

jlebon · 2018-05-28T18:44:30Z

src/libostree/ostree-repo-pull.c

@@ -195,6 +204,7 @@ typedef struct {
  OstreeObjectType objtype;
  guint recursion_depth; /* NB: not used anymore, though might be nice to print */
  OstreeCollectionRef *requested_ref;  /* (nullable) */
+  guint n_retries_remaining;


We don't actually need this one, right?

Yes we do, it’s used in queue_scan_one_metadata_object_c().

Yes, though it doesn't look like it's actually used for anything. I just compiled this branch with that line and the assignment in queue_scan_one_metadata_object_c() removed. Did you mean to make use of it elsewhere?

You are absolutely right, sorry. 🙁

Fixup pushed.

jlebon · 2018-05-28T19:10:42Z

src/libostree/ostree-fetcher-soup.c

@@ -1072,11 +1072,24 @@ on_request_sent (GObject        *object,
              GIOErrorEnum code;


One thing that feels odd to me is that the way this is structured, we'll be retrying the whole set of mirrors if the last one on the list returns a transient error. So e.g. the first N-1 mirrors could return NOT_FOUND and if the Nth mirror times out, we'll be retrying all N mirrors, right? Whereas I think what we want is to retry only the Nth mirror. It seems like fixing this properly would probably require indeed teaching each backend-specific fetcher about retries.

Anyway, I think it's definitely worth getting the current approach in and iterate on that since AFAIK mirrorlists are not widely used yet.

I’d definitely prefer to get the current approach in first. That said, I’m not sure whether I will have time afterwards to work on improvements to mirror handling (I’m working on this to fix a problem with our Endless downloads timing out, and mirrors don’t really impact on that).

However, I’m not entirely convinced about the situation you’re considering. Are we ever expecting the majority of mirrors to return NOT_FOUND? Sounds like an unlikely corner case to me. Even so, OstreeFetcherSoup maintains state about its mirrors, so I think the problem could be addressed by storing a bit more state there, rather than by rearchitecting things. Specifically, instead of storing a single mirrorlist_idx pointing to the best mirror to try, store some flags for each mirror, and track whether each mirror has returned NOT_FOUND, or returned a transient error, at all. Prioritise requesting from mirrors which haven’t. (This could also be implemented by storing a priority for each mirror, and demoting it on NOT_FOUND or transient errors.)

To clarify, I wasn't suggesting you would fix those up. Just something that'd need to be done down the road. As I said, I'm not aware of any big users of mirrorlists yet, so it's not a pressing concern.

Are we ever expecting the majority of mirrors to return NOT_FOUND? Sounds like an unlikely corner case to me.

Heh, sure. The NOT_FOUND was an example; anything "non-transient" would work. Anyway, I do agree with you that it could be done without lowering the retry logic down a level.

pwithnall · 2018-05-29T14:40:52Z

I don’t understand why test-pull-repeated.sh is failing on Travis. It reliably works fine for me locally. 😢

Allow network requests to be re-queued if they failed with a transient error, such as a socket timeout. Retry each request up to a limit (default: 5), and only then fail the entire pull and propagate the error to the caller. Add a new ostree_repo_pull_with_options() option, n-network-retries, to control the number of retries (including setting it back to the old default of 0, if the caller wants). Currently, retries are not supported for FetchDeltaSuperData requests, as they are not queued. Once they are queued, adding support for retries should be trivial. A FIXME comment has been left for this. Signed-off-by: Philip Withnall <withnall@endlessm.com>

This allows the retry code in ostree-repo-pull.c to recover from (for example) timeouts at the libsoup layer in the stack, as well as from the GSocket layer in the stack. Signed-off-by: Philip Withnall <withnall@endlessm.com>

This is exactly like the --random-500s option, except that it will cause error 408 (request timeout) to be returned, rather than error 500 (internal server error). This will be used in a following commit to test pull behaviour when timeouts occur. Signed-off-by: Philip Withnall <withnall@endlessm.com>

Extend test-pull-repeated.sh to test error 408 as well as error 500, to ensure that the new retry-on-network-timeout code in ostree-repo-pull.c correctly retries. Rather than the 200 iterations needed for the error 500 tests, only do 5 iterations. The pull code internally does 5 retries (by default), which means a full iteration count of 25. That seems to be sufficient to make the tests reliably pass, in my testing — we can always bump it up to 200 / 5 = 40 in future if needed (to put it in parity with the error 500 tests). Signed-off-by: Philip Withnall <withnall@endlessm.com>

jlebon · 2018-05-29T19:37:15Z

I rebased the PR and squashed all the fixups since it makes individual commits harder to review.

jlebon

This is looking in good shape! Just a few minor comments.

jlebon · 2018-05-29T19:42:31Z

src/libostree/ostree-fetcher-util.c

+      g_debug ("Should retry request (remaining: %u retries), due to transient error: %s",
+               n_retries_remaining, error->message);
+      return TRUE;
+    }


Minor: while we're at it, maybe let's also add a g_debug on the else branch when we don't decide to retry to sanity check what kind of error it was?

That’s already covered by one of the g_debug()s at the top of the function, no?

g_debug ("%s: error: %u:%u %s, n_retries_remaining: %u", G_STRFUNC, error->domain, error->code, error->message, n_retries_remaining);

jlebon · 2018-05-29T19:53:43Z

src/libostree/ostree-repo-pull.c

@@ -1286,9 +1310,14 @@ meta_fetch_on_complete (GObject           *object,
  g_assert (pull_data->n_outstanding_metadata_fetches > 0);
  pull_data->n_outstanding_metadata_fetches--;
  pull_data->n_fetched_metadata++;


There's a subtle accounting issue here. We should only be doing n_fetched_metadata++ if we actually fetched the metadata object successfully. The fact that we did this unconditionally before, even if we did have an error, didn't make a big difference since we would shortly afterwards error out and stop the pull operation. But now that we support retries, we need to make sure we're not incrementing this counter if we're retrying or we'd be counting it at least twice.

jlebon · 2018-05-29T19:56:58Z

src/libostree/ostree-repo-pull.c

@@ -1366,9 +1396,14 @@ static_deltapart_fetch_on_complete (GObject           *object,
  g_assert (pull_data->n_outstanding_deltapart_fetches > 0);
  pull_data->n_outstanding_deltapart_fetches--;
  pull_data->n_fetched_deltaparts++;


I think the same applies here as well.

jlebon · 2018-05-29T20:03:05Z

src/ostree/ostree-trivial-httpd.c

@@ -511,6 +524,13 @@ run (int argc, char **argv, GCancellable *cancellable, GError **error)
      goto out;
    }

+  if (!(opt_random_408s_percentage >= 0 && opt_random_408s_percentage <= 99))


Minor: the way I like to write these range type checks usually is:

if (!(0 <= opt_random_408s_percentage && opt_random_408s_percentage <= 99))

Ahh that's OK, I see you're just following the existing formatting of the 500s one.

…errors

Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com>

pwithnall · 2018-05-30T11:07:27Z

Ah, the CI test failures are when building with libcurl. I’ll see if I can do anything about that.

Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers. The libsoup fetcher still handles a few more internal error codes than the libcurl one; this could be built on in future. Signed-off-by: Philip Withnall <withnall@endlessm.com>

pwithnall · 2018-05-30T11:48:14Z

Ah, the CI test failures are when building with libcurl. I’ll see if I can do anything about that.

Fixed, I think, although some of the CI tests are hitting an internal error.

jlebon · 2018-05-30T13:46:10Z

src/libostree/ostree-fetcher-soup.c

@@ -1092,10 +1086,11 @@ on_request_sent (GObject        *object,
                  code = G_IO_ERROR_BROKEN_PIPE;
 #else
                  code = G_IO_ERROR_CONNECTION_CLOSED;
-#endif


Did you mean to move this #endif?

argh, no, the problems of squashing two diffs

jlebon · 2018-05-30T13:47:02Z

bot, retest this please

…d curl

jlebon · 2018-05-30T16:23:53Z

/tmp/test-tmp-libostree_test-pull-repeated.sh.test-HFQKJZ /tmp/test-tmp-libostree_test-pull-repeated.sh.test-HFQKJZ
7 metadata, 5 content objects fetched; 1 KiB transferred in 0 seconds
Success on iteration 2

So here it only needed 2 iterations, which feels pretty low for a test with 40 iterations. But I think it checks out. We're fetching 12 objects, and the chance that one of those returns a 408 5 times in a row is... 4%? Hmm, maybe we should add a ostree pull --network-retries=N switch to lower retries to 2 or something. That'd be useful for general use too. The fact that it's greater than 1 at least is comforting.

Anyway, we can easily fix up and tweak these things later.
This looks great overall! Thanks a lot for working on this.

@rh-atomic-bot r+ e283a05

rh-atomic-bot · 2018-05-30T16:24:01Z

⚡ Test exempted: merge already tested.

This allows the retry code in ostree-repo-pull.c to recover from (for example) timeouts at the libsoup layer in the stack, as well as from the GSocket layer in the stack. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

This is exactly like the --random-500s option, except that it will cause error 408 (request timeout) to be returned, rather than error 500 (internal server error). This will be used in a following commit to test pull behaviour when timeouts occur. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Extend test-pull-repeated.sh to test error 408 as well as error 500, to ensure that the new retry-on-network-timeout code in ostree-repo-pull.c correctly retries. Rather than the 200 iterations needed for the error 500 tests, only do 5 iterations. The pull code internally does 5 retries (by default), which means a full iteration count of 25. That seems to be sufficient to make the tests reliably pass, in my testing — we can always bump it up to 200 / 5 = 40 in future if needed (to put it in parity with the error 500 tests). Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers. The libsoup fetcher still handles a few more internal error codes than the libcurl one; this could be built on in future. Signed-off-by: Philip Withnall <withnall@endlessm.com> Closes: #1594 Approved by: jlebon

pwithnall · 2018-05-30T16:38:43Z

Thanks! I’ve updated the follow-up #1600 to queue static delta superblock requests.

PR ostreedev#1594 added logic to retry downloads when they fail due to network errors, or due to HTTP 408 Request Timeout. In practice, pulling from Flathub frequently fails with 503 Service Unavailable. This is a transient error, so it makes sense to retry when it's encountered. There is no good code in GIOErrorEnum for this so this patch takes the lazy route of mapping it to G_IO_ERROR_TIMED_OUT, which is already treated as a transient error in _ostree_fetcher_should_retry_request(). While we're here, also map 504 Gateway Timeout to G_IO_ERROR_TIMED_OUT (which does seem correct, if lossy).

pwithnall self-assigned this May 22, 2018

jlebon reviewed May 24, 2018

View reviewed changes

pwithnall mentioned this pull request May 25, 2018

Preparation for supporting retrying downloads #1599

Closed

pwithnall force-pushed the retry-downloads branch from 433259a to 754931a Compare May 25, 2018 00:07

jlebon reviewed May 25, 2018

View reviewed changes

pwithnall mentioned this pull request May 28, 2018

Queue static delta superblocks #1600

Closed

pwithnall force-pushed the retry-downloads branch from a4aa19d to 9ae3f08 Compare May 28, 2018 16:18

jlebon force-pushed the retry-downloads branch from 9ae3f08 to 8a9110e Compare May 28, 2018 18:28

jlebon reviewed May 28, 2018

View reviewed changes

pwithnall added 4 commits May 29, 2018 15:35

jlebon force-pushed the retry-downloads branch from 2aad299 to 30b6c08 Compare May 29, 2018 19:35

jlebon reviewed May 29, 2018

View reviewed changes

pwithnall added 2 commits May 30, 2018 12:00

fixup! lib/repo-pull: Support retrying requests on transient network …

3c51ee4

…errors

lib/repo-pull: Add some missing assertions for progress statistics

d8ad578

Various of the counters already have assertions like this; add some more for total paranoia. Signed-off-by: Philip Withnall <withnall@endlessm.com>

pwithnall force-pushed the retry-downloads branch from 7c65002 to ba8a265 Compare May 30, 2018 11:31

jlebon reviewed May 30, 2018

View reviewed changes

fixup! lib/fetcher: Factor out HTTP status code handling from soup an…

e283a05

…d curl

rh-atomic-bot closed this in 9380553 May 30, 2018

pwithnall deleted the retry-downloads branch May 30, 2018 16:30

This was referenced May 30, 2018

T22555 Retry downloads on flaky networks endlessm/ostree#120

Merged

add some retry logic for pulling from remotes #878

Closed

mwleeds mentioned this pull request Jun 13, 2018

Error updating Socket I/O timed out flatpak/flatpak#1752

Closed

This was referenced Jun 28, 2018

Add --network-retries=N to ostree pull #1659

Closed

ostree/pull: Add network-retries command line option #1669

Closed

wjt mentioned this pull request May 1, 2019

Retry requests on HTTP 503 and 504 #1847

Closed

mwleeds mentioned this pull request Mar 20, 2020

Retry network requests on transient errors flatpak/flatpak#3478

Merged

cgwalters mentioned this pull request Mar 24, 2022

Retry pull on network failure (Failure when receiving data from from the peer) coreos/rpm-ostree#3539

Closed

		@@ -1072,11 +1072,24 @@ on_request_sent (GObject *object,
		GIOErrorEnum code;

Support retrying downloads if they transiently fail #1594

Support retrying downloads if they transiently fail #1594

Conversation

pwithnall commented May 22, 2018 • edited Loading

cgwalters commented May 23, 2018

jlebon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwithnall commented May 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlebon May 25, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwithnall commented May 28, 2018

jlebon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwithnall commented May 29, 2018

jlebon commented May 29, 2018

jlebon left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pwithnall commented May 30, 2018

pwithnall commented May 30, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlebon commented May 30, 2018

jlebon commented May 30, 2018

rh-atomic-bot commented May 30, 2018

pwithnall commented May 30, 2018

pwithnall commented May 22, 2018 •

edited

Loading

pwithnall commented May 25, 2018 •

edited

Loading

jlebon May 25, 2018 •

edited

Loading