Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support retrying downloads if they transiently fail #1594

Closed
wants to merge 8 commits into from

Conversation

pwithnall
Copy link
Member

@pwithnall pwithnall commented May 22, 2018

Here’s a branch which adds support for retrying each network request in a pull operation if it fails with a transient error (like a socket timeout).

Static delta superblocks are currently not supported because they aren’t queued (that’s a separate issue). If this PR is accepted, I’ll put together another one to add queuing support for them, and remove the FIXME added in this one.

One issue which remains in my testing (using qdisc to increase latency and randomly drop packets) is that DNS queries fail a lot. That’s #894, and isn’t addressed in this PR. In real life, I would expect DNS queries to fail quite a lot less often than HTTP requests (since they’re tiny), but this does warrant further investigation.

I haven’t added tests to the branch because I have no idea how to write a unit test which reliably triggers this error handling path, short of writing a mock GSocket implementation and plumbing that in (which would result in a huge test harness). Suggestions welcome.

#878

@pwithnall pwithnall self-assigned this May 22, 2018
@cgwalters
Copy link
Member

ostree-trivial-httpd already has --random-500s; it might actually work to just remove the for ... seq loop there in the test-pull-repeated.sh? Though we might need slightly more sophisticated logic like --random-500s-between=X and use "number of objects" for X or something.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! How about we split those first 6 commits into a prep PR to start? They're all pretty straight-forward.

if (fetcher_queue_is_full (pull_data))
{
g_debug ("queuing fetch of static delta %s-%s part %u",
fetch_data->from_revision ? fetch_data->from_revision : "empty",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: you can use the GNU C extension ?: for this: https://gcc.gnu.org/onlinedocs/gcc/Conditionals.html#Conditionals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I’ve also added b4ea90b at the beginning of the branch to convert other conditionals in the file too.

@pwithnall
Copy link
Member Author

pwithnall commented May 25, 2018

Cool! How about we split those first 6 commits into a prep PR to start? They're all pretty straight-forward.

Done as #1599. I’ll rebase this one on that one shortly.

ostree-trivial-httpd already has --random-500s; …

I’ll check that out tomorrow, thanks. Given the positive reaction to this so far, I’ll also work on queueing static delta superblocks, but I’ll do that as a separate PR so as not to block this one.

check_outstanding_requests_handle_error (pull_data, &error);
scan_object_queue_data_free (scan_data);

if (should_retry_request (error, scan_data->n_retries_remaining--))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to check for network errors here? scan_one_metadata_object either enqueues a request or just scans a local object, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, indeed. I blame over-enthusiastic grepping for calls to check_outstanding_requests_handle_error().

enqueue_one_object_request_s (pull_data, g_steal_pointer (&fetch_data));
else
check_outstanding_requests_handle_error (pull_data, &local_error);

if (free_fetch_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new here, though I wonder why we need these bools instead of using the g_steal_pointer pattern higher up. Anyway, we can look into that later!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure it applies in this particular instance, but there are other similar patterns when enqueuing other structs, where g_steal_pointer (&the_struct) can’t be used because the_struct needs to be referred to again lower down, before it’s eventually potentially freed.

In any case, the code could be restructured to keep a const pointer to whatever it needs to refer to later, or something.

g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_NOT_FOUND) ||
g_error_matches (error, G_IO_ERROR, G_IO_ERROR_HOST_UNREACHABLE) ||
g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_NOT_FOUND) ||
g_error_matches (error, G_RESOLVER_ERROR, G_RESOLVER_ERROR_TEMPORARY_FAILURE))
Copy link
Member

@jlebon jlebon May 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, I'm confused now how this actually works. I don't see anywhere in either the libsoup or the libcurl fetchers where we translate curl/soup specific transport errors into these. We need to expand handling in those functions to tease out these transient issues, right? E.g. in the libcurl fetcher, I would expect to see something like:

      else if (curlres != CURLE_OK)
        {
           switch (curlres)
             {
             case CURLE_COULDNT_RESOLVE_HOST:
                /* map to G_RESOLVER_ERROR_NOT_FOUND */
                break;
             case CURLE_COULDNT_CONNECT:
                /* map to G_IO_ERROR_HOST_UNREACHABLE */
                break;
             ...
             }
        }

or something similar...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the libsoup case, I guess I’m getting the GSocket errors propagated back up through libsoup. I could definitely improve on_request_sent() in ostree-fetcher-soup.c to propagate some of the higher level errors a bit more successfully though.

I haven’t tested with the libcurl backend.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's worth trying to maintain a full mapping between the error domains, could just squash the ones we want to retry from curl to G_IO_ERROR_HOST_UNREACHABLE or whatever.

@pwithnall
Copy link
Member Author

ostree-trivial-httpd already has --random-500s; …

I’ll check that out tomorrow, thanks.

I’ve updated my branch with some tests based on this, which caught a few more places where network retries were needed.

While making these further changes I did wonder about whether I should be putting the retry support in at a lower level (e.g. in ostree-fetcher-{soup,curl}.c). However, there is an advantage to having it implemented in ostree-fetcher.c: in future, the queue code could use a timeout from one host as a hint to lower the priority of that host and prioritise trying other mirrors/peers when downloading a particular object. I haven’t implemented that here (and have no immediate plans to), but we have that flexibility in future if needed.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased this onto master!

@@ -195,6 +204,7 @@ typedef struct {
OstreeObjectType objtype;
guint recursion_depth; /* NB: not used anymore, though might be nice to print */
OstreeCollectionRef *requested_ref; /* (nullable) */
guint n_retries_remaining;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't actually need this one, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we do, it’s used in queue_scan_one_metadata_object_c().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, though it doesn't look like it's actually used for anything. I just compiled this branch with that line and the assignment in queue_scan_one_metadata_object_c() removed. Did you mean to make use of it elsewhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right, sorry. 🙁

Fixup pushed.

@@ -1072,11 +1072,24 @@ on_request_sent (GObject *object,
GIOErrorEnum code;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that feels odd to me is that the way this is structured, we'll be retrying the whole set of mirrors if the last one on the list returns a transient error. So e.g. the first N-1 mirrors could return NOT_FOUND and if the Nth mirror times out, we'll be retrying all N mirrors, right? Whereas I think what we want is to retry only the Nth mirror. It seems like fixing this properly would probably require indeed teaching each backend-specific fetcher about retries.

Anyway, I think it's definitely worth getting the current approach in and iterate on that since AFAIK mirrorlists are not widely used yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d definitely prefer to get the current approach in first. That said, I’m not sure whether I will have time afterwards to work on improvements to mirror handling (I’m working on this to fix a problem with our Endless downloads timing out, and mirrors don’t really impact on that).

However, I’m not entirely convinced about the situation you’re considering. Are we ever expecting the majority of mirrors to return NOT_FOUND? Sounds like an unlikely corner case to me. Even so, OstreeFetcherSoup maintains state about its mirrors, so I think the problem could be addressed by storing a bit more state there, rather than by rearchitecting things. Specifically, instead of storing a single mirrorlist_idx pointing to the best mirror to try, store some flags for each mirror, and track whether each mirror has returned NOT_FOUND, or returned a transient error, at all. Prioritise requesting from mirrors which haven’t. (This could also be implemented by storing a priority for each mirror, and demoting it on NOT_FOUND or transient errors.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I wasn't suggesting you would fix those up. Just something that'd need to be done down the road. As I said, I'm not aware of any big users of mirrorlists yet, so it's not a pressing concern.

Are we ever expecting the majority of mirrors to return NOT_FOUND? Sounds like an unlikely corner case to me.

Heh, sure. The NOT_FOUND was an example; anything "non-transient" would work. Anyway, I do agree with you that it could be done without lowering the retry logic down a level.

@pwithnall
Copy link
Member Author

I don’t understand why test-pull-repeated.sh is failing on Travis. It reliably works fine for me locally. 😢

Allow network requests to be re-queued if they failed with a transient
error, such as a socket timeout. Retry each request up to a limit
(default: 5), and only then fail the entire pull and propagate the error
to the caller.

Add a new ostree_repo_pull_with_options() option, n-network-retries, to
control the number of retries (including setting it back to the old
default of 0, if the caller wants).

Currently, retries are not supported for FetchDeltaSuperData requests,
as they are not queued. Once they are queued, adding support for retries
should be trivial. A FIXME comment has been left for this.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
This allows the retry code in ostree-repo-pull.c to recover from (for
example) timeouts at the libsoup layer in the stack, as well as from the
GSocket layer in the stack.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
This is exactly like the --random-500s option, except that it will cause
error 408 (request timeout) to be returned, rather than error 500
(internal server error).

This will be used in a following commit to test pull behaviour when
timeouts occur.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
Extend test-pull-repeated.sh to test error 408 as well as error 500, to
ensure that the new retry-on-network-timeout code in ostree-repo-pull.c
correctly retries.

Rather than the 200 iterations needed for the error 500 tests, only do 5
iterations. The pull code internally does 5 retries (by default), which
means a full iteration count of 25. That seems to be sufficient to make
the tests reliably pass, in my testing — we can always bump it up to 200
/ 5 = 40 in future if needed (to put it in parity with the error 500
tests).

Signed-off-by: Philip Withnall <withnall@endlessm.com>
@jlebon
Copy link
Member

jlebon commented May 29, 2018

I rebased the PR and squashed all the fixups since it makes individual commits harder to review.

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking in good shape! Just a few minor comments.

g_debug ("Should retry request (remaining: %u retries), due to transient error: %s",
n_retries_remaining, error->message);
return TRUE;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: while we're at it, maybe let's also add a g_debug on the else branch when we don't decide to retry to sanity check what kind of error it was?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s already covered by one of the g_debug()s at the top of the function, no?

    g_debug ("%s: error: %u:%u %s, n_retries_remaining: %u",
             G_STRFUNC, error->domain, error->code, error->message,
             n_retries_remaining);

@@ -1286,9 +1310,14 @@ meta_fetch_on_complete (GObject *object,
g_assert (pull_data->n_outstanding_metadata_fetches > 0);
pull_data->n_outstanding_metadata_fetches--;
pull_data->n_fetched_metadata++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a subtle accounting issue here. We should only be doing n_fetched_metadata++ if we actually fetched the metadata object successfully. The fact that we did this unconditionally before, even if we did have an error, didn't make a big difference since we would shortly afterwards error out and stop the pull operation. But now that we support retries, we need to make sure we're not incrementing this counter if we're retrying or we'd be counting it at least twice.

@@ -1366,9 +1396,14 @@ static_deltapart_fetch_on_complete (GObject *object,
g_assert (pull_data->n_outstanding_deltapart_fetches > 0);
pull_data->n_outstanding_deltapart_fetches--;
pull_data->n_fetched_deltaparts++;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the same applies here as well.

@@ -511,6 +524,13 @@ run (int argc, char **argv, GCancellable *cancellable, GError **error)
goto out;
}

if (!(opt_random_408s_percentage >= 0 && opt_random_408s_percentage <= 99))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: the way I like to write these range type checks usually is:

if (!(0 <= opt_random_408s_percentage && opt_random_408s_percentage <= 99))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh that's OK, I see you're just following the existing formatting of the 500s one.

Various of the counters already have assertions like this; add some more
for total paranoia.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
@pwithnall
Copy link
Member Author

Ah, the CI test failures are when building with libcurl. I’ll see if I can do anything about that.

Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers.
The libsoup fetcher still handles a few more internal error codes than
the libcurl one; this could be built on in future.

Signed-off-by: Philip Withnall <withnall@endlessm.com>
@pwithnall
Copy link
Member Author

Ah, the CI test failures are when building with libcurl. I’ll see if I can do anything about that.

Fixed, I think, although some of the CI tests are hitting an internal error.

@@ -1092,10 +1086,11 @@ on_request_sent (GObject *object,
code = G_IO_ERROR_BROKEN_PIPE;
#else
code = G_IO_ERROR_CONNECTION_CLOSED;
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to move this #endif?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, no, the problems of squashing two diffs

@jlebon
Copy link
Member

jlebon commented May 30, 2018

bot, retest this please

@jlebon
Copy link
Member

jlebon commented May 30, 2018

/tmp/test-tmp-libostree_test-pull-repeated.sh.test-HFQKJZ /tmp/test-tmp-libostree_test-pull-repeated.sh.test-HFQKJZ
7 metadata, 5 content objects fetched; 1 KiB transferred in 0 seconds
Success on iteration 2

So here it only needed 2 iterations, which feels pretty low for a test with 40 iterations. But I think it checks out. We're fetching 12 objects, and the chance that one of those returns a 408 5 times in a row is... 4%? Hmm, maybe we should add a ostree pull --network-retries=N switch to lower retries to 2 or something. That'd be useful for general use too. The fact that it's greater than 1 at least is comforting.

Anyway, we can easily fix up and tweak these things later.
This looks great overall! Thanks a lot for working on this.

@rh-atomic-bot r+ e283a05

@rh-atomic-bot
Copy link

⚡ Test exempted: merge already tested.

rh-atomic-bot pushed a commit that referenced this pull request May 30, 2018
This allows the retry code in ostree-repo-pull.c to recover from (for
example) timeouts at the libsoup layer in the stack, as well as from the
GSocket layer in the stack.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this pull request May 30, 2018
This is exactly like the --random-500s option, except that it will cause
error 408 (request timeout) to be returned, rather than error 500
(internal server error).

This will be used in a following commit to test pull behaviour when
timeouts occur.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this pull request May 30, 2018
Extend test-pull-repeated.sh to test error 408 as well as error 500, to
ensure that the new retry-on-network-timeout code in ostree-repo-pull.c
correctly retries.

Rather than the 200 iterations needed for the error 500 tests, only do 5
iterations. The pull code internally does 5 retries (by default), which
means a full iteration count of 25. That seems to be sufficient to make
the tests reliably pass, in my testing — we can always bump it up to 200
/ 5 = 40 in future if needed (to put it in parity with the error 500
tests).

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this pull request May 30, 2018
Various of the counters already have assertions like this; add some more
for total paranoia.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this pull request May 30, 2018
Use the same G_IO_ERROR_* values for HTTP status codes in both fetchers.
The libsoup fetcher still handles a few more internal error codes than
the libcurl one; this could be built on in future.

Signed-off-by: Philip Withnall <withnall@endlessm.com>

Closes: #1594
Approved by: jlebon
@pwithnall pwithnall deleted the retry-downloads branch May 30, 2018 16:30
@pwithnall
Copy link
Member Author

Thanks! I’ve updated the follow-up #1600 to queue static delta superblock requests.

wjt added a commit to wjt/ostree that referenced this pull request May 1, 2019
PR ostreedev#1594 added logic to retry downloads when they fail due to network
errors, or due to HTTP 408 Request Timeout.

In practice, pulling from Flathub frequently fails with 503 Service
Unavailable. This is a transient error, so it makes sense to retry when
it's encountered.

There is no good code in GIOErrorEnum for this so this patch takes the
lazy route of mapping it to G_IO_ERROR_TIMED_OUT, which is already
treated as a transient error in _ostree_fetcher_should_retry_request().

While we're here, also map 504 Gateway Timeout to G_IO_ERROR_TIMED_OUT
(which does seem correct, if lossy).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants