Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

Backoff not sufficient #425

Closed
peterbe opened this issue Apr 24, 2018 · 7 comments
Closed

Backoff not sufficient #425

peterbe opened this issue Apr 24, 2018 · 7 comments

Comments

@peterbe
Copy link
Contributor

peterbe commented Apr 24, 2018

See https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/1594248/

I wrote in my comment:

https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-24-01-36-04-mozilla-central-l10n/ does indeed exist. Now. That means we didn't back off long enough.

This is a lambda event that depends on fetching https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-24-01-36-04-mozilla-central-l10n/ and it tried 3 times and eventually had to give up. Since the URL now 200 OKs, it means it didn't wait long enough.

We recently made it so that the latest-inventory-to-kinto doesn't do backoff in fetch_json, that it only does that in the lambda function. We can now increase the backoff configuration to either try more times or using longer pauses.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 24, 2018

@leplatrem Can you sanity check this issue? It seems we could simply increase the backoff times and seconds to a higher number(s) since it only matters for the lambda function.
We have a cap of 5 minutes (I think) but it should be sufficiently long to retry more patiently.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 25, 2018

Here's another example. At the time of writing this, https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-25-10-01-22-mozilla-central/firefox-61.0a1.en-US.linux-i686.json is a perfectly fine 200 OK. But in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ it failed with a ClientResponseError

@leplatrem
Copy link
Collaborator

Indeed, we could retry more times, or some to some exponential internal (https://github.com/litl/backoff/blob/master/backoff/_wait_gen.py)

5min will give us a pretty good margin. However, we have to be prepared for the fact that sometimes the *-l10n/ folder may take more time to appear.
The code was supposed handle it like this:

except ValueError:
files = [] # No -l10/ folder published yet.

Which is different from the case in https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4284392/ where we can see in the logs:

- Processing firefox nightly metadata: pub/firefox/nightly/2018/04/2018-04-26-10-00-55-mozilla-central/firefox-61.0a1.en-US.win32.json
- Fetch new nightly metadata
- GET 'https://archive.mozilla.org/pub/firefox/nightly/2018/04/2018-04-26-10-00-55-mozilla-central/firefox-61.0a1.en-US.win32.json'
- Backing off fetch_json(...) for 0.4s (aiohttp.client_exceptions.ClientResponseError: 404, message='Not Found')

Honestly in this case I find it very weird that it takes so much time from the S3 event to the apparition on JSON API via the bucket lister. Maybe oremj has ideas...

Otherwise, using an AWS client we may have more chance to fetch it immediately...

@peterbe
Copy link
Contributor Author

peterbe commented Apr 27, 2018

In the Sentry entry it said it did first fail, then backoff for 0.4s, then second backoff for 1.2s and ultimately gave up on the 3rd attempt failing.

I honestly don't know where that 0.4 comes from. Why 0.4 and not 0.5 or 0.1 or 123.456?
Anyway, if you use backoff._wait_get.expo with (base=3, factor=0.4) you get this series:

0.4
1.2
3.6
10.8
32.4
97.2
291.6
874.8
2624.4
7873.2
23619.6

Meaning, we if we change from 3 max. retries to 5 max. retries we'll

*attempt*
sleep(0.4)
*attempt*
sleep(1.2)
*attempt*
sleep(3.6)
*attempt*
sleep(10.8)
*attempt*
**give up!**

Total of 16 seconds.

Change it to 6 and you get a total max. sleep of 48.4 seconds. Surely that should be enough. That's almost a whole minute.

@peterbe
Copy link
Contributor Author

peterbe commented Apr 27, 2018

#432 just sets it to 6. I'll check with Wei that this isn't overwritten in the env of Stage or Prod.

I'm not excited to dwell on this much more. 48 seconds is well less than 5 minutes (AWS Lambda max) and God forbid it takes longer than 48 seconds, the scraper will have to fix it later. Also, when we get the new buildhub.json we don't have to read any JSON listings at all any more anyway.

@peterbe
Copy link
Contributor Author

peterbe commented May 14, 2018

It happened again :(
Even though we have retry_on_notfound=True and plenty of backoff it happened.
https://sentry.prod.mozaws.net/operations/buildhub-stage/issues/4326570/
I'm just going to increase the backoff a bit more.

@peterbe peterbe reopened this May 14, 2018
@peterbe
Copy link
Contributor Author

peterbe commented May 15, 2018

Oops. In #466 I forgot to use the "fixes #425" suffix.

@peterbe peterbe closed this as completed May 15, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants