bug-1809500: make crashmover run synchronously#963
Conversation
2c7a813 to
0c4ccbf
Compare
0c4ccbf to
43afaf1
Compare
willkg
left a comment
There was a problem hiding this comment.
I have some things that should get changed. After that, I'll go through and look at whether this emits the same metrics and verify some other things.
This is awesome! This removes a ton of complexity from the application and makes it a lot more straight-forward and easier to reason about. I'm excited you're working through this.
antenna/crashmover.py
Outdated
| # After MAX_ATTEMPTS, we give up on this crash | ||
| LOGGER.error(f"{crash_id}: too many errors trying to publish; dropped") | ||
| MYMETRICS.incr("publish_crash_dropped.count") | ||
| return False |
There was a problem hiding this comment.
If we've gotten here, then the crash report data is in S3 and all we need to do is publish the crash id to the processing queue. If this fails, Socorro will notice. It's got a daily task that looks for raw crash files for which there's no corresponding processed crash file and add an entry to the missing processed crashes table. You can see that here:
https://crash-stats.mozilla.org/siteadmin/crashstats/missingprocessedcrash/
That page lets you reprocess crash ids where there isn't a processed crash report.
I thought Socorro also self-healed in that it would automatically reprocessed crash reports, but it doesn't. I forget what I was thinking in that regard. I must have been fixing it by hand for a time. There's a MissingProcessedCrashes API and a corresponding fetch_missing command, so that seems likely.
- https://crash-stats.mozilla.org/api/#MissingProcessedCrash
- https://github.com/mozilla-services/socorro/blob/main/socorro/scripts/fetch_missing.py
We should fix that to self-heal--it should put crash reports in the queue to get processed. If processing fails, it'll send Sentry events, so we'll always have some signal of crash reports failing to process. I'll write up a bug for that. It should be a straight-forward fix to verifyprocessed.py.
Anyhow, that's a long train of thought to get to what Antenna should do when it succeeds in saving the crash data to S3, but fails at publishing the crash id for processing. In this case, I think handle_crashreport should return True and Antenna should return an HTTP 200 with the crash id. Socorro should catch the problem and self-heal.
Also, the way this loop is written, it'll retry 20 times in rapid succession. I think it should retry 5 times waiting 2 seconds in between attempts and then give up and return True.
c181908 to
4a727e2
Compare
29dd0f5 to
e311f6c
Compare
d122abf to
0a69bdd
Compare
willkg
left a comment
There was a problem hiding this comment.
If I do this:
make run
it kicks up this error:
antenna-web-1 | + PORT=8000
antenna-web-1 | + GUNICORN_WORKERS=1
antenna-web-1 | + GUNICORN_WORKER_CONNECTIONS=4
antenna-web-1 | + GUNICORN_WORKER_CLASS=gevent
antenna-web-1 | + GUNICORN_MAX_REQUESTS=0
antenna-web-1 | + GUNICORN_MAX_REQUESTS_JITTER=0
antenna-web-1 | + CMD_PREFIX=
antenna-web-1 | + gunicorn --workers=1 --worker-connections=4 --worker-class=gevent --max-requests=0 --max-requests-jitter=0 --config=antenna/gunicornhooks.py --log-file=- --error-logfile=- --access-logfile=- --bind 0.0.0.0:8000 antenna.wsgi:application
antenna-web-1 |
antenna-web-1 | Error: 'antenna/gunicornhooks.py' doesn't exist
We need to change bin/run_web.sh and take out the line that looks like this:
--config=antenna/gunicornhooks.py \
One way to test Antenna is to run it in your local dev environment and then in another shell run the systemtests.
https://github.com/mozilla-services/antenna/tree/main/systemtest
I ran Antenna from main and this branch and ran the system tests and logged the metrics and compared the two runs. Everything looks good. It was a little confusing because the order is different but that's what you get when switching from background processing to doing everything in the HTTP request handling cycle.
A few more things to fix, but otherwise this looks good!
|
|
||
| :arg wait_time_generator: | ||
| Generator function that returns wait times until a maximum number of | ||
| Function that returns a generator of wait times until a maximum number of |
Co-authored-by: Will Kahn-Greene <willkg@users.noreply.github.com>
776c810 to
feba71b
Compare
ab6b9bd to
ed8c04d
Compare
willkg
left a comment
There was a problem hiding this comment.
Looks good!
When we land this and it auto-deploys to stage, we can watch the stage dashboard:
https://earthangel-b40313e5.influxcloud.net/d/mTjlP_8Zz/socorro-stage-app-metrics?orgId=1&refresh=1m
It might make sense to go over that dashboard to see if it makes sense before landing things. I haven't maintained it very well over the years.
No description provided.