Skip to content

bug-1809500: make crashmover run synchronously#963

Merged
relud merged 9 commits intomainfrom
crashmover-synchronous
Feb 1, 2024
Merged

bug-1809500: make crashmover run synchronously#963
relud merged 9 commits intomainfrom
crashmover-synchronous

Conversation

@relud
Copy link
Copy Markdown
Contributor

@relud relud commented Jan 24, 2024

No description provided.

@relud relud requested a review from a team as a code owner January 24, 2024 18:24
@relud relud force-pushed the crashmover-synchronous branch 2 times, most recently from 2c7a813 to 0c4ccbf Compare January 24, 2024 18:44
@relud relud force-pushed the crashmover-synchronous branch from 0c4ccbf to 43afaf1 Compare January 24, 2024 18:49
Copy link
Copy Markdown
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some things that should get changed. After that, I'll go through and look at whether this emits the same metrics and verify some other things.

This is awesome! This removes a ton of complexity from the application and makes it a lot more straight-forward and easier to reason about. I'm excited you're working through this.

# After MAX_ATTEMPTS, we give up on this crash
LOGGER.error(f"{crash_id}: too many errors trying to publish; dropped")
MYMETRICS.incr("publish_crash_dropped.count")
return False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we've gotten here, then the crash report data is in S3 and all we need to do is publish the crash id to the processing queue. If this fails, Socorro will notice. It's got a daily task that looks for raw crash files for which there's no corresponding processed crash file and add an entry to the missing processed crashes table. You can see that here:

https://crash-stats.mozilla.org/siteadmin/crashstats/missingprocessedcrash/

That page lets you reprocess crash ids where there isn't a processed crash report.

I thought Socorro also self-healed in that it would automatically reprocessed crash reports, but it doesn't. I forget what I was thinking in that regard. I must have been fixing it by hand for a time. There's a MissingProcessedCrashes API and a corresponding fetch_missing command, so that seems likely.

We should fix that to self-heal--it should put crash reports in the queue to get processed. If processing fails, it'll send Sentry events, so we'll always have some signal of crash reports failing to process. I'll write up a bug for that. It should be a straight-forward fix to verifyprocessed.py.

Anyhow, that's a long train of thought to get to what Antenna should do when it succeeds in saving the crash data to S3, but fails at publishing the crash id for processing. In this case, I think handle_crashreport should return True and Antenna should return an HTTP 200 with the crash id. Socorro should catch the problem and self-heal.

Also, the way this loop is written, it'll retry 20 times in rapid succession. I think it should retry 5 times waiting 2 seconds in between attempts and then give up and return True.

@relud relud force-pushed the crashmover-synchronous branch from c181908 to 4a727e2 Compare January 31, 2024 00:16
@relud relud requested a review from willkg January 31, 2024 00:18
@relud relud force-pushed the crashmover-synchronous branch from 29dd0f5 to e311f6c Compare January 31, 2024 17:11
@relud relud force-pushed the crashmover-synchronous branch from d122abf to 0a69bdd Compare January 31, 2024 17:15
Copy link
Copy Markdown
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do this:

make run

it kicks up this error:

antenna-web-1         | + PORT=8000
antenna-web-1         | + GUNICORN_WORKERS=1
antenna-web-1         | + GUNICORN_WORKER_CONNECTIONS=4
antenna-web-1         | + GUNICORN_WORKER_CLASS=gevent
antenna-web-1         | + GUNICORN_MAX_REQUESTS=0
antenna-web-1         | + GUNICORN_MAX_REQUESTS_JITTER=0
antenna-web-1         | + CMD_PREFIX=
antenna-web-1         | + gunicorn --workers=1 --worker-connections=4 --worker-class=gevent --max-requests=0 --max-requests-jitter=0 --config=antenna/gunicornhooks.py --log-file=- --error-logfile=- --access-logfile=- --bind 0.0.0.0:8000 antenna.wsgi:application
antenna-web-1         | 
antenna-web-1         | Error: 'antenna/gunicornhooks.py' doesn't exist

We need to change bin/run_web.sh and take out the line that looks like this:

  --config=antenna/gunicornhooks.py \

One way to test Antenna is to run it in your local dev environment and then in another shell run the systemtests.

https://github.com/mozilla-services/antenna/tree/main/systemtest

I ran Antenna from main and this branch and ran the system tests and logged the metrics and compared the two runs. Everything looks good. It was a little confusing because the order is different but that's what you get when switching from background processing to doing everything in the HTTP request handling cycle.

A few more things to fix, but otherwise this looks good!


:arg wait_time_generator:
Generator function that returns wait times until a maximum number of
Function that returns a generator of wait times until a maximum number of
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much clearer. Thank you!

relud and others added 2 commits January 31, 2024 13:43
Co-authored-by: Will Kahn-Greene <willkg@users.noreply.github.com>
@relud relud force-pushed the crashmover-synchronous branch from 776c810 to feba71b Compare January 31, 2024 22:06
@relud relud force-pushed the crashmover-synchronous branch 2 times, most recently from ab6b9bd to ed8c04d Compare January 31, 2024 22:26
@relud relud requested a review from willkg January 31, 2024 22:27
Copy link
Copy Markdown
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

When we land this and it auto-deploys to stage, we can watch the stage dashboard:

https://earthangel-b40313e5.influxcloud.net/d/mTjlP_8Zz/socorro-stage-app-metrics?orgId=1&refresh=1m

It might make sense to go over that dashboard to see if it makes sense before landing things. I haven't maintained it very well over the years.

@willkg willkg self-assigned this Feb 1, 2024
@relud relud merged commit 8107c88 into main Feb 1, 2024
@relud relud deleted the crashmover-synchronous branch February 1, 2024 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants