New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a script to generate signatures from crash pings #15
Conversation
Also add some more detailed error logging on errors returned by tecken, so that we can diagnose errors and make fixes accordingly.
normalized_channel="nightly" | ||
AND DATE(submission_timestamp)="{date}" | ||
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(DATE "{date}", INTERVAL 1 WEEK)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This query might need a bit of refining, I guess - ideally we want to generate signatures for all the crash pings, not restricted to the nightly channel, and not restricting the buildids. Assuming the airflow thing runs once a day on the previous day's submissions and no new submissions come in after the airflow task runs, we should just need the submission_timestamp
filter here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure-- from a BigQuery perspective this is reasonable, but the number of crashes on release is very large (1.2-1.7 million a day) so might be overwhelming for tecken to process. Would a 1% sample of release be acceptable? That would bring the quantity down to 17,000 a day, which seems more manageable. @willkg wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to spend some time looking at how Tecken handled that symbolication batch on 8/20. Did it cache a lot? Did it cause problems for other parts of the service?
Short-to-medium term, I want to redo how Tecken does symbolication so it's doing it on a dedicated cluster and thus symbolication outages don't affect symbol uploads.
So my current gut feeling is that we should:
- pick a small sample for now that Tecken can handle
- increase the sample in the future after I make Tecken changes and/or we figure out a better way to do symbolication
I'll look into how Tecken performed today and get back to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, in that case maybe we should start with just nightly crashes as it's valuable to see changes in the top crashers on a daily basis. And the volume is low enough that hopefully it shouldn't be too much of a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a long talk about Tecken today with Brian.
Here's what we're thinking:
- I'm in the process of redoing symbolication in Tecken. When that's done, it should work better for bigquery-etl. I'm planning to use this job to load test. I don't have an ETA--I'll probably know more next week as I flesh things out. The work is being done in this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
- This bigquery-etl job does a burst of symbolication requests which doesn't give Tecken enough time to scale up. It'd be better if somehow it could ramp up the rate of requests over 5 minutes.
- We should keep the sample rate small for now. We can probably play with that number to see how high we can get it, if you want.
I can review this tomorrow--idiomatic Python, signature generation, and symbolication. |
normalized_channel="nightly" | ||
AND DATE(submission_timestamp)="{date}" | ||
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(DATE "{date}", INTERVAL 1 WEEK)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, not sure-- from a BigQuery perspective this is reasonable, but the number of crashes on release is very large (1.2-1.7 million a day) so might be overwhelming for tecken to process. Would a 1% sample of release be acceptable? That would bring the quantity down to 17,000 a day, which seems more manageable. @willkg wdyt?
if sig is None or len(sig.signature) == 0: | ||
print(f"Error computing signature for {doc_id}", file=sys.stderr) | ||
continue | ||
print(f'{doc_id},"{sig.signature}"') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to be excessively long if we process anything like a large number of crashes. We should probably use the logging module here, instead of print. I can take care of this later though
👍 I was halfway through a review so finished it up, I may have missed some things though. |
This is likely from python version migration.
This script generates a CSV file of the format "document_id, crash_signature" by pulling from the `telemetry.crash` table in BigQuery. The script parallelizes operations internally to reduce total wall-clock time.
Updated patches to address review comments thus far. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @staktrace! Let's land this as-is, we can make followups as needed
This is the cleaned-up work we were discussing in #11