Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a script to generate signatures from crash pings #15

Merged
merged 6 commits into from Aug 26, 2020

Conversation

staktrace
Copy link
Contributor

This is the cleaned-up work we were discussing in #11

wlach and others added 4 commits August 24, 2020 10:36
Also add some more detailed error logging on errors returned by tecken, so
that we can diagnose errors and make fixes accordingly.
Comment on lines +12 to +15
normalized_channel="nightly"
AND DATE(submission_timestamp)="{date}"
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(DATE "{date}", INTERVAL 1 WEEK))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query might need a bit of refining, I guess - ideally we want to generate signatures for all the crash pings, not restricted to the nightly channel, and not restricting the buildids. Assuming the airflow thing runs once a day on the previous day's submissions and no new submissions come in after the airflow task runs, we should just need the submission_timestamp filter here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure-- from a BigQuery perspective this is reasonable, but the number of crashes on release is very large (1.2-1.7 million a day) so might be overwhelming for tecken to process. Would a 1% sample of release be acceptable? That would bring the quantity down to 17,000 a day, which seems more manageable. @willkg wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to spend some time looking at how Tecken handled that symbolication batch on 8/20. Did it cache a lot? Did it cause problems for other parts of the service?

Short-to-medium term, I want to redo how Tecken does symbolication so it's doing it on a dedicated cluster and thus symbolication outages don't affect symbol uploads.

So my current gut feeling is that we should:

  1. pick a small sample for now that Tecken can handle
  2. increase the sample in the future after I make Tecken changes and/or we figure out a better way to do symbolication

I'll look into how Tecken performed today and get back to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, in that case maybe we should start with just nightly crashes as it's valuable to see changes in the top crashers on a daily basis. And the volume is low enough that hopefully it shouldn't be too much of a problem.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a long talk about Tecken today with Brian.

Here's what we're thinking:

  1. I'm in the process of redoing symbolication in Tecken. When that's done, it should work better for bigquery-etl. I'm planning to use this job to load test. I don't have an ETA--I'll probably know more next week as I flesh things out. The work is being done in this bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1636210
  2. This bigquery-etl job does a burst of symbolication requests which doesn't give Tecken enough time to scale up. It'd be better if somehow it could ramp up the rate of requests over 5 minutes.
  3. We should keep the sample rate small for now. We can probably play with that number to see how high we can get it, if you want.

@willkg
Copy link
Collaborator

willkg commented Aug 24, 2020

I can review this tomorrow--idiomatic Python, signature generation, and symbolication.

bigquery-etl.py Outdated Show resolved Hide resolved
bigquery-etl.py Outdated Show resolved Hide resolved
bigquery-etl.py Outdated Show resolved Hide resolved
bigquery-etl.py Outdated Show resolved Hide resolved
Comment on lines +12 to +15
normalized_channel="nightly"
AND DATE(submission_timestamp)="{date}"
AND application.build_id > FORMAT_DATE("%Y%m%d", DATE_SUB(DATE "{date}", INTERVAL 1 WEEK))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure-- from a BigQuery perspective this is reasonable, but the number of crashes on release is very large (1.2-1.7 million a day) so might be overwhelming for tecken to process. Would a 1% sample of release be acceptable? That would bring the quantity down to 17,000 a day, which seems more manageable. @willkg wdyt?

if sig is None or len(sig.signature) == 0:
print(f"Error computing signature for {doc_id}", file=sys.stderr)
continue
print(f'{doc_id},"{sig.signature}"')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be excessively long if we process anything like a large number of crashes. We should probably use the logging module here, instead of print. I can take care of this later though

fx_crash_sig/symbolicate.py Outdated Show resolved Hide resolved
@wlach
Copy link
Contributor

wlach commented Aug 24, 2020

I can review this tomorrow--idiomatic Python, signature generation, and symbolication.

👍 I was halfway through a review so finished it up, I may have missed some things though.

staktrace and others added 2 commits August 25, 2020 13:56
This is likely from python version migration.
This script generates a CSV file of the format "document_id, crash_signature"
by pulling from the `telemetry.crash` table in BigQuery.

The script parallelizes operations internally to reduce total wall-clock time.
@staktrace
Copy link
Contributor Author

Updated patches to address review comments thus far.

Copy link
Contributor

@wlach wlach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @staktrace! Let's land this as-is, we can make followups as needed

@wlach wlach merged commit ef9db02 into mozilla-services:master Aug 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants