New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1543097: Convert crash reporting from raven to sentry_sdk #4917
Bug 1543097: Convert crash reporting from raven to sentry_sdk #4917
Conversation
9bc0108
to
e5c5c23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To test this, you could add a DSN for stage to the dev environment, and maybe set debug=True
in sentry_sdk.init
. A similar process to the last PR may work, where we put it in stage for 24 hours to get some cron errors, and raise an exception to test the Django side.
socorro/app/socorro_app.py
Outdated
@@ -206,6 +216,12 @@ def fix_exit_code(code): | |||
|
|||
config_manager.log_config(mylogger) | |||
|
|||
# Add version to crash reports | |||
version = (revision_data.get('version') or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a copy of socorro.lib.revision_data.get_version
that doesn't read the JSON from disk again. This may be a mis-optimization. I'm not sure if get_version
should take an optional revision_data
, or I should just close my eyes and call get_version
🙈.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the optional revision_data
.
@@ -22,8 +22,9 @@ def test_instantiation(self): | |||
with pytest.raises(NotImplementedError): | |||
sa.main() | |||
|
|||
@mock.patch('socorro.app.socorro_app.setup_crash_reporting') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mock avoidssentry_sdk
choking when processing a MagicMock
as a DSN string.
|
||
@patch('socorro.lib.sentry_client.get_hub', side_effect=Exception('fail')) | ||
@patch('socorro.lib.sentry_client.is_enabled', return_value=False) | ||
def test_rule_error_sentry_disabled(self, is_enabled, mock_get_hub): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_rule_error
used to test both with a blank DSN and a fake DSN. When converting to a mocked is_enabled
, it made sense to split the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return self.cmd_run_one(options['job'], options['force'], cmd_args) | ||
else: | ||
return self.cmd_run_all() | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped this try
block because the default handler does what the except
block does - log the uncaught exception to Sentry on exit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either filter the crash data or get an ok from @g-k to skip it.
These changes have ramification if they're not correct--we'll miss errors. Given that, I would write a rough test plan to show what we've thought through and surface holes. Then when this deploys to stage, we have a plan we can use to verify it.
socorro/cron/crontabber_app.py
Outdated
default='', | ||
reference_value_from='secrets.sentry', | ||
secret=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job lifting this up from the apps to the App.
dsn=SENTRY_DSN, | ||
release=SOCORRO_REVISION, | ||
send_default_pii=True, | ||
integrations=[DjangoIntegration()]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we talked about this a bit. I think I'm going to change my mind and suggest we filter the event before sending it along.
This seems pretty straight-forward:
https://docs.sentry.io/error-reporting/configuration/filtering/?platform=python
@g-k is our security person. It's worth having him eyeball this and 👍 / 👎 what we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try to maintain raven's client side filtering to avoid accidentally collecting PII.
We have audited Sentry for storing PII (part of the reason we self-host it), can filter it server side, and can probably purge PII if we do collect it, but not collecting it in the first place would be best.
Looking at getsentry/sentry-python#211 it looks like there isn't an equivalent for the raven filters and processors and we'd need to handle that in a before_send
hook (looks like the raven filters are matching on field names and value regexes https://github.com/getsentry/raven-python/blob/master/raven/processors.py#L165).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jwhitlock pointed out some additional nuance, so I'm going to run this by the rest of the secops team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect timing on bringing this up right before the secops team meeting! So our consensus was that if we can filter out more PII client side we should do that, but it's also OK to continue collecting the same user fields since crash reports already contain user submitted PII and we need to associate users with their crashes.
And if the new sdk does less filtering we should try to preserve the behavior of raven's filters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just chatted with @g-k and he pointed out that web users don't opt-in to webapp crash reports to sentry and so they're not really the same thing as firefox crash reports to socorro.
I've been on Socorro for like 3 years and I don't think I've looked at any of those fields before. Sure, past is not prologue, but I think that's probably a good indicator.
I vote we go the easy route and do set_default_pii=False
. If that turns out to be a problem in the future, we can revisit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK some things from further discussion with willkg:
-
"webapp crashes -> sentry are really different than firefox crashes -> socorro" (especially in that they aren't opt-in)
-
socorro usually goes out of its way to not associate Fx crashes with users (as in it's a legal or data requirement), so that's not a reason to collect PII
-
willkg can't recall when he last used the PII fields so we can set that flag to false to not collect them
-
we should re-implement the sanitizers from raven to preserve existing behavior (and upstream the change if sentry is open to it)
return self.cmd_run_one(options['job'], options['force'], cmd_args) | ||
else: | ||
return self.cmd_run_all() | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
@patch('socorro.lib.sentry_client.get_hub', side_effect=Exception('fail')) | ||
@patch('socorro.lib.sentry_client.is_enabled', return_value=False) | ||
def test_rule_error_sentry_disabled(self, is_enabled, mock_get_hub): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Nice work! |
I'm guessing another day to write and test a |
:returns: string | ||
|
||
""" | ||
revision_data = get_revision_data() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the result of get_version
will ever change during a run, so we could cache the result in the module and then return that. We do module-level caching in other places, too, so there's some precedent.
It's up to you.
b66e8f6
to
1820724
Compare
I'd feel better if this baked on stage for a bit. I want to push the stuff that's on stage now to prod before we land this. I'm thinking we'll do a prod deploy tomorrow. I'm going to add the DO NOT MERGE label so I don't forget. |
Today was research and a little code. Setting
We can turn off The raven processors work by trying to automatically filter sensitive data by keywords across data, HTTP headers, and stacktraces. We could potentially port this, but the Sentry dev says:
This makes me think that copying the raven logic to our |
1820724
to
910d893
Compare
I'll need to get a few more examples of filters before I can get a good API. I thought plain functions would be the simplest thing that could work, but since there is a little configuration and registration, classes with a Filters to write:
There are probably a few more. I'm getting the low-hanging fruit in the dev environment, and then looking at existing reports for more ideas. I'll push my work in progress for the curious, but no need to review the PR yet. |
👍 I was assuming we had legal or compliance requirements to use the old SDK filters, but if they are broken or incomplete then we don't have to port that behavior over (especially since we can set |
8d9e7d2
to
7a4b7a7
Compare
I think @Osmose did a careful job of validating the raven-provided filters in PR #4357, and I want to make sure we're not adding data that was previously filtered. The event format has changed with sentry_sdk, probably to standardize formatting across languages and platforms, which means I can't use existing raven-generated events to determine how to write my filters. The production stack (such as AWS load balancers) also adds some headers that I can't test locally, because I'm not sure how they will show up in the sentry_sdk event format. That's OK, we'll be mostly covered by That leaves a couple more tasks that I can do in development:
I hope to open for review by Monday. |
abb1a89
to
0a93e47
Compare
This is ready for review again. I developed the sanitizers by turning up debugging in
That fake DSN ensured that Sentry events were generated but went nowhere. I then added some The code is overly flexible:
The code is also too specific:
I'm not sure if I should tighten up the code, or refactor into some more general functions, or let it be for the first version. As I mentioned earlier, the event format changed from raven to sentry_sdk, and the existing raven data is different than this data. If the gunicorn request details appear in staging, there may be more filtering to do, to omit IP address data from gunicorn header logs. I suspect there will be one or more rounds of changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After sleeping on it, I want to see if the hint for query breadcrumbs has some options for more targeted sanitation. There may be a parsed SQL available, since Django 2.2 now requires sqlparse, or maybe there is a way to identify sensitive columns in Django itself.
Update: The hint
is empty for SQL breadcrumbs, no help there. sqlparse
could be used to write a very cool sanitizer, but it is probably too risky for v1 of a crash handler.
Then I want a YAGNI pass, to remove bits of the code that is there for flexibility and future expansion. We can worry about n=2 later, let's get n=1 right.
from socorro.lib.sentry_client import SentryFilter, SentryProcessor, SENTRY_LOG_NAME | ||
|
||
|
||
class CrumbTruncateQuery(SentryFilter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For SQL queries, I opted to truncate the SQL statement when a sensitive column is mentioned, rather than replace the whole thing with [filtered]
, or trying to replace the values with [filtered]
.
""" | ||
log_name = SENTRY_LOG_NAME if debug else None | ||
filters = ( | ||
EventProcessCrumbs(( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the important part, which sets up the actual filtering used in the webapp.
class TestCrumbTruncateQueryForPrivacy: | ||
"""Tests for CrumbTruncateQueryForPrivacy.""" | ||
|
||
CASES = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This CASES
pattern is used a lot to take data I saw during development and make test cases out of them. They are used here in the tests for the filter class, and later in the tests for get_before_send
to test the webapp configuration.
self.keywords = keywords | ||
self.reason = reason | ||
|
||
def __repr__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used __repr__
because I was listening to Raymond Hettinger's talk while I was writing this: https://youtu.be/wf-BqAjZb8M?t=1980
I want to think through this, but don't have the head space for it right now. Hopefully later this week. |
I'm going to take another refactoring pass on this:
|
e827c02
to
b7eb920
Compare
2bddd72
to
394eace
Compare
I'm glad I had a chance to refactor this code, I'm a lot more comfortable with it. I've also edited the initial PR comment to cover the additional scope of filtering events for PII and secure information, so that future devs don't have to read all the 40+ comments. We discussed in the meeting that this code is not urgent, and requires a couple of days to bake in stage. We can defer review and merging until we have that time. The code is in a good state to hang out on the shelf for a while, and I can rebase when needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a bunch of comments, but I like this approach and I appreciate that it's really easy to extract and turn into a general purpose library.
If you want to hop on zoom and talk about any of the issues, that might get us to a resolution faster than going back and forth with PR comments. Woot!
@@ -62,9 +62,6 @@ def get_standard_config(self, sentry_dsn=None): | |||
mocked_companion_process = mock.Mock() | |||
config.companion_process.companion_class = mock.Mock(return_value=mocked_companion_process) | |||
|
|||
config.sentry = mock.MagicMock() | |||
config.sentry.dsn = sentry_dsn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config
is a DotDict, so you need to add a namespace to it. You can do that like this:
config.sentry = DotDict()
config.sentry.dsn = sentry_dsn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was removed in the PR. With raven
, the sentry DSN was passed at the time of the exception. With sentry_sdk
, Sentry is initialized from this value at run
, which is not run for these tests.
My strategy was to mock is_enabled
to avoid initialization, but alternatively the tests could call pa.setup_crash_reporting(config, 'fake_version')
around the time they call pa._setup_source_and_destination()
.
filtered at creation, but most crumbs are discarded since an event isn't usually generated, so | ||
it makes sense to wait until event sanitization to process it. The SanitizeBreadcrumbs filter | ||
runs breadcrumb filters (that don't need data from the hint) at event processing rather than | ||
at crumb creation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These comments are super helpful. 👍
else: | ||
data_out[key] = value | ||
if modified: | ||
glom.glom(event, glom.Assign(self.section_path, data_out)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of funny. They added Assign
because I asked for it, but I think this is the first use we've had of it.
Oh, I had one other thought. This adds a lot of code and while there are tests, they're tests for known scenarios and don't account for change in sentry-sdk. What happens when this code fails? How would we find out? Maybe it makes sense to have the |
Use the standard Django integration of sentry_sdk to replace raven. We're using the send_default_pii=False (the default), to explicitly not collect user data like email addresses, and connection data like IP addresses and cookies.
Switch to the sentry_sdk initialized in settings: * Use the default exception handler for handle * Use sentry_sdk.capture_error() in _run_once
Switch management command depcheck to use the settings-initialized sentry_sdk.
394eace
to
2552a15
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a number of changes, and there's some more work to do
Exceptions in before_send
The sentry code wraps the before_send
call:
I tested this by raising an exception. There is:
- A multi-line
ERROR
-level log from sentry "Internal error in sentry_sdk", followed by the original traceback and the new exception. - A single-line
INFO
-level log that prints the event - Django's
ERROR
-level log of the original traceback and the new exception.
The event is not sent to Sentry if before_send
raises an exception.
Log level of before_send
loggers
The suggestion is to change the before_send
loggers to log at DEBUG
instead of INFO
. The default log level is INFO
, so DEBUG
messages are swallowed. A few ways forward:
- Continue logging at
INFO
, if theSENTRY_DEBUG
is set - Always log at
DEBUG
, and require the developer to switch toDEBUG
to see them - Always log at
DEBUG
, and dynamically add a handler forDEBUG
messages ifSENTRY_DEBUG
is set
Next set of work
- Convert to filters that just return the event, not the tuple of the event and if it was modified
- Redo logging configuration so that
debug_logger.debug()
works - Switch the query processing to filter rather than truncate
- Indent multi-line SQL test cases
- Determine if querystring parsing can be used without corrupting data
- Try "production-like" mode, and see if I can get the extended
gunicorn
breadcrumbs from production
@@ -62,9 +62,6 @@ def get_standard_config(self, sentry_dsn=None): | |||
mocked_companion_process = mock.Mock() | |||
config.companion_process.companion_class = mock.Mock(return_value=mocked_companion_process) | |||
|
|||
config.sentry = mock.MagicMock() | |||
config.sentry.dsn = sentry_dsn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was removed in the PR. With raven
, the sentry DSN was passed at the time of the exception. With sentry_sdk
, Sentry is initialized from this value at run
, which is not run for these tests.
My strategy was to mock is_enabled
to avoid initialization, but alternatively the tests could call pa.setup_crash_reporting(config, 'fake_version')
around the time they call pa._setup_source_and_destination()
.
What if we wrap the
If markus is configured at that point, it'll send something and we can see it in datadog. Does that sound like a good idea? I'm not sure how else we can get signals out.
I suggested we switch it to debug because I thought we were looking at debug level in the local dev environment. Now that I know it's not logging at debug level, I take back my comment--let's leave it as what you have since if someone has debug mode enabled and it seems onerous and confusing to require multiple actions to get debugging to work. |
It's more readable to me if we adjust where the parens are. It's a tuple of three elements with the second element being a really long string. I think this is more readable:
When I was reviewing that code, I had to count parens to figure out where the three elements were demarcated. That seems really prone to error. |
I think the logging is a little cleaner now. Instead of passing around a logger, there's a module-level logger that gets called when needed, and the logic to "turn on" the messages moves to settings. Take a look and see if you agree. Incrementing a counter with I'm continuing with migrating the return from Instead, it seems better to raise an exception when then event (or breadcrumb) should be dropped instead of processed, catch the exception in |
ids=tuple(case[0] for case in CASES)) | ||
def test_filtered_queries(self, keyword, message, expected): | ||
'keyword,sql', CASES.items(), | ||
ids=tuple(case for case in CASES)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this line up right? Does CASES.items()
return the same as iter(CASES)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my experience, yes, and that's my understanding from the dictionary view docs: https://docs.python.org/3.6/library/stdtypes.html#dictionary-view-objects
My remaining task is to run in production mode, and see if I can get the sensitive breadcrumbs I'm seeing in the raven-generated events in stage. That may add another sanitizer, but I don't expect the current sanitizers to change, so this may be another good place to review. Update: The event is fundamentally the same in production mode, so no further changes. I hope to squash all these review commits into the "Filer events for pii/security" commit, because I don't think there is a long-term benefit to all these follow-on commits. If you want me to squash before you review, let me know. Update 2: I've rebased and squashed the commits |
Convert ProcessorApp to use sentry_sdk: * Move crash reporting initialization to the base socorro_app.App * Assume sentry is initialized in sentry_client helpers * Remove now unused Raven library
* Change sentry_client.get_client to get_hub, to better match sentry_sdk, and update the test mocks * Add sentry_client.is_enabled() to check if Sentry has an initialized hub and dsn, and mock to return True in tests * Drop the sentry_dsn parameter from both get_hub and capture_error * Drop local handling of sentry_dsn in each app, defer to App, and simplify test config setup accordingly * Split test_rule_error from test_processor_2015.py into two tests, one with Sentry disabled, one enabled
26070fb
to
9d63a75
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The markus bits need to get fixed and I'd make the CASES
declarations more readable. This looks good though. After those fixes I think we can land this and test it out on stage.
Do you have a test plan for this?
Fixes applied in follow-on commits. I'd like to squash these as well before merging. My test plan:
There's a good chance that there will be addition scrubbing needed in the deployed environments, based on crumbs that appear in the deployed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! 👍
Sorry this was such a slog--I was kind of learning about the bits as we went along. 😞
Implement before_send on the webapp to clear personal information and security data from the event (exception or message) before sending to Sentry: * Truncate SQL queries when a sensitive column is mentioned * Filter HTTP headers to mask Auth-Token * Filter POST data to mask CSRF tokens * Filter query strings to mask OAuth parameters
a6cb593
to
1162b62
Compare
I was learning on this one too. Thanks for the feedback on coding styles, |
Switch our crash reporting library from raven to sentry_sdk 0.7.14. This is a follow-on for PR #4912, which didn't avoid huge test diffs like I planned 😢.
Switching from Raven to Sentry SDK
webapp-django/crashstats/settings/base.py
, and is very close to the documentation for Django integration.setup_crash_reporting()
is now called in the top-levelApp
insocorro/app/socorro_app.py
, in therun
method. This allows droppingsentry.sdk
handling from various derived apps.socorro/lib/sentry_client.py
has a few changes:Hub
to keep track of state, data scopes, and the client, andinit
sets up a globalHub
that is used by default. The functionget_client
, which initialized a raven client with a DSN, is replaced byget_hub
, which expectedsentry_sdk.init()
has been called, and does much less, but is still a good point for test mocking.is_enabled
checks ifsentry_sdk.init()
has been run with a valid DSN. Tests mock this to returnTrue
rather than initialize with a fake DSN.capture_error
dropssentry_dsnn
as the first argument. It usesis_enabled
to decide whether to send the exception to Sentry or log it.captureException
iscapture_exception
and takes a named parametererror
,captureMessage
iscapture_message
, etc.Filtering sensitive data from events
The previous Raven filters, added in PR #4357, sanitized the Sentry events. Similar filtering is a requirement for switching to the new code. However, it is hard to verify the filtering implementation because there are no tests for the existing filtering (it is assumed Raven has those tests), the event format has changed significantly with the new SDK (apparently to standardize events across languages), and some collected data varies from development to deployment environments. This PR makes an effort to replicate the Raven filters, and I expect that there will be some additional filtering needed once we see events in the stage deployment environment.
In
sentry_sdk
, a lot of sensitive data collection is covered by theinit
parametersend_default_pii
. It is disabled by default, and when enabled it selectively adds data:wsgi
integration, included in the Django integration, includes the user's IP address, headers with IP address info, and headers for cookies or auth. It will collect cookies for easier display.send_default_pii
.This PR explicitly sets
send_default_pii=False
, so future devs will know we decided to omit this data rather than unknowingly accepted the default. Skipping cookies does the same work as the Raven filters for keywordssessionid
,csrftoken
,anoncsrf
, andsc
.The SanitizePasswordProcessor added some other keys that look like authorization data, such as
password
andapi_key
. Additional code is needed to filter similar data withsentry_sdk
, which take ainit
parambefore_send
to filter events and breadcrumbs.When analyzing events in the development environment, only the web app appeared to have PII associated with the visitor, so the processing code lives in the web app. This PR adds a
SentryProcessor
class that runs a sequence of filters against an event, and includes some optional debug logging for development. The code could be trivially expanded to sanitize events from the processor app, or to sanitize breadcrumbs (data gathered during execution that may be useful if an exception occurs).The current filters are:
email
,username
,session_key
ortokens_token.key
, to avoid exposing PII or security data.Auth-Token
in HTTP headerscsrfmiddlewaretoken
in POST datacode
andstate
, used in the OAuth flow, from query stringsFuture work
I was unable to get enough data in the development environment to write some planned filters. I hope to return to these with data from stage or production:
This code should be merged at a time when we'll have a couple of working days to gather data in stage and create new filters as needed.
Testing this PR
For my local testing, I added these to
my.env
:The
SENTRY_DSN
could be set to send development events to our Sentry collector for stage, which was done in testing for PR #4357, but I found the Sentry debug logging to be sufficient for development of the code.