Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1543097: Convert crash reporting from raven to sentry_sdk #4917

Merged
merged 6 commits into from May 16, 2019

Conversation

jwhitlock
Copy link
Member

@jwhitlock jwhitlock commented Apr 29, 2019

Switch our crash reporting library from raven to sentry_sdk 0.7.14. This is a follow-on for PR #4912, which didn't avoid huge test diffs like I planned 😢.

Switching from Raven to Sentry SDK

  • The new SDK works best when initialized with sentry_sdk.init() near the start of execution.
    • In the webapp, this is done in webapp-django/crashstats/settings/base.py, and is very close to the documentation for Django integration.
    • For configman-based apps, setup_crash_reporting() is now called in the top-level App in socorro/app/socorro_app.py, in the run method. This allows dropping sentry.sdk handling from various derived apps.
  • The helper socorro/lib/sentry_client.py has a few changes:
    • The SDK uses a Hub to keep track of state, data scopes, and the client, and init sets up a global Hub that is used by default. The function get_client, which initialized a raven client with a DSN, is replaced by get_hub, which expected sentry_sdk.init() has been called, and does much less, but is still a good point for test mocking.
    • A new function is_enabled checks if sentry_sdk.init() has been run with a valid DSN. Tests mock this to return True rather than initialize with a fake DSN.
    • The function capture_error drops sentry_dsnn as the first argument. It uses is_enabled to decide whether to send the exception to Sentry or log it.
  • The SDK methods have also changed from raven to sentry_sdk. captureException is capture_exception and takes a named parameter error, captureMessage is capture_message, etc.

Filtering sensitive data from events

The previous Raven filters, added in PR #4357, sanitized the Sentry events. Similar filtering is a requirement for switching to the new code. However, it is hard to verify the filtering implementation because there are no tests for the existing filtering (it is assumed Raven has those tests), the event format has changed significantly with the new SDK (apparently to standardize events across languages), and some collected data varies from development to deployment environments. This PR makes an effort to replicate the Raven filters, and I expect that there will be some additional filtering needed once we see events in the stage deployment environment.

In sentry_sdk, a lot of sensitive data collection is covered by the init parameter send_default_pii. It is disabled by default, and when enabled it selectively adds data:

This PR explicitly sets send_default_pii=False, so future devs will know we decided to omit this data rather than unknowingly accepted the default. Skipping cookies does the same work as the Raven filters for keywords sessionid, csrftoken, anoncsrf, and sc.

The SanitizePasswordProcessor added some other keys that look like authorization data, such as password and api_key. Additional code is needed to filter similar data with sentry_sdk, which take a init param before_send to filter events and breadcrumbs.

When analyzing events in the development environment, only the web app appeared to have PII associated with the visitor, so the processing code lives in the web app. This PR adds a SentryProcessor class that runs a sequence of filters against an event, and includes some optional debug logging for development. The code could be trivially expanded to sanitize events from the processor app, or to sanitize breadcrumbs (data gathered during execution that may be useful if an exception occurs).

The current filters are:

  • Filter SQL query breadcrumbs that include a sensitive column, such as email, username, session_key or tokens_token.key, to avoid exposing PII or security data.
  • Mask Auth-Token in HTTP headers
  • Mask csrfmiddlewaretoken in POST data
  • Mask code and state, used in the OAuth flow, from query strings

Future work

I was unable to get enough data in the development environment to write some planned filters. I hope to return to these with data from stage or production:

  • Masking sensitive variables in call stacks (none found in dev)
  • Masking sensitive environment from gunicorn breadcrumbs (not recorded in dev, may be Raven-specific)

This code should be merged at a time when we'll have a couple of working days to gather data in stage and create new filters as needed.

Testing this PR

For my local testing, I added these to my.env:

SENTRY_DSN=https://bad-9c7b6b6bcb7a45d7ae01fe8f60594083@sentry.127.0.0.1.nip.io/0
SENTRY_DEBUG=True
resource.sentry.debug=True

The SENTRY_DSN could be set to send development events to our Sentry collector for stage, which was done in testing for PR #4357, but I found the Sentry debug logging to be sufficient for development of the code.

Copy link
Member Author

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To test this, you could add a DSN for stage to the dev environment, and maybe set debug=True in sentry_sdk.init. A similar process to the last PR may work, where we put it in stage for 24 hours to get some cron errors, and raise an exception to test the Django side.

@@ -206,6 +216,12 @@ def fix_exit_code(code):

config_manager.log_config(mylogger)

# Add version to crash reports
version = (revision_data.get('version') or
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a copy of socorro.lib.revision_data.get_version that doesn't read the JSON from disk again. This may be a mis-optimization. I'm not sure if get_version should take an optional revision_data, or I should just close my eyes and call get_version 🙈.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the optional revision_data.

@@ -22,8 +22,9 @@ def test_instantiation(self):
with pytest.raises(NotImplementedError):
sa.main()

@mock.patch('socorro.app.socorro_app.setup_crash_reporting')
Copy link
Member Author

@jwhitlock jwhitlock Apr 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mock avoidssentry_sdk choking when processing a MagicMock as a DSN string.


@patch('socorro.lib.sentry_client.get_hub', side_effect=Exception('fail'))
@patch('socorro.lib.sentry_client.is_enabled', return_value=False)
def test_rule_error_sentry_disabled(self, is_enabled, mock_get_hub):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_rule_error used to test both with a blank DSN and a fake DSN. When converting to a mocked is_enabled, it made sense to split the test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return self.cmd_run_one(options['job'], options['force'], cmd_args)
else:
return self.cmd_run_all()
except Exception:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped this try block because the default handler does what the except block does - log the uncaught exception to Sentry on exit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@jwhitlock jwhitlock requested a review from willkg April 29, 2019 21:31
Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should either filter the crash data or get an ok from @g-k to skip it.

These changes have ramification if they're not correct--we'll miss errors. Given that, I would write a rough test plan to show what we've thought through and surface holes. Then when this deploys to stage, we have a plan we can use to verify it.

socorro/app/socorro_app.py Show resolved Hide resolved
default='',
reference_value_from='secrets.sentry',
secret=True
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job lifting this up from the apps to the App.

dsn=SENTRY_DSN,
release=SOCORRO_REVISION,
send_default_pii=True,
integrations=[DjangoIntegration()])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know we talked about this a bit. I think I'm going to change my mind and suggest we filter the event before sending it along.

This seems pretty straight-forward:

https://docs.sentry.io/error-reporting/configuration/filtering/?platform=python

@g-k is our security person. It's worth having him eyeball this and 👍 / 👎 what we do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try to maintain raven's client side filtering to avoid accidentally collecting PII.

We have audited Sentry for storing PII (part of the reason we self-host it), can filter it server side, and can probably purge PII if we do collect it, but not collecting it in the first place would be best.

Looking at getsentry/sentry-python#211 it looks like there isn't an equivalent for the raven filters and processors and we'd need to handle that in a before_send hook (looks like the raven filters are matching on field names and value regexes https://github.com/getsentry/raven-python/blob/master/raven/processors.py#L165).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jwhitlock pointed out some additional nuance, so I'm going to run this by the rest of the secops team.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect timing on bringing this up right before the secops team meeting! So our consensus was that if we can filter out more PII client side we should do that, but it's also OK to continue collecting the same user fields since crash reports already contain user submitted PII and we need to associate users with their crashes.

And if the new sdk does less filtering we should try to preserve the behavior of raven's filters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just chatted with @g-k and he pointed out that web users don't opt-in to webapp crash reports to sentry and so they're not really the same thing as firefox crash reports to socorro.

I've been on Socorro for like 3 years and I don't think I've looked at any of those fields before. Sure, past is not prologue, but I think that's probably a good indicator.

I vote we go the easy route and do set_default_pii=False. If that turns out to be a problem in the future, we can revisit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK some things from further discussion with willkg:

  • "webapp crashes -> sentry are really different than firefox crashes -> socorro" (especially in that they aren't opt-in)

  • socorro usually goes out of its way to not associate Fx crashes with users (as in it's a legal or data requirement), so that's not a reason to collect PII

  • willkg can't recall when he last used the PII fields so we can set that flag to false to not collect them

  • we should re-implement the sanitizers from raven to preserve existing behavior (and upstream the change if sentry is open to it)

return self.cmd_run_one(options['job'], options['force'], cmd_args)
else:
return self.cmd_run_all()
except Exception:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


@patch('socorro.lib.sentry_client.get_hub', side_effect=Exception('fail'))
@patch('socorro.lib.sentry_client.is_enabled', return_value=False)
def test_rule_error_sentry_disabled(self, is_enabled, mock_get_hub):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@willkg
Copy link
Collaborator

willkg commented Apr 29, 2019

Nice work!

@jwhitlock
Copy link
Member Author

I'm guessing another day to write and test a before_send function. To get an idea what we're currently collecting, see the ZeroDivisionError from stage. I may need to send some data from my developer environment to stage with this library for comparison.

:returns: string

"""
revision_data = get_revision_data()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the result of get_version will ever change during a run, so we could cache the result in the module and then return that. We do module-level caching in other places, too, so there's some precedent.

It's up to you.

@willkg
Copy link
Collaborator

willkg commented Apr 30, 2019

I'd feel better if this baked on stage for a bit. I want to push the stuff that's on stage now to prod before we land this. I'm thinking we'll do a prod deploy tomorrow. I'm going to add the DO NOT MERGE label so I don't forget.

@jwhitlock
Copy link
Member Author

Today was research and a little code. Setting send_default_pii = True adds some data:

We can turn off send_default_pii without losing useful debugging data ✅ .

The raven processors work by trying to automatically filter sensitive data by keywords across data, HTTP headers, and stacktraces. We could potentially port this, but the Sentry dev says:

We don’t have a built-in password processor (mostly because it was never working perfectly for all usecases), but you can modify your events using before_send

This makes me think that copying the raven logic to our before_send and before_breadcrumb methods will potentially send sensitive data. Breadcrumbs, which contain SQL queries and gunicorn data, seem to be the likely candidates in testing. I think that we should add targeted filtering, and use some of the testing techniques from PR #4357, including some time in stage, and reading the event JSON closely.

@jwhitlock jwhitlock added the Not ready Not ready for code review label May 1, 2019
@jwhitlock
Copy link
Member Author

jwhitlock commented May 1, 2019

I'll need to get a few more examples of filters before I can get a good API. I thought plain functions would be the simplest thing that could work, but since there is a little configuration and registration, classes with a __call__ method might be better.

Filters to write:

  • Call stack: Look for sensitive variable names like password, csrf, etc. Might be in locals or calling arguments
  • Gunicorn logs: Anonomize sensitive headers like REMOTE_ADDR
  • URL query parameters, such as those passed in OAuth requests

There are probably a few more. I'm getting the low-hanging fruit in the dev environment, and then looking at existing reports for more ideas. I'll push my work in progress for the curious, but no need to review the PR yet.

@g-k
Copy link
Contributor

g-k commented May 2, 2019

👍 I was assuming we had legal or compliance requirements to use the old SDK filters, but if they are broken or incomplete then we don't have to port that behavior over (especially since we can set send_default_pii=False now and filter server side).

@jwhitlock jwhitlock force-pushed the sentry-sdk-1543097 branch 2 times, most recently from 8d9e7d2 to 7a4b7a7 Compare May 2, 2019 22:34
@jwhitlock
Copy link
Member Author

I think @Osmose did a careful job of validating the raven-provided filters in PR #4357, and I want to make sure we're not adding data that was previously filtered.

The event format has changed with sentry_sdk, probably to standardize formatting across languages and platforms, which means I can't use existing raven-generated events to determine how to write my filters. The production stack (such as AWS load balancers) also adds some headers that I can't test locally, because I'm not sure how they will show up in the sentry_sdk event format. That's OK, we'll be mostly covered by send_default_pii=False, and I'll be able to quickly add some more rules as needed after capturing an event in staging. But it means I won't be able to filter gunicorn headers in this PR, since I can't see them in development (which means they may not even be present in staging).

That leaves a couple more tasks that I can do in development:

  • Sanitize POSTs, along with CSRF in form data and headers
  • Sanitize API token usage
  • Sanitize the call stack variables

I hope to open for review by Monday.

@jwhitlock jwhitlock force-pushed the sentry-sdk-1543097 branch 2 times, most recently from abb1a89 to 0a93e47 Compare May 3, 2019 18:39
@jwhitlock jwhitlock added Do not merge and removed Not ready Not ready for code review labels May 3, 2019
@jwhitlock
Copy link
Member Author

This is ready for review again.

I developed the sanitizers by turning up debugging in my.env:

SENTRY_DSN=https://bad-9c7b6b6bcb7a45d7ae01fe8f60594083@sentry.127.0.0.1.nip.io/0
SENTRY_DEBUG=True
resource.sentry.debug=True

That fake DSN ensured that Sentry events were generated but went nowhere. I then added some raise Exception('here') lines, generated a Sentry event, formatted with jq, looked for trouble, and wrote some code. Rinse and repeat.

The code is overly flexible:

  • I added the ability to process breadcrumbs, which are added in response to SQL queries, logs, HTTP requests incoming and outgoing, etc. However, you can wait to sanitize these until the event is raised. It is possible that all the code to implement before_breadcrumb should be removed, and the code should focus on before_send.
  • I added before_send and before_breadcrumb handlers to the processor, but they just have optional debugging, and no filtering. I couldn't find any sensitive data outside of the crash data, which we're not filtering. That could mean all the filtering code belongs in the user-facing webapp, and shouldn't be in socorro/lib/sentry_client.py.

The code is also too specific:

  • I wrote a filter for each type of data: Headers, query strings, POST data, etc. A more generic filter could take a path in the event object using glom, and then process it in a more generic way. However, each content type has wrinkles, and it was developed incrementally, so I went with a more targeted approach.

I'm not sure if I should tighten up the code, or refactor into some more general functions, or let it be for the first version.

As I mentioned earlier, the event format changed from raven to sentry_sdk, and the existing raven data is different than this data. If the gunicorn request details appear in staging, there may be more filtering to do, to omit IP address data from gunicorn header logs. I suspect there will be one or more rounds of changes.

Copy link
Member Author

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After sleeping on it, I want to see if the hint for query breadcrumbs has some options for more targeted sanitation. There may be a parsed SQL available, since Django 2.2 now requires sqlparse, or maybe there is a way to identify sensitive columns in Django itself.

Update: The hint is empty for SQL breadcrumbs, no help there. sqlparse could be used to write a very cool sanitizer, but it is probably too risky for v1 of a crash handler.

Then I want a YAGNI pass, to remove bits of the code that is there for flexibility and future expansion. We can worry about n=2 later, let's get n=1 right.

socorro/app/socorro_app.py Outdated Show resolved Hide resolved
socorro/lib/sentry_client.py Outdated Show resolved Hide resolved
socorro/lib/sentry_client.py Outdated Show resolved Hide resolved
from socorro.lib.sentry_client import SentryFilter, SentryProcessor, SENTRY_LOG_NAME


class CrumbTruncateQuery(SentryFilter):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SQL queries, I opted to truncate the SQL statement when a sensitive column is mentioned, rather than replace the whole thing with [filtered], or trying to replace the values with [filtered].

webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
"""
log_name = SENTRY_LOG_NAME if debug else None
filters = (
EventProcessCrumbs((
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the important part, which sets up the actual filtering used in the webapp.

class TestCrumbTruncateQueryForPrivacy:
"""Tests for CrumbTruncateQueryForPrivacy."""

CASES = (
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CASES pattern is used a lot to take data I saw during development and make test cases out of them. They are used here in the tests for the filter class, and later in the tests for get_before_send to test the webapp configuration.

webapp-django/crashstats/settings/base.py Outdated Show resolved Hide resolved
self.keywords = keywords
self.reason = reason

def __repr__(self):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used __repr__ because I was listening to Raymond Hettinger's talk while I was writing this: https://youtu.be/wf-BqAjZb8M?t=1980

@willkg
Copy link
Collaborator

willkg commented May 7, 2019

I want to think through this, but don't have the head space for it right now. Hopefully later this week.

@jwhitlock jwhitlock added Not ready Not ready for code review and removed Do not merge labels May 8, 2019
@jwhitlock
Copy link
Member Author

I'm going to take another refactoring pass on this:

  • Drop breadcrumb parsing, because it appears we won't need it
  • Generalize the sanitizing algorithm, so that each data type isn't as much of a unique codepath

@jwhitlock
Copy link
Member Author

I'm glad I had a chance to refactor this code, I'm a lot more comfortable with it. I've also edited the initial PR comment to cover the additional scope of filtering events for PII and secure information, so that future devs don't have to read all the 40+ comments.

We discussed in the meeting that this code is not urgent, and requires a couple of days to bake in stage. We can defer review and merging until we have that time. The code is in a good state to hang out on the shelf for a while, and I can rebase when needed.

@jwhitlock jwhitlock added Do not merge No rush and removed Not ready Not ready for code review labels May 10, 2019
Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a bunch of comments, but I like this approach and I appreciate that it's really easy to extract and turn into a general purpose library.

If you want to hop on zoom and talk about any of the issues, that might get us to a resolution faster than going back and forth with PR comments. Woot!

socorro/app/socorro_app.py Show resolved Hide resolved
@@ -62,9 +62,6 @@ def get_standard_config(self, sentry_dsn=None):
mocked_companion_process = mock.Mock()
config.companion_process.companion_class = mock.Mock(return_value=mocked_companion_process)

config.sentry = mock.MagicMock()
config.sentry.dsn = sentry_dsn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config is a DotDict, so you need to add a namespace to it. You can do that like this:

config.sentry = DotDict()
config.sentry.dsn = sentry_dsn

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was removed in the PR. With raven, the sentry DSN was passed at the time of the exception. With sentry_sdk, Sentry is initialized from this value at run, which is not run for these tests.

My strategy was to mock is_enabled to avoid initialization, but alternatively the tests could call pa.setup_crash_reporting(config, 'fake_version') around the time they call pa._setup_source_and_destination().

webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
filtered at creation, but most crumbs are discarded since an event isn't usually generated, so
it makes sense to wait until event sanitization to process it. The SanitizeBreadcrumbs filter
runs breadcrumb filters (that don't need data from the hint) at event processing rather than
at crumb creation.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These comments are super helpful. 👍

webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
else:
data_out[key] = value
if modified:
glom.glom(event, glom.Assign(self.section_path, data_out))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of funny. They added Assign because I asked for it, but I think this is the first use we've had of it.

webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
webapp-django/crashstats/crashstats/tests/test_sentry.py Outdated Show resolved Hide resolved
@willkg
Copy link
Collaborator

willkg commented May 14, 2019

Oh, I had one other thought. This adds a lot of code and while there are tests, they're tests for known scenarios and don't account for change in sentry-sdk. What happens when this code fails? How would we find out?

Maybe it makes sense to have the before_send in a big try/except block that logs the exception if anything happens? That gets us something in the logs that we can watch after deploys that might affect sentry things. It doesn't pro-actively notify us, though.

Use the standard Django integration of sentry_sdk to replace raven.
We're using the send_default_pii=False (the default), to explicitly
not collect user data like email addresses, and connection data like IP
addresses and cookies.
Switch to the sentry_sdk initialized in settings:

* Use the default exception handler for handle
* Use sentry_sdk.capture_error() in _run_once
Switch management command depcheck to use the settings-initialized
sentry_sdk.
Copy link
Member Author

@jwhitlock jwhitlock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a number of changes, and there's some more work to do

Exceptions in before_send

The sentry code wraps the before_send call:

https://github.com/getsentry/sentry-python/blob/78d17d1847f885fff9164ee21051df2a1c2c65be/sentry_sdk/client.py#L153-L160

I tested this by raising an exception. There is:

  • A multi-line ERROR-level log from sentry "Internal error in sentry_sdk", followed by the original traceback and the new exception.
  • A single-line INFO-level log that prints the event
  • Django's ERROR-level log of the original traceback and the new exception.

The event is not sent to Sentry if before_send raises an exception.

Log level of before_send loggers

The suggestion is to change the before_send loggers to log at DEBUG instead of INFO. The default log level is INFO, so DEBUG messages are swallowed. A few ways forward:

  • Continue logging at INFO, if the SENTRY_DEBUG is set
  • Always log at DEBUG, and require the developer to switch to DEBUG to see them
  • Always log at DEBUG, and dynamically add a handler for DEBUG messages if SENTRY_DEBUG is set

Next set of work

  • Convert to filters that just return the event, not the tuple of the event and if it was modified
  • Redo logging configuration so that debug_logger.debug() works
  • Switch the query processing to filter rather than truncate
  • Indent multi-line SQL test cases
  • Determine if querystring parsing can be used without corrupting data
  • Try "production-like" mode, and see if I can get the extended gunicorn breadcrumbs from production

@@ -62,9 +62,6 @@ def get_standard_config(self, sentry_dsn=None):
mocked_companion_process = mock.Mock()
config.companion_process.companion_class = mock.Mock(return_value=mocked_companion_process)

config.sentry = mock.MagicMock()
config.sentry.dsn = sentry_dsn
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was removed in the PR. With raven, the sentry DSN was passed at the time of the exception. With sentry_sdk, Sentry is initialized from this value at run, which is not run for these tests.

My strategy was to mock is_enabled to avoid initialization, but alternatively the tests could call pa.setup_crash_reporting(config, 'fake_version') around the time they call pa._setup_source_and_destination().

webapp-django/crashstats/crashstats/sentry.py Outdated Show resolved Hide resolved
webapp-django/crashstats/crashstats/tests/test_sentry.py Outdated Show resolved Hide resolved
@willkg
Copy link
Collaborator

willkg commented May 14, 2019

Exceptions in before_send

What if we wrap the before_send in a try/except that uses markus to send metrics to datadog whenever it fails and then have it re-raise? Something like this:

metrics = markus.get_metrics('sentry')

...

def before_send(whatever):
    try:
        whatever
    except Exception:
        metrics.incr('before_send_exception')
        raise

If markus is configured at that point, it'll send something and we can see it in datadog.

Does that sound like a good idea? I'm not sure how else we can get signals out.

Log level of before_send loggers

I suggested we switch it to debug because I thought we were looking at debug level in the local dev environment. Now that I know it's not logging at debug level, I take back my comment--let's leave it as what you have since if someone has debug mode enabled and it seems onerous and confusing to require multiple actions to get debugging to work.

@willkg
Copy link
Collaborator

willkg commented May 14, 2019

#4917 (comment)

It's more readable to me if we adjust where the parens are. It's a tuple of three elements with the second element being a really long string. I think this is more readable:

        # Select a user by email
        (
            'email',
            (
                'SELECT "auth_user"."is_active" FROM "auth_user" WHERE'
                ' UPPER("auth_user"."email"::text) = UPPER(\'username@example.com\')'
            ),
            (
                '...'
            )
        )

When I was reviewing that code, I had to count parens to figure out where the three elements were demarcated. That seems really prone to error.

@jwhitlock
Copy link
Member Author

I think the logging is a little cleaner now. Instead of passing around a logger, there's a module-level logger that gets called when needed, and the logic to "turn on" the messages moves to settings. Take a look and see if you agree.

Incrementing a counter with markus will work to track exceptions. When we make it a library, I don't know if we want to require markus as a dependency. But, I think your code works for v1, and we can worry about the library later.

I'm continuing with migrating the return from (event, modified?) to just modified. Working through this change suggests another. The event is modified in place, and returned to signify that it wasn't dropped. None of our code drops events however, and it seems a misleading that we're modifying the input as well as returning it.

Instead, it seems better to raise an exception when then event (or breadcrumb) should be dropped instead of processed, catch the exception in SentryProcessor, and return None. But, none of the current filters drop events for crumbs, so I think I'm just going to drop the return value, and adjust docstrings to make it clear that the event is modified in place.

ids=tuple(case[0] for case in CASES))
def test_filtered_queries(self, keyword, message, expected):
'keyword,sql', CASES.items(),
ids=tuple(case for case in CASES))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this line up right? Does CASES.items() return the same as iter(CASES)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience, yes, and that's my understanding from the dictionary view docs: https://docs.python.org/3.6/library/stdtypes.html#dictionary-view-objects

@jwhitlock
Copy link
Member Author

jwhitlock commented May 15, 2019

My remaining task is to run in production mode, and see if I can get the sensitive breadcrumbs I'm seeing in the raven-generated events in stage. That may add another sanitizer, but I don't expect the current sanitizers to change, so this may be another good place to review.

Update: The event is fundamentally the same in production mode, so no further changes.

I hope to squash all these review commits into the "Filer events for pii/security" commit, because I don't think there is a long-term benefit to all these follow-on commits. If you want me to squash before you review, let me know.

Update 2: I've rebased and squashed the commits

Convert ProcessorApp to use sentry_sdk:

* Move crash reporting initialization to the base socorro_app.App
* Assume sentry is initialized in sentry_client helpers
* Remove now unused Raven library
* Change sentry_client.get_client to get_hub, to better match
  sentry_sdk, and update the test mocks
* Add sentry_client.is_enabled() to check if Sentry has an initialized hub
  and dsn, and mock to return True in tests
* Drop the sentry_dsn parameter from both get_hub and capture_error
* Drop local handling of sentry_dsn in each app, defer to App, and
  simplify test config setup accordingly
* Split test_rule_error from test_processor_2015.py into two tests, one
  with Sentry disabled, one enabled
Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markus bits need to get fixed and I'd make the CASES declarations more readable. This looks good though. After those fixes I think we can land this and test it out on stage.

Do you have a test plan for this?

webapp-django/crashstats/sentrylib.py Outdated Show resolved Hide resolved
webapp-django/crashstats/tests/test_sentrylib.py Outdated Show resolved Hide resolved
webapp-django/crashstats/tests/test_sentrylib.py Outdated Show resolved Hide resolved
webapp-django/crashstats/tests/test_sentrylib.py Outdated Show resolved Hide resolved
webapp-django/crashstats/tests/test_sentrylib.py Outdated Show resolved Hide resolved
@jwhitlock
Copy link
Member Author

Fixes applied in follow-on commits. I'd like to squash these as well before merging.

My test plan:

  • Use "Crash me now" to generate a webapp report in staging, look at the generated event
  • Wait for a KeyError: 'java_stack_trace_' exception (2 in last 48 hours, might be possible to induce with re-submit), look at the generated event.

There's a good chance that there will be addition scrubbing needed in the deployed environments, based on crumbs that appear in the deployed raven events but not in the development environment with sentry_sdk.

Copy link
Collaborator

@willkg willkg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! 👍

Sorry this was such a slog--I was kind of learning about the bits as we went along. 😞

Implement before_send on the webapp to clear personal information and
security data from the event (exception or message) before sending to
Sentry:

* Truncate SQL queries when a sensitive column is mentioned
* Filter HTTP headers to mask Auth-Token
* Filter POST data to mask CSRF tokens
* Filter query strings to mask OAuth parameters
@jwhitlock
Copy link
Member Author

I was learning on this one too. Thanks for the feedback on coding styles, markus, and the rest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants