fix bug 1505233: clean up Rule error handling by willkg · Pull Request #4696 · mozilla-services/socorro

willkg · 2018-11-07T15:10:44Z

This moves Rule from socorro.lib.transform_rules to socorro.processor.rules.base, guts it removing the error handling and some unused functionality, and then updates all the subclasses accordingly.

This moves socorro.lib.transform_rules to socorro.processor.rules.base. Rules are only used by the processor, so it's easier if the code is centralized there.

Previously, Rule was implemented so that a rule could stop processing for a crash report. We don't use that feature, so we can nix that and stop checking the return of action and act. Rule had action/_action and predicate/_predicate where action and predicate had a bunch of error handling. Instead of handling errors in Rule, we're going to let them get handled by the thing running the rules. Rules should be defensive and thoughtful in how they execute and not throw errors willy-nilly. Rules used to have versions. I took all that out since it's not used anywhere. After simplifying Rule, there wasn't anything to test, so I removed the Rule tests. There were some other subclasses of Rule, but those weren't used, so I removed those and their tests.

willkg · 2018-11-07T15:12:23Z

+    def action(self, raw_crash, raw_dumps, processed_crash, processor_meta):
+        if 'uuid' in raw_crash:
+            processed_crash['crash_id'] = raw_crash['uuid']
+            processed_crash['uuid'] = raw_crash['uuid']


The uuid gets added by the collector to the raw crash. We should never be processing crashes that don't have a uuid, so ... I'm not really sure why we would end up here ever.

willkg · 2018-11-07T15:15:04Z

+
+    def close(self):
+        # FIXME(willkg): see if any of the rules use .close()
+        self.config.logger.debug('null close on rule %s', self.__class__)


I can nix this FIXME. There is a rule that has a .close() in it.

willkg · 2018-11-07T15:21:45Z

The tests pass, linting passes, and I ran some crashes through it and it seems fine. There are a few differences here:

action and predicate may have been swallowing errors letting processing continue which are no longer swallowed
because of that, it's possible that bugs in the code will throw errors that weren't previously being thrown
because of that, it's possible that when this gets to stage, it'll start throwing a lot of errors and the processing queue will back up and things might go down

It'd be super if that didn't happen in production, so once this lands in stage, I'm going to let it hang out for a couple of days and fix anything that pops up. It's entirely possible that not much happens because @peterbe and @adngdb did a pass at fixing errors in rules a couple of years ago and that code has been in prod for ages. So hopefully this is a whole lot of worry for nothing.

We pulled all the exception handling out of Rule. This puts it in process_crash. It'll surface errors in sentry and also add a processor note so we can reprocess crashes.

willkg · 2018-11-07T16:46:17Z

The code changes look ok. This is a lot of code removal, but it's mostly things that we don't use at all. There's some cosmetic changes in here, too.

The big change here is that Rule no longer catches and surfaces errors--that's now done by Processor2015.process_crash. It'll also add notes to the processed crash making it easier for us to identify and reprocess partially processed crashes.

willkg · 2018-11-07T17:09:16Z

+                # NOTE(willkg): notes are public, so we can't put exception
+                # messages in them
+                processor_meta_data.processor_notes.append(
+                    'rule %s failed: %s' % (rule.__class__.__name__, exc.__class__.__name__)


This creates a line like "rule OSInfoRule failed: KeyError". That gives me something to search for to reprocess crashes that had issues but shouldn't reveal any PII in a public field.

The thing that calls process_crash also has error handling, but that error handling is tied in with processing the crash and saving it. I didn't want to further complicate that. Further, if we handled errors there, processing would halt on the first error thrown. That'd be great if we had a system where crashes that couldn't get through processing were put aside in a separate bin. But we don't have that. Our options are:

continue processing despite errors and end up with a partially processed crash

throw it back in the rabbitmq queue

drop it altogether and never process it again

We can't do option 3 because then we have crashes we've saved to S3 but don't have indexed and don't know anything about. Option 2 could lead to the queue filled with bad crashes and the processor grinding to a halt. Option 1 is unenthusing and could lead to confusion during analysis.

I'm going for option 1. The hope is that we'll catch the errors in sentry and fix the problems and push the fixes out. Since we're adding notes to the crashes in question, we can reprocess them once the problem is fixed.

We'll see how that goes.

willkg · 2018-11-07T17:15:58Z

-        )
-        assert processed_crash.processor_notes == expected
-        assert processed_crash.signature.startswith('shutdownhang')
-        assert len(processed_crash.signature) == 255


I have no idea what this was testing. Sure seemed like it was testing SigTrunc in signature generation. That's tested (better) elsewhere, so I nixed this.

willkg · 2018-11-07T18:21:32Z

I'm going to land this now and watch stage for a while. self-r+

lonnen · 2018-11-07T20:50:51Z

post-landing r+, this looks great. excited for it.

willkg · 2018-11-08T21:07:43Z

Thank you!

willkg added 2 commits November 6, 2018 20:38

bug 1505233: move Rule to processor

c782708

This moves socorro.lib.transform_rules to socorro.processor.rules.base. Rules are only used by the processor, so it's easier if the code is centralized there.

willkg commented Nov 7, 2018

View reviewed changes

Remove FIXME

5253f74

willkg commented Nov 7, 2018

View reviewed changes

fix bug 1505233: push exception handling to process_crash

d0ae0f3

We pulled all the exception handling out of Rule. This puts it in process_crash. It'll surface errors in sentry and also add a processor note so we can reprocess crashes.

Remove unused import

224e320

willkg commented Nov 7, 2018

View reviewed changes

willkg merged commit 19dd7b5 into mozilla-services:master Nov 7, 2018

willkg deleted the 1505233-rules branch November 7, 2018 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix bug 1505233: clean up Rule error handling#4696

fix bug 1505233: clean up Rule error handling#4696
willkg merged 5 commits into
mozilla-services:masterfrom
willkg:1505233-rules

willkg commented Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Uh oh!

willkg Nov 7, 2018 •

edited

Loading

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

lonnen commented Nov 7, 2018

Uh oh!

willkg commented Nov 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

willkg commented Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

willkg Nov 7, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

willkg Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

willkg Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

willkg Nov 7, 2018

Choose a reason for hiding this comment

Uh oh!

willkg commented Nov 7, 2018

Uh oh!

lonnen commented Nov 7, 2018

Uh oh!

willkg commented Nov 8, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

willkg Nov 7, 2018 •

edited

Loading