fix bug 1505233: clean up Rule error handling#4696
Conversation
This moves socorro.lib.transform_rules to socorro.processor.rules.base. Rules are only used by the processor, so it's easier if the code is centralized there.
Previously, Rule was implemented so that a rule could stop processing for a crash report. We don't use that feature, so we can nix that and stop checking the return of action and act. Rule had action/_action and predicate/_predicate where action and predicate had a bunch of error handling. Instead of handling errors in Rule, we're going to let them get handled by the thing running the rules. Rules should be defensive and thoughtful in how they execute and not throw errors willy-nilly. Rules used to have versions. I took all that out since it's not used anywhere. After simplifying Rule, there wasn't anything to test, so I removed the Rule tests. There were some other subclasses of Rule, but those weren't used, so I removed those and their tests.
| def action(self, raw_crash, raw_dumps, processed_crash, processor_meta): | ||
| if 'uuid' in raw_crash: | ||
| processed_crash['crash_id'] = raw_crash['uuid'] | ||
| processed_crash['uuid'] = raw_crash['uuid'] |
There was a problem hiding this comment.
The uuid gets added by the collector to the raw crash. We should never be processing crashes that don't have a uuid, so ... I'm not really sure why we would end up here ever.
|
|
||
| def close(self): | ||
| # FIXME(willkg): see if any of the rules use .close() | ||
| self.config.logger.debug('null close on rule %s', self.__class__) |
There was a problem hiding this comment.
I can nix this FIXME. There is a rule that has a .close() in it.
|
The tests pass, linting passes, and I ran some crashes through it and it seems fine. There are a few differences here:
It'd be super if that didn't happen in production, so once this lands in stage, I'm going to let it hang out for a couple of days and fix anything that pops up. It's entirely possible that not much happens because @peterbe and @adngdb did a pass at fixing errors in rules a couple of years ago and that code has been in prod for ages. So hopefully this is a whole lot of worry for nothing. |
We pulled all the exception handling out of Rule. This puts it in process_crash. It'll surface errors in sentry and also add a processor note so we can reprocess crashes.
|
The code changes look ok. This is a lot of code removal, but it's mostly things that we don't use at all. There's some cosmetic changes in here, too. The big change here is that |
| # NOTE(willkg): notes are public, so we can't put exception | ||
| # messages in them | ||
| processor_meta_data.processor_notes.append( | ||
| 'rule %s failed: %s' % (rule.__class__.__name__, exc.__class__.__name__) |
There was a problem hiding this comment.
This creates a line like "rule OSInfoRule failed: KeyError". That gives me something to search for to reprocess crashes that had issues but shouldn't reveal any PII in a public field.
There was a problem hiding this comment.
The thing that calls process_crash also has error handling, but that error handling is tied in with processing the crash and saving it. I didn't want to further complicate that. Further, if we handled errors there, processing would halt on the first error thrown. That'd be great if we had a system where crashes that couldn't get through processing were put aside in a separate bin. But we don't have that. Our options are:
- continue processing despite errors and end up with a partially processed crash
- throw it back in the rabbitmq queue
- drop it altogether and never process it again
We can't do option 3 because then we have crashes we've saved to S3 but don't have indexed and don't know anything about. Option 2 could lead to the queue filled with bad crashes and the processor grinding to a halt. Option 1 is unenthusing and could lead to confusion during analysis.
I'm going for option 1. The hope is that we'll catch the errors in sentry and fix the problems and push the fixes out. Since we're adding notes to the crashes in question, we can reprocess them once the problem is fixed.
We'll see how that goes.
| ) | ||
| assert processed_crash.processor_notes == expected | ||
| assert processed_crash.signature.startswith('shutdownhang') | ||
| assert len(processed_crash.signature) == 255 |
There was a problem hiding this comment.
I have no idea what this was testing. Sure seemed like it was testing SigTrunc in signature generation. That's tested (better) elsewhere, so I nixed this.
|
I'm going to land this now and watch stage for a while. self-r+ |
|
post-landing r+, this looks great. excited for it. |
|
Thank you! |
This moves
Rulefromsocorro.lib.transform_rulestosocorro.processor.rules.base, guts it removing the error handling and some unused functionality, and then updates all the subclasses accordingly.