Skip to content

fix bug 1505233: clean up Rule error handling#4696

Merged
willkg merged 5 commits into
mozilla-services:masterfrom
willkg:1505233-rules
Nov 7, 2018
Merged

fix bug 1505233: clean up Rule error handling#4696
willkg merged 5 commits into
mozilla-services:masterfrom
willkg:1505233-rules

Conversation

@willkg
Copy link
Copy Markdown
Collaborator

@willkg willkg commented Nov 7, 2018

This moves Rule from socorro.lib.transform_rules to socorro.processor.rules.base, guts it removing the error handling and some unused functionality, and then updates all the subclasses accordingly.

This moves socorro.lib.transform_rules to socorro.processor.rules.base.
Rules are only used by the processor, so it's easier if the code is
centralized there.
Previously, Rule was implemented so that a rule could stop processing for
a crash report. We don't use that feature, so we can nix that and stop
checking the return of action and act.

Rule had action/_action and predicate/_predicate where action and predicate
had a bunch of error handling. Instead of handling errors in Rule, we're
going to let them get handled by the thing running the rules. Rules should
be defensive and thoughtful in how they execute and not throw errors
willy-nilly.

Rules used to have versions. I took all that out since it's not used anywhere.

After simplifying Rule, there wasn't anything to test, so I removed the
Rule tests.

There were some other subclasses of Rule, but those weren't used, so I
removed those and their tests.
def action(self, raw_crash, raw_dumps, processed_crash, processor_meta):
if 'uuid' in raw_crash:
processed_crash['crash_id'] = raw_crash['uuid']
processed_crash['uuid'] = raw_crash['uuid']
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The uuid gets added by the collector to the raw crash. We should never be processing crashes that don't have a uuid, so ... I'm not really sure why we would end up here ever.

Comment thread socorro/processor/rules/base.py Outdated

def close(self):
# FIXME(willkg): see if any of the rules use .close()
self.config.logger.debug('null close on rule %s', self.__class__)
Copy link
Copy Markdown
Collaborator Author

@willkg willkg Nov 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can nix this FIXME. There is a rule that has a .close() in it.

@willkg
Copy link
Copy Markdown
Collaborator Author

willkg commented Nov 7, 2018

The tests pass, linting passes, and I ran some crashes through it and it seems fine. There are a few differences here:

  1. action and predicate may have been swallowing errors letting processing continue which are no longer swallowed
  2. because of that, it's possible that bugs in the code will throw errors that weren't previously being thrown
  3. because of that, it's possible that when this gets to stage, it'll start throwing a lot of errors and the processing queue will back up and things might go down

It'd be super if that didn't happen in production, so once this lands in stage, I'm going to let it hang out for a couple of days and fix anything that pops up. It's entirely possible that not much happens because @peterbe and @adngdb did a pass at fixing errors in rules a couple of years ago and that code has been in prod for ages. So hopefully this is a whole lot of worry for nothing.

We pulled all the exception handling out of Rule. This puts it in
process_crash. It'll surface errors in sentry and also add a processor
note so we can reprocess crashes.
@willkg
Copy link
Copy Markdown
Collaborator Author

willkg commented Nov 7, 2018

The code changes look ok. This is a lot of code removal, but it's mostly things that we don't use at all. There's some cosmetic changes in here, too.

The big change here is that Rule no longer catches and surfaces errors--that's now done by Processor2015.process_crash. It'll also add notes to the processed crash making it easier for us to identify and reprocess partially processed crashes.

# NOTE(willkg): notes are public, so we can't put exception
# messages in them
processor_meta_data.processor_notes.append(
'rule %s failed: %s' % (rule.__class__.__name__, exc.__class__.__name__)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This creates a line like "rule OSInfoRule failed: KeyError". That gives me something to search for to reprocess crashes that had issues but shouldn't reveal any PII in a public field.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing that calls process_crash also has error handling, but that error handling is tied in with processing the crash and saving it. I didn't want to further complicate that. Further, if we handled errors there, processing would halt on the first error thrown. That'd be great if we had a system where crashes that couldn't get through processing were put aside in a separate bin. But we don't have that. Our options are:

  1. continue processing despite errors and end up with a partially processed crash
  2. throw it back in the rabbitmq queue
  3. drop it altogether and never process it again

We can't do option 3 because then we have crashes we've saved to S3 but don't have indexed and don't know anything about. Option 2 could lead to the queue filled with bad crashes and the processor grinding to a halt. Option 1 is unenthusing and could lead to confusion during analysis.

I'm going for option 1. The hope is that we'll catch the errors in sentry and fix the problems and push the fixes out. Since we're adding notes to the crashes in question, we can reprocess them once the problem is fixed.

We'll see how that goes.

)
assert processed_crash.processor_notes == expected
assert processed_crash.signature.startswith('shutdownhang')
assert len(processed_crash.signature) == 255
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what this was testing. Sure seemed like it was testing SigTrunc in signature generation. That's tested (better) elsewhere, so I nixed this.

@willkg
Copy link
Copy Markdown
Collaborator Author

willkg commented Nov 7, 2018

I'm going to land this now and watch stage for a while. self-r+

@willkg willkg merged commit 19dd7b5 into mozilla-services:master Nov 7, 2018
@willkg willkg deleted the 1505233-rules branch November 7, 2018 19:02
@lonnen
Copy link
Copy Markdown
Contributor

lonnen commented Nov 7, 2018

post-landing r+, this looks great. excited for it.

@willkg
Copy link
Copy Markdown
Collaborator Author

willkg commented Nov 8, 2018

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants