Skip to content

Conversation

@Gankra
Copy link
Collaborator

@Gankra Gankra commented Sep 23, 2021

This seems to get the implementation working well on some examples in socorro I tested.

Notable problems I found and fixed:

  • I was too aggressively rejecting overlapping STACK WIN address ranges. It seems very common that the length of a STACK WIN record just covers the entirety of the function(?) and subsequent entries mask over it. So we have to truncate the lengths as we parse them to clean this up.
  • It seems that looking up STACK WIN entries based on the callee's raw $eip is incorrect, and that you need to subtract 1? (Which is what adjust_instruction does, so really we should just be using that value instead of eip?)
  • I was deriving grand_callee_parameter_size from the value provided in FUNC records, but it appears that STACK WIN records have the more accurate value.

Also I added in a quick and dirty more fancy version of the address_seems_valid that actually looks up the function record for the module, but it's not quite right because it doesn't have a case for "it's in the module but we couldn't load any function records". I should probably delete the code, I only wrote it while I was desperately trying to figure out what was breaking.

Surprisingly, I now don't have any examples which seem to need the accursed scanning or other hacks that breakpad is stuffed with for STACK WIN cfi!

TODO:

  • test more socorro cases
  • write tests
  • don't forward callee's registers as aggressively (doesn't seem to cause problems, but breakpad doesn't say those registers are valid anymore)

@Gankra
Copy link
Collaborator Author

Gankra commented Sep 23, 2021

reminder if anyone wants to test this out on some examples that I have instructions on how to use this with tecken here: https://github.com/luser/rust-minidump/tree/master/minidump-stackwalk#analyzing-firefox-minidumps

@Gankra
Copy link
Collaborator Author

Gankra commented Sep 23, 2021

Hmm the extra -1 offset I applied makes a ton of the breakpad x86 unittests break (both STACK WIN and STACK CFI). Seems there's something more subtle going on that I need to work out.

The reason I convinced myself I needed the -1 was that I was seeing a few cases where we were one-past-the-end of the record we were supposed to be using. Perhaps I'm still not computing/handling the length right?

@luser
Copy link
Collaborator

luser commented Sep 23, 2021

It seems that looking up STACK WIN entries based on the callee's raw $eip is incorrect, and that you need to subtract 1? (Which is what adjust_instruction does, so really we should just be using that value instead of eip?)

I think this is actually not unreasonable, because the return address on the stack is the instruction after the call, right?

@luser
Copy link
Collaborator

luser commented Sep 23, 2021

Surprisingly, I now don't have any examples which seem to need the accursed scanning or other hacks that breakpad is stuffed with for STACK WIN cfi!

Hah, amazing! I'm sure you could grovel through blame and bugzilla for examples, but all the actual crash reports are assuredly lost to the mists of time. Given that this is all 32-bit Windows-specific, maybe it's not worth the effort?

@Gankra
Copy link
Collaborator Author

Gankra commented Sep 23, 2021

Ah right, having slept on it I realized of course that we don't adjust the context frame's instruction, and that's the logic I need to make the -1 work properly in all cases!

As it turns out, most of my implementation was actually correct! The main issues
seemed to stem from the breakpad symbol file being a nasty thing full of Lies.
Most notably:

* The byte length of a STACK WIN entry may actually cover the whole function(?)
  and just expect later STACK WIN entries that overlap with it to mask it out.
  To handle this, we now truncate the previous STACK WIN entry to the starting
  address of the next one during parsing.

* The parameter_size field in FUNC entries is actually a trap, and the STACK WIN
  values are the ones that should be used. It's not 100% clear to me if the FUNC
  entries are *ever* reasonable, or if we should refuse to evaluate CFI if there
  is no STACK WIN entry to provide that value. For now I'm maximally permissive
  and will both use the FUNC value as a fallback, and further use 0 as a fallback
  if even that isn't present. (Defaulting to 0 is necessary when unwinding the
  context frame since there is no grand_callee yet, but it's not clear to me
  if that default should apply otherwise.)

I also realized that I should be using the callee's adjusted "instruction" value
rather than the raw value of `$eip` to lookup modules/symbols. This is more
universally applied in a refactor in the next patch.

This also adds a few "basic" tests from the breakpad testsuite. Notably I
have omitted the tests for all the weird corner-case heuristics that breakpad
uses like scanning from .raSearchStart. So far I haven't found an actual firefox
minidump (in my limited testing) that needs these, so I'll punt on them until
proven otherwise.
cfi unwining should be using callee.instruction instead of the pc/ip in the context. Rather than
duplicate the logic to specially handle the context frame everywhere, just use the fact that we
keep the right value in the callee and pass it down. To avoid having a million things passed down
I also remove several other arguments that are just in the callee frame already. This adds some
boilerplate for pulling out the callee validity, but I think it's fine.
also a little tweak to prevent register forwarding when using STACK WIN that I just
absentmindedly added while reviewing the documentation
@Gankra
Copy link
Collaborator Author

Gankra commented Sep 30, 2021

PR complete, I think?

@Gankra Gankra changed the title WIP STACK WIN unwinding STACK WIN unwinding Sep 30, 2021
@gabrielesvelto
Copy link
Collaborator

I'd say yes! I'm going to throw some particularly nasty minidumps at it to look at the results but the few I've already tried all looked very good already.

@Gankra
Copy link
Collaborator Author

Gankra commented Sep 30, 2021

Ok I got one (1) socorro crash report confidently verified to have the same unwind for all threads using rust-minindump, so I'm happy landing this now. Can iterate on this more later.

(Verified using my new little script: https://github.com/Gankra/socc-pair/)

@Gankra Gankra merged commit ec75e45 into rust-minidump:master Sep 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants