Win 10 opt/Lin Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] #2806
Comments
|
I've pinged @luser about this, as he seems to have a fair bit of breakpad experience. I'm hoping he can provide some thoughts on how best to approach this... |
|
Looking at that first log link, you can see the first stack trace has this frame: mozcrash should really do a better job of providing info here, but this is a dead giveaway that this stack is from an intentionally-created pair of minidumps, and this is the chrome process side. We invoke this method when a content process hangs, so that we can get stacks from both processes before killing it. If you scroll down a bit you'll find the second stack, at the |
|
Thanks for that clue! Knowing to just jump to the second crash stack is tremendously helpful. Going through the various log links posted above with that in mind shows that almost all of the real stacks appear quite different from one another. I suppose it's possible that they're all related if something is spraying crap across the address space and just hitting different spots. I'll continue to poke... |
|
The originally linked logs are of various platforms and tests. Here are the 8 most recent pushes to ^ https://treeherder.mozilla.org/logviewer.html#?job_id=112109325&repo=pine&lineNumber=2436 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=112063441&repo=pine&lineNumber=2242 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=112042587&repo=pine&lineNumber=4835 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=111380263&repo=pine&lineNumber=4828 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=111379855&repo=pine&lineNumber=2436 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=111293368&repo=pine&lineNumber=4333 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=111292356&repo=pine&lineNumber=4318 ^ https://treeherder.mozilla.org/logviewer.html#?job_id=111010778&repo=pine&lineNumber=2417 |
|
dmajor did take an initial look and thought |
|
So, I've noticed that a bunch (haven't yet looked through all) of these things are dying with EXCEPTION_BREAKPOINT on Windows, and DUMP_REQUESTED on Linux. After poking around bugzilla a bit, I think this is probably one of the hang monitors (we have more than one!) deciding (quite possibly incorrectly) that something is hung, and sending that signal to (presumably) the child process. |
|
All the recent builds are only seeing this on Win10 x64 opt, so I've changed the title. |
|
That's because |
|
Ah, fair enough. wlach and I have been speculating about possible resource exhaustion, and it turns out there is some resource usage data from these test runs. i'm trying to see what I can figure out by peering at it. |
We don't actually signal the child process, we just write a dump of it, and fake the exception reason. |
OK, that would seem to make the hang-reporter hypothesis at least a bit less likely. I did notice that in debug builds, the hang reports use MOZ_CRASH, which spits a message about the hang reporter to standard error, but not in opt builds, where we're seeing the crashes. @luser if it were the one of the hang reporters, would we be seeing the hang reporter stack trace in the crashing thread? |
|
On the resource exhaustion front; it turns out that each test has a resource_usage.json file on the log page that includes a bunch of interesting and useful data about CPU, swap, memory, and I/O. There is a "mach resource-usage" command which will plot some of that info on a nice HTML graph. gps said he'd happily accept patches to that code, and I suspect patching it to show swamp and memory info wouldn't be hard. That said, even if it confirmed that suspicion, it wouldn't tell us what was causing the crash. So I'm going to set that aside for the moment. @Mardak and I just spent a bunch of time chatting, and he's been doing a bunch of try-server work to try and sort out when/how this got introduced. He's going to keep on that path for now, and I'm going to get a Linux VM and run some tests and see if I can force failures just by resource-constraining stuff. Good times! |
|
It looks like removing the The parent of that commit has nearly 100% crashes: |
|
Wow, nice detective work! How did you find that? I guess the next questions are:
|
|
Theory time: Is it possible that activating Activity Stream is causing a network access attempt? Our testing infrastructure runs in a mode where network access is expressly forbidden, and attempts at opening a connection result in a crash: It dumps out a very helpful crash message when that network access attempt is made in the parent process. But, at least on my Macbook, if a content process attempts to make a network access attempt while in that mode, we get a tab crash with no helpful message in the console. Maybe the sandbox ate it, I dunno. Is that possible? Is Activity Stream attempting to access the network somehow for Talos, and not for standard mochitests? |
|
I would think not re: network connection. Sorry for the distraction as we didn't land this patch (from #2743) with the previous uplift: https://github.com/mozilla/activity-stream/blob/master/mozilla-central-patches/disable-as-default-sites.diff FYI, this is what I've been pushing to try/pine when testing: https://hg.mozilla.org/try/rev/24228804e8419521aa094d62e7bf693218929790 (set prefs, add tests) and https://hg.mozilla.org/try/rev/bef092981b44c7321fefa24668910be102721b4d (wip patches that fixes other test failures) |
|
Re: @dmose's suggestion of deferring differently, I did try I don't know if that issue is related to these Talos crashes, but suspiciously these both do involve session restoring and problems on Linux. @k88hudson how does about:newtab end up getting a top level |
|
Looking through a bunch of those crash stack traces, I have the strong suspicion that somehow the stack is getting corrupted. @mikeconley, have a look at the second stack in a bunch of those crashes (eg https://treeherder.mozilla.org/#/jobs?repo=try&revision=22bc4684290a4a9f58bbdb7d59537ffec9c3da86 ). I suspect you have more recent experience with stack scribbling; do you have any thoughts here? I'm working poking at buildconfig to try and make linux64-asan builds run the talos tests on try in the hopes that ASAN might catch whatever the problem is... |
|
To be fair lots of those stacks seem to be inside NT system DLLs. But one of the examples the crops up semi-regularly is https://treeherder.mozilla.org/logviewer.html#?job_id=112476974&repo=try&lineNumber=4490 and if you look at the code for HasLookupRuleWithGlyphByScript you see that the |
|
Well, we're actually running using ASan now, we'll see if we hit any of the crashes... https://treeherder.mozilla.org/#/jobs?repo=try&revision=4929a5523e9879de1a077ed1d9e2b330c09dcc7f |
|
If that doesn't work, another thing to try is (thanks to @mikeconley for the suggestion) is https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_loaner |
|
Sadly, the ASan talos runs didn't trigger any crashes (though these were all Linux, and there does appear to be a underdocument, maybe-not-supported Win10 asan configuration in the tree). Further, Asan builds should exit when they detect an error, and we don't appear to override that behavior in the tree, so they haven't detected any pre-crash errors either, it seems. :-/ |
|
@mikeconley has kindly filed https://bugzilla.mozilla.org/show_bug.cgi?id=1380047 to get us loaner machine access to try and reproduce on... |
|
The pine pushes from Monday have a much lower crash rate for unknown reasons: Crash rate from Friday was around 39% whereas on Monday the rate is about 4% |
|
That's very mysterious. Either something happened in this push to improve things, or something changed in automation. If the latter... I have no idea how we're going to track that down. If the former, perhaps we can bisect on try with this changeset range? |
|
I can try bisecting, but it'll be a bit intensive via We were tracking that for Windows crashes in #2802 Although I did include that patch in a pine push last week and windows was still failing: https://treeherder.mozilla.org/#/jobs?repo=pine&revision=8b9a8484dacaf9f0783944972721bd771908a36b&filter-searchStr=talos |
|
I have so far been unsuccessful in getting the build from this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&filter-searchStr=talos&group_state=expanded to crash on the loaner hardware... |
|
Looking at one of the .extra files for one of the crashes, I notice this annotation: This annotation is only ever set if the ContentChild in the content process has told the parent "I'm ready to shut down now.". So the parent has told the content process to shut down. Looking deeper within the .extra file, I also see this: And that's only set here. So a new hypothesis: Activity Stream causes content to request storage information from the parent off of the main thread... and then the thread that this code runs on: Maybe decides that the DB has already shut down by the time it's requesting it. Returning IPC_FAIL_NO_REASON, I believe, could result in the types of crashes we're seeing here - though perhaps kanru or somebody (billm?) who knows our IPC code better might be able to say for sure. |
|
Also note that this patch in the changes that landed since Friday might have reduced the probability of this crash occurring. |
|
It looks like the crashes were already reduced from before The next group of commits do seem potentially likely as well: |
|
OOP Extensions should only apply for WebExtensions though... I wonder how much of that impacts bootstrapped add-ons. |
|
There's several commits for Part 0f has 50% crashes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b93b9a30b42c487ac7caa68946982e144eaebe79&filter-searchStr=talos So it does seem like it's the enabling OOP extensions. Why it reduces crashes is not really clear although why activity-stream is triggering crashes here is also unclear. And how this affects Talos, I believe screenshots is a webextension system-addon.. not sure what other webextensions might be active during talos tests.. (For own reference: |
|
So.. umm… I pushed the latest activity-stream with latest mozilla-central to get a baseline for removing the I also added talos jobs for the latest …resolved? |
|
¯_(ツ)_/¯ ᕕ( ᐛ )ᕗ |
|
OMG; amazing. I'm resolving; we can always re-open if it comes back. |
|
Fwiw, I just pushed a mozilla-central from today to try, and saw the crash. So I think Activity Stream is likely in the clear, but I don't think we've seen the last of this thing. :/ |
|
Fair enough. It's not ideal, but at least it's a start... |
|
So for future reference, if you're suspicious of the stacks you're seeing in a log, and the crash is from a Windows machine, you can always download the minidump + symbols and load them in a Microsoft debugger. The Treeherder log viewer has a list of info in the left pane, including |

Talos seems to be failing from at least last week intermittently with crashes on various talos tests on various platforms and all have a similar structure of the crash reporter crashing when trying to pair the main and content processes for a stack as well as potentially another actual crash:
Win10 x64: https://treeherder.mozilla.org/logviewer.html#?job_id=110403282&repo=try&lineNumber=7663
Win10 x64 without startup optimizations: https://treeherder.mozilla.org/logviewer.html#?job_id=110363488&repo=try&lineNumber=1848
Win7: https://treeherder.mozilla.org/logviewer.html#?job_id=110384424&repo=try&lineNumber=1677
Linux 64: https://treeherder.mozilla.org/logviewer.html#?job_id=110376189&repo=try&lineNumber=7176
These are still happening even with
dom.ipc.processPrelaunch.enabledset tofalse:Linux x64: https://treeherder.mozilla.org/logviewer.html#?job_id=110560527&repo=try&lineNumber=2804
Win7: https://treeherder.mozilla.org/logviewer.html#?job_id=110574993&repo=try&lineNumber=1804
Potentially crashreporter itself is crashing because it doesn't know to look at the preloaded newtab content process? But unclear why it's needing to create a crash report in the first place…
The text was updated successfully, but these errors were encountered: