Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Win 10 opt/Lin Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] #2806

Closed
Mardak opened this issue Jun 29, 2017 · 45 comments

Comments

@Mardak
Copy link
Member

commented Jun 29, 2017

Talos seems to be failing from at least last week intermittently with crashes on various talos tests on various platforms and all have a similar structure of the crash reporter crashing when trying to pair the main and content processes for a stack as well as potentially another actual crash:

PROCESS-CRASH | tpaint | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…] 
PROCESS-CRASH | tp5o | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump] [log…] 
PROCESS-CRASH | damp | application crashed [@ nsACString::Capacity] [log…]
PROCESS-CRASH | damp | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump] [log…] 
PROCESS-CRASH | dromaeo_css | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump(std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > const &,bool (*)(wchar_t const *,wchar_t const *,void *,_EXCEPTION_POINTERS *,MDRawAssertionInfo *,bool),void *,_MINIDUMP_TYPE)] [log…] 
PROCESS-CRASH | dromaeo_css | application crashed [@ KiFastSystemCallRet + 0x0] [log…] 

Win10 x64: https://treeherder.mozilla.org/logviewer.html#?job_id=110403282&repo=try&lineNumber=7663
Win10 x64 without startup optimizations: https://treeherder.mozilla.org/logviewer.html#?job_id=110363488&repo=try&lineNumber=1848
Win7: https://treeherder.mozilla.org/logviewer.html#?job_id=110384424&repo=try&lineNumber=1677
Linux 64: https://treeherder.mozilla.org/logviewer.html#?job_id=110376189&repo=try&lineNumber=7176

These are still happening even with dom.ipc.processPrelaunch.enabled set to false:
Linux x64: https://treeherder.mozilla.org/logviewer.html#?job_id=110560527&repo=try&lineNumber=2804
Win7: https://treeherder.mozilla.org/logviewer.html#?job_id=110574993&repo=try&lineNumber=1804

Potentially crashreporter itself is crashing because it doesn't know to look at the preloaded newtab content process? But unclear why it's needing to create a crash report in the first place…

@Mardak Mardak added this to the City of God (July 9) milestone Jun 29, 2017

@k88hudson k88hudson added the Tests label Jun 29, 2017

@Mardak Mardak changed the title Intermittent Talos crashes [@ google_breakpad::ExceptionHandler::WriteMinidump] Lin/Win Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] Jun 30, 2017

@dmose dmose self-assigned this Jul 5, 2017

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 5, 2017

I've pinged @luser about this, as he seems to have a fair bit of breakpad experience. I'm hoping he can provide some thoughts on how best to approach this...

@luser

This comment has been minimized.

Copy link

commented Jul 6, 2017

Looking at that first log link, you can see the first stack trace has this frame:

2  xul.dll!CrashReporter::CreateMinidumpsAndPairInternal(...)

mozcrash should really do a better job of providing info here, but this is a dead giveaway that this stack is from an intentionally-created pair of minidumps, and this is the chrome process side. We invoke this method when a content process hangs, so that we can get stacks from both processes before killing it. If you scroll down a bit you'll find the second stack, at the PROCESS-CRASH | tps | application crashed [@ 0x2c47f35a4b4] marker. I honestly have no idea what's going on in that stack, you might try to rope in someone like dmajor or aklotz to help you figure that one out.

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

Thanks for that clue! Knowing to just jump to the second crash stack is tremendously helpful. Going through the various log links posted above with that in mind shows that almost all of the real stacks appear quite different from one another.

I suppose it's possible that they're all related if something is spraying crap across the address space and just hitting different spots.

I'll continue to poke...

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 6, 2017

The originally linked logs are of various platforms and tests. Here are the 8 most recent pushes to pine, and they all crashed for "Windows 10 x64 opt T-e10s s" with various signatures:

PROCESS-CRASH | tsvgr_opacity | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tsvgr_opacity | application crashed [@ NtWaitForSingleObject + 0x14] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=112109325&repo=pine&lineNumber=2436

PROCESS-CRASH | tsvgx | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tsvgx | application crashed [@ NtWaitForMultipleObjects + 0x14] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=112063441&repo=pine&lineNumber=2242

PROCESS-CRASH | tscrollx | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tscrollx | application crashed [@ ResCDirectoryGetEntryIndexEx + 0x2c] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=112042587&repo=pine&lineNumber=4835

PROCESS-CRASH | tscrollx | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tscrollx | application crashed [@ NtWaitForAlertByThreadId + 0x14] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=111380263&repo=pine&lineNumber=4828

PROCESS-CRASH | tsvgr_opacity | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tsvgr_opacity | application crashed [@ nsCSSFrameConstructor::FrameConstructionItem::FrameConstructionItem(nsCSSFrameConstructor::FrameConstructionData const *,nsIContent *,nsIAtom *,int,PendingBinding *,already_AddRefed<nsStyleContext> &,bool,nsTArray<nsIAnonymousContentCreator::ContentInfo> *)] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=111379855&repo=pine&lineNumber=2436

PROCESS-CRASH | tart | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…] 
PROCESS-CRASH | tart | application crashed [@ RtlpAllocateHeap + 0xa9d] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=111293368&repo=pine&lineNumber=4333

 PROCESS-CRASH | tart | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…] 
 PROCESS-CRASH | tart | application crashed [@ input_s + 0x3b4] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=111292356&repo=pine&lineNumber=4318

PROCESS-CRASH | tsvgr_opacity | application crashed [@ google_breakpad::ExceptionHandler::WriteMinidump()] [log…]
PROCESS-CRASH | tsvgr_opacity | application crashed [@ LdrInitSecurityCookie + 0x7a] [log…] 

^ https://treeherder.mozilla.org/logviewer.html#?job_id=111010778&repo=pine&lineNumber=2417

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 6, 2017

dmajor did take an initial look and thought ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv was forcing the crash dump, but there have been successful runs of Talos svg even with that, e.g., https://treeherder.mozilla.org/logviewer.html#?job_id=110250217&repo=pine&lineNumber=2220

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

So, I've noticed that a bunch (haven't yet looked through all) of these things are dying with EXCEPTION_BREAKPOINT on Windows, and DUMP_REQUESTED on Linux. After poking around bugzilla a bit, I think this is probably one of the hang monitors (we have more than one!) deciding (quite possibly incorrectly) that something is hung, and sending that signal to (presumably) the child process.

@dmose dmose changed the title Lin/Win Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] Win 10 x64 opt Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] Jul 6, 2017

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

All the recent builds are only seeing this on Win10 x64 opt, so I've changed the title.

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 6, 2017

That's because pine Linux talos aren't automatically run. They do still fail on try.

@dmose dmose changed the title Win 10 x64 opt Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] Win 10 opt/Lin Intermittent Talos CRASH [@ google_breakpad::ExceptionHandler::WriteMinidump] Jul 6, 2017

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

Ah, fair enough. wlach and I have been speculating about possible resource exhaustion, and it turns out there is some resource usage data from these test runs. i'm trying to see what I can figure out by peering at it.

@luser

This comment has been minimized.

Copy link

commented Jul 6, 2017

So, I've noticed that a bunch (haven't yet looked through all) of these things are dying with EXCEPTION_BREAKPOINT on Windows, and DUMP_REQUESTED on Linux. After poking around bugzilla a bit, I think this is probably one of the hang monitors (we have more than one!) deciding (quite possibly incorrectly) that something is hung, and sending that signal to (presumably) the child process.

We don't actually signal the child process, we just write a dump of it, and fake the exception reason.

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

We don't actually signal the child process, we just write a dump of it, and fake the exception reason.

OK, that would seem to make the hang-reporter hypothesis at least a bit less likely. I did notice that in debug builds, the hang reports use MOZ_CRASH, which spits a message about the hang reporter to standard error, but not in opt builds, where we're seeing the crashes.

@luser if it were the one of the hang reporters, would we be seeing the hang reporter stack trace in the crashing thread?

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 6, 2017

On the resource exhaustion front; it turns out that each test has a resource_usage.json file on the log page that includes a bunch of interesting and useful data about CPU, swap, memory, and I/O. There is a "mach resource-usage" command which will plot some of that info on a nice HTML graph. gps said he'd happily accept patches to that code, and I suspect patching it to show swamp and memory info wouldn't be hard. That said, even if it confirmed that suspicion, it wouldn't tell us what was causing the crash. So I'm going to set that aside for the moment.

@Mardak and I just spent a bunch of time chatting, and he's been doing a bunch of try-server work to try and sort out when/how this got introduced. He's going to keep on that path for now, and I'm going to get a Linux VM and run some tests and see if I can force failures just by resource-constraining stuff.

Good times!

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 7, 2017

It looks like removing the setTimeout from bootstrap.js to defer to just after sessionstore completely avoids the crash at least on windows:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=9ea1afe416c0d875ed68fa158e10859b73c94d72

The parent of that commit has nearly 100% crashes:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=22bc4684290a4a9f58bbdb7d59537ffec9c3da86

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 7, 2017

Wow, nice detective work! How did you find that?

I guess the next questions are:

  • why does this fix the crashes?

  • does this fix regress overall Talos times at all? (And I see you've pushed that patch to a pine run which should at least add a data point there).

  • Another possible way we workaround the crashes is by using idleDispatch rather than setTimeout. maybe. but it's not clear whether that's a good idea, since presumably we want this stuff loading relatively early, not just "whenever".

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

Theory time:

Is it possible that activating Activity Stream is causing a network access attempt? Our testing infrastructure runs in a mode where network access is expressly forbidden, and attempts at opening a connection result in a crash:

http://searchfox.org/mozilla-central/rev/238406d4c1b3f147522ce0a45a4c6f84a8115781/netwerk/base/nsSocketTransport2.cpp#1315-1330

It dumps out a very helpful crash message when that network access attempt is made in the parent process. But, at least on my Macbook, if a content process attempts to make a network access attempt while in that mode, we get a tab crash with no helpful message in the console. Maybe the sandbox ate it, I dunno.

Is that possible? Is Activity Stream attempting to access the network somehow for Talos, and not for standard mochitests?

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2017

I would think not re: network connection. Sorry for the distraction as we didn't land this patch (from #2743) with the previous uplift: https://github.com/mozilla/activity-stream/blob/master/mozilla-central-patches/disable-as-default-sites.diff

FYI, this is what I've been pushing to try/pine when testing: https://hg.mozilla.org/try/rev/24228804e8419521aa094d62e7bf693218929790 (set prefs, add tests) and https://hg.mozilla.org/try/rev/bef092981b44c7321fefa24668910be102721b4d (wip patches that fixes other test failures)

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2017

@mikeconley what are your thoughts on potentially regressing sessionrestore if we can't figure out how to fix these crashes? TL;DR: Moving activity-stream init earlier during startup avoids crashes but regresses sessionrestore Talos.

Here's the baseline try push that goes back to activity-stream's first commit that delays its startup to a setTimeout immediately after observing "sessionstore-windows-restored":
https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&filter-searchStr=talos&group_state=expanded
(note the diff is confusing to look at as it's reverting more recent uplifts to mozilla-central)

For this baseline, the various "other" and "svgr" Talos e10s tests on Linux64 and Windows10 succeeded 35 of 110 runs (32%).

From that baseline, I pushed a number of commits playing around with startup timing. This first one moves init to async after "browser-delayed-startup-finished": https://treeherder.mozilla.org/#/jobs?repo=try&revision=67acf15ba0d2973bc29437298d929baac84a0e96&filter-searchStr=talos&group_state=expanded

Here, Talos success rate is 32 / 85 (38%) but regresses both sessionrestore + no_auto about 6%: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&newProject=try&newRevision=67acf15ba0d2973bc29437298d929baac84a0e96&framework=1

If instead initing synchronously on "sessionstore-windows-restored": https://treeherder.mozilla.org/#/jobs?repo=try&revision=a785664cc13e824447bb083a2dd35bf28659901c&filter-searchStr=talos&group_state=expanded

Talos success goes way up to 60 / 84 (71%) and "only" regressing sessionrestore_no_auto by 5%: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&newProject=try&newRevision=a785664cc13e824447bb083a2dd35bf28659901c&framework=1

Combining the two of those changes reverts to what the code was before, i.e., synchronous init on "browser-delayed-startup-finished": https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5d89aa15547396f02c86e0054eb62e128d9c416&filter-searchStr=talos&group_state=expanded

This has an even higher Talos success 57 / 72 (79%) but regressing both sessionstore + no_auto by 5%: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&newProject=try&newRevision=f5d89aa15547396f02c86e0054eb62e128d9c416&framework=1

I also tried pushing init even earlier to async "final-ui-startup": https://treeherder.mozilla.org/#/jobs?repo=try&revision=944492fbe6174d6dd723fbaf433251cc715d2752&filter-searchStr=talos&group_state=expanded

This is the clear winner in not crashing 61 / 62 (98%) but regresses all the sessionrestore tests (plain, many, no_auto): https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&newProject=try&newRevision=944492fbe6174d6dd723fbaf433251cc715d2752&framework=1

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2017

Re: @dmose's suggestion of deferring differently, I did try setTimeout with various non-undefined delay amounts, idle via idleDispatchToMainThread and I started running into issues when about:newtab is loaded too soon, so the activity-stream part of about:newtab is broken/blank/never loads. This sounds a bit like #2819 where window.sendAsyncMessage doesn't exist on Linux when trying to restore about:newtab tabs.

I don't know if that issue is related to these Talos crashes, but suspiciously these both do involve session restoring and problems on Linux. @k88hudson how does about:newtab end up getting a top level sendAsyncMessage? I'm guessing this.channel = new RemotePages(this.pageURL) from ActivityStreamMessageChannel.jsm ? Would not doing this soon enough cause crashes though?

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 9, 2017

Looking through a bunch of those crash stack traces, I have the strong suspicion that somehow the stack is getting corrupted. @mikeconley, have a look at the second stack in a bunch of those crashes (eg https://treeherder.mozilla.org/#/jobs?repo=try&revision=22bc4684290a4a9f58bbdb7d59537ffec9c3da86 ). I suspect you have more recent experience with stack scribbling; do you have any thoughts here?

I'm working poking at buildconfig to try and make linux64-asan builds run the talos tests on try in the hopes that ASAN might catch whatever the problem is...

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 9, 2017

To be fair lots of those stacks seem to be inside NT system DLLs. But one of the examples the crops up semi-regularly is https://treeherder.mozilla.org/logviewer.html#?job_id=112476974&repo=try&lineNumber=4490 and if you look at the code for HasLookupRuleWithGlyphByScript you see that the index variable (presumably a stack automatic) gets passed by address (presumably for writing) down into the crashing code, which is one thing that makes me suspect this...

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 10, 2017

Well, we're actually running using ASan now, we'll see if we hit any of the crashes... https://treeherder.mozilla.org/#/jobs?repo=try&revision=4929a5523e9879de1a077ed1d9e2b330c09dcc7f

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 10, 2017

If that doesn't work, another thing to try is (thanks to @mikeconley for the suggestion) is https://wiki.mozilla.org/ReleaseEngineering/How_To/Request_a_loaner

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 11, 2017

Sadly, the ASan talos runs didn't trigger any crashes (though these were all Linux, and there does appear to be a underdocument, maybe-not-supported Win10 asan configuration in the tree).

Further, Asan builds should exit when they detect an error, and we don't appear to override that behavior in the tree, so they haven't detected any pre-crash errors either, it seems. :-/

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 11, 2017

@mikeconley has kindly filed https://bugzilla.mozilla.org/show_bug.cgi?id=1380047 to get us loaner machine access to try and reproduce on...

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2017

The pine pushes from Monday have a much lower crash rate for unknown reasons:
https://treeherder.mozilla.org/#/jobs?repo=pine&filter-searchStr=talos

Crash rate from Friday was around 39% whereas on Monday the rate is about 4%

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

That's very mysterious. Either something happened in this push to improve things, or something changed in automation. If the latter... I have no idea how we're going to track that down.

If the former, perhaps we can bisect on try with this changeset range?

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2017

I can try bisecting, but it'll be a bit intensive via try pushes across 245 commits. That pushlog does include Wei-Cheng Pan — Bug 1376760 - Fix race condition for ICustomDestinationList usage. r=jimm

We were tracking that for Windows crashes in #2802

Although I did include that patch in a pine push last week and windows was still failing: https://treeherder.mozilla.org/#/jobs?repo=pine&revision=8b9a8484dacaf9f0783944972721bd771908a36b&filter-searchStr=talos

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

I have so far been unsuccessful in getting the build from this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=dcffab05e99c0a20aaf33b136162a4d5fe108fcb&filter-searchStr=talos&group_state=expanded

to crash on the loaner hardware...

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2017

I'm "bisecting" by pushing the various commits just before the autoland/inbound merges within that pushlog:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6cfb46e866ab907f8c2cc8e3a454f9acb7edbd0a
https://treeherder.mozilla.org/#/jobs?repo=try&revision=5e9b98dae690c75a2866be11c0645919c7ddbca3
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e2d543ad55692b65cb301e9875120c4a8b13ee89
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c1e7442004cce684129f4581106329cbaee76d1
https://treeherder.mozilla.org/#/jobs?repo=try&revision=03475ab044901fb6991f70660a8582e937cf537b
https://treeherder.mozilla.org/#/jobs?repo=try&revision=2172d983e80ccd6e2fdff7fc253e3bc8601e0833
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d50137d0f65f210fa04a0e793d1bd51bcf76466c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a6444ddff3ef9ddb782593faf30935b68ce4e3b6
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a0a718a8da03e043400ee14075601e5974a381a0
https://treeherder.mozilla.org/#/jobs?repo=try&revision=004761adf7ebc91f72543127736bb1449edc3277
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c6ea07c64232938b984f4dd6ff5f7b34c5e73f22
https://treeherder.mozilla.org/#/jobs?repo=try&revision=941ae15bd9f20127eea1c9b656091ee06d438f12
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d3d87865b629dd55a7f63b9eac515c39d4d1faf1

(For own reference: git co 733b44b59105 && pushd ../activity-stream && npm run buildmc && popd && patch -p1 < ../activity-stream/mozilla-central-patches/disable-as-default-sites.diff && patch -p1 < ../activity-stream/mozilla-central-patches/disable-search-tests.diff && patch -p1 < ../activity-stream/mozilla-central-patches/enable-mc-as.diff && git add browser/extensions/activity-stream/ && git cia -C head && ./try-talos-windows.sh and git lg cfe7bf5ae1b4...733b44b59105)

@tspurway tspurway assigned Mardak and unassigned dmose Jul 12, 2017

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

Looking at one of the .extra files for one of the crashes, I notice this annotation:

IPCShutdownState=SendFinishShutdown

This annotation is only ever set if the ContentChild in the content process has told the parent "I'm ready to shut down now.".

So the parent has told the content process to shut down.

Looking deeper within the .extra file, I also see this:

ipc_channel_error=PStorageParent::RecvAsyncPreload

And that's only set here.

So a new hypothesis: Activity Stream causes content to request storage information from the parent off of the main thread... and then the thread that this code runs on:

http://searchfox.org/mozilla-central/rev/31311070d9860b24fe4a7a36976c14b328c16208/dom/storage/StorageIPC.cpp#404-407

Maybe decides that the DB has already shut down by the time it's requesting it. Returning IPC_FAIL_NO_REASON, I believe, could result in the types of crashes we're seeing here - though perhaps kanru or somebody (billm?) who knows our IPC code better might be able to say for sure.

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

Also note that this patch in the changes that landed since Friday might have reduced the probability of this crash occurring.

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 12, 2017

It looks like the crashes were already reduced from before Kris Maglione — Bug 1357486: Follow-up: Wait for extension shutdown before starting storage shutdown. r=rhelmer

The next group of commits do seem potentially likely as well: Kris Maglione — Bug 1357486: Enable OOP extensions by default on Windows. r=aswan

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 12, 2017

OOP Extensions should only apply for WebExtensions though... I wonder how much of that impacts bootstrapped add-ons.

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2017

There's several commits for Bug 1357486 - Turn on OOP extensions by default on Windows:

Bug 1357486: Part 0a - Fix permissions tests with OOP extensions. r=aswan
Bug 1357486: Part 0b - Fix inline options browser tests with OOP extensions. r=aswan
Bug 1357486: Part 0c - Propagate addonId to parent process with console messages. r=aswan
Bug 1357486: Part 0d - Propagate clonable console message args to the parent process. r=aswan
Bug 1357486: Part 0e - Support legacy extensions in OOP mode. r=aswan
Bug 1357486: Part 0f - Run some chrome tests in in-process mode. r=aswan
Bug 1357486: Part 0g - Run remote debugger host browser in same TabGroup as extension pages. r=me
Bug 1357486: Enable OOP extensions by default on Windows. r=aswan
Bug 1357486: Follow-up - Run flaky update tests in in-process mode on win32 debug.

Part 0f has 50% crashes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b93b9a30b42c487ac7caa68946982e144eaebe79&filter-searchStr=talos
Part 0g has 60% crashes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=408240b797d55f3b58db0e449df610949fbe0c20&filter-searchStr=talos
Enable OOP has 9% crashes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab179352a05cb4b9bf03b404086cbd0261f84ebe&filter-searchStr=talos
Follow-up has 12% crashes: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0db35e86b538f12a844ef66574f84044e4ed1f7&filter-searchStr=talos

So it does seem like it's the enabling OOP extensions. Why it reduces crashes is not really clear although why activity-stream is triggering crashes here is also unclear. And how this affects Talos, I believe screenshots is a webextension system-addon.. not sure what other webextensions might be active during talos tests..

(For own reference: c=t=>document.querySelectorAll(`[title~='Talos'][title~='${t}']`).length;b=c("busted");s=c("success");b/(b+s))

@Mardak

This comment has been minimized.

Copy link
Member Author

commented Jul 13, 2017

So.. umm… I pushed the latest activity-stream with latest mozilla-central to get a baseline for removing the setTimeout and there were no crashes in over 100 Talos runs, and performance looks good relative to last 2 days of mozilla-central:

screen shot 2017-07-13 at 4 01 51 am

I also added talos jobs for the latest pine push, and initial perfherder results seem to match up with the try.

…resolved?

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 13, 2017

¯_(ツ)_/¯

ᕕ( ᐛ )ᕗ 🚪

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 13, 2017

OMG; amazing. I'm resolving; we can always re-open if it comes back.

@dmose dmose closed this Jul 13, 2017

@Mardak Mardak moved this from Crashes to Done in Pref-on Test Failures Jul 13, 2017

@mikeconley

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2017

Fwiw, I just pushed a mozilla-central from today to try, and saw the crash.

So I think Activity Stream is likely in the clear, but I don't think we've seen the last of this thing. :/

@dmose

This comment has been minimized.

Copy link
Member

commented Jul 14, 2017

Fair enough. It's not ideal, but at least it's a start...

@luser

This comment has been minimized.

Copy link

commented Jul 18, 2017

So for future reference, if you're suspicious of the stacks you're seeing in a log, and the crash is from a Windows machine, you can always download the minidump + symbols and load them in a Microsoft debugger. The Treeherder log viewer has a list of info in the left pane, including artifact uploaded: dmp for each minidump the test harness encountered. You can download those to a Windows machine, download the crashreporter-symbols-full.zip from the build it was testing, and see if the stack from a debugger matches. There are certainly occasions where Breakpad doesn't get it right!

@Mardak Mardak moved this from Assigned / Milestone 1 to Complete in Land in Nightly / Graduate Jul 24, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.