Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

thread-suspend-suspended.exe and thread-suspend-selfsuspended.exe fail a lot on Windows i386 #10508

Closed
akoeplinger opened this issue Sep 7, 2018 · 17 comments

Comments

@akoeplinger
Copy link
Member

@akoeplinger akoeplinger commented Sep 7, 2018

Steps to Reproduce

  1. Run thread-suspend-suspended.exe and thread-suspend-selfsuspended.exe tests on 32bit Windows (it doesn't happen on 64bit!)
  2. The tests time out very often.

Might be related to the recent switch to hybrid suspend.

Current Behavior

The tests time out after 5 minutes. Examples:

Expected Behavior

No failure.

On which platforms did you notice this

[ ] macOS
[ ] Linux
[ X ] Windows i386 (not x64!)

Version Used: master

/cc @lateralusX

@akoeplinger akoeplinger added this to Bugs Pool in Bugs Week via automation Sep 7, 2018
@lateralusX lateralusX self-assigned this Sep 7, 2018
@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Sep 7, 2018

Is this relatively recent (last month or 2)? Could be hybrid-suspend related.

/cc @lambdageek

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 10, 2018

Looks like first failure of one of these tests occured here (might not be directly related though):

https://jenkins.mono-project.com/job/z/label=w32/7657/

Will try to reproduce this in the debugger during the week.

@lateralusX lateralusX moved this from Bugs Pool to In Progress in Bugs Week Sep 11, 2018
@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 14, 2018

Still no local repro, but investigated the code heavily under debugger and could force a couple of scenarios that could (at least in theory) happen under wow64 related to thread suspend failing. This scenario is then not handled by sgen (and very uncommon and only observed under wow64) but could get us into either a deadlock or asserts when doing the GC in scenarios like the once's hit by the test. Will need to spend some more time investigate this a little deeper in the debugger, but looks like this boils down to a potential case when mono_threads_suspend_begin_async_suspend returns FALSE (could happen for various resons under wow64).

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 18, 2018

@lambdageek One things that I noticed when digging in this issue. On MSVC compiled mono runtime (the one that is actually run and tested on CI), hybrid suspend is NOT the default, so its still using premptive suspend on both 32 and 64 bit Windows MSVC build runtime. The reason for this is that winconfig.h doesn't have the flag ENABLE_HYBRID_SUSPEND set. Setting that flag in winconfig.h gives us another problem with an unimplemented feature on MSVC that will cause failures when trying to use hybrid suspend and coop suspend on Windows. Looks like SAVE_REGS_ON_STACK is not defined for MSVC, mono-threads-coop.c:39. I could (hopefully) implement something similar on MSVC. So __builtin_unwind_init dumps all registers on the stack so GC will find references in registers when scanning the stack instead of what it previously did when scanning the captured MonoContext object for a thread, correct?

@jaykrell

This comment has been minimized.

Copy link
Collaborator

@jaykrell jaykrell commented Sep 18, 2018

RtlCaptureContext. Use LoadLibrary(ntdll.dll)/GetProcAddress for systems w/o kernel32.dll.
It also just a few lines of assembly can write it yourself.

@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Sep 20, 2018

I debugged thread-suspend-suspended a little today. Apparently, mono/utils/mono-threads.c is_thread_in_critical_region returns TRUE from the mono_threads_platform_in_critical_region() branch when sgen-stw.c:321 is looking at thread t1 (see the testcase source).

Which means that we end up resuming t1 back to BLOCKING and then I incorrectly only run the MOPUP phase of hybrid suspend which doesn't expect to see any blocking threads - it expected all of them to be preemptively suspended the first time through.

I think the correct plan is for me to do two-phase suspend after a restart.

It makes sense that we only saw this fail on windows i386 on CI because that's the only platform with mono_threads_platform_in_critical_region that ever returns TRUE.

@lambdageek lambdageek self-assigned this Sep 20, 2018
@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 20, 2018

On CI we run msvc build runtime for tests that currently only runs premptive suspend and recovers when mono_threads_platform_in_critical_region returns TRUE without errors. Regarding the assert seen in stw together with hybrid suspend, that will cause a resume thread logic that (as you say above) will trigger the assert in stw suspend logic. I easily hit that on msvc build runtime since the callback used in mono_thread_info_safe_suspend_and_run, async_suspend_critical never sees the thread as currently in managed code (probably something else that is missing on msvc hybrid suspend) so that will trigger an interrupt request and a return of MonoResumeThread that will restart thread and race with stw suspend loop as well, so same end result but two different ways hitting it.

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 20, 2018

As said, by default msvc build runtime is still using premtive suspend and that is whats running on CI, only mingw build runtime follows the new hybrid suspend, but that is not the runtime running the tests on CI. I have #10684 for fixing hybrid suspend on msvc so we could switch default for that version of the runtime as well.

@marek-safar

This comment has been minimized.

Copy link
Member

@marek-safar marek-safar commented Sep 20, 2018

@lateralusX could you disable the test for now on windows to reduce PR builds failures

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 20, 2018

@lambdageek

This comment has been minimized.

Copy link
Member

@lambdageek lambdageek commented Sep 20, 2018

Okay, so we need to fix two separate things:

  1. with preemptive suspend, the test hangs sometimes - this issue.
  2. with hybrid suspend, the test asserts (Thread 00006484 in unexpected state RUNNING with PEEK_BLOCKING_SUSPEND_REQUESTED) - new issue #10722
@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 21, 2018

A couple of finings and observations around the preemptive suspend issue. When running the test on CI with some additional diagnostics it turns out that the tests doesn't deadlock but takes a long time to complete (tipping over the 300 s timeout) for x86 while the x64 always complete much quick on the same hardware. The Windows builders only have 2 virtual cpu's and I can emulate this problem by setting thread affinity on my local machine to 1 core and I also tried on a Windows 2016 azure server image with 2 virtual cores and get similar results, x86 test takes long time to complete while x64 completes much quicker.

The design of these tests includes one thread that will spinn in a very tight loop consuming 100% CPU time on one core while the other threads are doing a lot of context switching consistently yielding time to the spinning thread that could, on a machine with a low core count (that is virtualized), significantly increase the execution time of the test. The wow64 layer present on x86 version of the test also increase kernel mode transition times compared to the x64 version that in turn could increase the context switches. When I looked at the test running slow on a one core machine, the time spend on each thread seems correct and each thread (when debugged) still did what's expected and didn't end up doing extra work etc. I will debug this even more and also try to sort out how much the threads are just sitting in ready state waiting to be scheduled. This scenario with low core count and a thread hogging one cpu could still hit a bug related to wow64, but nothing I have identified so far. Reducing the process hooging in thread t1 makes the test complete much quicker on x86 as well as x64.

@jaykrell

This comment has been minimized.

Copy link
Collaborator

@jaykrell jaykrell commented Sep 21, 2018

Why wow64 increase kernel time? That rarely, but sometimes, makes sense. It is usually an extremely thin layer and all usermode.

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Sep 21, 2018

Correct, was refering to the additional user mode work needed to transition from 32 -> 64 and back (even if its still should be a thin wrapper) compared to x64 process and not direct increased kernel mode execution time.

@jaykrell

This comment has been minimized.

Copy link
Collaborator

@jaykrell jaykrell commented Sep 21, 2018

That transition is so efficient I don't think you could measure it.
It is just a far call, copy a few bytes around, and a far ret.
sometimes it is more significant, some special cases, where one syscall will become more than one.

lateralusX added a commit to lateralusX/mono that referenced this issue Oct 1, 2018
…-suspend-selfsuspended.

thread-suspend-suspended and thread-suspend-selfsuspended tests has showed long
execution times on low core Windows machines running wow64. It looks like an OS
related issue where ResumeThread Win32 API takes a lot longer time to complete
when running a test scenario like thread-suspend-suspended and this additional
time adds up over many iterations pushing it over to timeout. NOTE, the same behavior
has been identified running on .NET framework, so it's not a mono specific issue and
only reproduce under wow64 on low core machines (probably when one core is busy
just leaving us with one core for most of the test).

Since the purpose of the test is to run GC when t1 is suspended, doing this more
deterministic will fix the ResumeThread issue under wow64 on low core count as
well as making the tests more deterministic, always having t1 in a suspended state
when running GC (that will test the suspend suspended scenario).

Full investigation can be found in, mono#10508.
@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Oct 2, 2018

Did some additional interesting investigation and observation around the tests. First, the same behavior could be reproduced using regular .NET Framework runtime, so this is indeed not a mono related problem but a behavior the test triggers on machines with low core count (or setting the affinity for the mono process to just schedule on one core).

I did a analysis of the scheduling of threads when hitting the repro of a slow executed test under wow64. Looking at thread execution times revealed some interesting information. First dump of user mode time looked like this (excluded unrelated threads):

Thread Time
4:2354 0 days 0:00:19.234 (t1 hogging the core)
0:a4e4 0 days 0:00:00.296 (main thread)
5:9568 0 days 0:00:00.046 (t2 periodically running GC)

Then let the test run for some more time gives us the following result (excluded unrelated threads):

Thread Time
4:2354 0 days 0:00:46.250 (t1 hogging the core)
0:a4e4 0 days 0:00:00.296 (main thread)
5:9568 0 days 0:00:00.109 (t2 periodically running GC)

This shows that the hogging thread gets plenty of scheduled CPU time while the other threads get very small amounts of CPU time. The main thread hasn’t got any measurable amount of time at all during this period and since it doesn’t run that frequently it won’t progress the test and won’t trigger any additional GC collections for t2 that will just spin and yield. Continuing to look at the thread schedule times for the test shows the same pattern, a lot of time given to t1 (that should be frequently suspended) while the other two threads get small amount of time causing very slow progression of the test under wow64.

Doing the same thread analysis under x64 shows a different behavior. Under x64 the time is much more evenly distributed between the threads and the hogging thread doesn’t get the same amount of “exclusivity”, meaning that the test will also progress much quicker since both GC as well as main thread will get more scheduled CPU time.

Starting to analyze where the additional time (mainly wait time in this case) shows that ResumeThread takes a long time to complete for the main thread when resuming t1, but since t1 still gets a lot of execution time, the wait within Win32 ResumeThread call seems to occurs after t1 has been resumed, but before returning back from ResumeThread call. Looking at the numbers shows that all additional time that the test takes under wow64 with 1 core ends up in ResumeThread (called from main thread). Doing a normal Resume Thread not hitting this issue takes normally < 10 micro seconds, but when hitting this issue, ResumeThread takes between 45 000 and 90 000 micro seconds and the total accumulation of that time ends up in the long test runs. NOTE, this behavior is not reproducible on a machine with an additional free cores.

A couple of things that mitigates the problem is to switch from Yield to Sleep (1) in main thread. Doing this instead of Yield will put the thread into a different state (not ready and not directly rescheduled if core is still free) that will make the scenario behave differently, not triggering the longer execution times of ResumeThread. Having that said I believe the best would be to rewrite the test into a more stable version consistently testing the suspend-suspended, suspend-selfsuspended scenario. That version of the test will behave more stable on systems having low to high core count. The current way of “synchronizing” between the thread using atomics and Yield will not cause deterministic execution behavior, especially on Windows in scenarios where threads gets different cores and since Yield will only give up the time slice on current core if there are threads available in the ready queues and rescheduled on the same core, there is a risk that the main thread will pass it’s Yield call without stopping moving directly into Thread.Resume before the GC thread had its chance to run, so then we will get into a scenario where we are not testing suspend-suspended since t1 might actually be running when GC runs.

A fix of the tests that appears more stable on all platforms is availalbe in #10646.

@lateralusX

This comment has been minimized.

Copy link
Member

@lateralusX lateralusX commented Oct 4, 2018

Mitigated by #10646.

@lateralusX lateralusX closed this Oct 4, 2018
Bugs Week automation moved this from In Progress to Done Oct 4, 2018
@marek-safar marek-safar moved this from Done to Archived in Bugs Week Oct 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Bugs Week
Archived
6 participants
You can’t perform that action at this time.