Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

Intermittent failures on test-regress-GH-14576.js #25619

Closed
mhdawson opened this issue Jul 2, 2015 · 10 comments
Closed

Intermittent failures on test-regress-GH-14576.js #25619

mhdawson opened this issue Jul 2, 2015 · 10 comments

Comments

@mhdawson
Copy link
Member

mhdawson commented Jul 2, 2015

Two failures in recent CI history

http://jenkins.nodejs.org/job/node-test-commit-windows/DESTCPU=ia32,label=windows-2k8r2/121/
http://jenkins.nodejs.org/job/node-test-commit-windows/DESTCPU=ia32,label=windows-2k8r2/110/

Need to either investigate/fix or mark as flaky test until we do.

@misterdjules
Copy link

The goal of this test is to test a bug when, sometimes, node would stall. So it could be that the bug was actually not fixed, just that it happens less often. If that is the case, we would need to reopen the issue too.

@misterdjules
Copy link

Adding as P-1 since the longer we wait to investigate this, the less context we'll have around what had been done initially to fix the problem and write the test. Adding to 0.12.7 milestone since a v0.12.6 release should be coming up very soon.

Also, /cc @tunniclm @dmelikyan

@mhdawson
Copy link
Member Author

mhdawson commented Jul 6, 2015

@misterdjules can you add the original issue number ?

@misterdjules
Copy link

@mhdawson You mean #14576?

@misterdjules misterdjules modified the milestones: 0.12.7, 0.12.8 Jul 6, 2015
@mhdawson
Copy link
Member Author

mhdawson commented Jul 8, 2015

Yes thanks

@misterdjules misterdjules modified the milestones: 0.12.8, 0.12.9 Jul 10, 2015
@tunniclm
Copy link

My initial thought when looking through this a bit, is that on Windows SuspendThread() and ResumeThread() are used instead of signals/SignalHandler and so the fix here is to code that is not used on Windows.

It is possible that the suspended thread holds some important locks (such as the process_wide_mutex_) at the time it is suspended to take a sample, however, so far I have not found an obvious code that takes locks in the sampler.

It would be good to recreate this hang so we can capture some stack traces to see where the threads are stuck.

@tunniclm
Copy link

Reproduced this on 32-bit windows and took a memory dump.

Key thread stacks:

Thread: 4db8
ChildEBP RetAddr  
0524eff4 76f69e5e ntdll_76f30000!ZwWaitForSingleObject+0x15
0524f058 76f69d42 ntdll_76f30000!RtlpWaitOnCriticalSection+0x13e
0524f080 7464759b ntdll_76f30000!RtlEnterCriticalSection+0x150
WARNING: Stack unwind information not available. Following frames may be wrong.
0524f29c 74648ad0 sysfer!FirstHookFunc+0xacb
0524f340 74648ccd sysfer!FirstHookFunc+0x2000
0524f3b8 74648dc6 sysfer!FirstHookFunc+0x21fd
0524f400 694e1592 sysfer!FirstHookFunc+0x22f6
0524f45c 694e41cc UMEngx86+0x1592
0524f4d0 694e42b0 UMEngx86+0x41cc
0524f534 75393b5b UMEngx86+0x42b0
0524f544 0051871e KERNELBASE!ResumeThread+0x12
0524f838 005182e3 node!v8::internal::Sampler::DoSample+0x8e [c:\node\node\deps\v8\src\sampler.cc @ 750]
0524f854 00662062 node!v8::internal::SamplerThread::Run+0x53 [c:\node\node\deps\v8\src\sampler.cc @ 554]
0524f85c 0066bcf3 node!v8::base::ThreadEntry+0x22 [c:\node\node\deps\v8\src\base\platform\platform-win32.cc @ 1340]
0524f894 0066be1b node!_callthreadstartex+0x1b [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 376]
0524f8a0 749b337a node!_threadstartex+0x7c [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 354]
0524f8ac 76f692e2 kernel32!BaseThreadInitThunk+0xe
0524f8ec 76f692b5 ntdll_76f30000!__RtlUserThreadStart+0x70
0524f904 00000000 ntdll_76f30000!_RtlUserThreadStart+0x1b

Thread: 4db4
ChildEBP RetAddr  
050bfa40 753914ab ntdll_76f30000!ZwWaitForSingleObject+0x15
050bfaac 749b1194 KERNELBASE!WaitForSingleObjectEx+0x98
050bfac4 749b1148 kernel32!WaitForSingleObjectExImplementation+0x75
050bfad8 004650b7 kernel32!WaitForSingleObject+0x12
050bff1c 00662062 node!v8::internal::Profiler::Run+0xd7 [c:\node\node\deps\v8\src\log.cc @ 755]
050bff24 0066bcf3 node!v8::base::ThreadEntry+0x22 [c:\node\node\deps\v8\src\base\platform\platform-win32.cc @ 1340]
050bff5c 0066be1b node!_callthreadstartex+0x1b [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 376]
050bff68 749b337a node!_threadstartex+0x7c [f:\dd\vctools\crt\crtw32\startup\threadex.c @ 354]
050bff74 76f692e2 kernel32!BaseThreadInitThunk+0xe
050bffb4 76f692b5 ntdll_76f30000!__RtlUserThreadStart+0x70
050bffcc 00000000 ntdll_76f30000!_RtlUserThreadStart+0x1b

Thread: 4da4
ChildEBP RetAddr  
00acf864 7463547c ntdll_76f30000!RtlLeaveCriticalSection+0x2
WARNING: Stack unwind information not available. Following frames may be wrong.
00acf88c 74648aed sysfer+0x1547c
00acf92c 74648ccd sysfer!FirstHookFunc+0x201d
00acf9a4 74648dc6 sysfer!FirstHookFunc+0x21fd
00acf9ec 0f931364 sysfer!FirstHookFunc+0x22f6
00acfa3c 0f931853 pcwum!PcwpSendIoctl+0x5c
00acfa88 0f9339af pcwum!PcwRegisterCounterSet+0x49
00acfb34 0f933b3f pcwum!PerfpRegisterCounterSet+0x198
00acfb58 0f9320f0 pcwum!PerfpAddCounterSet+0x166
00acfb6c 002fe906 pcwum!PerfSetCounterSetInfo+0x40
00acfbe8 002fec72 node!node::InitPerfCountersWin32+0x176 [c:\node\node\src\node_win32_perfctr_provider.cc @ 184]
00acfc24 002b376c node!node::InitPerfCounters+0x82 [c:\node\node\src\node_counters.cc @ 135]
00acfc84 002cd2ff node!node::LoadEnvironment+0x13c [c:\node\node\src\node.cc @ 2872]
00acfccc 00666625 node!wmain+0x2df [c:\node\node\src\node_main.cc @ 60]
00acfd14 749b337a node!__tmainCRTStartup+0xfe [f:\dd\vctools\crt\crtw32\startup\crt0.c @ 255]
00acfd20 76f692e2 kernel32!BaseThreadInitThunk+0xe
00acfd60 76f692b5 ntdll_76f30000!__RtlUserThreadStart+0x70
00acfd78 00000000 ntdll_76f30000!_RtlUserThreadStart+0x1b

@tunniclm
Copy link

Thread 4db8
         ChildEBP RetAddr  Args to Child    
Frame 3: 0524f080 7464759b 746859f4 7d7db3a3 00000000 ntdll_76f30000!RtlEnterCriticalSection+0x150

First argument is a CriticalSection:
0:005:x86> dt RTL_CRITICAL_SECTION 0x746859f4
node!RTL_CRITICAL_SECTION
   +0x000 DebugInfo        : 0x00df9230 _RTL_CRITICAL_SECTION_DEBUG
   +0x004 LockCount        : 0n-6
   +0x008 RecursionCount   : 0n1
   +0x00c OwningThread     : 0x00004da4 Void
   +0x010 LockSemaphore    : 0x000001e4 Void
   +0x014 SpinCount        : 0

So Thread 4db8 is waiting on a CiritcalSection owned by Thread 4da4

@tunniclm
Copy link

Thread 4da4
         ChildEBP RetAddr  Args to Child              
Frame 1: 00acf864 7463547c 746859f4 78f5bbbf 000000cb ntdll_76f30000!RtlLeaveCriticalSection+0x2

Thread 4da4 is trying to call LeaveCriticalSection() on the same CriticalSection (0x746859f4).

I took a second memory dump some minutes after the first and Thread 4da4 still appears to be in the same place.

So it looks like the suspended thread holds a CriticalSection that is required to resume it.

@tunniclm
Copy link

The sysfer dll appears to be related to Symantec Endpoint Protection, so my recreate is possibly different from the one on Jenkins. However, the same sort of scenario could occur for any lock that the main thread could take that the sampler code also takes.

@Trott Trott closed this as completed Apr 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants