Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacman: sometimes hangs validating GPG signatures #2752

Closed
jeremyd2019 opened this issue Dec 19, 2021 · 20 comments · Fixed by #2771
Closed

pacman: sometimes hangs validating GPG signatures #2752

jeremyd2019 opened this issue Dec 19, 2021 · 20 comments · Fixed by #2771

Comments

@jeremyd2019
Copy link
Member

I first saw this on Windows on ARM, but it seems to have shown up on msys2-autobuild recently too (https://github.com/msys2/msys2-autobuild/runs/4552699834?check_suite_focus=true and https://github.com/msys2/msys2-autobuild/runs/4575182521?check_suite_focus=true).

What I saw on ARM was that pstree showed two pacman processes, one a child of the other. If I killed the child, via its winpid and taskkill, pacman would complain about a GPGME error and things would move on (with a failed invocation of pacman). If I killed the parent, the child would still hang around as an apparent zombie.

I tried to debug it once a while back, but I seem to recall that attempting to attach a debugger to it caused it to 'unblock' and exit, which made trying to figure out what was going on pretty difficult.

@jeremyd2019
Copy link
Member Author

In case the job logs expire, they look something like:

2021-12-19T16:14:37.5863759Z checking keyring...
2021-12-19T16:14:37.7745146Z checking package integrity...
2021-12-19T21:28:27.2917878Z ##[error]The operation was canceled.

@jeremyd2019
Copy link
Member Author

Looks like another one, this time in MINGW-packages CI: https://github.com/msys2/MINGW-packages/runs/4586853393?check_suite_focus=true

Mon, 20 Dec 2021 19:40:41 GMT
  checking package integrity...
Mon, 20 Dec 2021 20:12:25 GMT
  Error: The operation was canceled.

@jeremyd2019
Copy link
Member Author

@jeremyd2019
Copy link
Member Author

Repro idea: run pacman -Su in a loop and see if it eventually hangs. Each invocation, even with nothing to do, would validate signatures on the databases. I don't know if there's anything else you can do with pacman to make it validate a bunch of packages for instance over and over.

@jeremyd2019
Copy link
Member Author

It looks like pacman -Sw validates packages. so something like

mkdir foo
pacman -Sy -b foo
while pacman --noconfirm -Sw -b foo base base-devel mingw-w64-{,clang-}{i686,x86_64}-toolchain; do true; done

?

@jeremyd2019
Copy link
Member Author

I've started this up on my local Win10 21H2 machine, it is validating 356 packages per invocation.

@jeremyd2019
Copy link
Member Author

It's hung up this morning, but not like I used to see it on ARM. There's one pacman.exe process, which is using CPU, and gpg.exe is running.

$ pstree -lp
?(1)─┬─bash(4741)───pstree(4771)
     ├─bash(437)───pacman(3406)
     └─gpg(4740)

@jeremyd2019
Copy link
Member Author

pacman stack traces
(gdb) thread apply all bt

Thread 4 (Thread 10348.0x19d0):
#0  0x00007ff8abd30861 in ntdll!DbgBreakPoint () from /c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff8abd5c98e in ntdll!DbgUiRemoteBreakin () from /c/Windows/SYSTEM32/ntdll.dll
#2  0x00007ff8ab817034 in KERNEL32!BaseThreadInitThunk () from /c/Windows/System32/KERNEL32.DLL
#3  0x00007ff8abce2651 in ntdll!RtlUserThreadStart () from /c/Windows/SYSTEM32/ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (Thread 10348.0x14ec):
#0  0x00007ff8abd2cf54 in ntdll!ZwClose () from /c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff8a964a395 in KERNELBASE!CloseHandle () from /c/Windows/System32/KERNELBASE.dll
#2  0x0000000180133463 in sig_send (p=<optimized out>, p@entry=0x0, si=..., tls=tls@entry=0x14cce00) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/sigproc.cc:771
#3  0x0000000180062f13 in exception::handle (e=0x14cc4f0, frame=<optimized out>, in=<optimized out>, dispatch=<optimized out>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/exceptions.cc:835
--Type <RET> for more, q to quit, c to continue without paging--c
#4  0x00007ff8abd320cf in ntdll!.chkstk () from /c/Windows/SYSTEM32/ntdll.dll
#5  0x00007ff8abce1454 in ntdll!RtlRaiseException () from /c/Windows/SYSTEM32/ntdll.dll
#6  0x00007ff8abd30bfe in ntdll!KiUserExceptionDispatcher () from /c/Windows/SYSTEM32/ntdll.dll
#7  0x0000000180092687 in fhandler_pipe::get_query_hdl_per_process (this=this@entry=0x1803700f8, name=name@entry=0x14cc820 L"\\Device\\NamedPipe\\dd50a72ab4668b33-10348-pipe-nt-0x6E6", ntfn=ntfn@entry=0x8000c2ce0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_pipe.cc:1281
#8  0x0000000180092bdb in fhandler_pipe::temporary_query_hdl (this=this@entry=0x1803700f8) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/fhandler_pipe.cc:1190
#9  0x0000000180125b71 in pipe_data_available (writing=true, h=<optimized out>, fh=0x1803700f8, fd=15) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:645
#10 pipe_data_available (fd=15, fh=0x1803700f8, h=<optimized out>, writing=<optimized out>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:588
#11 0x0000000180126c23 in peek_pipe (s=s@entry=0x80012b080, from_select=from_select@entry=true) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:762
#12 0x0000000180126ed4 in thread_pipe (arg=0x80011af30) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:785
#13 0x0000000180046583 in cygthread::callfunc (this=this@entry=0x180235838 <threads+88>, issimplestub=issimplestub@entry=false) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#14 0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x180235838 <threads+88>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#15 0x0000000180047716 in _cygtls::call2 (this=0x14cce00, func=0x180046ac0 <cygthread::stub(void*)>, arg=0x180235838 <threads+88>, buf=buf@entry=0x14ccd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#16 0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#17 0x00007ff8ab817034 in KERNEL32!BaseThreadInitThunk () from /c/Windows/System32/KERNEL32.DLL
#18 0x00007ff8abce2651 in ntdll!RtlUserThreadStart () from /c/Windows/SYSTEM32/ntdll.dll
#19 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 10348.0x1aec):
#0  0x00007ff8abd2ce34 in ntdll!ZwReadFile () from /c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff8a964aeb3 in ReadFile () from /c/Windows/System32/KERNELBASE.dll
#2  0x0000000180132639 in wait_sig () at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/sigproc.cc:1359
#3  0x0000000180046583 in cygthread::callfunc (this=this@entry=0x1802357e0 <threads>, issimplestub=issimplestub@entry=false) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:48
#4  0x0000000180046b46 in cygthread::stub (arg=arg@entry=0x1802357e0 <threads>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:91
#5  0x0000000180047716 in _cygtls::call2 (this=0x10cce00, func=0x180046ac0 <cygthread::stub(void*)>, arg=0x1802357e0 <threads>, buf=buf@entry=0x10ccd20) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#6  0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#7  0x00007ff8ab817034 in KERNEL32!BaseThreadInitThunk () from /c/Windows/System32/KERNEL32.DLL
#8  0x00007ff8abce2651 in ntdll!RtlUserThreadStart () from /c/Windows/SYSTEM32/ntdll.dll
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 10348.0x1e88):
#0  0x00007ff8abd2cdf4 in ntdll!ZwWaitForSingleObject () from /c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff8a9641a5e in WaitForSingleObjectEx () from /c/Windows/System32/KERNELBASE.dll
#2  0x0000000180046f79 in cygthread::detach (this=0x180235838 <threads+88>, sigwait=sigwait@entry=0x0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygthread.cc:270
#3  0x0000000180124de4 in pipe_cleanup (stuff=0xffffc260) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:840
#4  0x0000000180127cf1 in select_stuff::cleanup (this=this@entry=0xffffc260) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:255
#5  0x0000000180128a43 in select (maxfds=maxfds@entry=16, readfds=0xffffc340, writefds=0x0, exceptfds=0x0, us=4096, us@entry=1000000) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:212
#6  0x0000000180128ef5 in pselect (maxfds=16, readfds=0xffffc538, writefds=0xffffc540, exceptfds=0x0, to=<optimized out>, to@entry=0xffffc460, set=<optimized out>, set@entry=0x0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:120
#7  0x000000018012928b in cygwin_select (maxfds=<optimized out>, readfds=<optimized out>, writefds=<optimized out>, exceptfds=<optimized out>, to=0xffffc550) at
 /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/select.cc:147
#8  0x000000018019258b in _sigfe () at sigfe.s:35
#9  0x000000010084dd06 in ?? ()
#10 0x0000000100847630 in ?? ()
#11 0x000000010084fa7a in ?? ()
#12 0x000000010082d554 in ?? ()
#13 0x000000010042a755 in ?? ()
#14 0x000000010042b338 in ?? ()
#15 0x0000000100417b3f in ?? ()
#16 0x000000010042d95c in ?? ()
#17 0x000000010042f03b in ?? ()
#18 0x000000010040a1ea in ?? ()
#19 0x00000001009f19a9 in ?? ()
#20 0x0000000180049b91 in dll_crt0_1 () at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/dcrt0.cc:1053
#21 0x0000000180047716 in _cygtls::call2 (this=0xffffce00, func=0x180048a40 <dll_crt0_1(void*)>, arg=0x0, buf=buf@entry=0xffffcdf0) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:40
#22 0x00000001800477c4 in _cygtls::call (func=<optimized out>, arg=<optimized out>) at /c/S/msys2-runtime/src/msys2-runtime/winsup/cygwin/cygtls.cc:27
#23 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

I took a Windows dump of pacman and gpg, and also used dumper.exe from msys2 to take dumps, but gdb doesn't seem to want to use the core file fron dumper: "/home/adminuser/gpg.core.core": Core file format not supported. I had already killed gpg.exe to see if that broke loose pacman.exe, but it didn't.

@jeremyd2019
Copy link
Member Author

jeremyd2019 commented Dec 23, 2021

This is interesting. It seems that https://github.com/msys2/msys2-runtime/blob/2480c0e3242665890e1dac01891c5a7834fdd138/winsup/cygwin/fhandler_pipe.cc#L1281 is resulting in an exception. The only thing I can think of at this point is that somehow phi->NumberOfHandles is greater than n_handle. Maybe there should be a sanity check, or change the for condition to something like j < min(phi->NumberOfHandles, n_handle)?

@jeremyd2019
Copy link
Member Author

jeremyd2019 commented Dec 23, 2021

Wish I could open that core I took with dumper - anyone know why gdb claims not to support it? (#2756)

jeremyd2019 added a commit to jeremyd2019/msys2-runtime that referenced this issue Dec 23, 2021
Attempt to avoid an exception I caught once in gdb, trying to debug msys2/MSYS2-packages#2752
@jeremyd2019
Copy link
Member Author

My machine is still running like 15 hours in the loop validating 356 packages per invocation, so I haven't verified in the debugger what j/phi->NumberOfHandles/n_handle look like. But I also made a test change ☝️ and I'm thinking I'll try that on ARM where pacman hangs up calling GPG a lot more often.

@jeremyd2019
Copy link
Member Author

So far so good on ARM. Sent patch to cygwin-patches: https://cygwin.com/pipermail/cygwin-patches/2021q4/011611.html

@jeremyd2019
Copy link
Member Author

Patch seems to have done a world of good on ARM, have been building packages for hours and haven't had to manually intervene and kill anything yet!

@jeremyd2019
Copy link
Member Author

https://github.com/msys2/msys2-autobuild/runs/4624180675?check_suite_focus=true seems to have been one validating the sync dbs.

@jeremyd2019
Copy link
Member Author

Can I get some help with the requests from https://cygwin.com/pipermail/cygwin-patches/2021q4/011616.html:

So can you test your diagnosis by removing your patch and adding an assertion?

Would I need to set some special configure args to get an assertion to do anything?

Also, it seems to have started cropping up in msys2's CI when the GHA
runner was switched from "windows-2019" to "windows-2022".

And does your patch help here too?

What would be the best way to do this? Would I need to add a patch to msys2-runtime repo, and update msys2-runtime package, to get runners to exercise this?

/cc @dscho for msys2-runtime expertise

@jeremyd2019
Copy link
Member Author

I got the assertion in while pacman -Su; do true; done:

:: Starting core system upgrade...
 there is nothing to do
:: Starting full system upgrade...
 there is nothing to do
assertion "phi->NumberOfHandles < n_handle" failed: file "../../.././winsup/cygwin/fhandler_pipe.cc", line 1275, function: void* fhandler_pipe::get_query_hdl_per_process(WCHAR*, OBJECT_NAME_INFORMATION*)
Aborted

@jeremyd2019
Copy link
Member Author

I set up a windows server 2022 VM last night and went nuts stressing
pacman/GPGME. I was able to reproduce the issue there:

status = 0x00000000, phi->NumberOfHandles = 8261392, n_handle = 256
[#####----------------------------------]  14%
assertion "phi->NumberOfHandles <= n_handle" failed: file
"../../.././winsup/cygwin/fhandler_pipe.cc", line 1281, function: void*
fhandler_pipe::get_query_hdl_per_process(WCHAR*, OBJECT_NAME_INFORMATION*)

So it is not something inherent in the x86_64-on-ARM64 emulation but can happen on native x86_64 also.

@jeremyd2019
Copy link
Member Author

I think the current state of this is https://cygwin.com/pipermail/cygwin-patches/2021q4/011637.html: namely, waiting on somebody to sign off on the patch.

@jeremyd2019
Copy link
Member Author

msys2/msys2-runtime@0ce992c

@jeremyd2019
Copy link
Member Author

The hang that I could reproduce on x64 is now fixed (pending upload to repo). There is still a hang that I've only seen on ARM64, when validating sync dbs, that I have had no luck debugging with windbgx, gdb, or lldb. None of them can get a valid thread context or stack trace for the main thread. Maybe that's part of whatever the problem is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant