Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Investigation] The MSYS2 + meson + Python crash issue #17415

Closed
lazka opened this issue Jun 2, 2023 · 17 comments · Fixed by mesonbuild/wrapdb#1123
Closed

[Investigation] The MSYS2 + meson + Python crash issue #17415

lazka opened this issue Jun 2, 2023 · 17 comments · Fixed by mesonbuild/wrapdb#1123

Comments

@lazka
Copy link
Member

lazka commented Jun 2, 2023

Another issue (old one: #11864) to collect some information and what was tried so far.

I've created a small repo for reproducing the issue: https://github.com/lazka/python-crash-test

Any ideas regarding what we could try welcome.

The issue

meson fails with STATUS_ACCESS_VIOLATION sometimes like this when being called from ninja, and we haven't found a way to reproduce it locally:

FAILED: libmy-shared-lib6.dll.p/libmy-shared-lib6.dll.symbols 
"D:/a/_temp/msys64/ucrt64/bin/meson" "--internal" "symbolextractor" "D:/a/python-crash-test/python-crash-test/project/_build" libmy-shared-lib6.dll "libmy-shared-lib6.dll.a" libmy-shared-lib6.dll.p/libmy-shared-lib6.dll.symbols 

This below is now out of date, see the followup answers for the cause

When has it started

  • When we updated from Python 3.9 to 3.10

Where it failed so far

Where it hasn't failed so far

  • In MINGW32 / CLANG32
  • Locally, outside of CI

What makes the error go away

  • Running everything via powershell (and just setting PATH), and not from bash
  • Setting MSYS=winjitdebug
  • Downgrading to our last Python 3.9 mingw version (I created a rebuild for the last version we had)
  • Using the official CPython 3.9 installed via setup-python
  • Using the official CPython 3.11 installed via setup-python

What doesn't make the error go away

Bisecting:

Testing with official CPython builds:

  • NO-ERROR: 3.9.13, 3.10.0-alpha.1, 3.10.0-alpha.2, 3.10.0-alpha.3 <-> 3.11.0-alpha.4, 3.11.0-alpha.5, 3.11.0-alpha.6, 3.11.0-alpha.7, 3.11.3
  • ERROR: 3.10.0-alpha.4, 3.10.0-alpha.5, 3.10.0-alpha.7, 3.10.11, 3.11.0-alpha.1, 3.11.0-alpha.3

Error started: v3.10.0a3...v3.10.0a4
Fixed: v3.11.0a3...v3.11.0a4

Bisect, see next post.

@lazka lazka changed the title the MSYS2 + meson + Python crash issue The MSYS2 + meson + Python crash issue Jun 2, 2023
@lazka lazka changed the title The MSYS2 + meson + Python crash issue [Investigation] The MSYS2 + meson + Python crash issue Jun 2, 2023
@lazka
Copy link
Member Author

lazka commented Jun 5, 2023

Short summary of bisecting the issue with upstream CPython in GHA:

@lazka
Copy link
Member Author

lazka commented Jun 6, 2023

I've reduced the reproducer to no longer include MSYS2 now: https://github.com/lazka/python-crash-test/blob/70349faa496cdf19c4bd5152435542542eb7a14f/.github/workflows/main.yml

@lazka
Copy link
Member Author

lazka commented Jun 6, 2023

I've filed a bug report: python/cpython#105400 but I don't expect much from it, but who knows.

I'm also not sure how to proceed on the MSYS2 side of things. Prioritizing a 3.11 update would be one option.

@sskras
Copy link

sskras commented Jun 6, 2023

Wow, impressive. I tried to reproduce this in cmd by adding MINGW64 tools to the PATH with no success:
image

Did I miss something?

@Biswa96
Copy link
Member

Biswa96 commented Jun 6, 2023

Did I miss something?

Yes, this issue does not happen "Locally, outside of CI".

@sskras
Copy link

sskras commented Jun 6, 2023

OK. Is it possible to know the actual CI setup (VMM name/version, the host CPUID, the host OS) ?

@eli-schwartz
Copy link

eli-schwartz commented Jun 6, 2023

@sskras
Copy link

sskras commented Jun 6, 2023

@eli-schwartz, sure, I saw the defined action. Well, this says thing about the Guest:
https://github.com/actions/runner-images/releases/tag/win22%2F20230517.1

But no configuration of the lower levels – the VMM/hypervisor, the Host OS/hardware.

EDIT: Anyway, I do not even understand it it's possible to download/build the exact disk image of the Guest OS to try it at locally.

@eli-schwartz
Copy link

I have no idea how people generally find out information about the host hardware and hypervisor used on Github's servers, and would not assume that someone who ran a workflow could find that out.

@jeremyd2019
Copy link
Member

Interesting that it's specifically SEM_NOGPFAULTERRORBOX... the other relevant flag that we are losing by setting winjitdebug is SEM_FAILCRITICALERRORS and that's the one whose absence results in message boxes for missing exports. I don't know if there's a reasonable way to only set one and not the other for child processes though, as a workaround to keep CI from hanging on missing exports while still working around this issue. It's probably not worth a new MSYS option and additional patching of msys2-runtime over...

@kasper93
Copy link
Contributor

kasper93 commented Jun 9, 2023

Small update. I've reviewed meson's symbolextractor.py code. While there are few things that could be improved, it is not related to this issue.

One important thing I noticed that the issue happens only with llvm-nm which is preferred by meson, it does not happen with mingw's nm. With this python-crash-test reproducer.

I think next step would be to remove meson and ninja (if possible) from the reproducer. I tested with script that only runs llvm-nm on the library. The reproducer is pretty minimal, good job on that, but removing components like meson would really help to focus on important parts.

But my fear is that since it doesn't happen with Python 3.11 there will not be much interest (if at all) to find the root cause of this. I think during process exit some resources/fd are not in a valid state which causes the issue.

@anarazel
Copy link

Interesting that it's specifically SEM_NOGPFAULTERRORBOX...

There's a lot tied into that one these days... Windows error reporting, JIT debugging, etc only happen when SEM_NOGPFAULTERRORBOX is not set.

I don't know if there's a reasonable way to only set one and not the other for child processes though

I hit this issue without mingw or llvm-nm being involved (meson running C binaries compiled with msvc, via msbuild, not ninja). SEM_NOGPFAULTERRORBOX was the relevant factor, SEM_FAILCRITICALERRORS did not matter. Unfortunately the failure rates in my case are too low to make for good test case, and I hadn't been able to increase them meaningfully.

From writing this up for postgres:

The problem:

Occasionally ecpg test files would fail to compile, exiting with -1073741819:
C:\BuildTools\MSBuild\Microsoft\VC\v160\Microsoft.CppCommon.targets(241,5): error MSB8066: Custom build for 'C:\cirrus\build\meson-private\custom_target.rule' exited with code -1073741819. [c:\cirrus\build\src\interfaces\ecpg\test\sql\3701597@@twophase(dot)c(at)cus(dot)vcxproj]

-1073741819 is 0xc0000005, which in turn is STATUS_ACCESS_VIOLATION, i.e. a
segfault. This happens in roughly 1/3 of the builds, but with "streaks" of not
happening and more frequently happening.
...
The problem even persisted when replacing meson's sys.exit() with os._exit(),
which indeed just calls _exit().

Which to me strongly hints that the issue is somewhere in the windows runtime.

Possibly the explanation for why the issue is visible in CI but not, so far, locally, is due to different version of the windows runtime being used.

@lb90
Copy link
Collaborator

lb90 commented Jun 13, 2023

Hi! Writing here since that question may also help with this issue

I am experiencing a deadlock in python when running meson test --gdb. So I tried building mingw-w64-x86_64-python with options=('debug' '!strip'), but after installing the package python doesn't work anymore. Actually even the PKGBUILD check() fails. Anybody knows why? EDIT: solved by removing --with-pydebug

Regarding this issue, I'd try installing a vectored exception handler which simply calls Sleep() with a super long timeout. Then, when python hangs due to the exception handler, attach gdb (or lldb) to the process via SSH and gather a stacktrace. Vectored exception handlers always run on top of the normal stack, no unwind is done, so the backtrack would give us all the infos we need. Not entirely sure if handlers are also run during process teardown (ExitProcess), but most probably yes.

@kasper93
Copy link
Contributor

Would it be possible to update Python package to 3.11? It is already at 3.11.4 and 3.10 is in "security fixes" only stage. Arch also provides 3.11 by default, but of course they have also other version available in repo.

@lazka
Copy link
Member Author

lazka commented Jun 25, 2023

I can confirm that a mingw build of 3.11 also doesn't crash: msys2-contrib/cpython-mingw#139 (comment)

kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
kasper93 added a commit to kasper93/mpv that referenced this issue Jun 25, 2023
sfan5 pushed a commit to mpv-player/mpv that referenced this issue Jun 26, 2023
dyphire pushed a commit to dyphire/mpv that referenced this issue Jul 3, 2023
dyphire pushed a commit to dyphire/mpv that referenced this issue Jul 8, 2023
lazka added a commit to lazka/meson that referenced this issue Jul 29, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)
eli-schwartz pushed a commit to lazka/meson that referenced this issue Jul 31, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)
eli-schwartz pushed a commit to mesonbuild/meson that referenced this issue Aug 6, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)

(cherry picked from commit 3752041)
eli-schwartz pushed a commit to mesonbuild/meson that referenced this issue Aug 6, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)

(cherry picked from commit 3752041)
eli-schwartz pushed a commit to mesonbuild/meson that referenced this issue Aug 6, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)

(cherry picked from commit 3752041)
benoit-pierre added a commit to benoit-pierre/wrapdb that referenced this issue Aug 24, 2023
Make sure Python is updated to >=3.11 (fix msys2/MINGW-packages#17415).
eli-schwartz pushed a commit to mesonbuild/wrapdb that referenced this issue Aug 24, 2023
Make sure Python is updated to >=3.11 (fix msys2/MINGW-packages#17415).
bruchar1 pushed a commit to bruchar1/meson that referenced this issue Aug 25, 2023
This reverts commit e945f35.

With MSYS2 udpating to Python 3.11, this should no longer be needed.
See msys2/MINGW-packages#17415 (comment)
gnomesysadmins pushed a commit to GNOME/glib that referenced this issue Oct 12, 2023
It was added in 13fe2e0, but it's now unnecessary as the issue
has been fixed.

See msys2/MINGW-packages#17415
gnomesysadmins pushed a commit to GNOME/glib that referenced this issue Oct 16, 2023
It was added in 13fe2e0, but it's now unnecessary since
the issue has been fixed.

See msys2/MINGW-packages#17415
@Biswa96
Copy link
Member

Biswa96 commented Oct 21, 2023

Should this be closed?

@lazka
Copy link
Member Author

lazka commented Oct 21, 2023

Let's hope it doesn't come back :)

@lazka lazka closed this as completed Oct 21, 2023
willwray pushed a commit to willwray/wrapdb that referenced this issue Apr 22, 2024
Make sure Python is updated to >=3.11 (fix msys2/MINGW-packages#17415).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants