New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault on Ubuntu 20.04 when in combination with LightGBM #2453
Comments
|
without |
works fine in Ubuntu 18.04
^ means it didn't segfault, so good |
The crash occurs at It is not clear why importing datatable up-front causes any change in behavior, since lightgbm itself tries to import datatable at startup... |
also, when I compile LightGBM on Ubuntu 20.04, it works fine |
If I set a breakpoint for
which is perfectly reasonable, and what I would normally expect. Setting then a breakpoint for
The backtrace is no longer valid:
|
I suspect that the issue is actually in the order of module loading. If I rearrange imports as
then there is no longer a crash |
yes, confirmed, works in DAI too if pandas imported first |
I tried doing --- a/dtpd
+++ b/pddt
@@ -6,34 +6,26 @@
/lib/x86_64-linux-gnu/libz.so.1
/lib/x86_64-linux-gnu/libm.so.6
/lib/x86_64-linux-gnu/libc.so.6
-/tmp/blah/lib/python3.6/site-packages/datatable/lib/_datatable.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libstdc++.so.6
-/lib/x86_64-linux-gnu/libgcc_s.so.1
-/usr/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libbz2.so.1.0
-/usr/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/liblzma.so.5
-/usr/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libcrypto.so.1.1
-/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libffi.so.7
-/usr/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so
-/usr/lib/python3.6/lib-dynload/_curses.cpython-36m-x86_64-linux-gnu.so
-/lib/x86_64-linux-gnu/libncursesw.so.6
-/lib/x86_64-linux-gnu/libtinfo.so.6
-/usr/lib/python3.6/lib-dynload/termios.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/core/_multiarray_umath.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/core/../../numpy.libs/libopenblasp-r0-34a18dc3.3.7.so
/tmp/blah/lib/python3.6/site-packages/numpy/core/../../numpy.libs/libgfortran-ed201abd.so.3.0.0
/tmp/blah/lib/python3.6/site-packages/numpy/core/_multiarray_tests.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_ctypes.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libffi.so.7
/tmp/blah/lib/python3.6/site-packages/numpy/linalg/lapack_lite.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/linalg/_umath_linalg.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_bz2.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libbz2.so.1.0
+/usr/lib/python3.6/lib-dynload/_lzma.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/liblzma.so.5
/usr/lib/python3.6/lib-dynload/_decimal.cpython-36m-x86_64-linux-gnu.so
/lib/x86_64-linux-gnu/libmpdec.so.2
/tmp/blah/lib/python3.6/site-packages/numpy/fft/_pocketfft_internal.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/random/mtrand.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/random/_bit_generator.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/random/_common.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_hashlib.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libcrypto.so.1.1
/tmp/blah/lib/python3.6/site-packages/numpy/random/_bounded_integers.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/random/_mt19937.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/numpy/random/_philox.cpython-36m-x86_64-linux-gnu.so
@@ -63,6 +55,7 @@
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/tslib.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/interval.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/algos.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_opcode.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/properties.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/hashing.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/ops.cpython-36m-x86_64-linux-gnu.so
@@ -76,6 +69,8 @@
/usr/lib/python3.6/lib-dynload/mmap.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/reshape.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/window/aggregations.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libstdc++.so.6
+/lib/x86_64-linux-gnu/libgcc_s.so.1
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/window/indexers.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/groupby.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/reduction.cpython-36m-x86_64-linux-gnu.so
@@ -83,6 +78,11 @@
/usr/lib/python3.6/lib-dynload/_csv.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/json.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/pandas/_libs/testing.cpython-36m-x86_64-linux-gnu.so
+/tmp/blah/lib/python3.6/site-packages/datatable/lib/_datatable.cpython-36m-x86_64-linux-gnu.so
+/usr/lib/python3.6/lib-dynload/_curses.cpython-36m-x86_64-linux-gnu.so
+/lib/x86_64-linux-gnu/libncursesw.so.6
+/lib/x86_64-linux-gnu/libtinfo.so.6
+/usr/lib/python3.6/lib-dynload/termios.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/scipy/_lib/_ccallback_c.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/scipy/_lib/_uarray/_uarray.cpython-36m-x86_64-linux-gnu.so
/tmp/blah/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so I can surmise that perhaps it could be a name clash with one of the globally defined symbols? Not sure how to proceed at this point... |
@arnocandel
At this point it is unclear how to debug this problem any further, nor whether it is even possible to fix it within datatable. |
We could compile at least lightgbm “the normal way” with debug symbols, and bring into Ubuntu 20. |
I got this in valgrind (DAI):
|
and this from core file (DAI)
|
https://gcc.gnu.org/onlinedocs/gcc-4.8.2/libstdc++/api/a01452_source.html L1620 shows that we're in |
shows same thing in Docker:
|
but /dev/random value is probably expected to be uninitialized, so can be ignored. just interesting that (almost) at end of stacktrace of core file. |
According to https://en.cppreference.com/w/cpp/numeric/random/random_device, the function of the As such, it is perfectly normal for a random device to use some uninitialized value to further scramble bits of whatever is read from /dev/random. The random device's job is to be as unpredictable as possible. So I think we shouldn't be worrying too much about this report from valgrind. |
On the other hand, this part of the core dump I find suspicious:
As we already know, line 1620 of random.h simply calls |
yes, that's consistent with another segfault I just got, this time from TF:
|
Based on our investigation, the problem appears to be rooted in Ubuntu 20's system libraries, specifically the Since there seems to be nothing to we can do about it within datatable, I'm closing this issue. It should probably be re-raised with either Ubuntu or GCC teams. |
@st-pasha In a totally separate project I am encountering the same |
I'm facing it on mojo when datatable get installed from pypi(0.11.1) I reopen it as current release of datatable seems incompatible with tf and mojo.
|
Let me know if you need any help to reproduce it. |
@sh1ng We will be making a new release shortly, but other than that, there is nothing that we can fix within datatable to address this problem. |
For anyone else who happens to stumble upon this, I ran into a similar/same issue that occurred because tensorflow was binding certain More details: |
@aws-taylor I'm also getting a similar error, for PyTorch, on Ubuntu 22.04 (rwth-i6/returnn#1339). Can you provide some details on how to interpret the data from |
fails with:
The text was updated successfully, but these errors were encountered: