[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415

randerzander · 2022-03-07T20:22:19Z

test.py:

if __name__ == "__main__":
    from dask.distributed import Client
    from dask_cuda import LocalCUDACluster

    cluster = LocalCUDACluster(protocol="tcp")
    client = Client(cluster)
    print(client)

    from dask_sql import Context
    import cudf

    c = Context()

    test_df = cudf.DataFrame({'id': [0, 1, 2]})
    c.create_table("test", test_df)

    # segfault
    print(c.sql("select count(*) from test").compute())

EDIT: Leaving the below UCX snippet and trace for historical purposes, but the issue seems entirely unrelated to UCX.

from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask_sql import Context
import pandas as pd

cluster = LocalCUDACluster(protocol="ucx")
client = Client(cluster)

c = Context()

test_df = pd.DataFrame({'id': [0, 1, 2]})
c.create_table("test", test_df)

# segfault
c.sql("select count(*) from test")

trace:

/home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/distributed-2022.2.1+8.g39c5e885-py3.9.egg/distributed/comm/ucx.py:83: UserWarning: A CUDA context for device 0 already exists on process ID 1251168. This is often the result of a CUDA-enabled library calling a CUDA runtime function before Dask-CUDA can spawn worker processes. Please make sure any such function calls don't happen at import time or in the global scope of a program.
  warnings.warn(
distributed.preloading - INFO - Import preload module: dask_cuda.initialize
...
[rl-dgx2-r13-u7-rapids-dgx201:1232380:0:1232380] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x8)
==== backtrace (tid:1232380) ====
 0  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(ucs_handle_error+0x155) [0x7f921c5883f5]
 1  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d791) [0x7f921c588791]
 2  /home/rgelhausen/conda/envs/dsql-3-07/lib/python3.9/site-packages/ucp/_libs/../../../../libucs.so.0(+0x2d962) [0x7f921c588962]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x430c0) [0x7f976d27b0c0]
 4  [0x7f93a78e6b58]
=================================
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f93a78e6b58, pid=1232380, tid=1232380
#
# JRE version: OpenJDK Runtime Environment (11.0.1+13) (build 11.0.1+13-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.1+13-LTS, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# J 1791 c2 java.util.Arrays.hashCode([Ljava/lang/Object;)I java.base@11.0.1 (56 bytes) @ 0x00007f93a78e6b58 [0x00007f93a78e6b20+0x0000000000000038]
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to /home/nfs/rgelhausen/notebooks/core.1232380)
#
# An error report file with more information is saved as:
# /home/nfs/rgelhausen/notebooks/hs_err_pid1232380.log
Compiled method (c2)   17616 1791       4       java.util.Arrays::hashCode (56 bytes)
 total in heap  [0x00007f93a78e6990,0x00007f93a78e6d80] = 1008
 relocation     [0x00007f93a78e6b08,0x00007f93a78e6b20] = 24
 main code      [0x00007f93a78e6b20,0x00007f93a78e6c60] = 320
 stub code      [0x00007f93a78e6c60,0x00007f93a78e6c78] = 24
 metadata       [0x00007f93a78e6c78,0x00007f93a78e6c80] = 8
 scopes data    [0x00007f93a78e6c80,0x00007f93a78e6ce8] = 104
 scopes pcs     [0x00007f93a78e6ce8,0x00007f93a78e6d48] = 96
 dependencies   [0x00007f93a78e6d48,0x00007f93a78e6d50] = 8
 handler table  [0x00007f93a78e6d50,0x00007f93a78e6d68] = 24
 nul chk table  [0x00007f93a78e6d68,0x00007f93a78e6d80] = 24
Could not load hsdis-amd64.so; library not loadable; PrintAssembly is disabled
#
# If you would like to submit a bug report, please visit:

The text was updated successfully, but these errors were encountered:

randerzander · 2022-03-07T20:23:08Z

cc @jdye64, @pentschev

jdye64 · 2022-03-07T20:24:36Z

Conda environment to reproduce

conda create -n dsql -c rapidsai-nightly -c nvidia -c conda-forge cudatoolkit=11.5 cudf=22.04 dask-cuda dask-cudf dask/label/dev::dask-sql dask-ml ipykernel pyngrok plotly s3fs requests nbformat cuml

pentschev · 2022-03-07T20:52:42Z

Seems like the same issue from #297 which was fixed by #294 , any chance that fix has been reverted?

jdye64 · 2022-03-08T13:37:33Z

@pentschev I went back and checked the offending files from issue #297 and they have not been accidentally reverted but that was a good thing to check.

pentschev · 2022-03-08T15:50:17Z

The point here seems nevertheless to be that JVM produces segmentation faults as part of its garbage collection which are posteriorly caught by the UCX error handling. As per my #297 (comment) , I don't see any way around it but to disable UCX error handling, and as noted in that comment we may lose the ability to debug UCX more easily, but we can't really do anything else from outside of JVM's signal handler as the behavior of raising segmentation faults is inherent to the way garbage collection is done by JVM.

randerzander · 2022-03-09T16:00:19Z

fyi, the second snippet above fails regardless of UCX. Updating the issue title to reflect.

Also, I notice that the only conda environments in which these don't fail are ones in which I'm building dask-sql, dask, and distributed from source. I'm going to keep peeling back env customizations to further narrow.

ayushdg · 2022-03-10T03:09:28Z

I was able to triage this to environments with cuml resolving to older dask versions. This issue only comes up when we use the newer nightlies of cudf/dask-cudf with an older version of dask (2021.11.2).

Here is a minimal reproducer outside dask-sql:

import cudf
import dask_cudf
df = cudf.DataFrame({"id":[1,2,3]})
ddf = df
ddf = dask_cudf.from_cudf(df,1)
ddf["tmp"] = 1
ddf.groupby(["id"]).agg({"tmp":"count"}).compute()

We should be good to close on this repo.

Side note: The reproducer outside dask-sql shows the actual error (python recursion depth exceeded), but with the jvm spun up it gets captured as a segfault with no additional information

randerzander · 2022-03-10T03:13:19Z

Thank you for digging into this, Ayush. This explains my working environments (which were removing older versions of Dask, then building/installing from Dask & Distributed main).

randerzander added bug Something isn't working needs triage Awaiting triage by a dask-sql maintainer labels Mar 7, 2022

randerzander changed the title ~~[BUG] segfaults when using dask-sql with a UCX enabled dask-cuda cluster~~ [BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames Mar 9, 2022

randerzander closed this as completed Mar 10, 2022

ksonj mentioned this issue Jun 17, 2022

[BUG] Segfault when running with pytest #586

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415

[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415

randerzander commented Mar 7, 2022 •

edited

randerzander commented Mar 7, 2022

jdye64 commented Mar 7, 2022

pentschev commented Mar 7, 2022

jdye64 commented Mar 8, 2022

pentschev commented Mar 8, 2022

randerzander commented Mar 9, 2022

ayushdg commented Mar 10, 2022

randerzander commented Mar 10, 2022

[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415

[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415

Comments

randerzander commented Mar 7, 2022 • edited

randerzander commented Mar 7, 2022

jdye64 commented Mar 7, 2022

pentschev commented Mar 7, 2022

jdye64 commented Mar 8, 2022

pentschev commented Mar 8, 2022

randerzander commented Mar 9, 2022

ayushdg commented Mar 10, 2022

randerzander commented Mar 10, 2022

randerzander commented Mar 7, 2022 •

edited