New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Segfaults on "select count(*) from test" with tables on top of cuDF DataFrames #415
Comments
cc @jdye64, @pentschev |
Conda environment to reproduce
|
@pentschev I went back and checked the offending files from issue #297 and they have not been accidentally reverted but that was a good thing to check. |
The point here seems nevertheless to be that JVM produces segmentation faults as part of its garbage collection which are posteriorly caught by the UCX error handling. As per my #297 (comment) , I don't see any way around it but to disable UCX error handling, and as noted in that comment we may lose the ability to debug UCX more easily, but we can't really do anything else from outside of JVM's signal handler as the behavior of raising segmentation faults is inherent to the way garbage collection is done by JVM. |
fyi, the second snippet above fails regardless of UCX. Updating the issue title to reflect. Also, I notice that the only conda environments in which these don't fail are ones in which I'm building dask-sql, dask, and distributed from source. I'm going to keep peeling back env customizations to further narrow. |
I was able to triage this to environments with Here is a minimal reproducer outside dask-sql: import cudf
import dask_cudf
df = cudf.DataFrame({"id":[1,2,3]})
ddf = df
ddf = dask_cudf.from_cudf(df,1)
ddf["tmp"] = 1
ddf.groupby(["id"]).agg({"tmp":"count"}).compute() We should be good to close on this repo. Side note: The reproducer outside dask-sql shows the actual error (python recursion depth exceeded), but with the jvm spun up it gets captured as a segfault with no additional information |
Thank you for digging into this, Ayush. This explains my working environments (which were removing older versions of Dask, then building/installing from Dask & Distributed main). |
test.py:
EDIT: Leaving the below UCX snippet and trace for historical purposes, but the issue seems entirely unrelated to UCX.
trace:
The text was updated successfully, but these errors were encountered: