Fix segfault with printing dataframe #47097

roberthdevries · 2022-05-23T17:34:21Z

GH 46848

closes BUG: Segfault when printing dataframe #46848
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

roberthdevries · 2022-05-23T17:35:03Z

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen. I deem this not a workable regression test, but I am at a loss how to trigger the issue otherwise.
As far as I can determine the crash only happens when printing a dataframe requires the use of dots to indicate missing values when the screen is not wide enough.
Apparently there is some logic to determine this situation, and then take another code path. This other code path then takes care of the injection of the dots, but also triggers the code path that produces the segfault.
Ideally we can make a regression test that provokes the code path that produces the segfault without the print logic.

rhshadrach · 2022-05-23T21:00:43Z

Does the segfault arise in one of the lines of the diff of this PR? Or is it only when this result is used later on?

jbrockmendel · 2022-05-26T14:54:20Z

pandas/_libs/algos_take_helper.pxi.in

@@ -183,8 +183,15 @@ def take_2d_axis1_{{name}}_{{dest}}(ndarray[{{c_type_in}}, ndim=2] values,
                {{if c_type_in == "uint8_t" and c_type_out == "object"}}
                out[i, j] = True if values[i, idx] > 0 else False
                {{else}}
+                {{if c_type_in == "object"}} # GH46848
+                if values[i][idx] is None:
+                    out[i, j] = None


any idea why this makes a difference? this looks deeply weird

@jbrockmendel i've posted the arguments that cause take_2d_axis1_object_object to segfault in #46848 (comment) if that helps.

perhaps should explicitly check for NULL for clarity. see #46848 (comment)

If we fix here, probably should also patch take_2d_axis0_, see #46848 (comment)

@roberthdevries would changing L159 like done in https://github.com/pandas-dev/pandas/pull/13764/files also work?

Hmmmpff, It seems empty_like does not fill the object array with None, but just ensures NULL'ing. NumPy supports this (mostly?), it is perfectly fine if an array is filled with NULL's, this is the same as if the array was filled with None.

It is probably best if you also add support for that in pandas. That said, it would also be better to explicitly fill the array with None in NumPy. (And I do think we should define that as the right thing, even if there are branches where NULL's must be expected a fully initialized array filled with NULLs should never happen!)

To proof the point:

In [7]: arr = np.empty(3, dtype=object) In [8]: arr.tobytes() Out[8]: b'`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00`\xd1\x91\x00\x00\x00\x00\x00' In [9]: np.empty_like(arr).tobytes() Out[9]: b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'

@seberg How would you effectively detect NULL values in a numpy array? I assume that @simonjayhawkins prefers a fix somewhere in the setitem method of DataFrame, but I am at a loss how to intercept those values and convert them to None.

Honestly, it seems to me that cython may not know about the possibility of NULLs for object typed buffers (either that way or typed memoryviews). The generated code doesn't look like it. So I am not sure what a decent way forward would be.

So, a work-around that is ~~relatively~~ quick may be the following:

cdef cnp.intp_t[2] idx idx[0] = i idx[1] = idx cdef void *data = cnp.PyArray_GetPtr(values, idx) if data == NULL: data = <void *>None out[i,j] = <object>data

But maybe its nicer to just drop that weird PyArray_GetPtr:

cdef void *data = arr.data + x * arr.strides[0] + y * arr.strides[1] if data == NULL: data = <void *>None return <object>data

which will be as fast as possible without any micro-optimizations (not quite sure on Cython 3, but maybe it doesn't matter).

I am have planning on plugging this hole in NumPy (numpy/numpy#21817), but I am not sure if it is better there or in Cython.
(IMO, it is clearly right to fix this either in NumPy or in Cython, although if we fix it in NumPy, pandas still needs the work-around since you can rely on a new Cython, but not a new NumPy.)

roberthdevries · 2022-05-26T18:55:09Z

The segfault is caused on the original line 186.
The code generated by cython tries to increment the refcount of the object on the rhs of the assignment expression. However that is a NULL pointer.
Question is where that NULL value comes from.
The extra check for None prevents this call to increase the refcount.

jreback · 2022-05-27T12:55:39Z

can you add a test which replicates

simonjayhawkins · 2022-06-06T11:15:45Z

can you add a test which replicates

@roberthdevries can you also add a release note to doc/source/whatsnew/v1.4.3.rst and resolve conflicts with main

simonjayhawkins · 2022-06-06T11:19:16Z

I have a fix that prevents the segfault, but I am not sure how a regression test could be made as the failure depends on the width of the terminal screen.

use MRE from #46848 (comment)

simonjayhawkins · 2022-06-07T13:46:33Z

@jbrockmendel thoughts on catching the uninitialised numpy array in setitem instead? A change to setitem was the cause to the regression. either way I guess there will a performance regression.

according to numpy/numpy#7854 it is a Cython issue. maybe we should also not fix.

jbrockmendel · 2022-06-07T18:21:25Z

thoughts on catching the uninitialised numpy array in setitem instead?

its not clear to me how we check for this. is there something in like arr.flags that would have what we need?

simonjayhawkins · 2022-06-08T09:49:08Z

thoughts on catching the uninitialised numpy array in setitem instead?

its not clear to me how we check for this. is there something in like arr.flags that would have what we need?

I was thinking along the lines that maybe more performant to do a check once on setitem rather than on every subsequent take operation. but thinking some more should probably take this suggestion of the table and be consistent with the fix for the similar issue for DataFrame construction #13717.

simonjayhawkins · 2022-06-21T14:41:23Z

according to numpy/numpy#7854 it is a Cython issue. maybe we should also not fix.

in the issue, the OP is using

for col in df.columns:
    df[col] = np.empty_like(df[col])

print(df)

which results in the segfault.

instead

for col in df.columns:
    df.loc[:, col] = np.empty_like(df[col])

print(df)

works fine.

#46267 is reporting a significant slowdown for repeated assigned via __setitem__

this sort of supports the "won't fix" (in pandas) argument as the workaround is more performant.

@roberthdevries can you run the asvs to see if the current proposed fix affects performance.

seberg · 2022-06-21T14:48:50Z

The current proposed fix should greatly slow down things. You probably would need to push the code closer to C to fix it, though (unless cython is "fixed").
(That is should not be particularly difficult, but not a typical code pattern here I guess.)

Maybe we should prioritize NumPy avoiding this. There are some cases where NumPy probably cannot, but I guess that those cases do not really affect normal cython use cases. Of course, that will only fix the issue on newer NumPy versions.

simonjayhawkins · 2022-06-21T19:42:08Z

moving off 1.4.3 milestone.

roberthdevries · 2022-06-22T10:12:21Z

@roberthdevries can you run the asvs to see if the current proposed fix affects performance.

I have run the asvs, but the output is huge and I cannot find anything obvious in it that would indicate a slowdown. But maybe you could help me by pointing in the direction where I should look, or tell me what you are looking for.

simonjayhawkins · 2022-06-22T10:22:19Z

there's a benchmark using take_nd called parallel_take1d so I assume not relevant. And there's a time_take in asv_bench/benchmarks/indexing.py which is for a Series. So it maybe that we don't have any benchmarks specific to the changed code path.

So if we don't see any other slowdowns, maybe that's good or maybe should add an explicit benchmark to be sure

roberthdevries · 2022-06-22T11:44:56Z

My nightly run did not report any significant performance degradations:

[99.81%] ··· Setting up timedelta:14                                                                                                                                  ok
[99.81%] ··· timedelta.DatetimeAccessor.time_dt_accessor                                                                                                         178±2ns
[99.83%] ··· timedelta.DatetimeAccessor.time_timedelta_days                                                                                                     1.16±0ms
[99.86%] ··· timedelta.DatetimeAccessor.time_timedelta_microseconds                                                                                             1.17±0ms
[99.88%] ··· timedelta.DatetimeAccessor.time_timedelta_nanoseconds                                                                                              1.16±0ms
[99.90%] ··· timedelta.DatetimeAccessor.time_timedelta_seconds                                                                                               1.14±0.01ms
[99.93%] ··· Setting up tslibs.timedelta:55                                                                                                                           ok
[99.93%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_days                                                                                            167±2ns
[99.95%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_microseconds                                                                                    165±1ns
[99.98%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_nanoseconds                                                                                   167±0.6ns
[100.00%] ··· tslibs.timedelta.TimedeltaProperties.time_timedelta_seconds                                                                                         166±1ns
(base) robert@thinkpad-rhdv:~/projects/pandas/asv_bench$

simonjayhawkins · 2022-06-22T22:38:21Z

@jbrockmendel wdyt here?

The underlying issue this is fixing has been around since at least pandas-0.25.3 (the oldest I checked with using your code sample in #46848 (comment))

There have been issues with np.empty-like before but this current issue has only surfaced for the DataFrame printing due to the __setitem__ change in 1.4.

The fix here is probably the right place (if fixing in pandas) but the take code, I assume, is performance critical, although benchmarks haven't turned anything up. (do you know if the 2d take path is heavily used?)

jbrockmendel · 2022-06-22T23:40:01Z

The fix here is probably the right place (if fixing in pandas) but the take code,

Agreed this is the best option currently available. Maybe @scoder can weigh in on future possibilities.

I assume, is performance critical, although benchmarks haven't turned anything up. (do you know if the 2d take path is heavily used?)

It's object-dtype, so I'm inclined to just not worry about it.

GH 46848

seberg · 2022-06-23T14:59:49Z

Btw. I have opened a cython PR that should fix this: cython/cython#4859

Even if we say that the NumPy fix is the right path, that might be the best way forward to avoid lingering issues until we can rely on a "corrected" NumPy.

github-actions · 2022-07-24T00:06:33Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

simonjayhawkins · 2022-08-02T16:13:36Z

Btw. I have opened a cython PR that should fix this: cython/cython#4859

That's great. Thanks @seberg

I've not yet tested the fix, but now that that fix is in a released Cython we maybe able to close the issue with just merging the test from this PR.

Probably best not to add the release note, just in case we have issues at wheel build time, but can certainly backport the test to 1.4.x

seberg · 2022-08-03T10:56:49Z

Yeah, now that cython has a new release the fix is to pin that version and only add the test if anything. I guess there might be a few other places with hacks in place that can now be removed.

simonjayhawkins · 2022-08-05T18:36:40Z

Thanks @roberthdevries for the PR. closing in favor of #47979

tested with @jbrockmendel minimal example and cython 0.29.32

roberthdevries marked this pull request as draft May 23, 2022 17:35

roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from eb6301c to 9981fb1 Compare May 23, 2022 17:41

simonjayhawkins added this to the 1.4.3 milestone May 26, 2022

simonjayhawkins added Regression Functionality that used to work in a prior pandas version Output-Formatting __repr__ of pandas objects, to_string labels May 26, 2022

jbrockmendel reviewed May 26, 2022

View reviewed changes

jbrockmendel mentioned this pull request May 26, 2022

BUG: Segfault when printing dataframe #46848

Closed

3 tasks

simonjayhawkins mentioned this pull request Jun 8, 2022

RLS: 1.4.3 #46610

Closed

roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch 3 times, most recently from aa49b89 to 6f51d22 Compare June 10, 2022 12:51

roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from 6f51d22 to d8da853 Compare June 21, 2022 17:54

simonjayhawkins modified the milestones: 1.4.3, 1.4.4 Jun 21, 2022

roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from d8da853 to 1bb002b Compare June 22, 2022 10:18

seberg mentioned this pull request Jun 22, 2022

[BUG] Should cython object buffers check for NULL? cython/cython#4858

Closed

roberthdevries added 4 commits June 23, 2022 15:19

Fix segfault with printing dataframe

bdc8a7f

GH 46848

Add similar fix as in take_2d_axis1_* to take_2d_axis0_*

0387e87

Add regression tests

df0b2f5

Add whatsnew entry

450523f

roberthdevries force-pushed the 46848-fix-segfault-when-printing-dataframe branch from 1bb002b to 450523f Compare June 23, 2022 13:28

roberthdevries marked this pull request as ready for review June 23, 2022 13:28

github-actions bot added the Stale label Jul 24, 2022

simonjayhawkins removed the Stale label Aug 2, 2022

simonjayhawkins mentioned this pull request Aug 5, 2022

DEPS: Update cython #47979

Merged

2 tasks

simonjayhawkins closed this Aug 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix segfault with printing dataframe #47097

Fix segfault with printing dataframe #47097

roberthdevries commented May 23, 2022 •

edited

Loading

roberthdevries commented May 23, 2022

rhshadrach commented May 23, 2022

jbrockmendel May 26, 2022

simonjayhawkins Jun 6, 2022

simonjayhawkins Jun 7, 2022

simonjayhawkins Jun 7, 2022

simonjayhawkins Jun 7, 2022

seberg Jun 12, 2022

seberg Jun 12, 2022 •

edited

Loading

roberthdevries Jun 20, 2022

seberg Jun 20, 2022

seberg Jun 22, 2022 •

edited

Loading

roberthdevries commented May 26, 2022

jreback commented May 27, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 7, 2022

jbrockmendel commented Jun 7, 2022

simonjayhawkins commented Jun 8, 2022

simonjayhawkins commented Jun 21, 2022

seberg commented Jun 21, 2022

simonjayhawkins commented Jun 21, 2022

roberthdevries commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

roberthdevries commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

jbrockmendel commented Jun 22, 2022

seberg commented Jun 23, 2022

github-actions bot commented Jul 24, 2022

simonjayhawkins commented Aug 2, 2022

seberg commented Aug 3, 2022

simonjayhawkins commented Aug 5, 2022

Fix segfault with printing dataframe #47097

Fix segfault with printing dataframe #47097

Conversation

roberthdevries commented May 23, 2022 • edited Loading

roberthdevries commented May 23, 2022

rhshadrach commented May 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Jun 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

roberthdevries commented May 26, 2022

jreback commented May 27, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 6, 2022

simonjayhawkins commented Jun 7, 2022

jbrockmendel commented Jun 7, 2022

simonjayhawkins commented Jun 8, 2022

simonjayhawkins commented Jun 21, 2022

seberg commented Jun 21, 2022

simonjayhawkins commented Jun 21, 2022

roberthdevries commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

roberthdevries commented Jun 22, 2022

simonjayhawkins commented Jun 22, 2022

jbrockmendel commented Jun 22, 2022

seberg commented Jun 23, 2022

github-actions bot commented Jul 24, 2022

simonjayhawkins commented Aug 2, 2022

seberg commented Aug 3, 2022

simonjayhawkins commented Aug 5, 2022

roberthdevries commented May 23, 2022 •

edited

Loading

seberg Jun 12, 2022 •

edited

Loading

seberg Jun 22, 2022 •

edited

Loading