Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

corrupted data and segfault in groupby cumsum/cumprod/cummin/cummax with absent categories #16771

Closed
adbull opened this issue Jun 26, 2017 · 17 comments

Comments

@adbull
Copy link
Contributor

commented Jun 26, 2017

Code Sample, a copy-pastable example if possible

Requires the contents of bug.pkl.zip

>>> import pandas as pd
>>> pd.read_pickle('bug.pkl').groupby('x').y.cummax()
*** Error in `python': double free or corruption (out): 0x00000000035784c0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x7570b)[0x7f86d533f70b]
/lib64/libc.so.6(+0x7deaa)[0x7f86d5347eaa]
/lib64/libc.so.6(cfree+0x4c)[0x7f86d534b40c]
/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so(+0x5c2ed)[0x7f86ce5f12ed]
/lib/python3.6/site-packages/numpy/core/multiarray.cpython-36m-x86_64-linux-gnu.so(+0x2013e)[0x7f86ce5b513e]
/lib/python3.6/site-packages/pandas/_libs/groupby.cpython-36m-x86_64-linux-gnu.so(+0x69a76)[0x7f86bb8bba76]
/lib/python3.6/site-packages/pandas/_libs/groupby.cpython-36m-x86_64-linux-gnu.so(+0x6ad39)[0x7f86bb8bcd39]
...

Problem description

Calling groupby().cummax() on the attached dataframe in a new python process results in a segfault on my machine. Not sure why this dataframe specifically; I couldn't find a simple test case that caused this. Not sure if it's a numpy or pandas issue, or how machine-specific it is.

Expected Output

Not a segfault.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.10-100.fc24.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C
LANG: C
LOCALE: None.None

pandas: 0.20.2
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.0
scipy: 0.19.0
xarray: None
IPython: 4.2.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 1.5.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2017

This appears to be a regression in v0.20, when the groupby().cummax() implementation moved to cython. Storing the dataframe in HDF5 for compatibility across pandas versions, I can reproduce the same issue in v0.20.x, but not in v0.19.2.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2017

pls show the actual frame with construction and df.info() exactly as it looks

even if it doesn't repro the issue

@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Jun 26, 2017

See below for a simpler test-case. Issue appears to be caused when grouping by categories that don't appear in the frame.

import pandas as pd
import numpy as np

x_vals = np.arange(2) + 2
x_cats = np.arange(4)
y = np.arange(2.0)
df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y))

df.groupby('x').y.cummax()
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 2 columns):
x    2 non-null category
y    2 non-null float64
dtypes: category(1), float64(1)
memory usage: 290.0 bytes
@jreback

This comment has been minimized.

Copy link
Contributor

commented Jun 26, 2017

yeah shouldn't segfault but this is also not implemented (shouldn't be hard though)

@adbull adbull changed the title segfault in groupby().cummax() segfault in groupby().cummax() with absent categories Jun 26, 2017

@chris-b1 chris-b1 added this to the 0.20.3 milestone Jun 26, 2017

@jreback jreback modified the milestones: 0.20.3, 0.21.0 Jul 6, 2017

@jreback jreback modified the milestones: 0.21.0, Next Major Release Sep 23, 2017

@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Sep 26, 2017

note the same test case also fails on cumsum() -- the problem might affect all grouped-cumulative operations

@adbull adbull changed the title segfault in groupby().cummax() with absent categories segfault in groupby().cummax() and groupby().cumsum() with absent categories Sep 26, 2017

@mroeschke

This comment has been minimized.

Copy link
Member

commented Feb 8, 2019

Works on master now. Could use a test

In [4]: pd.__version__
Out[4]: '0.25.0.dev0+85.g0eddba883'

In [5]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: x_vals = np.arange(2) + 2
   ...: x_cats = np.arange(4)
   ...: y = np.arange(2.0)
   ...: df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y))
   ...:
   ...: df.groupby('x').y.cumsum()
Out[5]:
0    1.730601e-77
1    1.000000e+00
Name: y, dtype: float64
@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

@mroeschke
When I am running your code, it runs without segfault, but when the process (python interpreter or through script) ends, it ends with a seg fault. Even the original code segment with bug.pkl gives a seg fault when executed as

import pandas as pd
a = pd.read_pickle('bug.pkl')
a.groupby('x').y.cummax()

My pandas version is 0.24.0. I'll once try the master version tonight and see if the issue persists.

@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

So I ran the above code in the current master. The pandas version is

>>> pd.__version__
'0.25.0.dev0+113.gb8306f19d'

I got the following error after the interpreter crashed

invalid fastbin entry (free)
Aborted
@mroeschke

This comment has been minimized.

Copy link
Member

commented Feb 13, 2019

Which interpreter are you using? Works fine in ipython

Python 3.6.7 |Anaconda, Inc.| (default, Oct 23 2018, 14:01:38)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.1.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: pd.__version__
Out[1]: '0.25.0.dev0+118.gb4913dc92'

In [2]: In [5]: import pandas as pd
   ...:    ...: import numpy as np
   ...:    ...:
   ...:    ...: x_vals = np.arange(2) + 2
   ...:    ...: x_cats = np.arange(4)
   ...:    ...: y = np.arange(2.0)
   ...:    ...: df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y))
   ...:    ...:
   ...:    ...: df.groupby('x').y.cumsum()
Out[2]:
0    9.881313e-324
1     1.000000e+00
Name: y, dtype: float64
@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 13, 2019

Here's the complete thing.

$ python
Python 3.7.2 (default, Dec 29 2018, 06:19:36) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import pandas as pd
>>> a = pd.read_pickle('~/Downloads/bug.pkl')
>>> a.groupby('x').y.cummax()
invalid fastbin entry (free)
Aborted

Same thing when I tried your code

$ python
Python 3.7.2 (default, Dec 29 2018, 06:19:36) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
>>> import pandas as pd
>>> import numpy as np
>>> 
>>> x_vals = np.arange(2) + 2
>>> x_cats = np.arange(4)
>>> y = np.arange(2.0)
>>> df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y))
>>> 
>>> df.groupby('x').y.cumsum()
invalid fastbin entry (free)
Aborted
@mroeschke

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Maybe a 3.6 vs 3.7 issue? Could you create a 3.6 environment and see whether this still segfaults?

@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

3.6 gives a similar issue. Different segfault errors I got-

invalid fastbin entry (free)
aborted

or

malloc(): memory corruption
aborted

or

malloc_consolidate(): invalid chunk size
Aborted

or

corrupted size vs. prev_size
Aborted

Python version

Python 3.6.0 |Continuum Analytics, Inc.| (default, Dec 23 2016, 12:22:00) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
@mroeschke

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

Okay this may still be segfaulting on Linux then (I'm running OSX).

@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 14, 2019

Any idea which parts of the source code I should look into, for the issue?

@mroeschke

This comment has been minimized.

Copy link
Member

commented Feb 14, 2019

def group_cummax(groupby_t[:, :] out,

def group_cummin(groupby_t[:, :] out,

@TrigonaMinima

This comment has been minimized.

Copy link
Contributor

commented Feb 15, 2019

I don't understand what change I have to make here.

  1. The code doesn't segfault everytime.
  2. values and labels are only getting what they should be getting ie data without the missing categories. What check should I write here?
@adbull

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

The issue is caused by various cython helper functions incorrectly allocating temporary arrays with shape (rows, columns), when they should have shape (groups, columns). This fails when groups > rows, which is possible for categorical groupers with unused categories.

The bug can also cause corrupted data rather than segfaulting, e.g. as below:

import pandas as pd

x_vals = [1]
x_cats = range(2)
y = [1]
df = pd.DataFrame(dict(x=pd.Categorical(x_vals, x_cats), y=y))

print(df.groupby('x').y.cumsum())

This should print 1, but instead it prints a random value, at least on linux.

@adbull adbull changed the title segfault in groupby().cummax() and groupby().cumsum() with absent categories corrupted data and segfault in groupby cumsum/cumprod/cummin/cummax with absent categories Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 18, 2019

@jreback jreback modified the milestones: Contributions Welcome, 0.25.0 Apr 19, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 19, 2019

adbull pushed a commit to adbull/pandas that referenced this issue Apr 20, 2019

jreback added a commit that referenced this issue Apr 20, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.