Groupby and shift causing coredump #13813

mynci · 2016-07-27T08:23:37Z

Hi,

I have attempted to reduce this to the smallest example that exhibits this issue - rather than a useful example. The problem is that the operation causes python to core dump.

In the original case in which I discovered this the core dump would only occur sometimes (and when I put it in a loop it would occur on different iterations). This code seems to core dump on the third iteration every time I have run it.

Code example:

import os
import pandas as pd

df = pd.read_csv(os.path.join(os.getcwd(), 'error_report.txt'), sep='\t')

for i in range(0, 1000):
    print "Pre shift {}".format(i)
    df['shift_F'] = df.groupby(['B', 'C'])['F'].shift(-1)
    print "Post shift {}".format(i)

With the attached data file (tab separated) - code assumes in the same directory:
error_report.txt

This the output I get:

python pandas_test.py 
Pre shift 0
Post shift 0
Pre shift 1
Post shift 1
Pre shift 2
Segmentation fault (core dumped)

If I modify the code to use apply and then add the shifted column inside the apply function then there is no error. Similarly if I use .shift(0) I do not get the error.

Version info:

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.1
pip: 8.1.2
setuptools: 20.2.2
Cython: None
numpy: 1.11.1
scipy: 0.13.3
statsmodels: 0.5.0
xarray: None
IPython: 1.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.1
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: 0.7.2.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Regards
Stephen

The text was updated successfully, but these errors were encountered:

jreback · 2016-07-27T10:16:18Z

ok, will take a volunteer to debug this.

ivannz · 2016-07-27T12:57:59Z

Using exception stack traces I managed to pinpoint the problem. I believe it is within the group_shift_indexer procedure.

When I reintroduced the cython array boundary check option (@cython.boundscheck(True)) your use case did not crash, but instead raised a boundary violation error. The core of the problem is that the labels array, obtained from the groupby's grouper property, besides proper group integer-coded labels might contain the so called null keys (with value -1).

Lines L1358-L1359 do not properly check for this corner case. When I inject this patch:

...
                lab = labels[ii]

                # Skip null keys
                if lab == -1:
                    continue

                label_seen[lab] += 1
...

the problem goes away.

ivannz · 2016-07-27T13:22:59Z

-1 values occur in the labels array when a groupby-key contains a missing value.

jreback added Bug Groupby labels Jul 27, 2016

ivannz mentioned this issue Jul 27, 2016

BUG: group_shift_indexer checks for null group keys #13819

Closed

4 tasks

jreback added this to the 0.19.0 milestone Jul 29, 2016

jreback closed this as completed in 54b2777 Jul 29, 2016

jreback mentioned this issue Aug 12, 2016

BUG: Kernel crashing when using shift with groupby when there are NaNs in group column #13976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Groupby and shift causing coredump #13813

Groupby and shift causing coredump #13813

mynci commented Jul 27, 2016 •

edited

jreback commented Jul 27, 2016

ivannz commented Jul 27, 2016 •

edited

ivannz commented Jul 27, 2016

Groupby and shift causing coredump #13813

Groupby and shift causing coredump #13813

Comments

mynci commented Jul 27, 2016 • edited

jreback commented Jul 27, 2016

ivannz commented Jul 27, 2016 • edited

ivannz commented Jul 27, 2016

mynci commented Jul 27, 2016 •

edited

ivannz commented Jul 27, 2016 •

edited