Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Groupby and shift causing coredump #13813

Closed
mynci opened this issue Jul 27, 2016 · 3 comments
Closed

Groupby and shift causing coredump #13813

mynci opened this issue Jul 27, 2016 · 3 comments
Milestone

Comments

@mynci
Copy link

mynci commented Jul 27, 2016

Hi,

I have attempted to reduce this to the smallest example that exhibits this issue - rather than a useful example. The problem is that the operation causes python to core dump.

In the original case in which I discovered this the core dump would only occur sometimes (and when I put it in a loop it would occur on different iterations). This code seems to core dump on the third iteration every time I have run it.

Code example:

import os
import pandas as pd

df = pd.read_csv(os.path.join(os.getcwd(), 'error_report.txt'), sep='\t')

for i in range(0, 1000):
    print "Pre shift {}".format(i)
    df['shift_F'] = df.groupby(['B', 'C'])['F'].shift(-1)
    print "Post shift {}".format(i)

With the attached data file (tab separated) - code assumes in the same directory:
error_report.txt

This the output I get:

python pandas_test.py 
Pre shift 0
Post shift 0
Pre shift 1
Post shift 1
Pre shift 2
Segmentation fault (core dumped)

If I modify the code to use apply and then add the shifted column inside the apply function then there is no error. Similarly if I use .shift(0) I do not get the error.

Version info:

>>> pandas.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.18.1
nose: 1.3.1
pip: 8.1.2
setuptools: 20.2.2
Cython: None
numpy: 1.11.1
scipy: 0.13.3
statsmodels: 0.5.0
xarray: None
IPython: 1.2.1
sphinx: None
patsy: 0.2.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.6.1
matplotlib: 1.3.1
openpyxl: 1.7.0
xlrd: 0.9.2
xlwt: 0.7.5
xlsxwriter: None
lxml: None
bs4: 4.2.1
html5lib: 0.999
httplib2: 0.8
apiclient: None
sqlalchemy: None
pymysql: 0.7.2.None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

Regards
Stephen

@jreback
Copy link
Contributor

jreback commented Jul 27, 2016

ok, will take a volunteer to debug this.

@ivannz
Copy link
Contributor

ivannz commented Jul 27, 2016

Using exception stack traces I managed to pinpoint the problem. I believe it is within the group_shift_indexer procedure.

When I reintroduced the cython array boundary check option (@cython.boundscheck(True)) your use case did not crash, but instead raised a boundary violation error. The core of the problem is that the labels array, obtained from the groupby's grouper property, besides proper group integer-coded labels might contain the so called null keys (with value -1).

Lines L1358-L1359 do not properly check for this corner case. When I inject this patch:

...
                lab = labels[ii]

                # Skip null keys
                if lab == -1:
                    continue

                label_seen[lab] += 1
...

the problem goes away.

@ivannz
Copy link
Contributor

ivannz commented Jul 27, 2016

-1 values occur in the labels array when a groupby-key contains a missing value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants