Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: segfault with .loc indexing #5553

Closed
TomAugspurger opened this issue Nov 20, 2013 · 11 comments · Fixed by #5555
Closed

BUG: segfault with .loc indexing #5553

TomAugspurger opened this issue Nov 20, 2013 · 11 comments · Fixed by #5555
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@TomAugspurger
Copy link
Contributor

I found that segfault that messed up my HDFStore earlier.

You'll need to csv files:

https://www.dropbox.com/s/55gcpkp3b8byfka/df1.csv
https://www.dropbox.com/s/mqqhfgocdg4qggp/df2.csv

To reproduce:

In [1]: df1 = pd.read_csv('df1.csv', index_col=[0,1,2])

In [2]: df2 = pd.read_csv('df2.csv', index_col=[0,1,2])

In [3]: df2.loc[df1.index]
Segmentation fault: 11
(pandas-dev)dhcp80fff527:Desktop tom$ 

I've never debugged a segfault before, but I gather there are some methods. Can anyone reproduce this?

Also, that's a valid use of .loc right?

In [1]: pd.version
Out[1]: '0.12.0-1084-ged64416'

In [2]: np.version
Out[2]: '1.7.1'

@TomAugspurger
Copy link
Contributor Author

Interesting... maybe...

(gdb) bt
#0  0x0000000103408e07 in __pyx_pw_6pandas_5algos_333take_2d_axis1_float64_float64
(__pyx_self=0x1, __pyx_args=<value temporarily unavailable, due to optimizations>,
__pyx_kwds=<value temporarily unavailable, due to optimizations>) at pandas/algos.c:81901

...

@TomAugspurger
Copy link
Contributor Author

A potential hint: my df2.index is not unique. I'm doing the same operation on a bunch of files; for this one I have to mess with the index (all of the other indices are unique and none of those have caused any problems).

@TomAugspurger
Copy link
Contributor Author

Dropping the duplicates fixed the problem:

In [34]: dupes = df2.index.get_duplicates()

In [35]: df3 = df2.drop(dupes)

In [36]: df3.loc[df1.index]
Out[36]: (not a segfault)

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

if one of these is not unique this prob won't work

could do this via inner join I think

@TomAugspurger
Copy link
Contributor Author

I probably should have been doing that originally.

Would it be a big perf hit to check for uniqueness here?

On Nov 19, 2013, at 22:33, "jreback" <notifications@github.commailto:notifications@github.com> wrote:

if one of these is not unique this prob won't work

could do this via inner join I think


Reply to this email directly or view it on GitHubhttps://github.com//issues/5553#issuecomment-28863157.

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

unique check is cheap
loc dispatches based on it in fact

non unique loc returns duplicate matches in order where they r found
it's possible the mi blow it up if not sorted ( just a guess why it's faulting )
can u step thru and see where fault happens - use pdb before the loc and step thru

@TomAugspurger
Copy link
Contributor Author

> /Users/tom/Desktop/seg.py(6)<module>()
      5 import ipdb; ipdb.set_trace()
----> 6 df2.loc[df1.index]
      7 

ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(924)_indexer()
    923 
--> 924         def _indexer(self):
    925             if getattr(self, iname, None) is None:

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(925)_indexer()
    924         def _indexer(self):
--> 925             if getattr(self, iname, None) is None:
    926                 setattr(self, iname, indexer(self, name))

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(926)_indexer()
    925             if getattr(self, iname, None) is None:
--> 926                 setattr(self, iname, indexer(self, name))
    927             return getattr(self, iname)

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(927)_indexer()
    926                 setattr(self, iname, indexer(self, name))
--> 927             return getattr(self, iname)
    928 

ipdb> n
--Return--
<pandas....04850fd0>
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(927)_indexer()
    926                 setattr(self, iname, indexer(self, name))
--> 927             return getattr(self, iname)
    928 

ipdb> n
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(961)__getitem__()
    960 
--> 961     def __getitem__(self, key):
    962         if type(key) is tuple:

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(962)__getitem__()
    961     def __getitem__(self, key):
--> 962         if type(key) is tuple:
    963             return self._getitem_tuple(key)

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(965)__getitem__()
    964         else:
--> 965             return self._getitem_axis(key, axis=0)
    966 

fish: Job 1, 'python seg.py' terminated by signal SIGSEGV (Address boundary error)

Stepping into that last one:

ipdb> l 965
    960 
    961     def __getitem__(self, key):
    962         if type(key) is tuple:
    963             return self._getitem_tuple(key)
    964         else:
--> 965             return self._getitem_axis(key, axis=0)
    966 
    967     def _getitem_axis(self, key, axis=0):
    968         raise NotImplementedError()
    969 
    970     def _getbool_axis(self, key, axis=0):

ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1058)_getitem_axis()
   1057 
-> 1058     def _getitem_axis(self, key, axis=0):
   1059         labels = self.obj._get_axis(axis)

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1059)_getitem_axis()
   1058     def _getitem_axis(self, key, axis=0):
-> 1059         labels = self.obj._get_axis(axis)
   1060 

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1061)_getitem_axis()
   1060 
-> 1061         if isinstance(key, slice):
   1062             self._has_valid_type(key,axis)

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1064)_getitem_axis()
   1063             return self._get_slice_axis(key, axis=axis)
-> 1064         elif com._is_bool_indexer(key):
   1065             return self._getbool_axis(key, axis=axis)

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1066)_getitem_axis()
   1065             return self._getbool_axis(key, axis=axis)
-> 1066         elif _is_list_like(key) and not (isinstance(key, tuple) and
   1067                                          isinstance(labels, MultiIndex)):

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1069)_getitem_axis()
   1068 
-> 1069             if hasattr(key, 'ndim') and key.ndim > 1:
   1070                 raise ValueError('Cannot index with multidimensional key')

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1072)_getitem_axis()
   1071 
-> 1072             return self._getitem_iterable(key, axis=axis)
   1073         else:

ipdb> n
fish: Job 1, 'python seg.py' terminated by signal SIGSEGV (Address boundary error)

A ways further along:

ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(806)_getitem_iterable()
    805 
--> 806                     result = result._reindex_with_indexers({ axis : [ new_labels, new_indexer ] }, copy=True, allow_dups=True)
    807 

ipdb> n
fish: Job 1, 'python seg.py' terminated by signal SIGSEGV (Address boundary error)

@TomAugspurger
Copy link
Contributor Author

a bit further:

> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(806)_getitem_iterable()
    805 
--> 806                     result = result._reindex_with_indexers({ axis : [ new_labels, new_indexer ] }, copy=True, allow_dups=True)
    807 

ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(1400)_reindex_with_indexers()
   1399 
-> 1400     def _reindex_with_indexers(self, reindexers, method=None, fill_value=np.nan, limit=None, copy=False, allow_dups=False):
   1401         """ allow_dups indicates an internal call here """

ipdb> args
self = <class 'pandas.core.frame.DataFrame'>
MultiIndex: 13343 entries, (61002006171, 1, 1.0) to (999990398540913, 1, 2.0)
Data columns (total 71 columns):
PUHROFF2     504  non-null values
PWSSWGT      13343  non-null values
PUHROFF1     5838  non-null values
PEERNHRO     0  non-null values
PEERNHRY     0  non-null values
PENLFRET     183  non-null values
PROLDRRP     13343  non-null values
HRLONGLK     13343  non-null values
PTDTRACE     13343  non-null values
PRFAMREL     13343  non-null values
PUWK         10729  non-null values
PUERN2       0  non-null values
PUIODP1      5978  non-null values
PUIODP2      5769  non-null values
PUIODP3      5713  non-null values
PWCMPWGT     13343  non-null values
PEMLR        10766  non-null values
PESEX        13343  non-null values
PREMPNOT     10766  non-null values
PRHRUSL      6638  non-null values
PEHRFTPT     601  non-null values
PRDTIND1     6889  non-null values
PUDIS2       19  non-null values
PUDIS1       70  non-null values
PRTAGE       13343  non-null values
PWORWGT      13343  non-null values
PEHRWANT     1195  non-null values
PEERNLAB     0  non-null values
PEEDUCA      10800  non-null values
PEHRACT2     380  non-null values
GEREG        13343  non-null values
PUERNH1C     0  non-null values
HWHHWGT      13343  non-null values
GTCBSA       13343  non-null values
PUCHINHH     13343  non-null values
HRMONTH      13343  non-null values
HRMIS        13343  non-null values
GESTFIPS     13343  non-null values
HEFAMINC     13343  non-null values
PUDIS        420  non-null values
PUHROT1      5838  non-null values
PEHRACT1     6450  non-null values
PEHRUSL2     383  non-null values
PEHRUSL1     6638  non-null values
HRINTSTA     13343  non-null values
PUABSOT      1962  non-null values
PEERN        0  non-null values
GTINDVPC     13343  non-null values
PUHROT2      492  non-null values
PULAY        1653  non-null values
PEERNPER     0  non-null values
HUFINAL      13343  non-null values
PTHR         13343  non-null values
PRDISC       97  non-null values
PTWK         13343  non-null values
PWFMWGT      13343  non-null values
GESTCEN      13343  non-null values
PERET1       1935  non-null values
PURETOT      1850  non-null values
PEMJOT       6638  non-null values
PWLGWGT      13343  non-null values
HRYEAR4      13343  non-null values
PEHRUSLT     6638  non-null values
PEMARITL     10800  non-null values
PRPERTYP     13343  non-null values
PEHRACTT     6450  non-null values
PRERNWA      0  non-null values
PTOT         13343  non-null values
PEMJNUM      383  non-null values
PELKDUR      251  non-null values
timestamp    13343  non-null values
dtypes: float64(56), int64(14), object(1)
reindexers = {0: [array([(61002006171, 1, 1.0), (61002006171, 1, 2.0), (61002006171, 1, 3.0),
       ..., None, (999990398540913, 1, 1.0), (999990398540913, 1, 2.0)], dtype=object), array([    0,     1,     2, ...,    -1, 19716, 19717])]}
method = None
fill_value = nan
limit = None
copy = True
allow_dups = True

I'll take a closer look tomorrow. It's gotten down to pandas/core/internals.py(3015)reindex_indexer()

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

@TomAugspurger can you post the file links again? I think they are the same

@TomAugspurger
Copy link
Contributor Author

Sorry about that. Here's the other one. I'll edit the original post.

https://www.dropbox.com/s/mqqhfgocdg4qggp/df2.csv

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

fixed in #5555, didn't have the correct test case (which basically is selecting a unique index from a non-unique index, where you have lots of not found elements)
object? -> Details about 'object', use 'object??' for extra details.

In [1]:  df = DataFrame({'test': [5,7,9,11], 'test1': [4.,5,6,7], 'other': list('abcd') }, index=['A', 'A', 'B', 'C'])

In [2]: rows = ['F','G','H','C','B','E']

In [3]: df.loc[rows]
Out[3]: 
  other  test  test1
F   NaN   NaN    NaN
G   NaN   NaN    NaN
H   NaN   NaN    NaN
C     d    11      7
B     c     9      6
E   NaN   NaN    NaN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants