-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: segfault with .loc indexing #5553
Comments
Interesting... maybe...
|
A potential hint: my df2.index is not unique. I'm doing the same operation on a bunch of files; for this one I have to mess with the index (all of the other indices are unique and none of those have caused any problems). |
Dropping the duplicates fixed the problem: In [34]: dupes = df2.index.get_duplicates()
In [35]: df3 = df2.drop(dupes)
In [36]: df3.loc[df1.index]
Out[36]: (not a segfault) |
if one of these is not unique this prob won't work could do this via inner join I think |
I probably should have been doing that originally. Would it be a big perf hit to check for uniqueness here? On Nov 19, 2013, at 22:33, "jreback" <notifications@github.commailto:notifications@github.com> wrote: if one of these is not unique this prob won't work could do this via inner join I think — |
unique check is cheap non unique loc returns duplicate matches in order where they r found |
> /Users/tom/Desktop/seg.py(6)<module>()
5 import ipdb; ipdb.set_trace()
----> 6 df2.loc[df1.index]
7
ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(924)_indexer()
923
--> 924 def _indexer(self):
925 if getattr(self, iname, None) is None:
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(925)_indexer()
924 def _indexer(self):
--> 925 if getattr(self, iname, None) is None:
926 setattr(self, iname, indexer(self, name))
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(926)_indexer()
925 if getattr(self, iname, None) is None:
--> 926 setattr(self, iname, indexer(self, name))
927 return getattr(self, iname)
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(927)_indexer()
926 setattr(self, iname, indexer(self, name))
--> 927 return getattr(self, iname)
928
ipdb> n
--Return--
<pandas....04850fd0>
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(927)_indexer()
926 setattr(self, iname, indexer(self, name))
--> 927 return getattr(self, iname)
928
ipdb> n
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(961)__getitem__()
960
--> 961 def __getitem__(self, key):
962 if type(key) is tuple:
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(962)__getitem__()
961 def __getitem__(self, key):
--> 962 if type(key) is tuple:
963 return self._getitem_tuple(key)
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(965)__getitem__()
964 else:
--> 965 return self._getitem_axis(key, axis=0)
966
fish: Job 1, 'python seg.py' terminated by signal SIGSEGV (Address boundary error) Stepping into that last one: ipdb> l 965
960
961 def __getitem__(self, key):
962 if type(key) is tuple:
963 return self._getitem_tuple(key)
964 else:
--> 965 return self._getitem_axis(key, axis=0)
966
967 def _getitem_axis(self, key, axis=0):
968 raise NotImplementedError()
969
970 def _getbool_axis(self, key, axis=0):
ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1058)_getitem_axis()
1057
-> 1058 def _getitem_axis(self, key, axis=0):
1059 labels = self.obj._get_axis(axis)
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1059)_getitem_axis()
1058 def _getitem_axis(self, key, axis=0):
-> 1059 labels = self.obj._get_axis(axis)
1060
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1061)_getitem_axis()
1060
-> 1061 if isinstance(key, slice):
1062 self._has_valid_type(key,axis)
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1064)_getitem_axis()
1063 return self._get_slice_axis(key, axis=axis)
-> 1064 elif com._is_bool_indexer(key):
1065 return self._getbool_axis(key, axis=axis)
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1066)_getitem_axis()
1065 return self._getbool_axis(key, axis=axis)
-> 1066 elif _is_list_like(key) and not (isinstance(key, tuple) and
1067 isinstance(labels, MultiIndex)):
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1069)_getitem_axis()
1068
-> 1069 if hasattr(key, 'ndim') and key.ndim > 1:
1070 raise ValueError('Cannot index with multidimensional key')
ipdb> n
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(1072)_getitem_axis()
1071
-> 1072 return self._getitem_iterable(key, axis=axis)
1073 else:
ipdb> n
fish: Job 1, 'python seg.py' terminated by signal SIGSEGV (Address boundary error) A ways further along:
|
a bit further: > /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/indexing.py(806)_getitem_iterable()
805
--> 806 result = result._reindex_with_indexers({ axis : [ new_labels, new_indexer ] }, copy=True, allow_dups=True)
807
ipdb> s
--Call--
> /Users/tom/Envs/pandas-dev/lib/python2.7/site-packages/pandas-0.12.0_1084_ged64416-py2.7-macosx-10.8-x86_64.egg/pandas/core/generic.py(1400)_reindex_with_indexers()
1399
-> 1400 def _reindex_with_indexers(self, reindexers, method=None, fill_value=np.nan, limit=None, copy=False, allow_dups=False):
1401 """ allow_dups indicates an internal call here """
ipdb> args
self = <class 'pandas.core.frame.DataFrame'>
MultiIndex: 13343 entries, (61002006171, 1, 1.0) to (999990398540913, 1, 2.0)
Data columns (total 71 columns):
PUHROFF2 504 non-null values
PWSSWGT 13343 non-null values
PUHROFF1 5838 non-null values
PEERNHRO 0 non-null values
PEERNHRY 0 non-null values
PENLFRET 183 non-null values
PROLDRRP 13343 non-null values
HRLONGLK 13343 non-null values
PTDTRACE 13343 non-null values
PRFAMREL 13343 non-null values
PUWK 10729 non-null values
PUERN2 0 non-null values
PUIODP1 5978 non-null values
PUIODP2 5769 non-null values
PUIODP3 5713 non-null values
PWCMPWGT 13343 non-null values
PEMLR 10766 non-null values
PESEX 13343 non-null values
PREMPNOT 10766 non-null values
PRHRUSL 6638 non-null values
PEHRFTPT 601 non-null values
PRDTIND1 6889 non-null values
PUDIS2 19 non-null values
PUDIS1 70 non-null values
PRTAGE 13343 non-null values
PWORWGT 13343 non-null values
PEHRWANT 1195 non-null values
PEERNLAB 0 non-null values
PEEDUCA 10800 non-null values
PEHRACT2 380 non-null values
GEREG 13343 non-null values
PUERNH1C 0 non-null values
HWHHWGT 13343 non-null values
GTCBSA 13343 non-null values
PUCHINHH 13343 non-null values
HRMONTH 13343 non-null values
HRMIS 13343 non-null values
GESTFIPS 13343 non-null values
HEFAMINC 13343 non-null values
PUDIS 420 non-null values
PUHROT1 5838 non-null values
PEHRACT1 6450 non-null values
PEHRUSL2 383 non-null values
PEHRUSL1 6638 non-null values
HRINTSTA 13343 non-null values
PUABSOT 1962 non-null values
PEERN 0 non-null values
GTINDVPC 13343 non-null values
PUHROT2 492 non-null values
PULAY 1653 non-null values
PEERNPER 0 non-null values
HUFINAL 13343 non-null values
PTHR 13343 non-null values
PRDISC 97 non-null values
PTWK 13343 non-null values
PWFMWGT 13343 non-null values
GESTCEN 13343 non-null values
PERET1 1935 non-null values
PURETOT 1850 non-null values
PEMJOT 6638 non-null values
PWLGWGT 13343 non-null values
HRYEAR4 13343 non-null values
PEHRUSLT 6638 non-null values
PEMARITL 10800 non-null values
PRPERTYP 13343 non-null values
PEHRACTT 6450 non-null values
PRERNWA 0 non-null values
PTOT 13343 non-null values
PEMJNUM 383 non-null values
PELKDUR 251 non-null values
timestamp 13343 non-null values
dtypes: float64(56), int64(14), object(1)
reindexers = {0: [array([(61002006171, 1, 1.0), (61002006171, 1, 2.0), (61002006171, 1, 3.0),
..., None, (999990398540913, 1, 1.0), (999990398540913, 1, 2.0)], dtype=object), array([ 0, 1, 2, ..., -1, 19716, 19717])]}
method = None
fill_value = nan
limit = None
copy = True
allow_dups = True I'll take a closer look tomorrow. It's gotten down to |
@TomAugspurger can you post the file links again? I think they are the same |
Sorry about that. Here's the other one. I'll edit the original post. |
fixed in #5555, didn't have the correct test case (which basically is selecting a unique index from a non-unique index, where you have lots of not found elements)
|
I found that segfault that messed up my HDFStore earlier.
You'll need to csv files:
https://www.dropbox.com/s/55gcpkp3b8byfka/df1.csv
https://www.dropbox.com/s/mqqhfgocdg4qggp/df2.csv
To reproduce:
I've never debugged a segfault before, but I gather there are some methods. Can anyone reproduce this?
Also, that's a valid use of
.loc
right?In [1]: pd.version
Out[1]: '0.12.0-1084-ged64416'
In [2]: np.version
Out[2]: '1.7.1'
The text was updated successfully, but these errors were encountered: