New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex.get_loc misbehaves on NaNs #18485

Closed
toobaz opened this Issue Nov 25, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@toobaz
Member

toobaz commented Nov 25, 2017

Code Sample, a copy-pastable example if possible

In [2]: mi = pd.MultiIndex(levels=[[1, 2, 3, 5], [4, 6]], labels=[[3, 1, 2, 0], [1, -1, 0, -1]])

In [3]: mi.get_loc((2, np.nan))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-d77b20d4d7a2> in <module>()
----> 1 mi.get_loc((2, np.nan))

/home/pietro/nobackup/repo/pandas/pandas/core/indexes/multi.py in get_loc(self, key, method)
   2119             key = _values_from_object(key)
   2120             key = tuple(map(_maybe_str_to_time_stamp, key, self.levels))
-> 2121             return self._engine.get_loc(key)
   2122 
   2123         # -- partial selection or non-unique index

/home/pietro/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14965)()
    616         return super(MultiIndexObjectEngine, self).get_indexer(values)
    617 
--> 618     cpdef get_loc(self, object val):
    619 
    620         # convert a MI to an ndarray

/home/pietro/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.MultiIndexObjectEngine.get_loc (pandas/_libs/index.c:14886)()
    621         if hasattr(val, 'values'):
    622             val = val.values
--> 623         return super(MultiIndexObjectEngine, self).get_loc(val)
    624 
    625 

/home/pietro/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5832)()
    137             util.set_value_at(arr, loc, value)
    138 
--> 139     cpdef get_loc(self, object val):
    140         if is_definitely_invalid_key(val):
    141             raise TypeError("'{val}' is an invalid key".format(val=val))

/home/pietro/nobackup/repo/pandas/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5678)()
    159 
    160         try:
--> 161             return self.mapping.get_item(val)
    162         except (TypeError, ValueError):
    163             raise KeyError(val)

/home/pietro/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:21018)()
   1263                                        sizeof(uint32_t)) # flags
   1264 
-> 1265     cpdef get_item(self, object val):
   1266         cdef khiter_t k
   1267         if val != val or val is None:

/home/pietro/nobackup/repo/pandas/pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20972)()
   1271             return self.table.vals[k]
   1272         else:
-> 1273             raise KeyError(val)
   1274 
   1275     cpdef set_item(self, object key, Py_ssize_t val):

KeyError: (2, nan)

In [4]: mi.get_indexer(mi.copy())
Out[4]: array([ 0, -1,  2, -1])

In [5]: mi == mi.copy()
Out[5]: array([ True, False,  True, False], dtype=bool)

Problem description

I think this is actually the cause for this example, which is different from the one reported at the top of #18455 .

Expected Output

array([ 0, 1, 2, 3])

Output of pd.show_versions()

INSTALLED VERSIONS

commit: b45325e
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-4-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8
LOCALE: en_GB.UTF-8

pandas: 0.22.0.dev0+201.gb45325e28
pytest: 3.0.6
pip: 9.0.1
setuptools: 33.1.1
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.18.1
pyarrow: None
xarray: None
IPython: 5.2.2
sphinx: None
patsy: 0.4.1+dev
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: 1.2.0
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: 2.3.0
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: None
lxml: 3.7.1
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@toobaz toobaz changed the title from MultiIndex.get_indexer(MultiIndex) misbehaves on NaNs to MultiIndex.get_loc misbehaves on NaNs Nov 25, 2017

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 25, 2017

Notice that

In [2]: mi = pd.MultiIndex(levels=[[1, 2, 3, 5], [4, 6]], labels=[[3, 1, 2, 0], [1, -1, 0, -1]])

In [3]: flat = pd.Index(list(mi), tupleize_cols=False)

In [4]: flat.get_indexer(flat)
Out[4]: array([0, 1, 2, 3])

In [5]: flat.get_indexer(mi)
Out[5]: array([0, 1, 2, 3])

but

In [6]: mi.get_indexer(flat)
Out[6]: array([ 0, -1,  2, -1])
@jreback

This comment has been minimized.

Contributor

jreback commented Nov 25, 2017

btw, these are going to totally blow up if you have more than 1 nan because you can then have multiple matches. I think I did this internally in the block manager, IOW, 1 nan on indexing is ok, more than 1 we raise.

@jreback jreback added this to the Next Major Release milestone Nov 25, 2017

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 25, 2017

btw, these are going to totally blow up if you have more than 1 nan because you can then have multiple matches

Not sure I understand the difference with ordinary values

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

toobaz added a commit to toobaz/pandas that referenced this issue Nov 26, 2017

@toobaz

This comment has been minimized.

Member

toobaz commented Nov 27, 2017

This only affects small (< 10000 elements) indexes:

In [4]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10000)])

In [5]: (np.nan, 3) in mi
Out[5]: True

In [6]: mi = pd.MultiIndex.from_product([[1, np.nan], range(1, 10)])

In [7]: (np.nan, 3) in mi
Out[7]: False

see #18519 .

@toobaz toobaz referenced this issue Jan 4, 2018

Merged

REF: codes-based MultiIndex engine #19074

3 of 3 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment