Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiIndex.get_level_values() replaces NA by another value #5074

Closed
goyodiaz opened this issue Oct 1, 2013 · 12 comments · Fixed by #5090

Comments

@goyodiaz
Copy link
Contributor

commented Oct 1, 2013

Test case:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: index = pd.MultiIndex.from_arrays([
    ['a', 'b', 'b'],
    [1, np.nan, 2]
])

In [4]: index.get_level_values(1)
Out[4]: Float64Index([1.0, 2.0, 2.0], dtype=object)

The expected output is
Float64Index([1.0, nan, 2.0], dtype=object)

This happens because NA values are not stored in the MultiIndex levels and the corresponding label is set to -1. Then when labels are used as indexes to values in get_level_values() that -1 points to the last (not null) value.

I tried to fix this by appending a NA to the values if -1 is in levels.
goyodiaz@f028513
It needs to be improved in order to return the proper NA value (NaN, None, maybe NaT?) depending on the index type. Does this approach makes sense?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2013

not a bad idea (this actually makes take deal with this correctly).

you can add a method _fill_value and have Index return np.nan and override in DatetimeIndex return NaT

need some more tests, edge cases, e.g. empty index, multiple nan (datetime w/NaT), -1 at the end, beginning (you have in the middle case)

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2013

Better fix is to create a mask based on value == -1 and then fill with nan
afterwards. Otherwise you get weird behavior with setting levels and this
fix would complicate any future function that would remove extraneous
labels. Plus you'd always have to check for and remove null when outputting
levels. Might create a flag that checks for nan values so you use dtype
float instead. (or just check mask.any())

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2013

@goyodiaz something like (using @jtratner method)

mask = labels == -1
values = unique_values.take(labels)
values[mask] = np.nan
@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 1, 2013

@goyodiaz do you have Travis set up? Interested if your solution passes. I didn't realize you were saying to just do it temporarily. I still think it'll create issues if you wanted to shorten levels later on, but I'm less convinced than in my previous comment :)

@jreback does append create a copy of the underlying memory? I guess it's happening in either case, but bool certainly smaller

slight tweak:

mask = labels == -1
values = unique_values.take(labels)
if mask.any():
    values = values.astype(float)
    values[mask] = np.nan
@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2013

yes append copies

@goyodiaz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

@jtratner yes, travis builds passed.
https://travis-ci.org/goyodiaz/pandas

Can you think of any possible side effect which should be tested? I did not understand well your concerns.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2013

go ahead an open a pull-request, this will submit as a patch to the devs

@goyodiaz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

Will do it in a while.

BTW there is nothing to fix with DatetimeIndex:

In [1]: index = pd.MultiIndex.from_arrays([
    pd.DatetimeIndex([pd.NaT, 0, 1]),
    ['a', 'b', 'b']
])
In [2]: index.get_level_values(0)[0]
Out[2]: NaT

But I guess this could change when numpy get proper integer nan support, if ever.

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2013

@goyodiaz - it's totally fine, I appreciate you figuring out what was going
on! I was thinking about this from an internals perspective (i.e., space
and time complexity for masking vs. appending a value to the end of a list)
and also what I'm looking to do going forward.

That said, figuring out that the issue was that it was taking the wrong
value was very helpful - thanks for that!

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Oct 2, 2013

Behavior with NaT actually could be a bug, not sure.

@goyodiaz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

@jratner I think you are right, it's a bug in factorize() if it is supposed to return unique values without missing values and NaT is to be treated as a missing value.

@goyodiaz

This comment has been minimized.

Copy link
Contributor Author

commented Oct 2, 2013

I guess I should link the PR: #5090

jtratner added a commit that referenced this issue Oct 7, 2013
Merge pull request #5090 from goyodiaz/multiindex-nan
BUG: MultiIndex.get_level_values() replaces NA by another value (#5074)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.