Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Remove nan-likes from MultiIndex levels #29111

Open
topper-123 opened this issue Oct 20, 2019 · 10 comments
Open

API: Remove nan-likes from MultiIndex levels #29111

topper-123 opened this issue Oct 20, 2019 · 10 comments

Comments

@topper-123
Copy link
Contributor

topper-123 commented Oct 20, 2019

Working on #27138 I've found that MultiIndex keeps nan-likes in the levels, but encode them all to -1:

>>> levels, codes = [[nan, None, pd.NaT, 128, 2]], [[0, -1, 1, 2, 3, 4]]
>>> mi = pd.MultiIndex(levels, codes)
>>> mi.codes[0]
[-1, -1, -1, -1, 3, 4]
>>> mi.levels[0]
Index([nan, None, NaT, 128, 2], dtype='object')

All the MultiIndex nan-likes are encoded to -1, so it's not possible to decode them to their constituent values. So it's not possible to get more than one nan-like values out of the MultiIndex, so in this case None and NaT disappears when converting:

>>> mi.to_frame()[0].array
<PandasArray>
[nan, nan, nan, nan, 128, 2]
Length: 6, dtype: object

I think if nan-likes are all encoded to -1, it'd be more consistent to not have them in the levels, similarly to how Categorical does it already.

>>> c = pd.Categorical(levels[0])
>>> c.codes
array([-1, -1, -1,  1,  0], dtype=int8)
>>> c.categories
>>> Int64Index([2, 128], dtype='int64')

Is there acceptance to change the MultiIndex API so we get nan-likes out of the labels? That would give them an API more similar to Categorical.

@pandas-dev/pandas-core.

@topper-123 topper-123 changed the title API: MultiIndex keeps nan-likes in levels API: Remove nan-likes from MultiIndex levels Oct 20, 2019
@topper-123
Copy link
Contributor Author

@jreback @TomAugspurger , @jorisvandenbossche, any comments?

All of the nan-likes are encoded to -1 already, so no information will really be lost from this, and this will unite Categorical.categories and MultiIndex.levels.

@jorisvandenbossche
Copy link
Member

In principle, it would certainly be nice to clean that up I think, as that doesn't look good.

Can we think of potential changes that can impact users?
Eg what is the impact on indexing a series/dataframe with such an index? Does that change how those different missing indicators are treated?

This is also only applicable for object dtype level?

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 23, 2019

Yes, I think this should only affect object dtype levels, because other dtypes can only have one nan-like value.

I can't think how it could affect indexing, because indexing works using the codes, so all of NaN, NaT, None etc. already translate to the same code (-1).

I could start working on it, and if I'm missing some effect that has unexpected implications, we could discuss it again.

@jorisvandenbossche
Copy link
Member

I can't think how it could affect indexing, because indexing works using the codes, so all of NaN, NaT, None etc. already translate to the same code (-1).

Can you try some examples with the index you show above?

@jorisvandenbossche
Copy link
Member

Eg pd.Series(range(len(mi)), index=mi)[np.nan] does actually not even work for me. But I suppose there is some way to index with missing values into a MultiIndex?

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 23, 2019

Yes, I agree, seems like indexing MultiIndex by nans doesn't work currently.

Also, given that all the nan-likes are encoded to -1 in .codes, even it it indexing with nans did work, there AFAIKC, couldn't be any way to differentiate between nan and None anyway, so wouldn't be useful.

Also, object dtype seems to be an anamoly as other dtypes actually don't keep nan-likes in the level:

>>> mi = pd.MultiIndex.from_product([[10, np.nan]])
>>> mi.levels[0]
Int64Index([10], dtype='int64')
>>> mi.codes[0]
FrozenNDArray([0, -1], dtype='int8')

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 24, 2019 via email

@jorisvandenbossche
Copy link
Member

cc @toobaz in case you have any experience with NaNs in object MultiIndexes

Also, given that all the nan-likes are encoded to -1 in .codes, even it it indexing with nans did work, there AFAIKC, couldn't be any way to differentiate between nan and None anyway,

If it would have worked, we might need to ensure it keeps working (even when that specific indicator would not be in the values anymore), or have some deprecation for it. But ok, since it seems to not work, no need to argue for this ;)

@jorisvandenbossche
Copy link
Member

Some random other observations / thoughts:

I was wondering what happens if you convert such a MI into a normal index by eg dropping index levels:

In [54]: levels, codes = [[np.nan, None, pd.NaT, 128, 2], [0]], [[0, -1, 1, 2, 3, 4], [0]*6] 
    ...: mi = pd.MultiIndex(levels, codes)  

In [55]: df = pd.DataFrame({'a': range(len(mi))}, index=mi).reset_index(level=1, drop=True)  

In [56]: df.index       
Out[56]: Index([nan, nan, nan, nan, 128, 2], dtype='object')

In [57]: levels, codes = [[pd.NaT, None, np.nan, 128, 2], [0]], [[0, -1, 1, 2, 3, 4], [0]*6] 
    ...: mi = pd.MultiIndex(levels, codes)  

In [58]: df = pd.DataFrame({'a': range(len(mi))}, index=mi).reset_index(level=1, drop=True)

In [59]: df.index   
Out[59]: Index([nan, nan, nan, nan, 128, 2], dtype='object')

So it seems you only get NaNs, and that also does not depend on the order of the missing value indicators in the levels (so if NaN is not the first, you still get NaN).


If you create a MultiIndex in a more typical way (not by constructing it with the MI constructor from levels and codes, but eg by setting columns as the index), you get "properly" constructed MIs:

In [67]: df = pd.DataFrame({'l1': [1, 'a', np.nan, None, pd.NaT], 'l2': range(5), 'a': range(5)})  

In [68]: df  
Out[68]: 
     l1  l2  a
0     1   0  0
1     a   1  1
2   NaN   2  2
3  None   3  3
4   NaT   4  4

In [69]: df.set_index(['l1', 'l2']).index 
Out[69]: 
MultiIndex([(  1, 0),
            ('a', 1),
            (nan, 2),
            (nan, 3),
            (nan, 4)],
           names=['l1', 'l2'])

In [70]: df.set_index(['l1', 'l2']).index.levels 
Out[70]: FrozenList([[1, 'a'], [0, 1, 2, 3, 4]])

In [71]: df.set_index(['l1', 'l2']).index.codes  
Out[71]: FrozenList([[0, 1, -1, -1, -1], [0, 1, 2, 3, 4]])

The same is true for pd.MultiIndex.from_tuples or pd.MultiIndex.from_product:

In [77]: pd.MultiIndex.from_product([pd.Index([1, 'a', None, pd.NaT, np.nan], dtype=object), [1, 2]])  
Out[77]: 
MultiIndex([(  1, 1),
            (  1, 2),
            ('a', 1),
            ('a', 2),
            (nan, 1),
            (nan, 2),
            (nan, 1),
            (nan, 2),
            (nan, 1),
            (nan, 2)],
           )

In [78]: pd.MultiIndex.from_product([pd.Index([1, 'a', None, pd.NaT, np.nan], dtype=object), [1, 2]]).levels
Out[78]: FrozenList([[1, 'a'], [1, 2]])

All more reasons to fix this inconsistency. However, on:

Also, object dtype seems to be an anamoly as other dtypes actually don't keep nan-likes in the level

It seems not unique to object dtype (you used from_product instead of the main constructor, and also for object dtype that is sanitizing the missing values):

In [85]: levels, codes = [[np.nan, 128, 2]], [[0, -1, 1, 2]] 

In [86]: mi = pd.MultiIndex(levels, codes)

In [87]: mi  
Out[87]: 
MultiIndex([(  nan,),
            (  nan,),
            (128.0,),
            (  2.0,)],
           )

In [88]: mi.levels 
Out[88]: FrozenList([[nan, 128.0, 2.0]])

In [89]: mi.codes
Out[89]: FrozenList([[-1, -1, 1, 2]])

So it seems this is a general issue with the MultiIndex constructor.

@mvashishtha
Copy link

Can we mark this as a bug? It seems that there is agreement that levels should consistently keep nan-like values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants