New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All-Nan MultiIndex level has different dtype than all-NaN flat Index #17929

Closed
toobaz opened this Issue Oct 20, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@toobaz
Member

toobaz commented Oct 20, 2017

Code Sample, a copy-pastable example if possible

In [3]: values = [np.nan, np.nan]

In [4]: pd.Index(values).dtype
Out[4]: dtype('float64')

In [5]: pd.MultiIndex.from_arrays([values]).levels[0].dtype
Out[5]: dtype('O')

In [6]: pd.MultiIndex.from_arrays([values, [2, 3]]).levels[0].dtype
Out[6]: dtype('O')

Problem description

Yes, I know, "who cares?". But this is biting me in fixing #17924.

Expected Output

The same - and I tend to think dtype('float64') is both preferred and more backwards-compatible.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 51c5f4d
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.0-3-amd64
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: it_IT.UTF-8
LOCALE: it_IT.UTF-8

pandas: 0.21.0rc1+26.g51c5f4d2a.dirty
pytest: 3.0.6
pip: 9.0.1
setuptools: None
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
pyarrow: None
xarray: None
IPython: 5.1.0.dev
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.3.0
numexpr: 2.6.1
feather: 0.3.1
matplotlib: 2.0.0
openpyxl: None
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.6
lxml: None
bs4: 4.5.3
html5lib: 0.999999999
sqlalchemy: 1.0.15
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.2.1

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Oct 20, 2017

I am not sure this is possible to solve. NaNs are not really first class citizens in MIs, which means the level is actually empty, and the dtype of something empty is object.
(at least this is the case for Index, not yet for Series #17261)

@toobaz

This comment has been minimized.

Member

toobaz commented Oct 21, 2017

NaNs are not really first class citizens in MIs, which means the level is actually empty,

Here I follow you

and the dtype of something empty is object.

... here I don't: I think we should be able to have pd.MultiIndex.from_product([[], []]) (0-length level) stored as empty object array but my example above (>0-length level with only missing values) stored as empy float array.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Oct 21, 2017

But how do you distinguish a 'real' empty MI or an MI with only NaNs ? As the actual level is the same for both: an empty index

toobaz added a commit to toobaz/pandas that referenced this issue Oct 21, 2017

@toobaz toobaz referenced this issue Oct 21, 2017

Merged

BUG: fix dtype of all-NaN MultiIndex level #17934

4 of 4 tasks complete
@toobaz

This comment has been minimized.

Member

toobaz commented Oct 21, 2017

But how do you distinguish a 'real' empty MI or an MI with only NaNs ?

I would do it at initialization - see #17934

@jreback

This comment has been minimized.

Contributor

jreback commented Oct 21, 2017

In [13]: pd.MultiIndex.from_arrays([[pd.NaT, pd.NaT]]).levels[0]
Out[13]: DatetimeIndex([], dtype='datetime64[ns]', freq=None)

In [14]: pd.MultiIndex.from_arrays([[np.nan, np.nan]]).levels[0]
Out[14]: Index([], dtype='object')

In [15]: pd.MultiIndex.from_arrays([[None, None]]).levels[0]
Out[15]: Index([], dtype='object')

there is some ambiguity here, e.. whether [14] and [15] should match (since we use np,nan generically). Would be ok with making [14] float.

@jreback jreback added this to the 0.22.0 milestone Oct 28, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment