Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

Closed
jschendel opened this issue Dec 13, 2017 · 5 comments · Fixed by #18757
Closed

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

jschendel opened this issue Dec 13, 2017 · 5 comments · Fixed by #18757
Labels
Bug Interval Interval data type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@jschendel
Copy link
Member

Code Sample, a copy-pastable example if possible

In [2]: idx = pd.interval_range(0, 3).insert(0, np.nan)
   ...: idx
   ...:
Out[2]:
IntervalIndex([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0]]
              closed='right',
              dtype='interval[float64]')

In [3]: idx.to_tuples()
Out[3]: Index([(nan, nan), (0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], dtype='object')

Problem description

A tuple of np.nan is returned instead of just np.nan. Note that values and tolist both yield np.nan:

In [4]: idx.values
Out[4]:
array([nan, Interval(0.0, 1.0, closed='right'),
       Interval(1.0, 2.0, closed='right'),
       Interval(2.0, 3.0, closed='right')], dtype=object)

In [5]: idx.tolist()
Out[5]:
[nan,
 Interval(0.0, 1.0, closed='right'),
 Interval(1.0, 2.0, closed='right'),
 Interval(2.0, 3.0, closed='right')]

Expected Output

np.nan instead of a tuple of np.nan:

Index([nan, (0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], dtype='object')

Output of pd.show_versions()

INSTALLED VERSIONS

commit: d2fd22e
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0.dev0+353.gd2fd22e
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.6.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.0
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 13, 2017

hmm I don't really know why we have this method, its not semantically correct because you lose the closed value. I think .values (or list(IntervalIndex) is enough here). I would just deprecate this.

cc @shoyer

@jreback jreback added Deprecate Functionality to remove in pandas Interval Interval data type labels Dec 13, 2017
@shoyer
Copy link
Member

shoyer commented Dec 13, 2017

My original reasoning for this method was to make it easier to use other APIs that need tuples, e.g., to convert an IntervalIndex into levels of a MultiIndex with pd.MultiIndex(interval_index.to_tuples()) or put it in a (NumPy) structured array. Including closed would be redundant for most such representations.

Part of the need for this is that it surprisingly difficult to create a NumPy array of tuples, and this method makes that easy. For example, consider:

In [7]: np.array([(1.0, 2.0), (3.0, 4.0)], dtype=object)
Out[7]:
array([[1.0, 2.0],
       [3.0, 4.0]], dtype=object)

That said, this doesn't come up that often and the pandas.Series() constructor does the right thing.

Given that the behavior is marginally useful, unambiguous (aside from handling missing values) and self-contained, I think we should lean towards keeping it around.

With regards to the handling of NaN, I did choose this existing behavior intentionally. This ensures that the result has a homogeneous type (a list of all tuples), with values given by tuples of length 2. On the other hand, I do see why returning scalars could make more sense in some cases. Maybe a keyword argument na_tuple=False could be used request NaN instead of (NaN, NaN) for missing values?

@jreback jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Deprecate Functionality to remove in pandas labels Dec 15, 2017
@jreback jreback added this to the 0.22.0 milestone Dec 15, 2017
@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

ok @shoyer makes sense about the rationale for this method.

@jschendel
Copy link
Member Author

jschendel commented Dec 15, 2017

@shoyer : Thanks, that makes sense.

I like the na_tuple keyword argument idea and implemented it, but I'm having some trouble testing it due to the NA tuples. Consider the following setup:

import numpy as np
import pandas as pd
import pandas.util.testing as tm
from pandas.core.common import _asarray_tuplesafe

data1 = zip(np.array([1, 3, np.nan]), np.array([2, 4, np.nan]))
data2 = zip(np.array([1, 3, np.nan]), np.array([2, 4, np.nan]))

idx1 = pd.Index(_asarray_tuplesafe(data1))
idx2 = pd.Index(_asarray_tuplesafe(data2))

Then == returns False for the position with the NA tuples (not especially surprising due to the nature of NA):

In [2]: idx1
Out[2]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

In [3]: idx2
Out[3]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

In [4]: idx1 == idx2
Out[4]: array([ True,  True, False], dtype=bool)

But the problem is equals returns False, and tm.assert_index_equals raises:

In [5]: idx1.equals(idx2)
Out[5]: False

In [6]: tm.assert_index_equal(idx1, idx2)
---------------------------------------------------------------------------
AssertionError: Index are different

Index values are different (33.33333 %)
[left]:  Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')
[right]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

So I'm not quite sure what should be done for testing. Looks like fixing tm.assert_index_equal would involve writing a special case in Index.equals to handle tuples with NA, which feels like a very bad idea.

I guess the other workaround would be to use tm.assert_index_equal on the non-NA portion of the index, then use some assertions on the NA portion to make sure they match what na_tuple specifies?

@jreback
Copy link
Contributor

jreback commented Dec 15, 2017

this might be a shortcoming of array_equivalent which does this heavy lifting; IOW it might need to detect tuples and compare them directly (rather than rely on np.array_equal which prob borks)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Interval Interval data type Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants