BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

jschendel · 2017-12-13T03:50:54Z

Code Sample, a copy-pastable example if possible

In [2]: idx = pd.interval_range(0, 3).insert(0, np.nan)
   ...: idx
   ...:
Out[2]:
IntervalIndex([nan, (0.0, 1.0], (1.0, 2.0], (2.0, 3.0]]
              closed='right',
              dtype='interval[float64]')

In [3]: idx.to_tuples()
Out[3]: Index([(nan, nan), (0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], dtype='object')

Problem description

A tuple of np.nan is returned instead of just np.nan. Note that values and tolist both yield np.nan:

In [4]: idx.values
Out[4]:
array([nan, Interval(0.0, 1.0, closed='right'),
       Interval(1.0, 2.0, closed='right'),
       Interval(2.0, 3.0, closed='right')], dtype=object)

In [5]: idx.tolist()
Out[5]:
[nan,
 Interval(0.0, 1.0, closed='right'),
 Interval(1.0, 2.0, closed='right'),
 Interval(2.0, 3.0, closed='right')]

Expected Output

np.nan instead of a tuple of np.nan:

Index([nan, (0.0, 1.0), (1.0, 2.0), (2.0, 3.0)], dtype='object')

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: d2fd22e
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 78 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0.dev0+353.gd2fd22e
pytest: 3.1.2
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.13.1
scipy: 0.19.1
pyarrow: 0.6.0
xarray: 0.9.6
IPython: 6.1.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.2
feather: 0.4.0
matplotlib: 2.0.2
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 0.9.8
lxml: 3.8.0
bs4: None
html5lib: 0.999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: 0.1.0
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

jreback · 2017-12-13T14:03:31Z

hmm I don't really know why we have this method, its not semantically correct because you lose the closed value. I think .values (or list(IntervalIndex) is enough here). I would just deprecate this.

cc @shoyer

shoyer · 2017-12-13T16:48:57Z

My original reasoning for this method was to make it easier to use other APIs that need tuples, e.g., to convert an IntervalIndex into levels of a MultiIndex with pd.MultiIndex(interval_index.to_tuples()) or put it in a (NumPy) structured array. Including closed would be redundant for most such representations.

Part of the need for this is that it surprisingly difficult to create a NumPy array of tuples, and this method makes that easy. For example, consider:

In [7]: np.array([(1.0, 2.0), (3.0, 4.0)], dtype=object)
Out[7]:
array([[1.0, 2.0],
       [3.0, 4.0]], dtype=object)

That said, this doesn't come up that often and the pandas.Series() constructor does the right thing.

Given that the behavior is marginally useful, unambiguous (aside from handling missing values) and self-contained, I think we should lean towards keeping it around.

With regards to the handling of NaN, I did choose this existing behavior intentionally. This ensures that the result has a homogeneous type (a list of all tuples), with values given by tuples of length 2. On the other hand, I do see why returning scalars could make more sense in some cases. Maybe a keyword argument na_tuple=False could be used request NaN instead of (NaN, NaN) for missing values?

jreback · 2017-12-15T00:53:30Z

ok @shoyer makes sense about the rationale for this method.

jschendel · 2017-12-15T00:55:06Z

@shoyer : Thanks, that makes sense.

I like the na_tuple keyword argument idea and implemented it, but I'm having some trouble testing it due to the NA tuples. Consider the following setup:

import numpy as np
import pandas as pd
import pandas.util.testing as tm
from pandas.core.common import _asarray_tuplesafe

data1 = zip(np.array([1, 3, np.nan]), np.array([2, 4, np.nan]))
data2 = zip(np.array([1, 3, np.nan]), np.array([2, 4, np.nan]))

idx1 = pd.Index(_asarray_tuplesafe(data1))
idx2 = pd.Index(_asarray_tuplesafe(data2))

Then == returns False for the position with the NA tuples (not especially surprising due to the nature of NA):

In [2]: idx1
Out[2]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

In [3]: idx2
Out[3]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

In [4]: idx1 == idx2
Out[4]: array([ True,  True, False], dtype=bool)

But the problem is equals returns False, and tm.assert_index_equals raises:

In [5]: idx1.equals(idx2)
Out[5]: False

In [6]: tm.assert_index_equal(idx1, idx2)
---------------------------------------------------------------------------
AssertionError: Index are different

Index values are different (33.33333 %)
[left]:  Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')
[right]: Index([(1.0, 2.0), (3.0, 4.0), (nan, nan)], dtype='object')

So I'm not quite sure what should be done for testing. Looks like fixing tm.assert_index_equal would involve writing a special case in Index.equals to handle tuples with NA, which feels like a very bad idea.

I guess the other workaround would be to use tm.assert_index_equal on the non-NA portion of the index, then use some assertions on the NA portion to make sure they match what na_tuple specifies?

jreback · 2017-12-15T01:30:01Z

this might be a shortcoming of array_equivalent which does this heavy lifting; IOW it might need to detect tuples and compare them directly (rather than rely on np.array_equal which prob borks)

jschendel mentioned this issue Dec 13, 2017

BUG: Fix IntervalIndex.to_tuples() with NA values #18757

Merged

4 tasks

jreback added Deprecate Functionality to remove in pandas Interval Interval data type labels Dec 13, 2017

jreback added Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate and removed Deprecate Functionality to remove in pandas labels Dec 15, 2017

jreback added this to the 0.22.0 milestone Dec 15, 2017

jreback closed this as completed in #18757 Dec 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

jschendel commented Dec 13, 2017

INSTALLED VERSIONS

jreback commented Dec 13, 2017

shoyer commented Dec 13, 2017

jreback commented Dec 15, 2017

jschendel commented Dec 15, 2017 •

edited

Loading

jreback commented Dec 15, 2017

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

BUG: IntervalIndex.to_tuples doesn't handle NA correctly #18756

Comments

jschendel commented Dec 13, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Dec 13, 2017

shoyer commented Dec 13, 2017

jreback commented Dec 15, 2017

jschendel commented Dec 15, 2017 • edited Loading

jreback commented Dec 15, 2017

Output of `pd.show_versions()`

jschendel commented Dec 15, 2017 •

edited

Loading