[WIP] Add basic ExtensionIndex class #23223

jorisvandenbossche · 2018-10-18T14:38:49Z

Explored this a bit 2 weeks ago, so thought could open it as a WIP PR in case it might serve discussion.

For me, the main question is how "complete" we want this to be, before we consider merging it. This WIP certainly already is able to preserve the EA in the Index (not convert to objects), and basic indexing works (but only tested the basics, and not all combinations of uniques/duplicated, sorted/non-sorted ... indexes)

Closes #22861

pep8speaks · 2018-10-18T14:38:55Z

Hello @jorisvandenbossche! Thanks for updating the PR.

There are no PEP8 issues in the file pandas/core/dtypes/generic.py !
There are no PEP8 issues in the file pandas/core/indexes/base.py !
There are no PEP8 issues in the file pandas/core/indexes/extension.py !
There are no PEP8 issues in the file pandas/tests/extension/base/__init__.py !
In the file pandas/tests/extension/base/index.py, following are the PEP8 issues :

Line 64:27: E261 at least two spaces before inline comment
Line 64:28: E262 inline comment should start with '# '

There are no PEP8 issues in the file pandas/tests/extension/decimal/test_decimal.py !
There are no PEP8 issues in the file pandas/tests/extension/test_integer.py !
There are no PEP8 issues in the file pandas/tests/indexes/test_extension.py !

Comment last updated on October 19, 2018 at 13:41 Hours UTC

TomAugspurger · 2018-10-19T12:16:19Z

Interesting, this is less work than I expected. If I were to do df.groupby('extension_array').sum(), would we get back an extension index? Or do we need to update the Index.__new__ to make an ExtensionIndex in that case?

TomAugspurger · 2018-10-19T12:12:14Z

pandas/core/indexes/extension.py

+
+    @property
+    def values(self):
+        """ return the underlying data as an ndarray """


ndarray -> extension array.

jorisvandenbossche · 2018-10-19T12:26:35Z

Interesting, this is less work than I expected.

That might also be because I didn't yet test that much :)

If I were to do df.groupby('extension_array').sum(), would we get back an extension index?

Indeed:

In [1]: df = pd.DataFrame({'key': pd.core.arrays.integer_array([1, 2, 1, 3, 2]), 'val': range(5)})

In [2]: df.groupby('key').mean()
Out[2]: 
     val
key     
1    1.0
2    2.5
3    3.0

In [3]: df.groupby('key').mean().index
Out[3]: ExtensionIndex([1, 2, 3], dtype='Int64', name='key')

That's the small change in Index.__new__ to construct an ExtensionIndex, if an ExtensionArray is passed (that is not a built-in one).
So it at least already preserves the array dtype in such cases.

TomAugspurger · 2018-10-19T12:28:25Z

Ah, I completely overlooked that.

jorisvandenbossche · 2018-11-08T12:55:57Z

Putting here a comment I already wrote a time ago, but the actual discussion can maybe go in #23565

The main question here is: which values to use for indexing?

Currently, in this PR I tried to create the correct indexing Engine based on the values of EA._values_for_factorize, because that seemed a natural fit (for indexing, they need to be hashable like for factorize).

However, in many cases, we are using _ndarray_values internally. And eg for IntegerArray, those two are actually different: _values_for_factorize is an object array (to have the NaNs), but _ndarray_values is the integers. This raises errors in the engine not expecting an integer array. Of course, we might need to change _ndarray_values for IntegerArray to also return object, but that also means that indexing would be quite expensive in general (using the object dtype path)

TomAugspurger · 2018-11-08T14:21:48Z

Just thinking out loud here: For IntegerArray, one could in principle write an engine that uses _ndarray_values, correct? You would need to take care to mask the "missing values" (1s) in every operation, but it should be doable.

Regardless, I'm ok with a default "ExtensionIndex" (an Index with the extension dtype preserved) for now, with the possibility of more customizable extension index classes later.

jorisvandenbossche · 2018-11-08T14:33:57Z

Just thinking out loud here: For IntegerArray, one could in principle write an engine that uses _ndarray_values, correct? You would need to take care to mask the "missing values" (1s) in every operation, but it should be doable.

But that would mean writing a custom engine?

Anyway (also future wise), an engine that works with values/mask combination might be useful in general.
But that is not for this PR :-)

I suppose short term the easiest is to use the object dtype values for IntegerArray for all indexing related things.

TomAugspurger · 2018-11-08T14:54:09Z

But that would mean writing a custom engine?

Right, it would push more work onto the EA author (which is us for IntegerArray). Pandas would provide the hooks somewhere in pd.Index.__new__

engine = getattr(ExtensionArray, '_index_engine', default_engine)

I suppose short term the easiest is to use the object dtype values for IntegerArray for all indexing related things.

Agreed. Ensuring that Index.dtype is Int64Dtype, and not object, seems the most important thing. Then we can give speed improvements without breaking API.

jreback · 2019-01-14T00:18:02Z

nice idea for 0.25 :>

jreback · 2019-06-08T20:25:54Z

good idea, but needs a reboot.

Detry322 · 2019-11-06T21:53:12Z

Still hoping for this feature!

Delengowski · 2020-10-03T22:02:46Z

I'm wondering what the current thoughts on this issue are wrt to

#30001

Is it being considered a non issue or a different work around being put in place?

jreback · 2020-10-03T22:16:46Z

see #22861

jorisvandenbossche added 3 commits October 2, 2018 23:26

Add basic ExtensionIndex class

015e4b2

clean-up

9e282c9

Merge remote-tracking branch 'upstream/master' into EAindex

24fc7fd

jorisvandenbossche changed the title ~~Add basic ExtensionIndex class~~ [WIP] Add basic ExtensionIndex class Oct 18, 2018

TomAugspurger reviewed Oct 19, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into EAindex

da23c1f

jorisvandenbossche added 2 commits October 19, 2018 15:20

more robust constructor + add tests

6c1d798

add common tests

00d4a16

jorisvandenbossche mentioned this pull request Nov 8, 2018

API / internals: exact semantics of _ndarray_values #23565

Closed

jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 14, 2019

TomAugspurger mentioned this pull request Jan 28, 2019

Concatenating rows with Int64 datatype coerces to object #24768

Closed

jreback closed this Jun 8, 2019

jreback mentioned this pull request Dec 3, 2019

Nullable Int64 datatype is not preserved when used as index #30001

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add basic ExtensionIndex class #23223

[WIP] Add basic ExtensionIndex class #23223

jorisvandenbossche commented Oct 18, 2018 •

edited by TomAugspurger

pep8speaks commented Oct 18, 2018 •

edited

TomAugspurger commented Oct 19, 2018

TomAugspurger Oct 19, 2018

jorisvandenbossche commented Oct 19, 2018

TomAugspurger commented Oct 19, 2018

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

jreback commented Jan 14, 2019

jreback commented Jun 8, 2019

Detry322 commented Nov 6, 2019

Delengowski commented Oct 3, 2020

jreback commented Oct 3, 2020 •

edited

[WIP] Add basic ExtensionIndex class #23223

[WIP] Add basic ExtensionIndex class #23223

Conversation

jorisvandenbossche commented Oct 18, 2018 • edited by TomAugspurger

pep8speaks commented Oct 18, 2018 • edited

Comment last updated on October 19, 2018 at 13:41 Hours UTC

TomAugspurger commented Oct 19, 2018

TomAugspurger Oct 19, 2018

Choose a reason for hiding this comment

jorisvandenbossche commented Oct 19, 2018

TomAugspurger commented Oct 19, 2018

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

jorisvandenbossche commented Nov 8, 2018

TomAugspurger commented Nov 8, 2018

jreback commented Jan 14, 2019

jreback commented Jun 8, 2019

Detry322 commented Nov 6, 2019

Delengowski commented Oct 3, 2020

jreback commented Oct 3, 2020 • edited

jorisvandenbossche commented Oct 18, 2018 •

edited by TomAugspurger

pep8speaks commented Oct 18, 2018 •

edited

jreback commented Oct 3, 2020 •

edited