Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add basic ExtensionIndex class #23223

Closed

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Oct 18, 2018

Explored this a bit 2 weeks ago, so thought could open it as a WIP PR in case it might serve discussion.

For me, the main question is how "complete" we want this to be, before we consider merging it. This WIP certainly already is able to preserve the EA in the Index (not convert to objects), and basic indexing works (but only tested the basics, and not all combinations of uniques/duplicated, sorted/non-sorted ... indexes)

Closes #22861

@pep8speaks
Copy link

pep8speaks commented Oct 18, 2018

Hello @jorisvandenbossche! Thanks for updating the PR.

Line 64:27: E261 at least two spaces before inline comment
Line 64:28: E262 inline comment should start with '# '

Comment last updated on October 19, 2018 at 13:41 Hours UTC

@jorisvandenbossche jorisvandenbossche changed the title Add basic ExtensionIndex class [WIP] Add basic ExtensionIndex class Oct 18, 2018
@TomAugspurger
Copy link
Contributor

Interesting, this is less work than I expected. If I were to do df.groupby('extension_array').sum(), would we get back an extension index? Or do we need to update the Index.__new__ to make an ExtensionIndex in that case?


@property
def values(self):
""" return the underlying data as an ndarray """
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ndarray -> extension array.

@jorisvandenbossche
Copy link
Member Author

Interesting, this is less work than I expected.

That might also be because I didn't yet test that much :)

If I were to do df.groupby('extension_array').sum(), would we get back an extension index?

Indeed:

In [1]: df = pd.DataFrame({'key': pd.core.arrays.integer_array([1, 2, 1, 3, 2]), 'val': range(5)})

In [2]: df.groupby('key').mean()
Out[2]: 
     val
key     
1    1.0
2    2.5
3    3.0

In [3]: df.groupby('key').mean().index
Out[3]: ExtensionIndex([1, 2, 3], dtype='Int64', name='key')

That's the small change in Index.__new__ to construct an ExtensionIndex, if an ExtensionArray is passed (that is not a built-in one).
So it at least already preserves the array dtype in such cases.

@TomAugspurger
Copy link
Contributor

Ah, I completely overlooked that.

@jorisvandenbossche
Copy link
Member Author

Putting here a comment I already wrote a time ago, but the actual discussion can maybe go in #23565

The main question here is: which values to use for indexing?

Currently, in this PR I tried to create the correct indexing Engine based on the values of EA._values_for_factorize, because that seemed a natural fit (for indexing, they need to be hashable like for factorize).

However, in many cases, we are using _ndarray_values internally. And eg for IntegerArray, those two are actually different: _values_for_factorize is an object array (to have the NaNs), but _ndarray_values is the integers. This raises errors in the engine not expecting an integer array. Of course, we might need to change _ndarray_values for IntegerArray to also return object, but that also means that indexing would be quite expensive in general (using the object dtype path)

@TomAugspurger
Copy link
Contributor

Just thinking out loud here: For IntegerArray, one could in principle write an engine that uses _ndarray_values, correct? You would need to take care to mask the "missing values" (1s) in every operation, but it should be doable.

Regardless, I'm ok with a default "ExtensionIndex" (an Index with the extension dtype preserved) for now, with the possibility of more customizable extension index classes later.

@jorisvandenbossche
Copy link
Member Author

Just thinking out loud here: For IntegerArray, one could in principle write an engine that uses _ndarray_values, correct? You would need to take care to mask the "missing values" (1s) in every operation, but it should be doable.

But that would mean writing a custom engine?

Anyway (also future wise), an engine that works with values/mask combination might be useful in general.
But that is not for this PR :-)

I suppose short term the easiest is to use the object dtype values for IntegerArray for all indexing related things.

@TomAugspurger
Copy link
Contributor

But that would mean writing a custom engine?

Right, it would push more work onto the EA author (which is us for IntegerArray). Pandas would provide the hooks somewhere in pd.Index.__new__

engine = getattr(ExtensionArray, '_index_engine', default_engine)

I suppose short term the easiest is to use the object dtype values for IntegerArray for all indexing related things.

Agreed. Ensuring that Index.dtype is Int64Dtype, and not object, seems the most important thing. Then we can give speed improvements without breaking API.

@jreback jreback added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 14, 2019
@jreback
Copy link
Contributor

jreback commented Jan 14, 2019

nice idea for 0.25 :>

@jreback
Copy link
Contributor

jreback commented Jun 8, 2019

good idea, but needs a reboot.

@jreback jreback closed this Jun 8, 2019
@Detry322
Copy link

Detry322 commented Nov 6, 2019

Still hoping for this feature!

@Delengowski
Copy link
Contributor

I'm wondering what the current thoughts on this issue are wrt to

#30001

Is it being considered a non issue or a different work around being put in place?

@jreback
Copy link
Contributor

jreback commented Oct 3, 2020

see #22861

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add interface for defining an ExtensionIndex
6 participants