ENH: Extending EAs #51471

jbrockmendel · 2023-02-17T23:57:11Z

On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray. They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.

* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU. Potential modin or dask distributed EAs would have analogous pain points.

I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked". In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable? cc @pandas-dev/pandas-core

Some thoughts on what this might entail:

groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.
- that construction has roughly one zillion cases.
- in the simple case of df.groupby(col), it is mostly df[col].factorize(sort=False).
- we'd need to let .factorize return something something other than an ndarray, probably another EA.
- That ndarray[intp] currently lives in BaseGrouper.group_info[0]. So any place that currently uses that would need to be adapted. e.g. the ids keyword in _groupby_op introduced in REF: let EAs override WrappedCythonOp groupby implementations #51166.
- Some of that adaptation we're going to need to do anyway if we want to support pyarrow dtypes without converting to numpy. (again xref REF: let EAs override WrappedCythonOp groupby implementations #51166)
merge code I haven't looked into as closely
- in a lot of it we convert to numpy and then call our libjoin functions.
- so we could plausibly let EAs specify something other than those libjoin functions to use.
IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.
Window - no idea what this would take.
- I don't know this code that well, but my best vibes-based guess is that it mostly works with numpy dtypes. If we're going to support pyarrow dtypes, can we support general EAs?
- xref ENH: Add masked support for rolling operations #50449 for masked dtypes

Some potential reasons not to do this:

In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant. i.e. your columns need to be all-GPU or all-distributed. Maybe EAs aren't the right abstraction for that?
Do we draw the line somewhere? Plotting? I/O?
Early on we wanted to keep the EA namespace limited. This could make it significantly larger.

The text was updated successfully, but these errors were encountered:

mroeschke · 2023-02-18T00:23:16Z

Although a large undertaking, I am generally sensing too that an extension "thing that manages EAs" is necessary if we want to encourage other libraries that want to leverage the pandas API but have a different execution paradigm to use pandas extension mechanism and not reinvent the API.

I mention extension "thing that manages EAs" as potentially a separate object that may be necessary as I'm skeptical that and array abstraction is enough to satisfy the different execution paradigm that other libraries are after.

Dr-Irv · 2023-02-19T19:10:53Z

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc., i.e., it could only be used for basic calculations (which might support the cuDF use case - not sure). Our current ExtensionArray would be a subclass of SimpleExtensionArray . If someone created a subclass of SimpleExtensionArray, then the non-implemented methods like take, factorize, etc. would just raise.

In the docs for ExtensionArray, we list a bunch of methods that people can choose to re-implement for performance reasons, with the paragraph ahead of that saying "Some methods require casting the ExtensionArray to an ndarray of Python objects with self.astype(object)", so if someone were to register a SimpleExtensionArray, which did not have those methods, then those methods would fail. But a lot of other methods (like adding two series backed by the SimpleExtensionArray) would probably work.

jbrockmendel · 2023-02-19T19:48:22Z

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc

I'm not clear on what problem this is aimed at solving.

extension "thing that manages EAs" as potentially a separate object

I'm not opposed to this, but want to squeeze as much mileage as we reasonably can out of EA before going down that path.

Dr-Irv · 2023-02-19T20:12:17Z

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc

I'm not clear on what problem this is aimed at solving.

The idea here is that the cuDF maintainers could use this on GPU's, because we'd have a "limited function" EA that wouldn't support all of pandas, but "enough" of pandas for their library. Again, not sure of this, because I don't have familiarity with cuDF.

phofl · 2023-02-19T21:46:17Z

I think expanding EAs to handle these cases is the logical thing to do, otherwise we won't get as much out of it as we could and it won't be really useful to third party packages.

MarcoGorelli · 2023-02-24T18:16:31Z

Generally good idea

I can't comment on whether a GPUEA or DistributedEA are feasible, but if they are, then it looks to me like this is the way to go.

There's already an intrinsic motivation within pandas to improve support for EAs, so it should be a safe bet for other libraries to spend time implementing extensions if they want to leverage the pandas API

rgommers · 2023-03-02T16:48:06Z

... leverage the pandas API but have a different execution paradigm to use pandas extension mechanism and not reinvent the API.

This seems to me like a good formulation of a key goal that other dataframe libraries have. There's a significant issue with ExtensionArrays as far as I can see though. They are too low-level, meaning you get this kind of code execution:

Call to a dataframe method, df.meth(...)
A number of Python calls inside def meth()
Hitting an EA API, at which point the external (GPU, distributed, whatever) library is called

Going through the pandas internals (step 2), which are optimized for performance on CPU and pandas itself, may not work for other libraries - algorithms may be very slow and need different logic to run on GPU for example. Inside pandas internals, you really don't want to have constraints that can affect the performance characteristics of another library.

We had similar conversations in NumPy around dispatching on the public API. Separating the API from the implementation was a key concept, exactly to avoid dealing with the above type of thing (internals "leaking out"). I made some slides about that, this is the key diagram that hopefully explains it:

ExtensionArray seems like a nice abstraction to extend pandas itself, in particular with new dtypes and domain-specific things like IP addresses (CyberPandas). But they seem to me like the wrong abstraction when one wants the API and semantics unchanged but affect performance characteristics. And cuDF, Dask et al. are about the latter. They need to be in control of everything that follows right after the user calls the public API.

WillAyd · 2023-03-02T23:35:16Z

I think right now we are too tightly bound to numpy to really allow extension array authors to provide their own implementation. I think we need to go through a full iteration of using Arrow as a complete backend irrespective of NumPy to learn how we can really make that generic and extenisble to third parties

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023

mroeschke added Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 18, 2023

This was referenced Feb 25, 2023

REF: de-duplicate NDFrame.take, remove Manager.take keyword #51482

Merged

REF: let EAs override WrappedCythonOp groupby implementations #51166

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Extending EAs #51471

ENH: Extending EAs #51471

jbrockmendel commented Feb 17, 2023 •

edited

mroeschke commented Feb 18, 2023

Dr-Irv commented Feb 19, 2023

jbrockmendel commented Feb 19, 2023

Dr-Irv commented Feb 19, 2023

phofl commented Feb 19, 2023

MarcoGorelli commented Feb 24, 2023

rgommers commented Mar 2, 2023

WillAyd commented Mar 2, 2023

ENH: Extending EAs #51471

ENH: Extending EAs #51471

Comments

jbrockmendel commented Feb 17, 2023 • edited

mroeschke commented Feb 18, 2023

Dr-Irv commented Feb 19, 2023

jbrockmendel commented Feb 19, 2023

Dr-Irv commented Feb 19, 2023

phofl commented Feb 19, 2023

MarcoGorelli commented Feb 24, 2023

rgommers commented Mar 2, 2023

WillAyd commented Mar 2, 2023

jbrockmendel commented Feb 17, 2023 •

edited