Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Extending EAs #51471

Open
jbrockmendel opened this issue Feb 17, 2023 · 8 comments
Open

ENH: Extending EAs #51471

jbrockmendel opened this issue Feb 17, 2023 · 8 comments
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 17, 2023

On a call yesterday with some of the cuDF maintainers, the question came up of why they haven't implemented an ExtensionArray. They pointed to operations where we convert* to numpy (which is very expensive for their hypothetical EA), in particular groupby construction and merge.

* Not actually doing EA.to_numpy(), but having EA.factorize or EA.argsort return ndarrays in these cases means moving everything from a GPU to CPU. Potential modin or dask distributed EAs would have analogous pain points.

I said something to the effect of: "if you implemented EAs, the pandas team would be very-much on-board with helping make sure it worked". In retrospect I should have spoken only for myself, so want to ask: how do folks feel about extending the EA interface in order to make GPU/Distributed EAs viable? cc @pandas-dev/pandas-core

Some thoughts on what this might entail:

  1. groupby construction produces an ndarray[intp] of labels assigning each row (focusing only on axis=0) to a group.

  2. merge code I haven't looked into as closely

    • in a lot of it we convert to numpy and then call our libjoin functions.
    • so we could plausibly let EAs specify something other than those libjoin functions to use.
  3. IndexEngine - we have a non-performant EA engine and a performant MaskedEngine. In principle we could allow EAs to bring their own.

  4. Window - no idea what this would take.

Some potential reasons not to do this:

  1. In the groupby case in particular, the data-locality (either for GPU or distributed) needs to be the same for your group labels and each of your columns if you want to be performant. i.e. your columns need to be all-GPU or all-distributed. Maybe EAs aren't the right abstraction for that?

  2. Do we draw the line somewhere? Plotting? I/O?

  3. Early on we wanted to keep the EA namespace limited. This could make it significantly larger.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 17, 2023
@mroeschke
Copy link
Member

Although a large undertaking, I am generally sensing too that an extension "thing that manages EAs" is necessary if we want to encourage other libraries that want to leverage the pandas API but have a different execution paradigm to use pandas extension mechanism and not reinvent the API.

I mention extension "thing that manages EAs" as potentially a separate object that may be necessary as I'm skeptical that and array abstraction is enough to satisfy the different execution paradigm that other libraries are after.

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 18, 2023
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 19, 2023

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc., i.e., it could only be used for basic calculations (which might support the cuDF use case - not sure). Our current ExtensionArray would be a subclass of SimpleExtensionArray . If someone created a subclass of SimpleExtensionArray, then the non-implemented methods like take, factorize, etc. would just raise.

In the docs for ExtensionArray, we list a bunch of methods that people can choose to re-implement for performance reasons, with the paragraph ahead of that saying "Some methods require casting the ExtensionArray to an ndarray of Python objects with self.astype(object)", so if someone were to register a SimpleExtensionArray, which did not have those methods, then those methods would fail. But a lot of other methods (like adding two series backed by the SimpleExtensionArray) would probably work.

@jbrockmendel
Copy link
Member Author

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc

I'm not clear on what problem this is aimed at solving.

extension "thing that manages EAs" as potentially a separate object

I'm not opposed to this, but want to squeeze as much mileage as we reasonably can out of EA before going down that path.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Feb 19, 2023

What if we created a SimpleExtensionArray that could not be used for groupby operations, merge, etc

I'm not clear on what problem this is aimed at solving.

The idea here is that the cuDF maintainers could use this on GPU's, because we'd have a "limited function" EA that wouldn't support all of pandas, but "enough" of pandas for their library. Again, not sure of this, because I don't have familiarity with cuDF.

@phofl
Copy link
Member

phofl commented Feb 19, 2023

I think expanding EAs to handle these cases is the logical thing to do, otherwise we won't get as much out of it as we could and it won't be really useful to third party packages.

@MarcoGorelli
Copy link
Member

Generally good idea

I can't comment on whether a GPUEA or DistributedEA are feasible, but if they are, then it looks to me like this is the way to go.

There's already an intrinsic motivation within pandas to improve support for EAs, so it should be a safe bet for other libraries to spend time implementing extensions if they want to leverage the pandas API

@rgommers
Copy link
Contributor

rgommers commented Mar 2, 2023

... leverage the pandas API but have a different execution paradigm to use pandas extension mechanism and not reinvent the API.

This seems to me like a good formulation of a key goal that other dataframe libraries have. There's a significant issue with ExtensionArrays as far as I can see though. They are too low-level, meaning you get this kind of code execution:

  1. Call to a dataframe method, df.meth(...)
  2. A number of Python calls inside def meth()
  3. Hitting an EA API, at which point the external (GPU, distributed, whatever) library is called

Going through the pandas internals (step 2), which are optimized for performance on CPU and pandas itself, may not work for other libraries - algorithms may be very slow and need different logic to run on GPU for example. Inside pandas internals, you really don't want to have constraints that can affect the performance characteristics of another library.

We had similar conversations in NumPy around dispatching on the public API. Separating the API from the implementation was a key concept, exactly to avoid dealing with the above type of thing (internals "leaking out"). I made some slides about that, this is the key diagram that hopefully explains it:

image

ExtensionArray seems like a nice abstraction to extend pandas itself, in particular with new dtypes and domain-specific things like IP addresses (CyberPandas). But they seem to me like the wrong abstraction when one wants the API and semantics unchanged but affect performance characteristics. And cuDF, Dask et al. are about the latter. They need to be in control of everything that follows right after the user calls the public API.

@WillAyd
Copy link
Member

WillAyd commented Mar 2, 2023

I think right now we are too tightly bound to numpy to really allow extension array authors to provide their own implementation. I think we need to go through a full iteration of using Arrow as a complete backend irrespective of NumPy to learn how we can really make that generic and extenisble to third parties

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

7 participants