Skip to content

API/DISC: Mock pandas DataFrame/Series/Index in numba agg/apply/transform #53933

@lithomas1

Description

@lithomas1

Overview

I was working on numba apply support pretty recently, and I was finding it pretty hard to support apply
the same way as agg and transform(which is to take in values and index and return the new values)
because it's way more flexible than agg or transform
(You can change the index as well as the values, and also change shape of values by adding/deleting columns).

This made me end up writing a ton of code to handle this (basically reimplementing the concat logic for apply for groupby), so I decided to take a look at wrapping the DataFrame object using the numba extension API.

Pros

  • Same API between numba and regular engine, you can reuse functions since you would use the DF object for both
    • Small nit: We wouldn't support stuff that numba doesn't support in nopython mode like non-string names
    • It's also more convenient as a user to not have to keep track the index/columns/values of the DataFrame separately.
  • Reuse internal logic (what goes in out of the numba function is a DF/Series because we do the boxing/unboxing)
    • Should be able to de-dup a lot, and save a lot of misery in keeping the numba/regular paths consistent

Cons

  • Annoying to develop
    • Hard to wrap classes
      • You are basically writing C code in Python via numba that tells numba how to box/unbox the Python DF to a C representation
    • Adding new methods is pretty easy tho, those are just like regular numba functions
    • Easy to make mistakes

We should decide whether we want to go down this road, and if so, how much of the pandas API to implement.
(e.g. what methods should exist, and what to wrap)

Some sample code (only works with int64 arrays ATM, which is hardcoded) wrapping a subset of DataFrame attributes/methods.

cc @mroeschke @jbrockmendel @rhshadrach

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignNeeds DiscussionRequires discussion from core team before further actionnumbanumba-accelerated operations

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions