-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
Description
Overview
I was working on numba apply support pretty recently, and I was finding it pretty hard to support apply
the same way as agg and transform(which is to take in values and index and return the new values)
because it's way more flexible than agg or transform
(You can change the index as well as the values, and also change shape of values by adding/deleting columns).
This made me end up writing a ton of code to handle this (basically reimplementing the concat logic for apply for groupby), so I decided to take a look at wrapping the DataFrame object using the numba extension API.
Pros
- Same API between numba and regular engine, you can reuse functions since you would use the DF object for both
- Small nit: We wouldn't support stuff that numba doesn't support in nopython mode like non-string names
- It's also more convenient as a user to not have to keep track the index/columns/values of the DataFrame separately.
- Reuse internal logic (what goes in out of the numba function is a DF/Series because we do the boxing/unboxing)
- Should be able to de-dup a lot, and save a lot of misery in keeping the numba/regular paths consistent
Cons
- Annoying to develop
- Hard to wrap classes
- You are basically writing C code in Python via numba that tells numba how to box/unbox the Python DF to a C representation
- Adding new methods is pretty easy tho, those are just like regular numba functions
- Easy to make mistakes
- Hard to wrap classes
We should decide whether we want to go down this road, and if so, how much of the pandas API to implement.
(e.g. what methods should exist, and what to wrap)
Some sample code (only works with int64 arrays ATM, which is hardcoded) wrapping a subset of DataFrame attributes/methods.