New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Limited window function support for pandas #1083

Closed
wants to merge 1 commit into
base: master
from

Conversation

Projects
None yet
2 participants
@cpcloud
Member

cpcloud commented Jul 31, 2017

No description provided.

@cpcloud cpcloud self-assigned this Jul 31, 2017

@cpcloud cpcloud added this to the 0.11.3 milestone Jul 31, 2017

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 2 times, most recently from 46e09db to f5f5837 Aug 3, 2017

@cpcloud cpcloud changed the title from WIP: Pandas window functions to ENH: Limited window function support for pandas Aug 4, 2017

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 3 times, most recently from 7953f4c to 9b41812 Aug 4, 2017

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 5 times, most recently from 0b97191 to 935ff51 Aug 5, 2017

ops.LastValue: fixed_arity(sa.func.first_value, 1),
ops.NthValue: fixed_arity(sa.func.nth_value, 2),
ops.Lead: fixed_arity(sa.func.lead, 1),
ops.Lag: fixed_arity(sa.func.lag, 1),

This comment has been minimized.

@cpcloud

cpcloud Aug 5, 2017

Member

I'm going to revert these changes because there are no tests. I'll add in a follow up PR.

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 13 times, most recently from 1b2b77c to 77a446b Aug 5, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 8, 2017

@wesm can you review this when you get a chance? most important things here are the changes in ibis/pandas/execution.py and ibis/pandas/tests/test_operations.py.

@execute_node.register(ops.Selection, pd.DataFrame)
def execute_selection_dataframe(op, data, scope=None):
def execute_selection_dataframe(op, data, scope=None, **kwargs):

This comment has been minimized.

@cpcloud

cpcloud Aug 8, 2017

Member

I need to refactor this function a bit after this PR is merged. It's getting a bit too big for its britches.

@wesm

This comment has been minimized.

Member

wesm commented Aug 8, 2017

Wow, big PR! I will take some time to review, am going to try to make a little more progress on the remaining Arrow 0.6.0 issues, and then I can get to this

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch from 77a446b to 5bfe3c5 Aug 8, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 9, 2017

One thing to note here is that for the groupby.transform style operations (OVER (PARTITION BY) in SQL parlance), we execute this in the fastest way possible: by executing each transform separately, rather than executing the operation per group.

Take z-score for example. There are two ways we could implement this.

1. Call execute in a lambda that executes the operation on each group.

df.groupby('foo').bar.transform(lambda x: (x - x.mean()) / x.std())

This will be very slow for high cardinality grouping keys, since pandas can't know that our lambda is composed of a bunch of function calls that are very fast when done individually. Also, the left operand of the subtraction in this case can be taken from the parent DataFrame. This avoids any indexing that happens when doing groupby.apply.

2. Execute each aggregation in a separate transform.

gb = df.groupby('foo').bar
result = (df.x - gb.transform('mean')) / gb.transform('std')

This can be up to 500x (!) faster than the former for certain operations.

This patch always uses the latter when executing any non trivial operations using t.groupby().mutate() expressions.

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 2 times, most recently from b625bdf to 1404f92 Aug 9, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 10, 2017

GCS keeps intermittently failing, rebuilding.

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 10, 2017

I'm going to pluck out the scope containing op instead of expr changes here

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch 7 times, most recently from aacb757 to 0831a4f Aug 10, 2017

def batting_df():
path = os.environ.get('BATTING_CSV', 'batting.csv')
df = pd.read_csv(path, index_col=None, sep=',')
five_percent = int(0.01 * len(df))

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

This should be renamed

@@ -0,0 +1,69 @@
CREATE EXTENSION IF NOT EXISTS file_fdw;

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

This whole file should be removed

raise ValueError(
'More than one operation name found in {} class'.format(typename)
)
return getattr(data.expanding(), operation_name.lower())()

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

We can use a context here to handle this attribute stuff

@execute_node.register(ops.StandardDev, SeriesGroupBy, SeriesGroupBy)
def execute_reduction_series_groupby_mask_std(op, data, mask, **kwargs):
return data.apply(lambda x, mask=mask.obj: x[mask[x.index]].std())

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

These both should use context.

@execute_node.register(ops.CumulativeMax, pd.Series)
def execute_series_cummax(op, data, **kwargs):
return data.cummax()

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

Need to add a test to make sure these do or don't work on SeriesGroupBy as well.

@wesm

This comment has been minimized.

Member

wesm commented Aug 15, 2017

Reviewing this now. Sorry about the delay

@six.add_metaclass(abc.ABCMeta)
class Context(object):

This comment has been minimized.

@wesm

wesm Aug 15, 2017

Member

I'm a bit confused about terminology. What is Context, maybe WindowType?

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

Yeah this isn't named very well.

The idea this encapsulates is really something like AggregationContext.

For any particular aggregation such as sum, mean, etc we need to decide based on the presence or absence of other expressions like group_by and order_by whether we should call groupby(...).transform, groupby(...).apply, groupby(...).expanding().<method>, groupby(...).rolling().<method> or just groupby(...).<method> on the data.

This comment has been minimized.

@wesm

wesm Aug 15, 2017

Member

Gotcha. The name doesn't matter too much for now as long as it's explained what the thing is.

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

I'll add documentation to explain this module and it's classes

This comment has been minimized.

@wesm

wesm Aug 15, 2017

Member

👍

result = expr.execute()
columns = ['G', 'yearID']
more_values = batting_df[columns].sort_values('yearID').G.rolling(5).sum()

This comment has been minimized.

@wesm

wesm Aug 15, 2017

Member

Do you think this will be able to easily handle temporal expressions like "5 days" here?

This comment has been minimized.

@cpcloud

cpcloud Aug 15, 2017

Member

I think we can extend ibis.window to accept time intervals (probably implemented as an interval type) which would be translated to pandas objects and passed into rolling which accepts offsets as of 0.19.0

@wesm

This comment has been minimized.

Member

wesm commented Aug 15, 2017

LGTM modulo comment about internal terminology

@cpcloud cpcloud force-pushed the cpcloud:window-funcs branch from 02ee5a0 to 10cf761 Aug 15, 2017

@cpcloud

This comment has been minimized.

Member

cpcloud commented Aug 15, 2017

Merging on green.

@cpcloud cpcloud closed this in e7d6353 Aug 15, 2017

@cpcloud cpcloud deleted the cpcloud:window-funcs branch Aug 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment