Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Limited window function support for pandas #1083

Closed
wants to merge 1 commit into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 31, 2017

No description provided.

@cpcloud cpcloud self-assigned this Jul 31, 2017
@cpcloud cpcloud added feature Features or general enhancements pandas The pandas backend labels Jul 31, 2017
@cpcloud cpcloud added this to the 0.11.3 milestone Jul 31, 2017
@cpcloud cpcloud force-pushed the window-funcs branch 2 times, most recently from 46e09db to f5f5837 Compare August 4, 2017 18:38
@cpcloud cpcloud changed the title WIP: Pandas window functions ENH: Limited window function support for pandas Aug 4, 2017
@cpcloud cpcloud force-pushed the window-funcs branch 3 times, most recently from 7953f4c to 9b41812 Compare August 5, 2017 02:42
@cpcloud cpcloud force-pushed the window-funcs branch 5 times, most recently from 0b97191 to 935ff51 Compare August 5, 2017 20:53
ops.LastValue: fixed_arity(sa.func.first_value, 1),
ops.NthValue: fixed_arity(sa.func.nth_value, 2),
ops.Lead: fixed_arity(sa.func.lead, 1),
ops.Lag: fixed_arity(sa.func.lag, 1),
Copy link
Member Author

@cpcloud cpcloud Aug 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to revert these changes because there are no tests. I'll add in a follow up PR.

@cpcloud cpcloud force-pushed the window-funcs branch 13 times, most recently from 1b2b77c to 77a446b Compare August 7, 2017 23:40
@cpcloud
Copy link
Member Author

cpcloud commented Aug 8, 2017

@wesm can you review this when you get a chance? most important things here are the changes in ibis/pandas/execution.py and ibis/pandas/tests/test_operations.py.

@execute_node.register(ops.Selection, pd.DataFrame)
def execute_selection_dataframe(op, data, scope=None):
def execute_selection_dataframe(op, data, scope=None, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to refactor this function a bit after this PR is merged. It's getting a bit too big for its britches.

@wesm
Copy link
Member

wesm commented Aug 8, 2017

Wow, big PR! I will take some time to review, am going to try to make a little more progress on the remaining Arrow 0.6.0 issues, and then I can get to this

@cpcloud
Copy link
Member Author

cpcloud commented Aug 9, 2017

One thing to note here is that for the groupby.transform style operations (OVER (PARTITION BY) in SQL parlance), we execute this in the fastest way possible: by executing each transform separately, rather than executing the operation per group.

Take z-score for example. There are two ways we could implement this.

1. Call execute in a lambda that executes the operation on each group.

df.groupby('foo').bar.transform(lambda x: (x - x.mean()) / x.std())

This will be very slow for high cardinality grouping keys, since pandas can't know that our lambda is composed of a bunch of function calls that are very fast when done individually. Also, the left operand of the subtraction in this case can be taken from the parent DataFrame. This avoids any indexing that happens when doing groupby.apply.

2. Execute each aggregation in a separate transform.

gb = df.groupby('foo').bar
result = (df.x - gb.transform('mean')) / gb.transform('std')

This can be up to 500x (!) faster than the former for certain operations.

This patch always uses the latter when executing any non trivial operations using t.groupby().mutate() expressions.

@cpcloud cpcloud force-pushed the window-funcs branch 2 times, most recently from b625bdf to 1404f92 Compare August 10, 2017 06:25
@cpcloud
Copy link
Member Author

cpcloud commented Aug 10, 2017

GCS keeps intermittently failing, rebuilding.

@cpcloud
Copy link
Member Author

cpcloud commented Aug 10, 2017

I'm going to pluck out the scope containing op instead of expr changes here

def batting_df():
path = os.environ.get('BATTING_CSV', 'batting.csv')
df = pd.read_csv(path, index_col=None, sep=',')
five_percent = int(0.01 * len(df))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be renamed

ci/pgload.sql Outdated
@@ -0,0 +1,69 @@
CREATE EXTENSION IF NOT EXISTS file_fdw;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole file should be removed

raise ValueError(
'More than one operation name found in {} class'.format(typename)
)
return getattr(data.expanding(), operation_name.lower())()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use a context here to handle this attribute stuff


@execute_node.register(ops.StandardDev, SeriesGroupBy, SeriesGroupBy)
def execute_reduction_series_groupby_mask_std(op, data, mask, **kwargs):
return data.apply(lambda x, mask=mask.obj: x[mask[x.index]].std())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These both should use context.


@execute_node.register(ops.CumulativeMax, pd.Series)
def execute_series_cummax(op, data, **kwargs):
return data.cummax()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a test to make sure these do or don't work on SeriesGroupBy as well.

@wesm
Copy link
Member

wesm commented Aug 15, 2017

Reviewing this now. Sorry about the delay



@six.add_metaclass(abc.ABCMeta)
class Context(object):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused about terminology. What is Context, maybe WindowType?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this isn't named very well.

The idea this encapsulates is really something like AggregationContext.

For any particular aggregation such as sum, mean, etc we need to decide based on the presence or absence of other expressions like group_by and order_by whether we should call groupby(...).transform, groupby(...).apply, groupby(...).expanding().<method>, groupby(...).rolling().<method> or just groupby(...).<method> on the data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. The name doesn't matter too much for now as long as it's explained what the thing is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add documentation to explain this module and it's classes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

result = expr.execute()

columns = ['G', 'yearID']
more_values = batting_df[columns].sort_values('yearID').G.rolling(5).sum()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will be able to easily handle temporal expressions like "5 days" here?

Copy link
Member Author

@cpcloud cpcloud Aug 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can extend ibis.window to accept time intervals (probably implemented as an interval type) which would be translated to pandas objects and passed into rolling which accepts offsets as of 0.19.0

@wesm
Copy link
Member

wesm commented Aug 15, 2017

LGTM modulo comment about internal terminology

@cpcloud
Copy link
Member Author

cpcloud commented Aug 15, 2017

Merging on green.

@cpcloud cpcloud closed this in e7d6353 Aug 15, 2017
@cpcloud cpcloud deleted the window-funcs branch August 15, 2017 19:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements pandas The pandas backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants