-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Limited window function support for pandas #1083
Conversation
46e09db
to
f5f5837
Compare
7953f4c
to
9b41812
Compare
0b97191
to
935ff51
Compare
ibis/sql/postgres/compiler.py
Outdated
| ops.LastValue: fixed_arity(sa.func.first_value, 1), | ||
| ops.NthValue: fixed_arity(sa.func.nth_value, 2), | ||
| ops.Lead: fixed_arity(sa.func.lead, 1), | ||
| ops.Lag: fixed_arity(sa.func.lag, 1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to revert these changes because there are no tests. I'll add in a follow up PR.
1b2b77c
to
77a446b
Compare
|
@wesm can you review this when you get a chance? most important things here are the changes in |
ibis/pandas/execution.py
Outdated
| @execute_node.register(ops.Selection, pd.DataFrame) | ||
| def execute_selection_dataframe(op, data, scope=None): | ||
| def execute_selection_dataframe(op, data, scope=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to refactor this function a bit after this PR is merged. It's getting a bit too big for its britches.
|
Wow, big PR! I will take some time to review, am going to try to make a little more progress on the remaining Arrow 0.6.0 issues, and then I can get to this |
|
One thing to note here is that for the Take z-score for example. There are two ways we could implement this. 1. Call
|
b625bdf
to
1404f92
Compare
|
GCS keeps intermittently failing, rebuilding. |
|
I'm going to pluck out the |
aacb757
to
0831a4f
Compare
ibis/pandas/tests/conftest.py
Outdated
| def batting_df(): | ||
| path = os.environ.get('BATTING_CSV', 'batting.csv') | ||
| df = pd.read_csv(path, index_col=None, sep=',') | ||
| five_percent = int(0.01 * len(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be renamed
ci/pgload.sql
Outdated
| @@ -0,0 +1,69 @@ | |||
| CREATE EXTENSION IF NOT EXISTS file_fdw; | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole file should be removed
ibis/pandas/execution.py
Outdated
| raise ValueError( | ||
| 'More than one operation name found in {} class'.format(typename) | ||
| ) | ||
| return getattr(data.expanding(), operation_name.lower())() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a context here to handle this attribute stuff
|
|
||
| @execute_node.register(ops.StandardDev, SeriesGroupBy, SeriesGroupBy) | ||
| def execute_reduction_series_groupby_mask_std(op, data, mask, **kwargs): | ||
| return data.apply(lambda x, mask=mask.obj: x[mask[x.index]].std()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These both should use context.
|
|
||
| @execute_node.register(ops.CumulativeMax, pd.Series) | ||
| def execute_series_cummax(op, data, **kwargs): | ||
| return data.cummax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a test to make sure these do or don't work on SeriesGroupBy as well.
|
Reviewing this now. Sorry about the delay |
ibis/pandas/context.py
Outdated
|
|
||
|
|
||
| @six.add_metaclass(abc.ABCMeta) | ||
| class Context(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused about terminology. What is Context, maybe WindowType?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this isn't named very well.
The idea this encapsulates is really something like AggregationContext.
For any particular aggregation such as sum, mean, etc we need to decide based on the presence or absence of other expressions like group_by and order_by whether we should call groupby(...).transform, groupby(...).apply, groupby(...).expanding().<method>, groupby(...).rolling().<method> or just groupby(...).<method> on the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. The name doesn't matter too much for now as long as it's explained what the thing is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add documentation to explain this module and it's classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| result = expr.execute() | ||
|
|
||
| columns = ['G', 'yearID'] | ||
| more_values = batting_df[columns].sort_values('yearID').G.rolling(5).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this will be able to easily handle temporal expressions like "5 days" here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can extend ibis.window to accept time intervals (probably implemented as an interval type) which would be translated to pandas objects and passed into rolling which accepts offsets as of 0.19.0
|
LGTM modulo comment about internal terminology |
|
Merging on green. |
No description provided.