-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Faster grouped rolling and expanding operations in the pandas backend #1549
Conversation
|
@cpcloud should I review? |
|
@kszucs Please do, whenever you have a chance. |
| @@ -731,6 +731,9 @@ class SortExpr(Expr): | |||
| def _type_display(self): | |||
| return 'array-sort' | |||
|
|
|||
| def get_name(self): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this name propagation only used for Sort expressions? It suggests that somewhere We don't have a clear distinction between expressions and operations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, get_name is a method on Expr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.op().expr.get_name() in this case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change to not use resolve_name and make an issue to move resolve_name into Node (instead of just in ValueOp where it is now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #1552 to track this.
| return lambda data, function=function, args=args, kwargs=kwargs: ( | ||
| function(data, *args, **kwargs) | ||
| ) | ||
|
|
||
|
|
||
| class Summarize(AggregationContext): | ||
|
|
||
| __slots__ = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK __slots__ are not inherited, must be defined in each child class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I believe that is the case in this module.
ibis/pandas/aggcontext.py
Outdated
| type(self).__name__ | ||
| group_by = self.group_by | ||
|
|
||
| if not group_by: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could You use a couple of one liner comments? There are four, not necessarily straightforward agg branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
ibis/pandas/aggcontext.py
Outdated
| keys = group_by + order_by | ||
| frame = self.parent.obj | ||
| name = grouped_data.obj.name | ||
| indexed_series = frame[keys + [name]].set_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is pretty terse, comments? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep
| @@ -360,7 +360,13 @@ def execute(self, query, params=None, limit='default', async=False): | |||
| ) | |||
|
|
|||
| assert isinstance(query, ir.Expr) | |||
| return execute(query, params=params) | |||
| result = execute(query, params=params) | |||
| if isinstance(result, pd.DataFrame): | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment about removing the indices?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| result = execute(query, params=params) | ||
| if isinstance(result, pd.DataFrame): | ||
| schema = query.schema() | ||
| return result.reset_index()[schema.names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema.apply_to()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reverted this commit because it introduce a whole host of other issues related to casting and nullability of integer data (pandas casts integers with null values to float64)
ibis/pandas/decimal.py
Outdated
| @@ -52,7 +52,7 @@ def execute_decimal_log2(op, data, **kwargs): | |||
| return decimal.Decimal('NaN') | |||
|
|
|||
|
|
|||
| @execute_node.register(ops.UnaryOp, decimal.Decimal) | |||
| @execute_node.register((ops.UnaryOp, ops.Negate), decimal.Decimal) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't Negate a UnaryOp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I was playing around with some performance improvements related to multipledispatch 0.5.0, but they aren't strictly necessary. I'll remove this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean We should prefer the dict lookups, just could mention it in a comment or at the top of the modules. It's more explicit too.
| @@ -139,17 +150,33 @@ def execute_cast_series_date(op, data, type, **kwargs): | |||
| raise TypeError("Don't know how to cast {} to {}".format(from_type, type)) | |||
|
|
|||
|
|
|||
| @execute_node.register(ops.SortKey, pd.Series, bool) | |||
| def execute_sort_key_series_bool(op, data, ascending, **kwargs): | |||
| return data | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it supposed to actually sort the series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. This dispatch exists so that I don't have to special case in ibis/pandas/execution/util.py.
| @execute_node.register( | ||
| (ops.Comparison, ops.Add, ops.Multiply), six.string_types, pd.Series) | ||
| @execute_node.register( | ||
| (ops.Comparison, ops.Add), six.string_types, six.string_types) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice set of rules!
| @@ -73,31 +74,31 @@ def convert_to_offset(n): | |||
| return data.apply(convert_to_offset) | |||
|
|
|||
|
|
|||
| @execute_node.register(ops.TimestampAdd, datetime.datetime, datetime.timedelta) | |||
| @execute_node.register(ops.TimestampAdd, timestamp_types, timedelta_types) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the timestamp and timedelta literals are properly converted to pd.Timestamp and pd.Timedelta beforehand, then defining an execute_node for BinaryOp (parent of all temporal binary ops) might be enough - not sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, this was related to the perf optimization. I can remove it.
| expected = df[['plain_datetimes_naive', 'dup_strings']].set_index( | ||
| 'plain_datetimes_naive').squeeze().tshift( | ||
| freq=execute(-range_offset)).reindex( | ||
| df.plain_datetimes_naive).reset_index(drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple expressions?
| try: | ||
| if isinstance(by, six.string_types): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can it be a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's a group by key. Sorting group by and order by keys before is the key thing here that makes this much faster.
| @@ -104,28 +122,51 @@ def execute_window_op(op, data, window, scope=None, context=None, **kwargs): | |||
| factory=OrderedDict, | |||
| ) | |||
|
|
|||
| # figure out what the dtype of the operand is | |||
| operand_type = operand.type() | |||
| if isinstance(operand_type, dt.Integer) and operand_type.nullable: | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's the responsibility of to_pandas()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem there is that the dtypes would have to know about the operations. One option is to convert all nullable integer columns to float64, but that seems really disgusting.
This reverts commit b63936d.
that need them get them, but users never have to interact with them.
CallsThis is complicated by the use ofapply_toon results in the pandas backendnanas a missing value. Operations whose output type is nullable integral, will produce afloat64array in the presence ofnans. We have to think about to handle this in a reasonable way.