New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Faster grouped rolling and expanding operations in the pandas backend #1549

Closed
wants to merge 37 commits into
base: master
from

Conversation

3 participants
@cpcloud
Member

cpcloud commented Jul 24, 2018

  • Also improves support for time series based rolling and expanding operations
  • Includes behind the scenes handling of indexes, so that operations
    that need them get them, but users never have to interact with them.
  • Includes a performance improvement for grouped and ordered operations
  • Forces joins to be materialized (this is a bugfix)
  • Calls apply_to on results in the pandas backend This is complicated by the use of nan as a missing value. Operations whose output type is nullable integral, will produce a float64 array in the presence of nans. We have to think about to handle this in a reasonable way.

cpcloud added some commits Jul 20, 2018

@kszucs

This comment has been minimized.

Member

kszucs commented Jul 24, 2018

@cpcloud should I review?

@cpcloud cpcloud added this to the 0.14 milestone Jul 24, 2018

@cpcloud cpcloud added this to To do in Pandas via automation Jul 24, 2018

@cpcloud cpcloud added this to To do in Refactoring via automation Jul 24, 2018

@cpcloud

This comment has been minimized.

Member

cpcloud commented Jul 24, 2018

@kszucs Please do, whenever you have a chance.

cpcloud added some commits Jul 24, 2018

@@ -731,6 +731,9 @@ class SortExpr(Expr):
def _type_display(self):
return 'array-sort'
def get_name(self):

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Is this name propagation only used for Sort expressions? It suggests that somewhere We don't have a clear distinction between expressions and operations.

This comment has been minimized.

@cpcloud

cpcloud Jul 24, 2018

Member

No, get_name is a method on Expr.

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

self.op().expr.get_name() in this case

This comment has been minimized.

@cpcloud

cpcloud Jul 25, 2018

Member

I'll change to not use resolve_name and make an issue to move resolve_name into Node (instead of just in ValueOp where it is now).

This comment has been minimized.

@cpcloud

cpcloud Jul 25, 2018

Member

Created #1552 to track this.

return lambda data, function=function, args=args, kwargs=kwargs: (
function(data, *args, **kwargs)
)
class Summarize(AggregationContext):
__slots__ = ()

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

AFAIK __slots__ are not inherited, must be defined in each child class

This comment has been minimized.

@cpcloud

cpcloud Jul 24, 2018

Member

Yep, I believe that is the case in this module.

type(self).__name__
group_by = self.group_by
if not group_by:

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Could You use a couple of one liner comments? There are four, not necessarily straightforward agg branches.

This comment has been minimized.

@cpcloud
keys = group_by + order_by
frame = self.parent.obj
name = grouped_data.obj.name
indexed_series = frame[keys + [name]].set_index(

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

This block is pretty terse, comments? :)

This comment has been minimized.

@cpcloud
@@ -360,7 +360,13 @@ def execute(self, query, params=None, limit='default', async=False):
)
assert isinstance(query, ir.Expr)
return execute(query, params=params)
result = execute(query, params=params)
if isinstance(result, pd.DataFrame):

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

A comment about removing the indices?

This comment has been minimized.

@cpcloud
result = execute(query, params=params)
if isinstance(result, pd.DataFrame):
schema = query.schema()
return result.reset_index()[schema.names]

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Schema.apply_to()?

This comment has been minimized.

@cpcloud

This comment has been minimized.

@cpcloud

cpcloud Jul 25, 2018

Member

I reverted this commit because it introduce a whole host of other issues related to casting and nullability of integer data (pandas casts integers with null values to float64)

@@ -52,7 +52,7 @@ def execute_decimal_log2(op, data, **kwargs):
return decimal.Decimal('NaN')
@execute_node.register(ops.UnaryOp, decimal.Decimal)
@execute_node.register((ops.UnaryOp, ops.Negate), decimal.Decimal)

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Isn't Negate a UnaryOp?

This comment has been minimized.

@cpcloud

cpcloud Jul 24, 2018

Member

Yep. I was playing around with some performance improvements related to multipledispatch 0.5.0, but they aren't strictly necessary. I'll remove this.

This comment has been minimized.

@kszucs

kszucs Jul 25, 2018

Member

I mean We should prefer the dict lookups, just could mention it in a comment or at the top of the modules. It's more explicit too.

@@ -139,17 +150,33 @@ def execute_cast_series_date(op, data, type, **kwargs):
raise TypeError("Don't know how to cast {} to {}".format(from_type, type))
@execute_node.register(ops.SortKey, pd.Series, bool)
def execute_sort_key_series_bool(op, data, ascending, **kwargs):
return data

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Is it supposed to actually sort the series?

This comment has been minimized.

@cpcloud

cpcloud Jul 24, 2018

Member

No. This dispatch exists so that I don't have to special case in ibis/pandas/execution/util.py.

@execute_node.register(
(ops.Comparison, ops.Add, ops.Multiply), six.string_types, pd.Series)
@execute_node.register(
(ops.Comparison, ops.Add), six.string_types, six.string_types)

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Nice set of rules!

@@ -73,31 +74,31 @@ def convert_to_offset(n):
return data.apply(convert_to_offset)
@execute_node.register(ops.TimestampAdd, datetime.datetime, datetime.timedelta)
@execute_node.register(ops.TimestampAdd, timestamp_types, timedelta_types)

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

If the timestamp and timedelta literals are properly converted to pd.Timestamp and pd.Timedelta beforehand, then defining an execute_node for BinaryOp (parent of all temporal binary ops) might be enough - not sure though.

This comment has been minimized.

@cpcloud

cpcloud Jul 24, 2018

Member

Yep, this was related to the perf optimization. I can remove it.

expected = df[['plain_datetimes_naive', 'dup_strings']].set_index(
'plain_datetimes_naive').squeeze().tshift(
freq=execute(-range_offset)).reindex(
df.plain_datetimes_naive).reset_index(drop=True)

This comment has been minimized.

@kszucs

kszucs Jul 24, 2018

Member

Multiple expressions?

cpcloud added some commits Jul 25, 2018

cpcloud added some commits Jul 25, 2018

Revert "Apply to with schema"
This reverts commit b63936d.

@cpcloud cpcloud force-pushed the cpcloud:faster-moving branch from 04a107d to 285a589 Jul 27, 2018

cpcloud added some commits Jul 27, 2018

@cpcloud cpcloud closed this in 9624af8 Jul 29, 2018

Refactoring automation moved this from To do to Done Jul 29, 2018

Pandas automation moved this from To do to Done Jul 29, 2018

@cpcloud cpcloud deleted the cpcloud:faster-moving branch Jul 29, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment