Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Faster grouped rolling and expanding operations in the pandas backend #1549

Closed
wants to merge 37 commits into from

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Jul 24, 2018

  • Also improves support for time series based rolling and expanding operations
  • Includes behind the scenes handling of indexes, so that operations
    that need them get them, but users never have to interact with them.
  • Includes a performance improvement for grouped and ordered operations
  • Forces joins to be materialized (this is a bugfix)
  • Calls apply_to on results in the pandas backend This is complicated by the use of nan as a missing value. Operations whose output type is nullable integral, will produce a float64 array in the presence of nans. We have to think about to handle this in a reasonable way.

@kszucs
Copy link
Member

kszucs commented Jul 24, 2018

@cpcloud should I review?

@cpcloud cpcloud added this to the 0.14 milestone Jul 24, 2018
@cpcloud cpcloud added feature Features or general enhancements expressions Issues or PRs related to the expression API pandas The pandas backend performance Issues related to ibis's performance labels Jul 24, 2018
@cpcloud
Copy link
Member Author

cpcloud commented Jul 24, 2018

@kszucs Please do, whenever you have a chance.

@@ -731,6 +731,9 @@ class SortExpr(Expr):
def _type_display(self):
return 'array-sort'

def get_name(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this name propagation only used for Sort expressions? It suggests that somewhere We don't have a clear distinction between expressions and operations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, get_name is a method on Expr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.op().expr.get_name() in this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change to not use resolve_name and make an issue to move resolve_name into Node (instead of just in ValueOp where it is now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #1552 to track this.

return lambda data, function=function, args=args, kwargs=kwargs: (
function(data, *args, **kwargs)
)


class Summarize(AggregationContext):

__slots__ = ()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK __slots__ are not inherited, must be defined in each child class

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I believe that is the case in this module.

type(self).__name__
group_by = self.group_by

if not group_by:
Copy link
Member

@kszucs kszucs Jul 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could You use a couple of one liner comments? There are four, not necessarily straightforward agg branches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

keys = group_by + order_by
frame = self.parent.obj
name = grouped_data.obj.name
indexed_series = frame[keys + [name]].set_index(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is pretty terse, comments? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep

@@ -360,7 +360,13 @@ def execute(self, query, params=None, limit='default', async=False):
)

assert isinstance(query, ir.Expr)
return execute(query, params=params)
result = execute(query, params=params)
if isinstance(result, pd.DataFrame):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment about removing the indices?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result = execute(query, params=params)
if isinstance(result, pd.DataFrame):
schema = query.schema()
return result.reset_index()[schema.names]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema.apply_to()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted this commit because it introduce a whole host of other issues related to casting and nullability of integer data (pandas casts integers with null values to float64)

@@ -52,7 +52,7 @@ def execute_decimal_log2(op, data, **kwargs):
return decimal.Decimal('NaN')


@execute_node.register(ops.UnaryOp, decimal.Decimal)
@execute_node.register((ops.UnaryOp, ops.Negate), decimal.Decimal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't Negate a UnaryOp?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I was playing around with some performance improvements related to multipledispatch 0.5.0, but they aren't strictly necessary. I'll remove this.

Copy link
Member

@kszucs kszucs Jul 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean We should prefer the dict lookups, just could mention it in a comment or at the top of the modules. It's more explicit too.

@@ -139,17 +150,33 @@ def execute_cast_series_date(op, data, type, **kwargs):
raise TypeError("Don't know how to cast {} to {}".format(from_type, type))


@execute_node.register(ops.SortKey, pd.Series, bool)
def execute_sort_key_series_bool(op, data, ascending, **kwargs):
return data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it supposed to actually sort the series?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. This dispatch exists so that I don't have to special case in ibis/pandas/execution/util.py.

@execute_node.register(
(ops.Comparison, ops.Add, ops.Multiply), six.string_types, pd.Series)
@execute_node.register(
(ops.Comparison, ops.Add), six.string_types, six.string_types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice set of rules!

@@ -73,31 +74,31 @@ def convert_to_offset(n):
return data.apply(convert_to_offset)


@execute_node.register(ops.TimestampAdd, datetime.datetime, datetime.timedelta)
@execute_node.register(ops.TimestampAdd, timestamp_types, timedelta_types)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the timestamp and timedelta literals are properly converted to pd.Timestamp and pd.Timedelta beforehand, then defining an execute_node for BinaryOp (parent of all temporal binary ops) might be enough - not sure though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this was related to the perf optimization. I can remove it.

expected = df[['plain_datetimes_naive', 'dup_strings']].set_index(
'plain_datetimes_naive').squeeze().tshift(
freq=execute(-range_offset)).reindex(
df.plain_datetimes_naive).reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple expressions?

try:
if isinstance(by, six.string_types):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can it be a string?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a group by key. Sorting group by and order by keys before is the key thing here that makes this much faster.

@@ -104,28 +122,51 @@ def execute_window_op(op, data, window, scope=None, context=None, **kwargs):
factory=OrderedDict,
)

# figure out what the dtype of the operand is
operand_type = operand.type()
if isinstance(operand_type, dt.Integer) and operand_type.nullable:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's the responsibility of to_pandas()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem there is that the dtypes would have to know about the operations. One option is to convert all nullable integer columns to float64, but that seems really disgusting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
expressions Issues or PRs related to the expression API feature Features or general enhancements pandas The pandas backend performance Issues related to ibis's performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants