Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix analytic ops over ungrouped and unordered windows on Pandas backend #2376

Merged

Conversation

timothydijamco
Copy link
Contributor

Overview

Using analytic ops (like lead, lag, first, last, or analytic UDFs) over a window where no group_by and no order_by are defined produces an incorrect result.

Example

Analytic op lag over an ungrouped+unordered window

import pandas as pd
import ibis

client = ibis.pandas.connect({'df': pd.DataFrame({'id': [1, 2, 3], 'value_col': [16, 92, 51]})})
table = client.table('df')

window = ibis.window()
expr = table.value_col.lag().over(window)  # lag operation over window
df = expr.execute()

df.to_markdown()

Before this PR

The result of the analytic op is a Series. That Series result is strangely copied n times, where n is the number of rows of the original column.

0
0 0 NaN
1 16.0
2 92.0
Name: value_col, dtype: float64
1 0 NaN
1 16.0
2 92.0
Name: value_col, dtype: float64
2 0 NaN
1 16 .0
2 92.0
Name: value_col, dtype: float64

After this PR

The Series result of the analytic op is treated as the end result of using the analytic op over the window.

value_col
0 nan
1 16
2 92

As additional verification that this is the correct result: the above table is what we get when we use the exact same code but with window = ibis.window(order_by=[table.id]) (both before and after this PR).

Testing

New tests in test_window.py to exercise the four kinds of windows:

  1. Window with group_by and order_by
  2. Window with group_by only
  3. Window with order_by only
  4. Window with neither group_by nor order_by

New test in test_vectorized_udf.py to test using an analytic op and adding the result to a table using mutate. This implicitly executes the analytic op over a window. (Also a few additional general UDF tests for completeness)

@timothydijamco
Copy link
Contributor Author

Going to need to investigate a bit into why the tests are failing on the Spark backend

@jreback jreback added pandas The pandas backend window functions Issues or PRs related to window functions labels Sep 12, 2020
@jreback jreback added this to the Next Bugfix Release milestone Sep 12, 2020
@timothydijamco
Copy link
Contributor Author

There are some differences in how backends view a window spec like this:

ibis.window(order_by=order_by)

Backends that treat this as a completely unbounded window (applies the function over ALL rows in the table):

  • Pandas
  • PySpark
  • Postgres
  • MySQL
  • SQLite
  • Clickhouse

Backends that treats this as an expanding window (apply the function over all rows up until the current row):

  • Spark
  • Impala

This is sort of orthogonal to this PR (this is hit by a test I added in this PR for completeness but not for testing the change being done) so this could be addressed in a separate PR.

But to confirm, behavior across backends should be consistent regardless of how the backend itself treats windows in its non-Ibis API? If not, (if we want to accept the difference) I can just change the tests here to reflect the difference.

@icexelloss
Copy link
Contributor

Can you please add a release note?

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. Some minor comments.

@timothydijamco
Copy link
Contributor Author

timothydijamco commented Sep 14, 2020

About the behavior difference I mentioned above: Created #2378 to track making ungrouped, ordered windows consistent across backends. I will x-fail that test for Impala and Spark for this PR.

@timothydijamco
Copy link
Contributor Author

Still have a couple of tests I'm still troubleshooting: lag and lead over empty window spec on Spark backend leads to results that don't match other backends

I need to confirm this is a problem with the Spark backend and not something that should be changed with the test (strangely, Spark inherits from Impala for this, yet Impala seems to pass for these tests)

@timothydijamco
Copy link
Contributor Author

I think I know the reason behind the failing tests on the Spark backend, but I haven't been able to figure out why the Impala backend doesn't fail in the same way, given that the Spark backend inherits from the Impala backend. I'm going to stop investigating further since this is orthogonal to the main part of the PR anyways. I opened #2381 to look into this further

Copy link
Contributor

@icexelloss icexelloss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. +1

@@ -30,12 +30,19 @@
from ibis.pandas.execution import util


def _post_process_empty(scalar, parent, order_by, group_by):
def _post_process_empty(result, parent, order_by, group_by):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you type these args (and return type of this function)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result.index = index
return result
if isinstance(result, pd.Series):
# `result` is a Series e.g. when an analytic operation is being
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elabort when this happens (e.g. what casuses this result to be a Series vs a scalar). Is this better handled in a dispatch rule though? e.g. to differentiate these cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the comments a little clearer

I think if all the _post_process_xxxx functions could be made into a dispatch rule that would be pretty nice, but some of them exactly share the same types of input parameters

In terms of a dispatch rule for just _post_process_empty, I don't think it'd be any clearer because I think the dispatch rule for the case that result is a scalar would have to be @_post_process_empty.register(Any), which doesn't clearly indicate that that is the scalar case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK this ^ has been done

ibis/tests/all/test_vectorized_udf.py Show resolved Hide resolved
@@ -323,25 +329,120 @@ def test_bounded_preceding_windows(backend, alltypes, df, con, window_fn):
),
],
)
@pytest.mark.parametrize(
('ordered'), [param(True, id='orderered'), param(False, id='unordered')],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these ids different for a reason? why are you using param here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this actually doesn't need param or ids. I'll change it to just [True, False]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this makes the test output a bit nicer, are you OK with keeping this as is?

# Without ids
XFAIL ibis/tests/all/test_window.py::test_unbounded_window_grouped[Pandas-True-mean]

vs

# With ids
XFAIL ibis/tests/all/test_window.py::test_unbounded_window_grouped[Pandas-orderered-mean]

ibis/tests/all/test_window.py Show resolved Hide resolved
@jreback jreback merged commit 21565e3 into ibis-project:master Sep 16, 2020
@jreback
Copy link
Contributor

jreback commented Sep 16, 2020

thanks @timothydijamco

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pandas The pandas backend window functions Issues or PRs related to window functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants