Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

window expression not allowed in aggregation: allow chained .over() and .rolling() aggregations #12051

Open
raoulj opened this issue Oct 26, 2023 · 13 comments
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature

Comments

@raoulj
Copy link

raoulj commented Oct 26, 2023

Description

Consider the following DataFrame:

dates = [
    "2020-01-01 13:45:48",
    "2020-01-01 16:42:13",
    "2020-01-01 16:45:09",
    "2020-01-02 18:12:48",
    "2020-01-03 19:45:32",
    "2020-01-08 23:16:43",
]
df = pl.DataFrame({"dt": dates, 'train_line': ['a', 'b', 'a', 'a', 'b', 'a'], "num_passengers": [3, 7, 5, 9, 2, 1]}).with_columns(
    pl.col("dt").str.strptime(pl.Datetime).set_sorted()
)
print(df)
shape: (6, 3)
┌─────────────────────┬────────────┬────────────────┐
│ departure_time      ┆ train_line ┆ num_passengers │
│ ---                 ┆ ---        ┆ ---            │
│ datetime[μs]        ┆ str        ┆ i64            │
╞═════════════════════╪════════════╪════════════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              │
└─────────────────────┴────────────┴────────────────┘

If I want to get the rolling average, at each departure, of the last 2 days worth of departures for each train line. Here's how I would think I would do that:

df.with_columns(pl.col('num_passengers').mean().over('train_line').rolling(index_column='departure_time', period='2d'))

But, doing this, I currently get InvalidOperationError: window expression not allowed in aggregation

@raoulj raoulj added the enhancement New feature or an improvement of an existing feature label Oct 26, 2023
@cmdlineluser
Copy link
Contributor

There is a dedicated Expr.rolling_mean() but there is an issue with it currently: #11225

df.rolling(by="train_line", index_column="dt", period="2d").agg(
   pl.col("num_passengers").mean()
)

# shape: (6, 3)
# ┌────────────┬─────────────────────┬────────────────┐
# │ train_line ┆ dt                  ┆ num_passengers │
# │ ---        ┆ ---                 ┆ ---            │
# │ str        ┆ datetime[μs]        ┆ f64            │
# ╞════════════╪═════════════════════╪════════════════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ 3.0            │
# │ a          ┆ 2020-01-01 16:45:09 ┆ 4.0            │
# │ a          ┆ 2020-01-02 18:12:48 ┆ 5.666667       │
# │ a          ┆ 2020-01-08 23:16:43 ┆ 1.0            │
# │ b          ┆ 2020-01-01 16:42:13 ┆ 7.0            │
# │ b          ┆ 2020-01-03 19:45:32 ┆ 2.0            │
# └────────────┴─────────────────────┴────────────────┘

@raoulj
Copy link
Author

raoulj commented Oct 26, 2023

@cmdlineluser appreciate the pointer! I may have oversimplified my example. The motivating case I encountered is a dynamic threshold with an .all() aggregation. In the spirit on the provided example:

Say we have:

┌─────────────────────┬────────────┬────────────────┬───────────┐
│ departure_time      ┆ train_line ┆ num_passengers ┆ threshold │
│ ---                 ┆ ---        ┆ ---            ┆ ---       │
│ datetime[μs]        ┆ str        ┆ i64            ┆ i64       │
╞═════════════════════╪════════════╪════════════════╪═══════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              ┆ 2         │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              ┆ 4         │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              ┆ 1         │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              ┆ 3         │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              ┆ 2         │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              ┆ 4         │
└─────────────────────┴────────────┴────────────────┴───────────┘

How would I filter to train departures that had more than the threshold number of riders for every ride in the last 2 days? I can't use rolling_min because the threshold changes every departure.

@xyk2000
Copy link

xyk2000 commented Oct 26, 2023

That would be especially helpful if we can use expressions like that

pl.col().any_method().rolling().over()

@raoulj
Copy link
Author

raoulj commented Oct 26, 2023

@xyk2000 messaging here to not clutter that other thread

I'm not sure if #12049 would help this? That's talking about breaking up the rolling api, which is different than allowing nested window functions like this is.

@cmdlineluser
Copy link
Contributor

Do you mean something like this @raoulj ?

df.rolling("departure_time", by="train_line", period="2d").agg(
   pl.exclude("departure_time"),
   all = (pl.col("num_passengers") > pl.col("threshold")).all()
)

# shape: (6, 5)
# ┌────────────┬─────────────────────┬────────────────┬───────────┬───────┐
# │ train_line ┆ departure_time      ┆ num_passengers ┆ threshold ┆ all   │
# │ ---        ┆ ---                 ┆ ---            ┆ ---       ┆ ---   │
# │ str        ┆ datetime[μs]        ┆ list[i64]      ┆ list[i64] ┆ bool  │
# ╞════════════╪═════════════════════╪════════════════╪═══════════╪═══════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ [3]            ┆ [2]       ┆ true  │
# │ a          ┆ 2020-01-01 16:45:09 ┆ [3, 5]         ┆ [2, 1]    ┆ true  │
# │ a          ┆ 2020-01-02 18:12:48 ┆ [3, 5, 9]      ┆ [2, 1, 3] ┆ true  │
# │ a          ┆ 2020-01-08 23:16:43 ┆ [1]            ┆ [4]       ┆ false │
# │ b          ┆ 2020-01-01 16:42:13 ┆ [7]            ┆ [4]       ┆ true  │
# │ b          ┆ 2020-01-03 19:45:32 ┆ [2]            ┆ [2]       ┆ false │
# └────────────┴─────────────────────┴────────────────┴───────────┴───────┘

Apologies if I've misunderstood.

@raoulj
Copy link
Author

raoulj commented Oct 26, 2023

No that is exactly what I wanted! Didn't know about DataFrame.rolling(). Thank you for pointing me in the right direction.

Is this by= property available on Expr.rolling()? I ask because the property is currently mentioned in the check_sorted property. Looking at the src (which I am looking at for the first time) it looks like there's no by implementation even though check_sorted is an arg.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Oct 26, 2023

#11445 (comment)

No.. because the by argument would need to reorder the other columns or this output. For the by argument case we need groupby_rolling (soon the rolling) context.

(DataFrame.group_by_rolling() was recently renamed to DataFrame.rolling())

@raoulj
Copy link
Author

raoulj commented Oct 27, 2023

Okay. So the Expr.rolling() docs are temporarily incorrect while the new rolling context is developed. Thanks for the context.

I do like the .rolling().over() syntax. Unsure if that's in scope for the rolling context mentioned in the linked PR. Is there a public facing roadmap anywhere where I could understand this effort?

@mkleinbort-ic
Copy link

mkleinbort-ic commented Nov 22, 2023

Not sure if related, but today I was trying do complex opeation:

pl.col('value').pct_change().over('entityId').rank().over('date')

On a table a bit like:

date entityId value
2020-01-01 "K" 7
2020-01-02 "K" 8
2020-01-03 "K" 9
2020-01-01 "G" 5
2020-01-02 "G" 12
2020-01-03 "G" 7

The expression is invalid due to

InvalidOperationError: window expression not allowed in aggregation

But not clear how to do the same at the pl.Expr level.

@kszlim
Copy link
Contributor

kszlim commented Jan 11, 2024

I've run into this issue as well, as a workaround, what I do is ensure my express that does the rolling happens in an earlier with_columns and instead of directly depending on that expression, I reference the output of the first rolling express by name/alias.

It's a bit of a footgun, would love to see this limitation removed.

@t-ded
Copy link
Contributor

t-ded commented May 16, 2024

I would like to add some traffic to this feature request.
It would be very convenient to have possibility either for pl.Expr().rolling().over() or group_by parameter within the pl.Expr().rolling() function
Currently, the pl.DataFrame.rolling() is the only way to get the desired output with the group_by parameter for ops such as n_unique. This, however, forces user to then join the result(s) back to the original frame in case a new column is desired (and possibly raises the need to keep some form of index on which to join back in specific cases). The inconvenience is even more prevalent in cases when one would want to create multiple rolling-window-based features either for different groupings or for different time settings of the rolling window.

@lorentzenchr
Copy link
Contributor

May I ask, is it just a matter of someone to implement it or are there open discussion points to address first?

@cmdlineluser
Copy link
Contributor

@lorentzenchr I don't know the answer, but one of the comments in the rolling rewrite proposal did suggest it would also close this issue.

From what I've read, I think it's just that the devs are currently busy implementing "Polars Cloud" and the new-streaming engine, which have since taken higher priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation enhancement New feature or an improvement of an existing feature
Projects
Status: Ready
Development

No branches or pull requests

8 participants