window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations #12051

raoulj · 2023-10-26T14:50:27Z

Description

Consider the following DataFrame:

dates = [
    "2020-01-01 13:45:48",
    "2020-01-01 16:42:13",
    "2020-01-01 16:45:09",
    "2020-01-02 18:12:48",
    "2020-01-03 19:45:32",
    "2020-01-08 23:16:43",
]
df = pl.DataFrame({"dt": dates, 'train_line': ['a', 'b', 'a', 'a', 'b', 'a'], "num_passengers": [3, 7, 5, 9, 2, 1]}).with_columns(
    pl.col("dt").str.strptime(pl.Datetime).set_sorted()
)
print(df)

shape: (6, 3)
┌─────────────────────┬────────────┬────────────────┐
│ departure_time      ┆ train_line ┆ num_passengers │
│ ---                 ┆ ---        ┆ ---            │
│ datetime[μs]        ┆ str        ┆ i64            │
╞═════════════════════╪════════════╪════════════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              │
└─────────────────────┴────────────┴────────────────┘

If I want to get the rolling average, at each departure, of the last 2 days worth of departures for each train line. Here's how I would think I would do that:

df.with_columns(pl.col('num_passengers').mean().over('train_line').rolling(index_column='departure_time', period='2d'))

But, doing this, I currently get InvalidOperationError: window expression not allowed in aggregation

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2023-10-26T15:25:50Z

There is a dedicated Expr.rolling_mean() but there is an issue with it currently: #11225

df.rolling(by="train_line", index_column="dt", period="2d").agg(
   pl.col("num_passengers").mean()
)

# shape: (6, 3)
# ┌────────────┬─────────────────────┬────────────────┐
# │ train_line ┆ dt                  ┆ num_passengers │
# │ ---        ┆ ---                 ┆ ---            │
# │ str        ┆ datetime[μs]        ┆ f64            │
# ╞════════════╪═════════════════════╪════════════════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ 3.0            │
# │ a          ┆ 2020-01-01 16:45:09 ┆ 4.0            │
# │ a          ┆ 2020-01-02 18:12:48 ┆ 5.666667       │
# │ a          ┆ 2020-01-08 23:16:43 ┆ 1.0            │
# │ b          ┆ 2020-01-01 16:42:13 ┆ 7.0            │
# │ b          ┆ 2020-01-03 19:45:32 ┆ 2.0            │
# └────────────┴─────────────────────┴────────────────┘

raoulj · 2023-10-26T15:36:37Z

@cmdlineluser appreciate the pointer! I may have oversimplified my example. The motivating case I encountered is a dynamic threshold with an .all() aggregation. In the spirit on the provided example:

Say we have:

┌─────────────────────┬────────────┬────────────────┬───────────┐
│ departure_time      ┆ train_line ┆ num_passengers ┆ threshold │
│ ---                 ┆ ---        ┆ ---            ┆ ---       │
│ datetime[μs]        ┆ str        ┆ i64            ┆ i64       │
╞═════════════════════╪════════════╪════════════════╪═══════════╡
│ 2020-01-01 13:45:48 ┆ a          ┆ 3              ┆ 2         │
│ 2020-01-01 16:42:13 ┆ b          ┆ 7              ┆ 4         │
│ 2020-01-01 16:45:09 ┆ a          ┆ 5              ┆ 1         │
│ 2020-01-02 18:12:48 ┆ a          ┆ 9              ┆ 3         │
│ 2020-01-03 19:45:32 ┆ b          ┆ 2              ┆ 2         │
│ 2020-01-08 23:16:43 ┆ a          ┆ 1              ┆ 4         │
└─────────────────────┴────────────┴────────────────┴───────────┘

How would I filter to train departures that had more than the threshold number of riders for every ride in the last 2 days? I can't use rolling_min because the threshold changes every departure.

xyk2000 · 2023-10-26T16:51:54Z

That would be especially helpful if we can use expressions like that

pl.col().any_method().rolling().over()

raoulj · 2023-10-26T18:17:07Z

@xyk2000 messaging here to not clutter that other thread

I'm not sure if #12049 would help this? That's talking about breaking up the rolling api, which is different than allowing nested window functions like this is.

cmdlineluser · 2023-10-26T22:17:41Z

Do you mean something like this @raoulj ?

df.rolling("departure_time", by="train_line", period="2d").agg(
   pl.exclude("departure_time"),
   all = (pl.col("num_passengers") > pl.col("threshold")).all()
)

# shape: (6, 5)
# ┌────────────┬─────────────────────┬────────────────┬───────────┬───────┐
# │ train_line ┆ departure_time      ┆ num_passengers ┆ threshold ┆ all   │
# │ ---        ┆ ---                 ┆ ---            ┆ ---       ┆ ---   │
# │ str        ┆ datetime[μs]        ┆ list[i64]      ┆ list[i64] ┆ bool  │
# ╞════════════╪═════════════════════╪════════════════╪═══════════╪═══════╡
# │ a          ┆ 2020-01-01 13:45:48 ┆ [3]            ┆ [2]       ┆ true  │
# │ a          ┆ 2020-01-01 16:45:09 ┆ [3, 5]         ┆ [2, 1]    ┆ true  │
# │ a          ┆ 2020-01-02 18:12:48 ┆ [3, 5, 9]      ┆ [2, 1, 3] ┆ true  │
# │ a          ┆ 2020-01-08 23:16:43 ┆ [1]            ┆ [4]       ┆ false │
# │ b          ┆ 2020-01-01 16:42:13 ┆ [7]            ┆ [4]       ┆ true  │
# │ b          ┆ 2020-01-03 19:45:32 ┆ [2]            ┆ [2]       ┆ false │
# └────────────┴─────────────────────┴────────────────┴───────────┴───────┘

Apologies if I've misunderstood.

raoulj · 2023-10-26T22:37:47Z

No that is exactly what I wanted! Didn't know about DataFrame.rolling(). Thank you for pointing me in the right direction.

Is this by= property available on Expr.rolling()? I ask because the property is currently mentioned in the check_sorted property. Looking at the src (which I am looking at for the first time) it looks like there's no by implementation even though check_sorted is an arg.

cmdlineluser · 2023-10-26T22:50:46Z

#11445 (comment)

No.. because the by argument would need to reorder the other columns or this output. For the by argument case we need groupby_rolling (soon the rolling) context.

(DataFrame.group_by_rolling() was recently renamed to DataFrame.rolling())

raoulj · 2023-10-27T17:05:27Z

Okay. So the Expr.rolling() docs are temporarily incorrect while the new rolling context is developed. Thanks for the context.

I do like the .rolling().over() syntax. Unsure if that's in scope for the rolling context mentioned in the linked PR. Is there a public facing roadmap anywhere where I could understand this effort?

mkleinbort-ic · 2023-11-22T07:10:07Z

Not sure if related, but today I was trying do complex opeation:

pl.col('value').pct_change().over('entityId').rank().over('date')

On a table a bit like:

date	entityId	value
2020-01-01	"K"	7
2020-01-02	"K"	8
2020-01-03	"K"	9
2020-01-01	"G"	5
2020-01-02	"G"	12
2020-01-03	"G"	7

The expression is invalid due to

InvalidOperationError: window expression not allowed in aggregation

But not clear how to do the same at the pl.Expr level.

kszlim · 2024-01-11T08:09:45Z

I've run into this issue as well, as a workaround, what I do is ensure my express that does the rolling happens in an earlier with_columns and instead of directly depending on that expression, I reference the output of the first rolling express by name/alias.

It's a bit of a footgun, would love to see this limitation removed.

t-ded · 2024-05-16T12:38:16Z

I would like to add some traffic to this feature request.
It would be very convenient to have possibility either for pl.Expr().rolling().over() or group_by parameter within the pl.Expr().rolling() function
Currently, the pl.DataFrame.rolling() is the only way to get the desired output with the group_by parameter for ops such as n_unique. This, however, forces user to then join the result(s) back to the original frame in case a new column is desired (and possibly raises the need to keep some form of index on which to join back in specific cases). The inconvenience is even more prevalent in cases when one would want to create multiple rolling-window-based features either for different groupings or for different time settings of the rolling window.

lorentzenchr · 2024-12-16T08:15:59Z

May I ask, is it just a matter of someone to implement it or are there open discussion points to address first?

cmdlineluser · 2024-12-17T12:17:52Z

@lorentzenchr I don't know the answer, but one of the comments in the rolling rewrite proposal did suggest it would also close this issue.

Improve and expand Expr.rolling() #12049

From what I've read, I think it's just that the devs are currently busy implementing "Polars Cloud" and the new-streaming engine, which have since taken higher priority.

raoulj added the enhancement New feature or an improvement of an existing feature label Oct 26, 2023

xyk2000 mentioned this issue Oct 26, 2023

Improve and expand Expr.rolling() #12049

Open

DrMaphuse mentioned this issue Feb 8, 2024

Allow nested / chained aggregations #14361

Open

orlp added the accepted Ready for implementation label May 17, 2024

github-project-automation bot added this to Backlog May 17, 2024

github-project-automation bot moved this to Ready in Backlog May 17, 2024

jackaixin mentioned this issue Jul 1, 2024

Slow rolling_skew performance and inconsistent signature vs other rolling methods #17339

Open

etiennebacher mentioned this issue Jul 8, 2024

Add test for example where $over(..., order_by = ...) is useful etiennebacher/tidypolars#124

Closed

francescomandruvs mentioned this issue Jan 7, 2025

Nested aggregations #20601

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations #12051

window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations #12051

raoulj commented Oct 26, 2023

cmdlineluser commented Oct 26, 2023

raoulj commented Oct 26, 2023

xyk2000 commented Oct 26, 2023

raoulj commented Oct 26, 2023

cmdlineluser commented Oct 26, 2023

raoulj commented Oct 26, 2023

cmdlineluser commented Oct 26, 2023 •

edited

Loading

raoulj commented Oct 27, 2023

mkleinbort-ic commented Nov 22, 2023 •

edited

Loading

kszlim commented Jan 11, 2024

t-ded commented May 16, 2024

lorentzenchr commented Dec 16, 2024

cmdlineluser commented Dec 17, 2024

window expression not allowed in aggregation: allow chained .over() and .rolling() aggregations #12051

window expression not allowed in aggregation: allow chained .over() and .rolling() aggregations #12051

Comments

raoulj commented Oct 26, 2023

Description

cmdlineluser commented Oct 26, 2023

raoulj commented Oct 26, 2023

xyk2000 commented Oct 26, 2023

raoulj commented Oct 26, 2023

cmdlineluser commented Oct 26, 2023

raoulj commented Oct 26, 2023

cmdlineluser commented Oct 26, 2023 • edited Loading

raoulj commented Oct 27, 2023

mkleinbort-ic commented Nov 22, 2023 • edited Loading

kszlim commented Jan 11, 2024

t-ded commented May 16, 2024

lorentzenchr commented Dec 16, 2024

cmdlineluser commented Dec 17, 2024

window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations #12051

window expression not allowed in aggregation: allow chained `.over()` and `.rolling()` aggregations #12051

cmdlineluser commented Oct 26, 2023 •

edited

Loading

mkleinbort-ic commented Nov 22, 2023 •

edited

Loading