-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
window expression not allowed in aggregation: allow chained .over()
and .rolling()
aggregations
#12051
Comments
There is a dedicated df.rolling(by="train_line", index_column="dt", period="2d").agg(
pl.col("num_passengers").mean()
)
# shape: (6, 3)
# ┌────────────┬─────────────────────┬────────────────┐
# │ train_line ┆ dt ┆ num_passengers │
# │ --- ┆ --- ┆ --- │
# │ str ┆ datetime[μs] ┆ f64 │
# ╞════════════╪═════════════════════╪════════════════╡
# │ a ┆ 2020-01-01 13:45:48 ┆ 3.0 │
# │ a ┆ 2020-01-01 16:45:09 ┆ 4.0 │
# │ a ┆ 2020-01-02 18:12:48 ┆ 5.666667 │
# │ a ┆ 2020-01-08 23:16:43 ┆ 1.0 │
# │ b ┆ 2020-01-01 16:42:13 ┆ 7.0 │
# │ b ┆ 2020-01-03 19:45:32 ┆ 2.0 │
# └────────────┴─────────────────────┴────────────────┘ |
@cmdlineluser appreciate the pointer! I may have oversimplified my example. The motivating case I encountered is a dynamic threshold with an Say we have:
How would I filter to train departures that had more than the threshold number of riders for every ride in the last 2 days? I can't use |
That would be especially helpful if we can use expressions like that
|
Do you mean something like this @raoulj ? df.rolling("departure_time", by="train_line", period="2d").agg(
pl.exclude("departure_time"),
all = (pl.col("num_passengers") > pl.col("threshold")).all()
)
# shape: (6, 5)
# ┌────────────┬─────────────────────┬────────────────┬───────────┬───────┐
# │ train_line ┆ departure_time ┆ num_passengers ┆ threshold ┆ all │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ datetime[μs] ┆ list[i64] ┆ list[i64] ┆ bool │
# ╞════════════╪═════════════════════╪════════════════╪═══════════╪═══════╡
# │ a ┆ 2020-01-01 13:45:48 ┆ [3] ┆ [2] ┆ true │
# │ a ┆ 2020-01-01 16:45:09 ┆ [3, 5] ┆ [2, 1] ┆ true │
# │ a ┆ 2020-01-02 18:12:48 ┆ [3, 5, 9] ┆ [2, 1, 3] ┆ true │
# │ a ┆ 2020-01-08 23:16:43 ┆ [1] ┆ [4] ┆ false │
# │ b ┆ 2020-01-01 16:42:13 ┆ [7] ┆ [4] ┆ true │
# │ b ┆ 2020-01-03 19:45:32 ┆ [2] ┆ [2] ┆ false │
# └────────────┴─────────────────────┴────────────────┴───────────┴───────┘ Apologies if I've misunderstood. |
No that is exactly what I wanted! Didn't know about DataFrame.rolling(). Thank you for pointing me in the right direction. Is this |
( |
Okay. So the I do like the |
Not sure if related, but today I was trying do complex opeation: pl.col('value').pct_change().over('entityId').rank().over('date') On a table a bit like:
The expression is invalid due to
But not clear how to do the same at the |
I've run into this issue as well, as a workaround, what I do is ensure my express that does the It's a bit of a footgun, would love to see this limitation removed. |
I would like to add some traffic to this feature request. |
May I ask, is it just a matter of someone to implement it or are there open discussion points to address first? |
@lorentzenchr I don't know the answer, but one of the comments in the rolling rewrite proposal did suggest it would also close this issue. From what I've read, I think it's just that the devs are currently busy implementing "Polars Cloud" and the new-streaming engine, which have since taken higher priority. |
Description
Consider the following DataFrame:
If I want to get the rolling average, at each departure, of the last 2 days worth of departures for each train line. Here's how I would think I would do that:
But, doing this, I currently get
InvalidOperationError: window expression not allowed in aggregation
The text was updated successfully, but these errors were encountered: