-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
group_by_dynamic with offset computes wrong window starts when a DST time change happens just before the 1st window #15966
Comments
thanks for the report, will take a look |
Looking at this more closely, I think the current behaviour is correct Or at least, it's behaving as documented. The docs say
And indeed: [16]: pl.Series([datetime(2023, 3, 26, 16, 56)], dtype=pl.Datetime('us', 'Europe/Paris')).dt.truncate('1d').dt.offset_by('6h')
Out[16]:
shape: (1,)
Series: '' [datetime[μs, Europe/Paris]]
[
2023-03-26 07:00:00 CEST
] |
closing then as this looks expected, but thanks for the report! please do reach out if anything else trips you up |
Hi @MarcoGorelli , thanks for you analysis. I was in holidays so I didn't answer in time. However, here are a few arguments you might want to consider. Let's look at that example: from datetime import datetime, timedelta, UTC
import polars as pl
print(
pl.DataFrame(
data={
"t": pl.Series(
[
datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
]
)
.dt.cast_time_unit("ms")
.dt.convert_time_zone("Europe/Paris"),
"q": [10, 11, 12],
}
)
.set_sorted("t")
.group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
.agg([pl.sum("q").alias("q")])
) that produces the following result:
Now, I mischievously add a single data point 28 years back in time with the value 0: from datetime import datetime, timedelta, UTC
import polars as pl
print(
pl.DataFrame(
data={
"t": pl.Series(
[
datetime(1997, 3, 30, 14, 56, tzinfo=UTC),
datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
]
)
.dt.cast_time_unit("ms")
.dt.convert_time_zone("Europe/Paris"),
"q": [0, 10, 11, 12],
}
)
.set_sorted("t")
.group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
.agg([pl.sum("q").alias("q")])
) and now the results are all messed up:
As we can see, how the data is aggregated depends not only on the parameters, but also on the data itself. A single data point affects the whole result. I cannot think of any use-case where that is a desired behavior. Moreover, I believe it conflicts with the expectation that the windows are fully defined by the arguments Furthermore, I don't fully agree that this is a documented behavior. Let's follow the documented recipe:
I also think that the consequences are hard to foresee (I deployed that code in production for months, fully unaware of this edge case.) If you still think that this behavior is suitable in some circumstances, then adding a warning in the documentation would at least prevent users from making wrong assumptions. |
thanks for providing more context
|
To be honest, the default of It may be a good idea to redesign this a bit - it may need to be a breaking change as part of the 1.0 release, but if it's done right, it'll be for the better |
Checks
Reproducible example
Log output
Issue description
When a DST time change ("spring forward" or "fall back") happens between midnight and the first window start, then all the window starts are not offset correctly.
Expected behavior
Installed versions
The text was updated successfully, but these errors were encountered: