Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

group_by_dynamic with offset computes wrong window starts when a DST time change happens just before the 1st window #15966

Open
2 tasks done
michelbl opened this issue Apr 30, 2024 · 6 comments
Assignees
Labels
A-timeseries Area: date/time functionality documentation Improvements or additions to documentation python Related to Python Polars

Comments

@michelbl
Copy link

michelbl commented Apr 30, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2023, 3, 26, 14, 56, tzinfo=UTC),
                    datetime(2023, 3, 27, 14, 56, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [4, 5],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2023, 10, 29, 14, 56, tzinfo=UTC),
                    datetime(2023, 10, 30, 14, 56, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [4, 5],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

Log output

No output

Issue description

When a DST time change ("spring forward" or "fall back") happens between midnight and the first window start, then all the window starts are not offset correctly.

shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-03-26 07:00:00 CEST   ┆ 4   │
│ 2023-03-27 07:00:00 CEST   ┆ 5   │
└────────────────────────────┴─────┘
shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-10-29 05:00:00 CET    ┆ 4   │
│ 2023-10-30 05:00:00 CET    ┆ 5   │
└────────────────────────────┴─────┘

Expected behavior

shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-03-26 06:00:00 CEST   ┆ 4   │
│ 2023-03-27 06:00:00 CEST   ┆ 5   │
└────────────────────────────┴─────┘
shape: (2, 2)
┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2023-10-29 06:00:00 CET    ┆ 4   │
│ 2023-10-30 06:00:00 CET    ┆ 5   │
└────────────────────────────┴─────┘

Installed versions

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python:               3.11.4 (main, Jun 26 2023, 15:13:33) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.10.0
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.2
nest_asyncio:         1.5.8
numpy:                1.26.2
openpyxl:             3.1.2
pandas:               2.1.3
pyarrow:              11.0.0
pydantic:             2.5.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           2.0.23
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@michelbl michelbl added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Apr 30, 2024
@MarcoGorelli MarcoGorelli added the A-timeseries Area: date/time functionality label Apr 30, 2024
@MarcoGorelli
Copy link
Collaborator

thanks for the report, will take a look

@MarcoGorelli MarcoGorelli self-assigned this Apr 30, 2024
@MarcoGorelli MarcoGorelli added the P-medium Priority: medium label Apr 30, 2024
@MarcoGorelli
Copy link
Collaborator

Looking at this more closely, I think the current behaviour is correct

Or at least, it's behaving as documented. The docs say

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

And indeed:

 [16]: pl.Series([datetime(2023, 3, 26, 16, 56)], dtype=pl.Datetime('us', 'Europe/Paris')).dt.truncate('1d').dt.offset_by('6h')
Out[16]:
shape: (1,)
Series: '' [datetime[μs, Europe/Paris]]
[
        2023-03-26 07:00:00 CEST
]

@MarcoGorelli MarcoGorelli removed bug Something isn't working needs triage Awaiting prioritization by a maintainer P-medium Priority: medium labels May 1, 2024
@MarcoGorelli
Copy link
Collaborator

closing then as this looks expected, but thanks for the report! please do reach out if anything else trips you up

@michelbl
Copy link
Author

michelbl commented May 7, 2024

Hi @MarcoGorelli , thanks for you analysis. I was in holidays so I didn't answer in time. However, here are a few arguments you might want to consider.

Let's look at that example:

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
                    datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [10, 11, 12],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

that produces the following result:

┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 2025-01-01 06:00:00 CET    ┆ 21  │
│ 2025-01-02 06:00:00 CET    ┆ 12  │
└────────────────────────────┴─────┘

Now, I mischievously add a single data point 28 years back in time with the value 0:

from datetime import datetime, timedelta, UTC

import polars as pl

print(
    pl.DataFrame(
        data={
            "t": pl.Series(
                [
                    datetime(1997, 3, 30, 14, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 5, 56, tzinfo=UTC),
                    datetime(2025, 1, 1, 10, 6, tzinfo=UTC),
                    datetime(2025, 1, 2, 5, 45, tzinfo=UTC),
                ]
            )
            .dt.cast_time_unit("ms")
            .dt.convert_time_zone("Europe/Paris"),
            "q": [0, 10, 11, 12],
        }
    )
    .set_sorted("t")
    .group_by_dynamic(index_column="t", every="1d", offset=timedelta(hours=6))
    .agg([pl.sum("q").alias("q")])
)

and now the results are all messed up:

┌────────────────────────────┬─────┐
│ t                          ┆ q   │
│ ---                        ┆ --- │
│ datetime[ms, Europe/Paris] ┆ i64 │
╞════════════════════════════╪═════╡
│ 1997-03-30 07:00:00 CEST   ┆ 0   │
│ 2024-12-31 07:00:00 CET    ┆ 10  │
│ 2025-01-01 07:00:00 CET    ┆ 23  │
└────────────────────────────┴─────┘

As we can see, how the data is aggregated depends not only on the parameters, but also on the data itself. A single data point affects the whole result. I cannot think of any use-case where that is a desired behavior. Moreover, I believe it conflicts with the expectation that the windows are fully defined by the arguments offset and every and would not vary depending on the data to aggregate.

Furthermore, I don't fully agree that this is a documented behavior. Let's follow the documented recipe:

‘window’: Start by taking the earliest timestamp, truncating it with every, and then adding offset. Note that weekly windows start on Monday.

  • Our first timestamp is 1997-03-30 16:56 Europe/Paris (which append to be in summer time)
  • Truncated by day: 1997-03-30 00:00 Europe/Paris (now we are in winter time)
  • Adding an offset of 6 hours. Do you add 6 "wall time" hours or 6 "absolute time" hours? In the former case, the result is 1997-03-30 06:00 Europe/Paris (summer time), in the later case the result is 1997-03-30 07:00 Europe/Paris. Since the two semantics are completely valid on their own, there is now way to tell which one is used.
  • Since the computations of all the subsequent time windows use the "wall time" semantics (DST hour gets added/removed), it is reasonable to assume that the first window start is also using the "wall time" semantics. But this is not the case.

I also think that the consequences are hard to foresee (I deployed that code in production for months, fully unaware of this edge case.)

If you still think that this behavior is suitable in some circumstances, then adding a warning in the documentation would at least prevent users from making wrong assumptions.

@MarcoGorelli
Copy link
Collaborator

thanks for providing more context

.dt.offset_by mentions which durations are "calendar" ones, and which not: https://docs.pola.rs/py-polars/html/reference/expressions/api/polars.Expr.dt.offset_by.html#polars.Expr.dt.offset_by . That should probably be linked

@MarcoGorelli MarcoGorelli reopened this May 7, 2024
@MarcoGorelli MarcoGorelli added the documentation Improvements or additions to documentation label May 7, 2024
@MarcoGorelli
Copy link
Collaborator

To be honest, the default of -every for offset has been annoying me for some time

It may be a good idea to redesign this a bit - it may need to be a breaking change as part of the 1.0 release, but if it's done right, it'll be for the better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-timeseries Area: date/time functionality documentation Improvements or additions to documentation python Related to Python Polars
Projects
Archived in project
Development

No branches or pull requests

2 participants