Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.Expr.get(pl.Expr.arg_min()) does not respect rolling window #15078

Closed
2 tasks done
dalejung opened this issue Mar 15, 2024 · 3 comments
Closed
2 tasks done

pl.Expr.get(pl.Expr.arg_min()) does not respect rolling window #15078

dalejung opened this issue Mar 15, 2024 · 3 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@dalejung
Copy link

dalejung commented Mar 15, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.from_repr(
    """
    ┌─────────────────────┬───────┬─────┐
    │ date                ┆ alpha ┆ num │
    │ ---                 ┆ ---   ┆ --- │
    │ datetime[μs]        ┆ str   ┆ i64 │
    ╞═════════════════════╪═══════╪═════╡
    │ 2022-01-01 11:00:00 ┆ A     ┆ 0   │
    │ 2022-01-01 11:01:00 ┆ B     ┆ 1   │
    │ 2022-01-01 11:02:00 ┆ C     ┆ 2   │
    │ 2022-01-01 11:03:00 ┆ D     ┆ 3   │
    │ 2022-01-01 11:04:00 ┆ E     ┆ 4   │
    │ 2022-01-01 11:05:00 ┆ F     ┆ 5   │
    │ 2022-01-01 11:06:00 ┆ G     ┆ 6   │
    │ 2022-01-01 11:07:00 ┆ H     ┆ 7   │
    │ 2022-01-01 11:08:00 ┆ I     ┆ 8   │
    │ 2022-01-01 11:09:01 ┆ J     ┆ 9   │
    └─────────────────────┴───────┴─────┘
    """
).with_columns(pl.col('date').set_sorted())

assert isinstance(df, pl.DataFrame)

Using rolling and get(arg_min)

rolling_report = df.rolling(
    'date',
    period='3m',
    offset="0",
    closed='left',
).agg(
    pl.col('num').alias('num_list'),
    pl.col('num').first().alias('first_num'),
    pl.col('num').arg_min().alias('arg_min'),
    pl.col('num').get(
        pl.col('num').arg_min()
    ).alias('min_num_get'),
    pl.col('num').get(
        0,
    ).alias('min_num_explicit'),
)

# this fails even though the smallest number of each window should be the first.
assert rolling_report['min_num_get'].eq(rolling_report['first_num']).all()

# NOTE: passing in an explict 0 to pl.Expr.get returns the correct
assert rolling_report['min_num_explicit'].eq(rolling_report['first_num']).all()

group_by_dynamic doesn't have this issue.

groupby_report = df.group_by_dynamic(
    'date',
    every='3m',
).agg(
    pl.col('num').alias('num_list'),
    pl.col('num').first().alias('first_num'),
    pl.col('num').arg_min().alias('arg_min'),
    pl.col('num').get(
        pl.col('num').arg_min()
    ).alias('min_num_get'),
    pl.col('num').get(
        0,
    ).alias('min_num_explicit'),
)

# GroupBy works as expected
assert groupby_report['min_num_get'].eq(groupby_report['first_num']).all()
assert groupby_report['min_num_explicit'].eq(groupby_report['first_num']).all()

Log output

No response

Issue description

When using a rolling window, I would expect

pl.col('num').get(
            pl.col('num').arg_min()
        )

To return the smallest num in each window when aggregated. I set up a toy example where the num column is ordered so the smallest number will always be the first in the window. However even though I verified arg_min returns all 0s, I'm not getting the correct value from get. Passing in an explicit 0 to get gives me what I expect.

In [62]: rolling_report
Out[62]:
                  date num_list  first_num  arg_min  min_num_get  min_num_explicit
0  2022-01-01 11:00:00  [0 1 2]          0        0            0                 0
1  2022-01-01 11:01:00  [1 2 3]          1        0            1                 1
2  2022-01-01 11:02:00  [2 3 4]          2        0            2                 2
3  2022-01-01 11:03:00  [3 4 5]          3        0            1                 3
4  2022-01-01 11:04:00  [4 5 6]          4        0            2                 4
5  2022-01-01 11:05:00  [5 6 7]          5        0            3                 5
6  2022-01-01 11:06:00  [6 7 8]          6        0            2                 6
7  2022-01-01 11:07:00  [7 8 9]          7        0            3                 7
8  2022-01-01 11:08:00    [8 9]          8        0            4                 8
9  2022-01-01 11:09:01      [9]          9        0            3                 9

group_by_dynamic windows work correctly and return the first/smallest num.

In [63]: groupby_report
Out[63]:
                  date num_list  first_num  arg_min  min_num_get  min_num_explicit
0  2022-01-01 11:00:00  [0 1 2]          0        0            0                 0
1  2022-01-01 11:03:00  [3 4 5]          3        0            3                 3
2  2022-01-01 11:06:00  [6 7 8]          6        0            6                 6
3  2022-01-01 11:09:00      [9]          9        0            9                 9

Expected behavior

pl.Expr.get should get the indexed value from the current rolling window only.

Installed versions

--------Version info---------
Polars:               0.20.14
Index type:           UInt32
Platform:             Linux-6.7.8-arch1-1-x86_64-with-glibc2.39
Python:               3.12.2 | packaged by conda-forge | (main, Feb 16 2024, 20:50:58) [GCC 12.3.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.2.0
gevent:               <not installed>
hvplot:               0.9.2.post8+g4cb29ba
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             3.1.2
pandas:               3.0.0.dev0+432.g5bcc7b7077
pyarrow:              16.0.0.dev248+g120a1dbec
pydantic:             1.10.13
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@dalejung dalejung added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 15, 2024
@dalejung dalejung changed the title pl.Expr.get does not respect rolling window pl.Expr.get(pl.Expr.arg_min()) does not respect rolling window Mar 15, 2024
@dalejung
Copy link
Author

This also appears to be an issue for group_by_dynamic when defining different every and period.

dates = pl.date_range(
        pl.datetime(2023, 1, 1),
        pl.datetime(2024, 1, 1),
        eager=True,
    )
    df = pl.DataFrame({
        'date': dates,
    }).with_columns(
        num=pl.int_range(pl.len())
    )

    assert isinstance(df, pl.DataFrame)

    grouped = df.group_by_dynamic(
        'date',
        every='1mo',
        period='2mo',
    )

    report = grouped.agg(
        pl.col('num').first().alias('num_first'),
        pl.col('num').get(
            pl.col('num').arg_min(),
        ).alias('num_get'),
    )

In this example num_first and num_get should be the same but they are not. Removing period correctly shows them being the same.

@dalejung
Copy link
Author

Possible that internally it is the same bug as:

#15091

I thought group_by_dyamic behaved better but now seeing that offset will return incorrect results like rolling, it feels similar.

@dalejung
Copy link
Author

This looks to be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

1 participant