Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill based on rows #16501

Open
zacqed opened this issue May 26, 2024 · 1 comment
Open

Backfill based on rows #16501

zacqed opened this issue May 26, 2024 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@zacqed
Copy link

zacqed commented May 26, 2024

Description

On discord usage-question, the replication for following Pandas code seems to be over-complicated and does not scale for the purpose of performance. Can this be considered as a feature that Polars should have?

temp_df = pd.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})

series = [0, 1, 2, 3]

temp_df1 = (
        (
            temp_df.groupby(["instu_id", "COLL_END_DATE", "Period"], sort=False, dropna=False)["outstanding"]
            .sum()
            .unstack()
        )
        .reindex(columns=series)
        .bfill(axis=1)
    ).reset_index()
    
Output
Period  instu_id COLL_END_DATE            0            1          2          3
0        9000098    2023-10-31  118877900.0  118877900.0        NaN        NaN
1        9000099    2023-10-31    1231231.0    1231231.0  1231231.0  1231231.0

Polars:


df = pl.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})
groups = ["instu_id", "COLL_END_DATE"]
period = pl.Series("Period", [0, 1, 2, 3])

(df.unique(groups)
   .select(groups)
   .join(period.to_frame(), how="cross")
   .join(df, on=groups + ["Period"], how="left")
   .with_columns(
      pl.when(pl.col.outstanding.is_not_null().any())
        .then(pl.col.outstanding.sum())
        .over(groups + ["Period"])
   )
   .with_columns(pl.col.outstanding.backward_fill().over(groups))
   .pivot(
      index = groups,
      columns = "Period",
      values = "outstanding",
      aggregate_function = "first"
   )
)
# shape: (2, 6)
# ┌──────────┬───────────────┬────────────┬────────────┬────────────┬────────────┐
# │ instu_id ┆ COLL_END_DATE ┆ 0          ┆ 1          ┆ 2          ┆ 3          │
# │ ---      ┆ ---           ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
# │ i64      ┆ str           ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
# ╞══════════╪═══════════════╪════════════╪════════════╪════════════╪════════════╡
# │ 9000099  ┆ 2023-10-31    ┆ 1.231231e6 ┆ 1.231231e6 ┆ 1.231231e6 ┆ 1.231231e6 │
# │ 9000098  ┆ 2023-10-31    ┆ 1.188779e8 ┆ 1.188779e8 ┆ null       ┆ null       │
# └──────────┴───────────────┴────────────┴────────────┴────────────┴────────────┘

Or is there any simpler way?

@zacqed zacqed added the enhancement New feature or an improvement of an existing feature label May 26, 2024
@cmdlineluser
Copy link
Contributor

Just to be clear, the performance remark was about a "horizontal fill" with cum_reduce (and not the approach shown above)

# This does not "scale"
backward_fill_horizontal = lambda cols: (
    pl.cum_reduce(function=lambda l, r: pl.coalesce(l, r), exprs=map(str, reversed(cols)))
      .struct.field("*")
)

I believe another way to do this in pandas could be to use Categoricals? i.e. without horizontal bfill() / reindex:

import pandas as pd

df = pd.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})

df["Period"] = pd.Categorical(temp_df["Period"], categories=[0, 1, 2, 3], ordered=True)

(df.groupby(["instu_id", "COLL_END_DATE", "Period"], dropna=False, observed=False)
   .agg(["any", "sum"])
) 

#                               outstanding
#                                       any          sum
# instu_id COLL_END_DATE Period
# 9000098  2023-10-31    0              NaN          0.0
#                        1             True  118877900.0
#                        2              NaN          0.0
#                        3              NaN          0.0
# 9000099  2023-10-31    0              NaN          0.0
#                        1              NaN          0.0
#                        2              NaN          0.0
#                        3             True    1231231.0

Which has popped up in a few issues:

tidyr::complete() functionality is also sort of similar:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants