Backfill based on rows #16501

zacqed · 2024-05-26T15:03:56Z

Description

On discord usage-question, the replication for following Pandas code seems to be over-complicated and does not scale for the purpose of performance. Can this be considered as a feature that Polars should have?

temp_df = pd.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})

series = [0, 1, 2, 3]

temp_df1 = (
        (
            temp_df.groupby(["instu_id", "COLL_END_DATE", "Period"], sort=False, dropna=False)["outstanding"]
            .sum()
            .unstack()
        )
        .reindex(columns=series)
        .bfill(axis=1)
    ).reset_index()

Output
Period  instu_id COLL_END_DATE            0            1          2          3
0        9000098    2023-10-31  118877900.0  118877900.0        NaN        NaN
1        9000099    2023-10-31    1231231.0    1231231.0  1231231.0  1231231.0

Polars:


df = pl.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})
groups = ["instu_id", "COLL_END_DATE"]
period = pl.Series("Period", [0, 1, 2, 3])


(df.unique(groups)
   .select(groups)
   .join(period.to_frame(), how="cross")
   .join(df, on=groups + ["Period"], how="left")
   .with_columns(
      pl.when(pl.col.outstanding.is_not_null().any())
        .then(pl.col.outstanding.sum())
        .over(groups + ["Period"])
   )
   .with_columns(pl.col.outstanding.backward_fill().over(groups))
   .pivot(
      index = groups,
      columns = "Period",
      values = "outstanding",
      aggregate_function = "first"
   )
)

# shape: (2, 6)
# ┌──────────┬───────────────┬────────────┬────────────┬────────────┬────────────┐
# │ instu_id ┆ COLL_END_DATE ┆ 0          ┆ 1          ┆ 2          ┆ 3          │
# │ ---      ┆ ---           ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
# │ i64      ┆ str           ┆ f64        ┆ f64        ┆ f64        ┆ f64        │
# ╞══════════╪═══════════════╪════════════╪════════════╪════════════╪════════════╡
# │ 9000099  ┆ 2023-10-31    ┆ 1.231231e6 ┆ 1.231231e6 ┆ 1.231231e6 ┆ 1.231231e6 │
# │ 9000098  ┆ 2023-10-31    ┆ 1.188779e8 ┆ 1.188779e8 ┆ null       ┆ null       │
# └──────────┴───────────────┴────────────┴────────────┴────────────┴────────────┘

Or is there any simpler way?

The text was updated successfully, but these errors were encountered:

cmdlineluser · 2024-05-26T19:01:25Z

Just to be clear, the performance remark was about a "horizontal fill" with cum_reduce (and not the approach shown above)

# This does not "scale"
backward_fill_horizontal = lambda cols: (
    pl.cum_reduce(function=lambda l, r: pl.coalesce(l, r), exprs=map(str, reversed(cols)))
      .struct.field("*")
)

I believe another way to do this in pandas could be to use Categoricals? i.e. without horizontal bfill() / reindex:

import pandas as pd

df = pd.DataFrame({
    "instu_id": [9000098, 9000099],
    "COLL_END_DATE": ["2023-10-31","2023-10-31"],
    "Period": [1,3],
    "outstanding": [1.188779e+08, 1231231]
})

df["Period"] = pd.Categorical(temp_df["Period"], categories=[0, 1, 2, 3], ordered=True)

(df.groupby(["instu_id", "COLL_END_DATE", "Period"], dropna=False, observed=False)
   .agg(["any", "sum"])
) 

#                               outstanding
#                                       any          sum
# instu_id COLL_END_DATE Period
# 9000098  2023-10-31    0              NaN          0.0
#                        1             True  118877900.0
#                        2              NaN          0.0
#                        3              NaN          0.0
# 9000099  2023-10-31    0              NaN          0.0
#                        1              NaN          0.0
#                        2              NaN          0.0
#                        3             True    1231231.0

Which has popped up in a few issues:

Option to report zero-length groups when applying .group_by().len() #15997

tidyr::complete() functionality is also sort of similar:

Create missing combinations of variables (like tidyr::complete() in R) #9722

zacqed added the enhancement New feature or an improvement of an existing feature label May 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfill based on rows #16501

Backfill based on rows #16501

zacqed commented May 26, 2024

cmdlineluser commented May 26, 2024

Backfill based on rows #16501

Backfill based on rows #16501

Comments

zacqed commented May 26, 2024

Description

cmdlineluser commented May 26, 2024