Option to report zero-length groups when applying .group_by().len() #15997

tzeitim · 2024-05-01T17:01:32Z

Description

Given a set of groups in a group_by after applying .len(), is there a way to keep track of groups of size zero?

For some longitudinal data it would be very useful to get the zeros for those groups in which a combination of factors has no data points, specially for plotting with lines.

Note that on the bar plot, there is a range of missing coordinates.

Libraries in other languages support an option for such capability as explained in this example on R.

I have seen some examples on SO on how one can achieve this using enums and joins but it can become a bit hard to adapt for more complicated, real life scenarios.

Thanks!!

The text was updated successfully, but these errors were encountered:

mcrumiller · 2024-05-01T18:56:33Z

Are you referring to what is essentially the equivalent of pandas' observed=True, whereby the resulting grouped index contains all possible combinations of the grouping variables, and not just the observed ones?

If so, you can create the cartesian product of your grouping keys and left-join that against your groups. See the following example:

import polars as pl

df = pl.DataFrame({
    "key1": [1, 1, 1, 2, 2, 2],
    "key2": [1, 2, 3, 1, 2, 2],  # note (2,3) missing from key
    "value": [1, 2, 3, 4, 5, 6],
})

print(df.group_by(["key1", "key2"]).len())
# shape: (5, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ ---  ┆ ---  ┆ --- │
# │ i64  ┆ i64  ┆ u32 │
# ╞══════╪══════╪═════╡
# │ 1    ┆ 1    ┆ 1   │
# │ 1    ┆ 2    ┆ 1   │
# │ 1    ┆ 3    ┆ 1   │
# │ 2    ┆ 1    ┆ 1   │
# │ 2    ┆ 2    ┆ 2   │
# └──────┴──────┴─────┘  # note missing (2, 3)

# Create all combinations of keys.
# You may want to generalize this to multiple variables with a function
keys = df.select("key1").unique().join(df.select("key2").unique(), how="cross")

# Join with group by.
# You could easily function-ify this by passing the join args inward.
print(
    keys.join(
        # our join
        df.group_by(["key1", "key2"], maintain_order=True).len(),
        on=["key1", "key2"],
        how="left"
    )
    .with_columns(pl.col("len").fill_null(0))
)
# shape: (6, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ ---  ┆ ---  ┆ --- │
# │ i64  ┆ i64  ┆ i64 │
# ╞══════╪══════╪═════╡
# │ 2    ┆ 3    ┆ 0   │  # new result!
# │ 2    ┆ 2    ┆ 2   │
# │ 2    ┆ 1    ┆ 1   │
# │ 1    ┆ 3    ┆ 1   │
# │ 1    ┆ 2    ┆ 1   │
# │ 1    ┆ 1    ┆ 1   │
# └──────┴──────┴─────┘

tzeitim · 2024-05-01T23:52:46Z

Yes, that's the operation I am referring to. I didn't know it's version on pandas.

The solution you suggest is of the same nature as the ones I've referred to as nice examples on how to go about it with a toy example with two groups. In my current data frame the grouping I am performing involves several fields (8 to 10).

How would an implementation of your suggested solution look like in a scenario with more than 2 grouping fields?

I hope I am not missing some obvious no-no (e. g. insane combinatorics of elements) but would still think a simple flag on the .len() function would take care of it.

I will try to convert to pandas and test whether the flag observed = True solves the issue without some performance prohibitions.

mcrumiller · 2024-05-02T01:05:13Z

Well yes, you'll hit a combinatorial explosion if you have lots of variables and lots of options. The number of combinations is the product of the number of unique values in each variable, so it can blow up fairly quickly.

This is not a current option because the detected groups are groups that exist in the data and they are referenced by row index. In the example above, the (2, 3) does not exist in the data. I think your best bet is to simply use a workaround like the above. I can help you a bit:

import polars as pl
from polars import col


def all_keys(df_keys):
    """Create dataframe of all unique combinations of values."""
    columns = df_keys.columns
    df = df_keys.select(columns[0]).unique()
    for column in columns[1:]:
        df = df.join(df_keys.select(column).unique(), how="cross")
    return df


def len_all_combos(df, key, agg_ops=pl.all().len(), fill_value=None):
    """Perform group-by with op and include all possible key combinations."""
    keys = all_keys(df.select(key))
    df_grouped = df.group_by(key).agg(agg_ops)
    df_out = keys.join(df_grouped, on=key, how="left")
    if fill_value is not None:
        other_columns = [x for x in df.columns if x not in key]
        df_out = df_out.with_columns(col(other_columns).fill_null(fill_value))
    return df_out.sort(keys)


# let's try it out
df = pl.DataFrame({
    "key1": [1, 1, 1, 2, 2, 2],
    "key2": [1, 2, 3, 1, 2, 2],  # note (2,3) missing from key
    "value": [1, 2, 3, 4, 5, 6],
})

print(
    len_all_combos(df, ["key1", "key2"], fill_value=0)
)
# shape: (6, 3)
# ┌──────┬──────┬───────┐
# │ key1 ┆ key2 ┆ value │
# │ ---  ┆ ---  ┆ ---   │
# │ i64  ┆ i64  ┆ i64   │
# ╞══════╪══════╪═══════╡
# │ 1    ┆ 1    ┆ 1     │
# │ 1    ┆ 2    ┆ 1     │
# │ 1    ┆ 3    ┆ 1     │
# │ 2    ┆ 1    ┆ 1     │
# │ 2    ┆ 2    ┆ 2     │
# │ 2    ┆ 3    ┆ 0     │
# └──────┴──────┴───────┘

tzeitim · 2024-05-02T01:17:37Z

Thanks for the help. I will test your suggestions and continue digesting the problem.

tzeitim added the enhancement New feature or an improvement of an existing feature label May 1, 2024

tzeitim closed this as completed May 3, 2024

cmdlineluser mentioned this issue May 26, 2024

Backfill based on rows #16501

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to report zero-length groups when applying .group_by().len() #15997

Option to report zero-length groups when applying .group_by().len() #15997

tzeitim commented May 1, 2024 •

edited

Loading

mcrumiller commented May 1, 2024 •

edited

Loading

tzeitim commented May 1, 2024 •

edited

Loading

mcrumiller commented May 2, 2024 •

edited

Loading

tzeitim commented May 2, 2024

Option to report zero-length groups when applying .group_by().len() #15997

Option to report zero-length groups when applying .group_by().len() #15997

Comments

tzeitim commented May 1, 2024 • edited Loading

Description

mcrumiller commented May 1, 2024 • edited Loading

tzeitim commented May 1, 2024 • edited Loading

mcrumiller commented May 2, 2024 • edited Loading

tzeitim commented May 2, 2024

tzeitim commented May 1, 2024 •

edited

Loading

mcrumiller commented May 1, 2024 •

edited

Loading

tzeitim commented May 1, 2024 •

edited

Loading

mcrumiller commented May 2, 2024 •

edited

Loading