Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to report zero-length groups when applying .group_by().len() #15997

Closed
tzeitim opened this issue May 1, 2024 · 4 comments
Closed

Option to report zero-length groups when applying .group_by().len() #15997

tzeitim opened this issue May 1, 2024 · 4 comments
Labels
enhancement New feature or an improvement of an existing feature

Comments

@tzeitim
Copy link

tzeitim commented May 1, 2024

Description

Given a set of groups in a group_by after applying .len(), is there a way to keep track of groups of size zero?

For some longitudinal data it would be very useful to get the zeros for those groups in which a combination of factors has no data points, specially for plotting with lines.

image

image

Note that on the bar plot, there is a range of missing coordinates.

Libraries in other languages support an option for such capability as explained in this example on R.

I have seen some examples on SO on how one can achieve this using enums and joins but it can become a bit hard to adapt for more complicated, real life scenarios.

Thanks!!

@tzeitim tzeitim added the enhancement New feature or an improvement of an existing feature label May 1, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented May 1, 2024

Are you referring to what is essentially the equivalent of pandas' observed=True, whereby the resulting grouped index contains all possible combinations of the grouping variables, and not just the observed ones?

If so, you can create the cartesian product of your grouping keys and left-join that against your groups. See the following example:

import polars as pl

df = pl.DataFrame({
    "key1": [1, 1, 1, 2, 2, 2],
    "key2": [1, 2, 3, 1, 2, 2],  # note (2,3) missing from key
    "value": [1, 2, 3, 4, 5, 6],
})

print(df.group_by(["key1", "key2"]).len())
# shape: (5, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ ---  ┆ ---  ┆ --- │
# │ i64  ┆ i64  ┆ u32 │
# ╞══════╪══════╪═════╡
# │ 1    ┆ 1    ┆ 1   │
# │ 1    ┆ 2    ┆ 1   │
# │ 1    ┆ 3    ┆ 1   │
# │ 2    ┆ 1    ┆ 1   │
# │ 2    ┆ 2    ┆ 2   │
# └──────┴──────┴─────┘  # note missing (2, 3)

# Create all combinations of keys.
# You may want to generalize this to multiple variables with a function
keys = df.select("key1").unique().join(df.select("key2").unique(), how="cross")

# Join with group by.
# You could easily function-ify this by passing the join args inward.
print(
    keys.join(
        # our join
        df.group_by(["key1", "key2"], maintain_order=True).len(),
        on=["key1", "key2"],
        how="left"
    )
    .with_columns(pl.col("len").fill_null(0))
)
# shape: (6, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ ---  ┆ ---  ┆ --- │
# │ i64  ┆ i64  ┆ i64 │
# ╞══════╪══════╪═════╡
# │ 2    ┆ 3    ┆ 0   │  # new result!
# │ 2    ┆ 2    ┆ 2   │
# │ 2    ┆ 1    ┆ 1   │
# │ 1    ┆ 3    ┆ 1   │
# │ 1    ┆ 2    ┆ 1   │
# │ 1    ┆ 1    ┆ 1   │
# └──────┴──────┴─────┘

@tzeitim
Copy link
Author

tzeitim commented May 1, 2024

Yes, that's the operation I am referring to. I didn't know it's version on pandas.

The solution you suggest is of the same nature as the ones I've referred to as nice examples on how to go about it with a toy example with two groups. In my current data frame the grouping I am performing involves several fields (8 to 10).

How would an implementation of your suggested solution look like in a scenario with more than 2 grouping fields?

I hope I am not missing some obvious no-no (e. g. insane combinatorics of elements) but would still think a simple flag on the .len() function would take care of it.

I will try to convert to pandas and test whether the flag observed = True solves the issue without some performance prohibitions.

@mcrumiller
Copy link
Contributor

mcrumiller commented May 2, 2024

Well yes, you'll hit a combinatorial explosion if you have lots of variables and lots of options. The number of combinations is the product of the number of unique values in each variable, so it can blow up fairly quickly.

This is not a current option because the detected groups are groups that exist in the data and they are referenced by row index. In the example above, the (2, 3) does not exist in the data. I think your best bet is to simply use a workaround like the above. I can help you a bit:

import polars as pl
from polars import col


def all_keys(df_keys):
    """Create dataframe of all unique combinations of values."""
    columns = df_keys.columns
    df = df_keys.select(columns[0]).unique()
    for column in columns[1:]:
        df = df.join(df_keys.select(column).unique(), how="cross")
    return df


def len_all_combos(df, key, agg_ops=pl.all().len(), fill_value=None):
    """Perform group-by with op and include all possible key combinations."""
    keys = all_keys(df.select(key))
    df_grouped = df.group_by(key).agg(agg_ops)
    df_out = keys.join(df_grouped, on=key, how="left")
    if fill_value is not None:
        other_columns = [x for x in df.columns if x not in key]
        df_out = df_out.with_columns(col(other_columns).fill_null(fill_value))
    return df_out.sort(keys)


# let's try it out
df = pl.DataFrame({
    "key1": [1, 1, 1, 2, 2, 2],
    "key2": [1, 2, 3, 1, 2, 2],  # note (2,3) missing from key
    "value": [1, 2, 3, 4, 5, 6],
})

print(
    len_all_combos(df, ["key1", "key2"], fill_value=0)
)
# shape: (6, 3)
# ┌──────┬──────┬───────┐
# │ key1 ┆ key2 ┆ value │
# │ ---  ┆ ---  ┆ ---   │
# │ i64  ┆ i64  ┆ i64   │
# ╞══════╪══════╪═══════╡
# │ 1    ┆ 1    ┆ 1     │
# │ 1    ┆ 2    ┆ 1     │
# │ 1    ┆ 3    ┆ 1     │
# │ 2    ┆ 1    ┆ 1     │
# │ 2    ┆ 2    ┆ 2     │
# │ 2    ┆ 3    ┆ 0     │
# └──────┴──────┴───────┘

@tzeitim
Copy link
Author

tzeitim commented May 2, 2024

Thanks for the help. I will test your suggestions and continue digesting the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

2 participants