-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to report zero-length groups when applying .group_by().len() #15997
Comments
Are you referring to what is essentially the equivalent of pandas' If so, you can create the cartesian product of your grouping keys and left-join that against your groups. See the following example: import polars as pl
df = pl.DataFrame({
"key1": [1, 1, 1, 2, 2, 2],
"key2": [1, 2, 3, 1, 2, 2], # note (2,3) missing from key
"value": [1, 2, 3, 4, 5, 6],
})
print(df.group_by(["key1", "key2"]).len())
# shape: (5, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ u32 │
# ╞══════╪══════╪═════╡
# │ 1 ┆ 1 ┆ 1 │
# │ 1 ┆ 2 ┆ 1 │
# │ 1 ┆ 3 ┆ 1 │
# │ 2 ┆ 1 ┆ 1 │
# │ 2 ┆ 2 ┆ 2 │
# └──────┴──────┴─────┘ # note missing (2, 3)
# Create all combinations of keys.
# You may want to generalize this to multiple variables with a function
keys = df.select("key1").unique().join(df.select("key2").unique(), how="cross")
# Join with group by.
# You could easily function-ify this by passing the join args inward.
print(
keys.join(
# our join
df.group_by(["key1", "key2"], maintain_order=True).len(),
on=["key1", "key2"],
how="left"
)
.with_columns(pl.col("len").fill_null(0))
)
# shape: (6, 3)
# ┌──────┬──────┬─────┐
# │ key1 ┆ key2 ┆ len │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞══════╪══════╪═════╡
# │ 2 ┆ 3 ┆ 0 │ # new result!
# │ 2 ┆ 2 ┆ 2 │
# │ 2 ┆ 1 ┆ 1 │
# │ 1 ┆ 3 ┆ 1 │
# │ 1 ┆ 2 ┆ 1 │
# │ 1 ┆ 1 ┆ 1 │
# └──────┴──────┴─────┘ |
Yes, that's the operation I am referring to. I didn't know it's version on pandas. The solution you suggest is of the same nature as the ones I've referred to as nice examples on how to go about it with a toy example with two groups. In my current data frame the grouping I am performing involves several fields (8 to 10). How would an implementation of your suggested solution look like in a scenario with more than 2 grouping fields? I hope I am not missing some obvious no-no (e. g. insane combinatorics of elements) but would still think a simple flag on the I will try to convert to pandas and test whether the flag |
Well yes, you'll hit a combinatorial explosion if you have lots of variables and lots of options. The number of combinations is the product of the number of unique values in each variable, so it can blow up fairly quickly. This is not a current option because the detected groups are groups that exist in the data and they are referenced by row index. In the example above, the import polars as pl
from polars import col
def all_keys(df_keys):
"""Create dataframe of all unique combinations of values."""
columns = df_keys.columns
df = df_keys.select(columns[0]).unique()
for column in columns[1:]:
df = df.join(df_keys.select(column).unique(), how="cross")
return df
def len_all_combos(df, key, agg_ops=pl.all().len(), fill_value=None):
"""Perform group-by with op and include all possible key combinations."""
keys = all_keys(df.select(key))
df_grouped = df.group_by(key).agg(agg_ops)
df_out = keys.join(df_grouped, on=key, how="left")
if fill_value is not None:
other_columns = [x for x in df.columns if x not in key]
df_out = df_out.with_columns(col(other_columns).fill_null(fill_value))
return df_out.sort(keys)
# let's try it out
df = pl.DataFrame({
"key1": [1, 1, 1, 2, 2, 2],
"key2": [1, 2, 3, 1, 2, 2], # note (2,3) missing from key
"value": [1, 2, 3, 4, 5, 6],
})
print(
len_all_combos(df, ["key1", "key2"], fill_value=0)
)
# shape: (6, 3)
# ┌──────┬──────┬───────┐
# │ key1 ┆ key2 ┆ value │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞══════╪══════╪═══════╡
# │ 1 ┆ 1 ┆ 1 │
# │ 1 ┆ 2 ┆ 1 │
# │ 1 ┆ 3 ┆ 1 │
# │ 2 ┆ 1 ┆ 1 │
# │ 2 ┆ 2 ┆ 2 │
# │ 2 ┆ 3 ┆ 0 │
# └──────┴──────┴───────┘ |
Thanks for the help. I will test your suggestions and continue digesting the problem. |
Description
Given a set of groups in a
group_by
after applying.len()
, is there a way to keep track of groups of size zero?For some longitudinal data it would be very useful to get the zeros for those groups in which a combination of factors has no data points, specially for plotting with lines.
Note that on the bar plot, there is a range of missing coordinates.
Libraries in other languages support an option for such capability as explained in this example on R.
I have seen some examples on SO on how one can achieve this using enums and joins but it can become a bit hard to adapt for more complicated, real life scenarios.
Thanks!!
The text was updated successfully, but these errors were encountered: