ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

kaiogu · 2020-09-15T08:20:58Z

Is your feature request related to a problem?

I want to apply chained operations on the same groupby groups without having to actually make identical costly groupby calls before each operation. The Related problem is available on the following SO question:

https://codereview.stackexchange.com/questions/249222/get-exactly-n-unique-randomly-sampled-rows-per-category-in-a-dataframe

Describe the solution you'd like

Add and optional argument that changes the behavior of non-aggregating methods on groupby groups to return groups instead of DataFrames.

This is useful if you want to make multiple chained operations on the Groups without having to do a GroupBy each time (split once, apply multiple times, combine once). This would also have the benefit of actually decoupling the apply and combine parts of the split-apply-combine paradigm when possible. This does not make sense for aggregating methods, but for something like groupby.DataFrameGroupBy.filter or groupby.DataFrameGroupBy.sample (maybe anything that does not reduce the groups to a single value) returning groups was actually the expected behavior for me and for at least the person who answered the SO question above.

API breaking implications

This would not break the API afaik, as the current behavior can be set as the default i.e. return_groups=False.

Describe alternatives you've considered

The alternatives that I could come up with all involved multiple groupby calls (I think there is an implicit call in value_counts). For my specific use case (see SO question), an option to drop groups with insufficient rows would actually suffice, but I think the proposed solution is more general and decoupled.

Additional context

Here is the code to randomly sample exactly n rows per group, dropping groups with insufficient rows:

n = 4
df = pd.DataFrame({'category': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1],
    'value' : range(12)})

# sample exactly x rows per category
df = df.groupby('category').filter(lambda x: len(x) >= n).groupby('category').sample(n)

# proposed alternative solution:
df = df.groupby('category').filter(lambda x: len(x) >= n, return_groups=True).sample(n)

I also expected filter to actually pass groups instead of DataFrames to the lambda.

The text was updated successfully, but these errors were encountered:

jreback · 2020-09-15T08:53:03Z

-1 on adding any keywords to groupby as it's already very complicated

groupby step is very cheap - it's processing the function which might be expensive

you can also simply iterate over the groups themselves if u want (in comprehension for example)

erfannariman · 2020-09-15T23:18:04Z

If your actual problem is sampling when n > than the group size, why not use replace=True?

df.groupby("category").sample(n, replace=True)

    category  value
9          1      9
9          1      9
11         1     11
3          1      3
7          2      7
1          2      1
10         2     10
7          2      7
2          3      2
5          3      5
8          3      8
2          3      2

You can chain it with drop_duplicates to get rid of the duplication.

Besides that, there are ways to skip groupby twice, for example:

df.groupby("category").apply(lambda x: x.sample(n) if len(x) >= n else x).reset_index(drop=True)

    category  value
0          1     11
1          1      0
2          1      6
3          1      3
4          2     10
5          2      4
6          2      1
7          2      7
8          3      2
9          3      5
10         3      8

Or (this one might not be really efficient:

pd.concat([d.sample(n) for _, d in df.groupby("category") if len(d) >= n])

kaiogu · 2020-09-16T08:40:50Z

@jreback
The keywords wouldn't be added to the groupby call, but to the non aggregating methods that follow it.

Are you saying it is cheap because it actually does not get computed when it is called, or is the actual computation part of the groupby cheap?

I was trying to avoid it stylistically and because "don't write your own for loops in pandas" is ingrained in me by now. I will try the suggestions out though.

Lastly, I like the separation between the apply and combine steps, if these make sense. It is obvious that in Aggregation functions this does not apply. But in methods that fall under Transformation or Filtering (or anything, really, in which each group is not reduced to a single value) it would make sense to have the option to return groups for further group-wise processing.

@erfannariman
My bad, it is specified in the Stack Exchange question but I didn't specify it here: I need the rows to be unique and the groups should have the same amount of rows.

I liked your first solution (just had to change else x to else None to actually drop the rare categories. But surprisingly your second solution was much faster!

n_rows = 10_000_000
n_groups = 4
threshold = n_rows // n_groups

df = pd.DataFrame({'category': np.random.default_rng(seed=0).integers(n_groups, size=n_rows),
                   'value' : np.arange(n_rows)})

Your first solution

%%timeit
df.groupby('category').apply(lambda x: x.sample(threshold) if len(x) >= threshold else None).reset_index(0, drop=True)

3.02 s ± 9.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Your second solution

%%timeit
pd.concat([d.sample(threshold) for _, d in df.groupby("category") if len(d) >= threshold])

655 ms ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I will go with iterating over the groups + concat for now, as I want to do even more operations on the same groups. But the proposal to have an option for non aggregating groupby methods to return groups still stands.

jreback · 2020-09-16T08:44:07Z

not against making .filter better but would have to be comprehensive

adam-kral · 2020-11-26T12:30:40Z

I also thought the filter would not evaluate eagerly, but rather return a GroupBy object. Found question https://stackoverflow.com/questions/49831784/filter-groups-after-groupby-in-pandas-while-keeping-the-groups , then I searched for this issue.

My case:

df = pd.DataFrame({'runner': [1, 1, 2, 2], 'monday' : [10, 5, 6, 12], 'tuesday': [10, 5, 10, 10]})

runners = df.groupby('runner')

# compute expensive statistics, save for subsequent multiple usage
runner_total_dst = runners.apply(lambda g: g.monday.sum() + g.tuesday.sum())

avid_runners = runners.filter(lambda g: runner_total_dst.loc[g.name] > 30)  # possibly chain more .filters

avid_runners.apply(lambda group: print(group.columns))  # error: no attribute columns on Series -- group is not a dataframe :(
# avid_runners was eagerly evaluated and is now a dataframe, rather than a groupby object

Current method:

avid_runners = runners.filter(lambda g: runner_total_dst.loc[g.name] > 30).groupby('runner')

Current method is very expensive: filter copies the whole dataframe AND groupby has to sort it (right?). Plus its not readable (why again a groupby('runner') on runners?) and there's no possibility to chain filters.

Edit:
Particularly runner_total_dst can be made without apply, so that probably numpy does the heavy lifting (vectorized):

runner_total_dst = runners.monday.sum() + runners.tuesday.sum()

(So when passed to apply, is group[Series/Dataframe] actually created with new data, or is the DF/Series just a wrapper with rows indices to the original DF. Probably a sub-copy, right? But I suspect this faster solution in Edit uses a masking array/index to np.sum the values just from the group, all in the original underlying ndarray, right?)

kaiogu added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020

dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 15, 2020

rhshadrach added the Groupby label Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

kaiogu commented Sep 15, 2020 •

edited

jreback commented Sep 15, 2020

erfannariman commented Sep 15, 2020 •

edited

kaiogu commented Sep 16, 2020 •

edited

jreback commented Sep 16, 2020

adam-kral commented Nov 26, 2020 •

edited

ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

Comments

kaiogu commented Sep 15, 2020 • edited

Is your feature request related to a problem?

Describe the solution you'd like

API breaking implications

Describe alternatives you've considered

Additional context

jreback commented Sep 15, 2020

erfannariman commented Sep 15, 2020 • edited

kaiogu commented Sep 16, 2020 • edited

Your first solution

Your second solution

jreback commented Sep 16, 2020

adam-kral commented Nov 26, 2020 • edited

kaiogu commented Sep 15, 2020 •

edited

erfannariman commented Sep 15, 2020 •

edited

kaiogu commented Sep 16, 2020 •

edited

adam-kral commented Nov 26, 2020 •

edited