Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Make non-aggregating methods on groupby groups optionally return groups #36380

Open
kaiogu opened this issue Sep 15, 2020 · 5 comments
Open

Comments

@kaiogu
Copy link

kaiogu commented Sep 15, 2020

Is your feature request related to a problem?

I want to apply chained operations on the same groupby groups without having to actually make identical costly groupby calls before each operation. The Related problem is available on the following SO question:

https://codereview.stackexchange.com/questions/249222/get-exactly-n-unique-randomly-sampled-rows-per-category-in-a-dataframe

Describe the solution you'd like

Add and optional argument that changes the behavior of non-aggregating methods on groupby groups to return groups instead of DataFrames.

This is useful if you want to make multiple chained operations on the Groups without having to do a GroupBy each time (split once, apply multiple times, combine once). This would also have the benefit of actually decoupling the apply and combine parts of the split-apply-combine paradigm when possible. This does not make sense for aggregating methods, but for something like groupby.DataFrameGroupBy.filter or groupby.DataFrameGroupBy.sample (maybe anything that does not reduce the groups to a single value) returning groups was actually the expected behavior for me and for at least the person who answered the SO question above.

API breaking implications

This would not break the API afaik, as the current behavior can be set as the default i.e. return_groups=False.

Describe alternatives you've considered

The alternatives that I could come up with all involved multiple groupby calls (I think there is an implicit call in value_counts). For my specific use case (see SO question), an option to drop groups with insufficient rows would actually suffice, but I think the proposed solution is more general and decoupled.

Additional context

Here is the code to randomly sample exactly n rows per group, dropping groups with insufficient rows:

n = 4
df = pd.DataFrame({'category': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1],
    'value' : range(12)})

# sample exactly x rows per category
df = df.groupby('category').filter(lambda x: len(x) >= n).groupby('category').sample(n)

# proposed alternative solution:
df = df.groupby('category').filter(lambda x: len(x) >= n, return_groups=True).sample(n)

I also expected filter to actually pass groups instead of DataFrames to the lambda.

@kaiogu kaiogu added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020
@jreback
Copy link
Contributor

jreback commented Sep 15, 2020

-1 on adding any keywords to groupby as it's already very complicated

groupby step is very cheap - it's processing the function which might be expensive

you can also simply iterate over the groups themselves if u want (in comprehension for example)

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 15, 2020
@erfannariman
Copy link
Member

erfannariman commented Sep 15, 2020

If your actual problem is sampling when n > than the group size, why not use replace=True?

df.groupby("category").sample(n, replace=True)

    category  value
9          1      9
9          1      9
11         1     11
3          1      3
7          2      7
1          2      1
10         2     10
7          2      7
2          3      2
5          3      5
8          3      8
2          3      2

You can chain it with drop_duplicates to get rid of the duplication.


Besides that, there are ways to skip groupby twice, for example:

df.groupby("category").apply(lambda x: x.sample(n) if len(x) >= n else x).reset_index(drop=True)

    category  value
0          1     11
1          1      0
2          1      6
3          1      3
4          2     10
5          2      4
6          2      1
7          2      7
8          3      2
9          3      5
10         3      8

Or (this one might not be really efficient:

pd.concat([d.sample(n) for _, d in df.groupby("category") if len(d) >= n])

@kaiogu
Copy link
Author

kaiogu commented Sep 16, 2020

@jreback
The keywords wouldn't be added to the groupby call, but to the non aggregating methods that follow it.

Are you saying it is cheap because it actually does not get computed when it is called, or is the actual computation part of the groupby cheap?

I was trying to avoid it stylistically and because "don't write your own for loops in pandas" is ingrained in me by now. I will try the suggestions out though.

Lastly, I like the separation between the apply and combine steps, if these make sense. It is obvious that in Aggregation functions this does not apply. But in methods that fall under Transformation or Filtering (or anything, really, in which each group is not reduced to a single value) it would make sense to have the option to return groups for further group-wise processing.


@erfannariman
My bad, it is specified in the Stack Exchange question but I didn't specify it here: I need the rows to be unique and the groups should have the same amount of rows.

I liked your first solution (just had to change else x to else None to actually drop the rare categories. But surprisingly your second solution was much faster!

n_rows = 10_000_000
n_groups = 4
threshold = n_rows // n_groups

df = pd.DataFrame({'category': np.random.default_rng(seed=0).integers(n_groups, size=n_rows),
                   'value' : np.arange(n_rows)})
Your first solution
%%timeit
df.groupby('category').apply(lambda x: x.sample(threshold) if len(x) >= threshold else None).reset_index(0, drop=True)

3.02 s ± 9.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Your second solution
%%timeit
pd.concat([d.sample(threshold) for _, d in df.groupby("category") if len(d) >= threshold])

655 ms ± 5.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


I will go with iterating over the groups + concat for now, as I want to do even more operations on the same groups. But the proposal to have an option for non aggregating groupby methods to return groups still stands.

@jreback
Copy link
Contributor

jreback commented Sep 16, 2020

not against making .filter better but would have to be comprehensive

@adam-kral
Copy link

adam-kral commented Nov 26, 2020

I also thought the filter would not evaluate eagerly, but rather return a GroupBy object. Found question https://stackoverflow.com/questions/49831784/filter-groups-after-groupby-in-pandas-while-keeping-the-groups , then I searched for this issue.

My case:

df = pd.DataFrame({'runner': [1, 1, 2, 2], 'monday' : [10, 5, 6, 12], 'tuesday': [10, 5, 10, 10]})

runners = df.groupby('runner')

# compute expensive statistics, save for subsequent multiple usage
runner_total_dst = runners.apply(lambda g: g.monday.sum() + g.tuesday.sum())

avid_runners = runners.filter(lambda g: runner_total_dst.loc[g.name] > 30)  # possibly chain more .filters

avid_runners.apply(lambda group: print(group.columns))  # error: no attribute columns on Series -- group is not a dataframe :(
# avid_runners was eagerly evaluated and is now a dataframe, rather than a groupby object

Current method:

avid_runners = runners.filter(lambda g: runner_total_dst.loc[g.name] > 30).groupby('runner')

Current method is very expensive: filter copies the whole dataframe AND groupby has to sort it (right?). Plus its not readable (why again a groupby('runner') on runners?) and there's no possibility to chain filters.

Edit:
Particularly runner_total_dst can be made without apply, so that probably numpy does the heavy lifting (vectorized):

runner_total_dst = runners.monday.sum() + runners.tuesday.sum()

(So when passed to apply, is group[Series/Dataframe] actually created with new data, or is the DF/Series just a wrapper with rows indices to the original DF. Probably a sub-copy, right? But I suspect this faster solution in Edit uses a masking array/index to np.sum the values just from the group, all in the original underlying ndarray, right?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants