ENH: enumerate groups #11642

Closed
dsm054 opened this Issue Nov 18, 2015 · 5 comments

Comments

Projects
None yet
3 participants
@dsm054
Contributor

dsm054 commented Nov 18, 2015

Sometimes it's handy to have access to a distinct integer for each group. For example, using the (internal) grouper:

>>> df = pd.DataFrame({"a": list("xyyzxy"), "b": list("ab"*3), "c": range(6)})
>>> df["group_id"] = df.groupby(["a","b"]).grouper.group_info[0]
>>> df
   a  b  c  group_id
0  x  a  0         0
1  y  b  1         2
2  y  a  2         1
3  z  b  3         3
4  x  a  4         0
5  y  b  5         2

This can be achieved in a number of ways but none of them are particularly elegant, esp. if we're grouping on multiple keys and/or Series. Accordingly, after a brief discussion on gitter, I propose a new method transform("enumerate") which returns a Series of integers from 0 to ngroups-1 matching the order the groups will be iterated in. In other words, we'll simply be applying the following map:

>>> m = {k: i for i, (k,g) in enumerate(df.groupby(["a","b"]))}
>>> m
{('x', 'a'): 0, ('y', 'b'): 2, ('y', 'a'): 1, ('z', 'b'): 3}

(Note this is only to shows the desired behaviour, and wouldn't be how it'd be implemented!)

@jreback jreback added this to the Next Major Release milestone Nov 18, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 18, 2015

Contributor

can you show an example of its utility!

also to note that this is really only useful as a .transform method (a reduction is kind of silly as its just the range(len(df.groupby(...))))

Contributor

jreback commented Nov 18, 2015

can you show an example of its utility!

also to note that this is really only useful as a .transform method (a reduction is kind of silly as its just the range(len(df.groupby(...))))

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Nov 19, 2015

Member

Note that this is essentially exactly the same information provided by pandas.factorize:

In [1]: import pandas as pd

In [2]: pd.factorize(['a', 'a', 'b', 'c'])
Out[2]: (array([0, 0, 1, 2]), array(['a', 'b', 'c'], dtype=object))
Member

shoyer commented Nov 19, 2015

Note that this is essentially exactly the same information provided by pandas.factorize:

In [1]: import pandas as pd

In [2]: pd.factorize(['a', 'a', 'b', 'c'])
Out[2]: (array([0, 0, 1, 2]), array(['a', 'b', 'c'], dtype=object))
@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Nov 19, 2015

Contributor

I couldn't think of a clean way to get factorize to handle the same inputs as groupby, though (both the multiple-series case and the mixed column-name/list input.) Might have missed something obvious, of course, as is my wont. But if I needed to write a few lines to get it to work, then those lines would more naturally fit as a groupby method, or so it seemed to me.

Contributor

dsm054 commented Nov 19, 2015

I couldn't think of a clean way to get factorize to handle the same inputs as groupby, though (both the multiple-series case and the mixed column-name/list input.) Might have missed something obvious, of course, as is my wont. But if I needed to write a few lines to get it to work, then those lines would more naturally fit as a groupby method, or so it seemed to me.

@dsm054

This comment has been minimized.

Show comment
Hide comment
@dsm054

dsm054 Nov 21, 2015

Contributor

As I went to implement this, I started to wonder if it doesn't make more sense to use df.groupby("a").enumerate() instead of df.groupby("a").transform("enumerate"), to be parallel with df.groupby("a").cumcount(), instead of df.groupby("a").transform("cumcount") (which doesn't work.) This would give us something like

>>> df = pd.DataFrame({"A": [1,2,2,2,1]})
>>> df["group_id"] = df.groupby("A").enumerate()
>>> df["group_index"] = df.groupby("A").cumcount()
>>> df
   A  group_id  group_index
0  1         0            0
1  2         1            0
2  2         1            1
3  2         1            2
4  1         0            1
Contributor

dsm054 commented Nov 21, 2015

As I went to implement this, I started to wonder if it doesn't make more sense to use df.groupby("a").enumerate() instead of df.groupby("a").transform("enumerate"), to be parallel with df.groupby("a").cumcount(), instead of df.groupby("a").transform("cumcount") (which doesn't work.) This would give us something like

>>> df = pd.DataFrame({"A": [1,2,2,2,1]})
>>> df["group_id"] = df.groupby("A").enumerate()
>>> df["group_index"] = df.groupby("A").cumcount()
>>> df
   A  group_id  group_index
0  1         0            0
1  2         1            0
2  2         1            1
3  2         1            2
4  1         0            1
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Nov 21, 2015

Contributor

that looks reasonable

Contributor

jreback commented Nov 21, 2015

that looks reasonable

dsm054 added a commit to dsm054/pandas that referenced this issue Aug 18, 2016

dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017

dsm054 added a commit to dsm054/pandas that referenced this issue Mar 22, 2017

@jreback jreback modified the milestones: 0.20.2, Next Major Release May 31, 2017

@jreback jreback closed this in #14026 Jun 1, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment