get_group sometimes throws an exception when using an index of tuples with different lengths #8121

Closed
dwiel opened this Issue Aug 27, 2014 · 5 comments

Comments

Projects
None yet
2 participants
Contributor

dwiel commented Aug 27, 2014

Here is a simple test case that exposes the problem:

    df = pd.DataFrame(pd.Series([(1,), (1,2), (1,), (1, 2)]), columns = ['ids'])
    gb = df.groupby('ids')
    for i in gb.size().index :
        print i
        gb.get_group(i)

The issues is that in _get_index of GroupBy, these lines assume that if there is a tuple in the index, then the index is a multi-index, which in the above test case isn't true. Maybe there is some other way to detect that values are from a multi-index, or should pandas explicitly not support tuples in this situation (in an index of a groupby)

    pandas/core/groupby.py:

    sample = next(iter(self.indices))
    if isinstance(sample, tuple):
        if not isinstance(name, tuple):
            raise ValueError("must supply a tuple to get_group with multiple grouping keys")
        if not len(name) == len(sample):
            raise ValueError("must supply a a same-length tuple to get_group with multiple grouping keys")
Contributor

TomAugspurger commented Aug 27, 2014

I'll take a look. While you're probably right that this shouldn't thrown an exception, storing containers in DataFrames is usually frowned upon. Something like

In [14]: gr = df.groupby(pd.factorize(df.ids)[0])

In [15]: for i in gr.size().index:
   ....:     print(i)
   ....:     gr.get_group(i)
   ....:     
0
1

is usually better (faster and I think clearer). pd.factorize also returns the labels if you need those.

Contributor

TomAugspurger commented Aug 27, 2014

@dwiel This is what you're expecting, right?

In [1]: good = pd.DataFrame([[1, 1, 1, 1], ['a', 'b', 'a', 'b']]).T

In [2]: bad = pd.DataFrame(pd.Series([(1,), (1,2), (1,), (1, 2)]), columns = ['
ids'])

In [3]: gg = good.groupby([0, 1])

In [4]: gb = bad.groupby('ids')

In [5]: good
Out[5]: 
   0  1
0  1  a
1  1  b
2  1  a
3  1  b

In [6]: bad
Out[6]: 
      ids
0    (1,)
1  (1, 2)
2    (1,)
3  (1, 2)

In [9]: def run(gr):
    for i in gr.size().index:
        print(i)
        print(gr.get_group(i))
   ...:         

In [10]: run(gg)
(1, 'a')
   0  1
0  1  a
2  1  a
(1, 'b')
   0  1
1  1  b
3  1  b

In [11]: run(gb)
(1,)
    ids
0  (1,)
2  (1,)
(1, 2)
      ids
1  (1, 2)
3  (1, 2)
Contributor

dwiel commented Aug 27, 2014

The factorize code does appear to do what I want.

To your second comment that does look like how I would expect it to work.

Contributor

TomAugspurger commented Aug 28, 2014

Should be fixed now. Like I said, you're probably better off with factorizeing and then grouping in this case.

Thanks for the report!

Contributor

dwiel commented Aug 28, 2014

Thanks!

On Wed, Aug 27, 2014 at 9:59 PM, Tom Augspurger notifications@github.com
wrote:

Should be fixed now. Like I said, you're probably better off with
factorizeing and then grouping in this case.

Thanks for the report!


Reply to this email directly or view it on GitHub
pydata#8121 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment