Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in Fancy/Boolean Indexing with nested lists #2702

Closed
jim22k opened this issue Jan 15, 2013 · 9 comments · Fixed by #4756

Comments

@jim22k
Copy link

commented Jan 15, 2013

Fancy or Boolean indexing on a Series has two strange behaviors. My examples only show the behavior with Fancy indexing, but it's the same for Boolean indexing.

LHS vs RHS length

    >>> s = pd.Series(list('abc'))
    >>> s[[0,1,2]] = range(27)
    >>> list(s)
    [0, 1, 2]

I would have expected an error, similar to what I get with slice indexing

    >>> s = pd.Series(list('abc'))
    >>> s[0:3] = range(27)
    ValueError: cannot copy sequence with size 27 to array axis with dimension 3

An even odder behavior is when you have too few items in the RHS

    >>> s = pd.Series(list('abc'))
    >>> s[[0,1,2]] = range(2)
    >>> list(s)
    [0, 1, 0]

It seems to be using something like itertools.cycle which seems very arbitrary to me

Nested RHS

This may seem like a strange use of pandas, but I need to store Python lists

    >>> s = pd.Series(list('abc'))
    >>> s[[0,1,2]] = [[100,200], [300,400], [500,600]]
    >>> list(s)
    [100, 200, 300]

Very strange. It's like it flattens the input first.
But this flattening only happens if the nested levels are all the same size.

    >>> s = pd.Series(list('abc'))
    >>> s[[0,1,2]] = [[100,200], [300,400], [500,600, 601, 602]]
    >>> list(s)
    [[100,200], [300,400], [500,600, 601, 602]]

I know in numpy the array constructor would make a distinction between these two inputs, so maybe that's the reason for the difference, but I still don't see why ndarrays are being flattened.

I can work around the issue by converting the RHS to a 1-D array and passing that in.

    >>> s = pd.Series(list('abc'))
    >>> rhs = np.empty(3).astype('object')
    >>> rhs[:] = [[100,200], [300,400], [500,600]]
    >>> s[[0,1,2]] = rhs
    >>> list(s)
    [[100,200], [300,400], [500,600]]

Slice indexing doesn't have this problem at all

    >>> s = pd.Series(list('abc'))
    >>> s[0:3] = [[100,200], [300,400], [500,600]]
    >>> list(s)
    [[100,200], [300,400], [500,600]]

My Question: Are these behaviors a bug or a "feature"? I think Fancy/Boolean indexing should operate the same as slice indexing -- i.e. check for matching lengths and don't auto-convert to numpy array.

@ghost ghost assigned wesm Jan 20, 2013

@wesm

This comment has been minimized.

Copy link
Member

commented Jan 20, 2013

Oh boy. Hitting a bunch of buggy/underspecified NumPy stuff here. I'm having a look but may kick this can down the road

@wesm

This comment has been minimized.

Copy link
Member

commented Jan 20, 2013

This is all NumPy behavior. It's going to be too much work for me to fix this anytime soon. I'm already completely fed up with the NumPy library so i would like to overhaul all this mess to make it consistent at some point in the future

@jim22k

This comment has been minimized.

Copy link
Author

commented Jan 21, 2013

You're right. I just validated the same bugs on a plain ndarray. Do you think there is any value in raising this issue on a NumPy forum?

Thanks for looking into these corner cases. Pandas just keeps getting better and I find myself using it more and more when dealing with any non-trivial dataset.

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2013

@jreback is this resolved for pandas now that Series isn't an ndarray anymore?

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 5, 2013

Did I miss something? Series is no longer an NDFrame?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2013

I will take a look - haven't seen this issue before

@jtratner

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2013

@cpcloud whoops! miswrote - mean no longer an ndarray

@cpcloud

This comment has been minimized.

Copy link
Member

commented Sep 5, 2013

@jtratner No worries! Figured it was something like that....just wanted to stay in the loop!

@jreback

This comment has been minimized.

Copy link
Contributor

commented Sep 5, 2013

This is easy to make all of these act the same, just an extension in where. Right for ndim==1 we basically handle a single element and a single list element on the rhs, as well as a boolean indexer that matches the rhs.

so this good (#2745)

In [3]: s = Series([1, 2])

In [4]: s[[True, False]] = [0, 1]

In [5]: s
Out[5]: 
0    0
1    2
dtype: int64

else it is converted to a ndarray. So just need to deal with shorter/longer ones and raise a ValueError.
https://github.com/pydata/pandas/blob/master/pandas/core/generic.py#L2285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.