DataFrame.ix losing row ordering when index has duplicates #3561

Closed
dalejung opened this Issue May 10, 2013 · 5 comments

Comments

Projects
None yet
3 participants
Contributor

dalejung commented May 10, 2013

import pandas as pd

ind = ['A', 'A', 'B', 'C']i
df = pd.DataFrame({'test':range(len(ind))}, index=ind)

rows = ['C', 'B']
res = df.ix[rows]
assert rows == list(res.index) # fails

The problem is that the resulting DataFrame keeps the ordering of the df.index and not the rows key. You'll notice that the rows key doesn't reference a duplicate value.

Contributor

jreback commented May 10, 2013

thanks for the catch, this is a case that though worked, was using a set like indexer so the order was not guaranteeed - provided an opportunity to refactor a bit...PR coming soon

Contributor

jreback commented May 10, 2013

Unique

In [1]: df=DataFrame(randn(5,3),index=list('ABCDE'))

In [2]: df.ix[['A']]
Out[2]: 
          0         1        2
A -1.048431 -0.435366  0.33573

In [3]: df.ix[['A','G']]
Out[3]: 
          0         1        2
A -1.048431 -0.435366  0.33573
G       NaN       NaN      NaN

Duplicate

In [4]: dfnu=DataFrame(randn(5,3),index=list('AABCD'))

In [5]: dfnu.ix[['A']]
Out[5]: 
          0         1         2
A  0.039932  1.049630 -2.647776
A -0.213537  0.747972 -0.830574

In [7]: dfnu.ix[['B','A','E']]
Out[7]: 
          0         1         2
B  0.292704 -1.396854 -0.414920
A  0.039932  1.049630 -2.647776
A -0.213537  0.747972 -0.830574

@dalejung @y-p @wesm
ok...behavior fixed, but what do you think about the last case
e.g. selecting something that doesn't exist (but at least 1 value exists)
in the unique case you get equivalent of reindexing, should I fix the duplicate case to do
the same?

Contributor

y-p commented May 10, 2013

re:

In [41]: dfnu=DataFrame(randn(4,3),index=list('ABCD'))

In [42]: dfnu.ix[['E']]
Out[42]: 
    0   1   2
E NaN NaN NaN

In [43]: dfnu=DataFrame(randn(5,3),index=list('AABCD'))

In [44]: dfnu.ix[['E']]
Out[44]: 
Empty DataFrame
Columns: [0, 1, 2]
Index: []

yeah, that is inconsistent.

Contributor

dalejung commented May 10, 2013

I think for consistency sake it should be the same. To be honest, I don't have a use case for indexing a non-existent label or an iterable key that contains a duplicate. I came across the bug when a source file upstream had a duplicate row.

Thanks for the quick patch.

Contributor

jreback commented May 10, 2013

np...we have been fixing duplicate indicies lately (again not there is that much use for them), but they should work....will be merged soon

jreback closed this in #3563 May 14, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment