New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partial Selection on Hierarichal Index Not Working as Expected Beyond 2 Levels #2995

Closed
dragoljub opened this Issue Mar 9, 2013 · 11 comments

Comments

Projects
None yet
3 participants
@dragoljub

dragoljub commented Mar 9, 2013

The Pandas documentation states that a hierarchically indexed DataFrame can be partially selected from by passing a partial list of index values. For example if we have a 5-level index, then df.ix['A','B','C','D'] should return a DataFrame with the subset of rows where levels[0,1,2,3] match values ['A','B','C','D'] indexed with by the remaining lowest index level 5. I have found that this only works when partially selecting on the first two levels, anything beyond the first 2 levels requires the user to append a null slice at the end for .ix to work. This becomes cumbersome especially if you do not know exactly how many levels are available in a DataFrame after several processing steps. The same behavior persists regardless if your MultiIndex is int64 or string or mixed types.

It would be best if any partial tuple of index values passed to the .ix method returns a partial selection matching that tuple without requiring users to explicitly append the empty slice for the remaining levels. Let me know if this is how the .ix method is intended to work. It might be a simple fix of by checking if the tuple length == # levels and if not appending the empty slice within the .ix method.

Example Showing the Issue:

Take an example data set with 5 index levels, group by the first 4 levels and count the group sizes. Select the group with the largest size.

Results: (Clean Code Pasted Below IPython Output)

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: print pd.__version__
0.11.0.dev-80945b6

In [4]: # Generate Test DataFrame
   ...: NUM_ROWS = 100000
   ...:

In [5]: NUM_COLS = 10

In [6]: col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]

In [7]: index_cols = col_names[:5]

In [8]: # Set DataFrame to have 5 level Hierarchical Index.
   ...: # The dtype does not matter try str or np.int64 same results.
   ...: df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
   ...:

In [9]: df = df.set_index(index_cols)

In [10]: df
Out[10]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (2, 2, 4, 0, 0) to (3, 2, 3, 1, 1)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [11]: # Group by first 4 index columns.
   ....: grp = df.groupby(level=index_cols[:-1])
   ....:

In [12]: # Compute group size and find index of largest group.
   ....: grp_size = grp.size()
   ....:

In [13]: grp_size.sort()

In [14]: grp_size[::-1]
Out[14]: A0  A1  A2  A3
2   1   2   1     200
1   4   1   3     200
3   1   4   1     199
4   1   1   1     196
2   4   2   0     190
4   4   0   1     190
3   1   0   0     189
0   0   4   3     189
4   1   2   1     188
0   1   4   2     187
2   2   3   0     187
3   2   3   2     187
    3   4   0     186
    1   3   0     186
    3   3   4     185
...
0   1   1   4     136
2   4   4   3     135
4   4   1   2     135
2   4   3   2     135
3   3   4   1     134
4   2   3   1     133
3   2   2   4     133
4   0   1   3     133
0   4   3   4     131
1   2   1   0     130
4   4   1   4     130
    2   1   4     128
0   2   3   2     127
3   3   0   1     121
1   0   2   3     106
Length: 625, dtype: int64

In [15]: loc = grp.size().idxmax()

In [16]: loc
Out[16]: (1, 4, 1, 3)

In [17]: # The following line does not work.
   ....: # Can't use partial selection to select largest group.
   ....: df.ix[loc]
   ....:
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-17-e26047d119d7> in <module>()
      1 # The following line does not work.
      2 # Can't use partial selection to select largest group.
----> 3 df.ix[loc]
      4

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     44                 pass
     45
---> 46             return self._getitem_tuple(key)
     47         else:
     48             return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
    235         for i, key in enumerate(tup):
    236             if i >= self.obj.ndim:
--> 237                 raise IndexingError('Too many indexers')
    238
    239             if _is_null_slice(key):

IndexingError: Too many indexers

In [18]: # Appending the null slice at the end works!
   ....: # I was hoping for partial selection to work above.
   ....: df.ix[loc+(slice(None),)]
   ....:
Out[18]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 0
Data columns:
A5    200  non-null values
A6    200  non-null values
A7    200  non-null values
A8    200  non-null values
A9    200  non-null values
dtypes: int64(5)

In [19]: # Selecting on first level works.
   ....: df.ix[loc[0]]
   ....:
Out[19]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 19796 entries, (3, 1, 1, 1) to (4, 1, 3, 0)
Data columns:
A5    19796  non-null values
A6    19796  non-null values
A7    19796  non-null values
A8    19796  non-null values
A9    19796  non-null values
dtypes: int64(5)

In [20]: # Selecting on first two levels works.
   ....: df.ix[loc[0], loc[1]]
   ....:
Out[20]: A1  A2  A3  A4
3   1   1   1     3
0   1   1   0     2
    4   0   1     3
4   3   1   4     1
1   4   4   4     4
0   0   3   2     4
1   1   2   3     1
4   3   4   3     4
2   1   2   0     3
1   2   0   2     1
0   0   3   0     3
    2   4   2     2
4   4   0   4     1
    2   2   4     4
1   3   0   3     4
...
0   2   4   0     0
1   4   2   1     4
    0   2   1     4
3   4   2   3     2
1   1   3   2     3
3   3   0   3     1
2   1   2   4     3
4   4   2   2     0
    0   2   3     2
    4   4   1     3
3   1   3   4     2
2   2   4   1     4
3   1   1   1     1
1   0   1   3     1
4   1   3   0     3
Name: A9, Length: 19796, dtype: int64

In [21]: # Selecting on first 3 levels fails.
   ....: df.ix[loc[0], loc[1], loc[2]]
   ....:
---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-21-38eb049081e0> in <module>()
      1 # Selecting on first 3 levels fails.
----> 2 df.ix[loc[0], loc[1], loc[2]]
      3

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in __getitem__(self, key)
     44                 pass
     45
---> 46             return self._getitem_tuple(key)
     47         else:
     48             return self._getitem_axis(key, axis=0)

C:\Python27\lib\site-packages\pandas\core\indexing.pyc in _getitem_tuple(self, tup)
    235         for i, key in enumerate(tup):
    236             if i >= self.obj.ndim:
--> 237                 raise IndexingError('Too many indexers')
    238
    239             if _is_null_slice(key):

IndexingError: Too many indexers

In [22]: # Adding the null slice and it works.
   ....: df.ix[loc[0], loc[1], loc[2], slice(None)]
Out[22]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 848 entries, (2, 1) to (3, 0)
Data columns:
A5    848  non-null values
A6    848  non-null values
A7    848  non-null values
A8    848  non-null values
A9    848  non-null values
dtypes: int64(5)

Code:

import numpy as np
import pandas as pd

print pd.__version__

# Generate Test DataFrame
NUM_ROWS = 100000
NUM_COLS = 10
col_names = ['A'+num for num in map(str,np.arange(NUM_COLS).tolist())]
index_cols = col_names[:5]

# Set DataFrame to have 5 level Hierarchical Index.
# The dtype does not matter try str or np.int64 same results.
df = pd.DataFrame(np.random.randint(5, size=(NUM_ROWS,NUM_COLS)), dtype=np.int64, columns=col_names)
df = df.set_index(index_cols)
df

# Group by first 4 index columns.
grp = df.groupby(level=index_cols[:-1])

# Compute group size and find index of largest group.
grp_size = grp.size()
grp_size.sort()
grp_size[::-1]
loc = grp.size().idxmax()
loc

# The following line does not work.
# Can't use partial selection to select largest group.
df.ix[loc]

# Appending the null slice at the end works!
# I was hoping for partial selection to work above.
df.ix[loc+(slice(None),)]

# Selecting on first level works.
df.ix[loc[0]]

# Selecting on first two levels works.
df.ix[loc[0], loc[1]]

# Selecting on first 3 levels fails.
df.ix[loc[0], loc[1], loc[2]]

# Adding the null slice and it works.
df.ix[loc[0], loc[1], loc[2], slice(None)]
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Mar 9, 2013

Contributor

Here is a workaround for solving your particular issue.

The problem is passing a tuple of indexers is tantamount to this:

df.ix[row_indexer,col_indexer] generates this tuple(row_indexer,col_indexer)

so passing a tuple is ambiguous; this is handled for a len 2 tuple however (there is currently no method to 'decide' if you are passing a tuple to index axis=0 with a tuple on a multi-index or index axis0 and axis1)

This is ambiguous (and is indexing on axis0 for loc0 but axis1 for loc[1])
In [14]: df.ix[loc[0],loc[1]]
Out[14]: 
A1  A2  A3  A4
0   0   2   0     3
2   0   1   3     1
3   1   3   3     0
0   1   3   3     1
2   1   1   4     0
0   2   2   3     2
1   2   4   1     2
2   3   2   2     3
4   4   4   3     2
3   3   0   0     2
2   4   0   1     1
    3   3   3     0
4   3   0   1     0
2   3   3   4     1
    0   2   4     0
...
4   3   1   3     2
0   0   4   1     2
2   0   4   3     1
    1   4   4     2
    4   3   2     4
0   2   2   3     2
3   4   1   1     2
    1   3   3     4
2   1   0   1     0
            3     2
3   4   4   2     0
1   4   2   2     0
2   0   1   1     0
1   2   2   3     4
3   4   1   4     3
Name: A9, Length: 19888, dtype: int64

This is what I think you actually want
In [8]: df.ix[loc[0]].ix[loc[1]]
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4009 entries, (4, 4, 3) to (3, 1, 3)
Data columns:
A5    4009  non-null values
A6    4009  non-null values
A7    4009  non-null values
A8    4009  non-null values
A9    4009  non-null values
dtypes: int64(5)

Define this function

def partial_indexer(df, tup, axis=0):
            if not isinstance(tup, (tuple,list)):
                raise Exception("must pass a tuple or a list")
            if len(tup) > df.index.levels:
                raise Exception("too many indexing levels")

            for i, key in enumerate(tup):
                df = df.xs(key, axis=axis)
            return df

In [9]: partial_indexer(df,loc[0:2])
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4009 entries, (4, 4, 3) to (3, 1, 3)
Data columns:
A5    4009  non-null values
A6    4009  non-null values
A7    4009  non-null values
A8    4009  non-null values
A9    4009  non-null values
dtypes: int64(5)

In [10]: partial_indexer(df,loc)
Out[10]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 3 to 1
Data columns:
A5    210  non-null values
A6    210  non-null values
A7    210  non-null values
A8    210  non-null values
A9    210  non-null values
dtypes: int64(5)
Contributor

jreback commented Mar 9, 2013

Here is a workaround for solving your particular issue.

The problem is passing a tuple of indexers is tantamount to this:

df.ix[row_indexer,col_indexer] generates this tuple(row_indexer,col_indexer)

so passing a tuple is ambiguous; this is handled for a len 2 tuple however (there is currently no method to 'decide' if you are passing a tuple to index axis=0 with a tuple on a multi-index or index axis0 and axis1)

This is ambiguous (and is indexing on axis0 for loc0 but axis1 for loc[1])
In [14]: df.ix[loc[0],loc[1]]
Out[14]: 
A1  A2  A3  A4
0   0   2   0     3
2   0   1   3     1
3   1   3   3     0
0   1   3   3     1
2   1   1   4     0
0   2   2   3     2
1   2   4   1     2
2   3   2   2     3
4   4   4   3     2
3   3   0   0     2
2   4   0   1     1
    3   3   3     0
4   3   0   1     0
2   3   3   4     1
    0   2   4     0
...
4   3   1   3     2
0   0   4   1     2
2   0   4   3     1
    1   4   4     2
    4   3   2     4
0   2   2   3     2
3   4   1   1     2
    1   3   3     4
2   1   0   1     0
            3     2
3   4   4   2     0
1   4   2   2     0
2   0   1   1     0
1   2   2   3     4
3   4   1   4     3
Name: A9, Length: 19888, dtype: int64

This is what I think you actually want
In [8]: df.ix[loc[0]].ix[loc[1]]
Out[8]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4009 entries, (4, 4, 3) to (3, 1, 3)
Data columns:
A5    4009  non-null values
A6    4009  non-null values
A7    4009  non-null values
A8    4009  non-null values
A9    4009  non-null values
dtypes: int64(5)

Define this function

def partial_indexer(df, tup, axis=0):
            if not isinstance(tup, (tuple,list)):
                raise Exception("must pass a tuple or a list")
            if len(tup) > df.index.levels:
                raise Exception("too many indexing levels")

            for i, key in enumerate(tup):
                df = df.xs(key, axis=axis)
            return df

In [9]: partial_indexer(df,loc[0:2])
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4009 entries, (4, 4, 3) to (3, 1, 3)
Data columns:
A5    4009  non-null values
A6    4009  non-null values
A7    4009  non-null values
A8    4009  non-null values
A9    4009  non-null values
dtypes: int64(5)

In [10]: partial_indexer(df,loc)
Out[10]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 210 entries, 3 to 1
Data columns:
A5    210  non-null values
A6    210  non-null values
A7    210  non-null values
A8    210  non-null values
A9    210  non-null values
dtypes: int64(5)
@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 9, 2013

Good catch on the df.ix[loc[0],loc[1]] not correctly selecting what I intended. Indeed I meant to select df.ix[loc[0]].ix[loc[1]]. Thanks for the partial_indexer function.

If the df.ix[row, col] method converts the index arguments to a tuple(row, col) are the following correctly interpreted as I have shown:

df.ix[1, 4, 1, 3, :] correctly partially selects the data.
(1,4,1,3, slice(None, None, None))

df.ix[(1, 4, 1, 3), :] does not work Key Error
((1, 4, 1, 3), slice(None, None, None))

df.ix[1, 4, 1, 3, 1] does not work --> Too many Indexers. Not sure how to partially select using all indices.
(1, 4, 1, 3, 1)

df.ix[1, 4, 1, 3, 1, :] does not work either --> Too many Indexers

but this works... df.ix[df[0:1].index]
Selecting by one MultiIndex and does not drop the index columns and correctly returns the partial data.

Perhaps it would make sense to accept a tuple of tuples to partition the row/col selections.

((ordered tuple of row indexers),(ordered tuple of col indexers))

then the outer tuple can be unpacked and you have two separate objects for partially selecting from both rows and cols.

-Gagi

dragoljub commented Mar 9, 2013

Good catch on the df.ix[loc[0],loc[1]] not correctly selecting what I intended. Indeed I meant to select df.ix[loc[0]].ix[loc[1]]. Thanks for the partial_indexer function.

If the df.ix[row, col] method converts the index arguments to a tuple(row, col) are the following correctly interpreted as I have shown:

df.ix[1, 4, 1, 3, :] correctly partially selects the data.
(1,4,1,3, slice(None, None, None))

df.ix[(1, 4, 1, 3), :] does not work Key Error
((1, 4, 1, 3), slice(None, None, None))

df.ix[1, 4, 1, 3, 1] does not work --> Too many Indexers. Not sure how to partially select using all indices.
(1, 4, 1, 3, 1)

df.ix[1, 4, 1, 3, 1, :] does not work either --> Too many Indexers

but this works... df.ix[df[0:1].index]
Selecting by one MultiIndex and does not drop the index columns and correctly returns the partial data.

Perhaps it would make sense to accept a tuple of tuples to partition the row/col selections.

((ordered tuple of row indexers),(ordered tuple of col indexers))

then the outer tuple can be unpacked and you have two separate objects for partially selecting from both rows and cols.

-Gagi

@lodagro

This comment has been minimized.

Show comment
Hide comment
@lodagro

lodagro Mar 9, 2013

Contributor

These are scenarios where i prefer xs over ix. Also with xs user has control of copy versus view.

In [44]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[44]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4003 entries, (0, 2, 3) to (2, 4, 4)
Data columns:
A5    4003  non-null values
A6    4003  non-null values
A7    4003  non-null values
A8    4003  non-null values
A9    4003  non-null values
dtypes: int64(5)
In [45]: df.xs([loc[1], loc[3]], level=[1, 3])
Out[45]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4055 entries, (2, 2, 3) to (3, 1, 1)
Data columns:
A5    4055  non-null values
A6    4055  non-null values
A7    4055  non-null values
A8    4055  non-null values
A9    4055  non-null values
dtypes: int64(5)
Contributor

lodagro commented Mar 9, 2013

These are scenarios where i prefer xs over ix. Also with xs user has control of copy versus view.

In [44]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[44]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4003 entries, (0, 2, 3) to (2, 4, 4)
Data columns:
A5    4003  non-null values
A6    4003  non-null values
A7    4003  non-null values
A8    4003  non-null values
A9    4003  non-null values
dtypes: int64(5)
In [45]: df.xs([loc[1], loc[3]], level=[1, 3])
Out[45]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4055 entries, (2, 2, 3) to (3, 1, 1)
Data columns:
A5    4055  non-null values
A6    4055  non-null values
A7    4055  non-null values
A8    4055  non-null values
A9    4055  non-null values
dtypes: int64(5)
@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 11, 2013

I see that df.xs does seem to support partial indexing as expected. Is it possible to call df.xs without explicitly passing the level? I level is not passed could it default to a list from 1 to the number of dimensions passed?

dragoljub commented Mar 11, 2013

I see that df.xs does seem to support partial indexing as expected. Is it possible to call df.xs without explicitly passing the level? I level is not passed could it default to a list from 1 to the number of dimensions passed?

@lodagro

This comment has been minimized.

Show comment
Hide comment
@lodagro

lodagro Mar 11, 2013

Contributor

It should default to the first n levels (n=length of the key), but it does not (at least not on master any more, see also #1796).

In [28]: df.xs([loc[0], loc[1]])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 2'

In [29]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[29]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4047 entries, (2, 1, 3) to (1, 1, 0)
Data columns:
A5    4047  non-null values
A6    4047  non-null values
A7    4047  non-null values
A8    4047  non-null values
A9    4047  non-null values
dtypes: int64(5)
Contributor

lodagro commented Mar 11, 2013

It should default to the first n levels (n=length of the key), but it does not (at least not on master any more, see also #1796).

In [28]: df.xs([loc[0], loc[1]])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 2'

In [29]: df.xs([loc[0], loc[1]], level=[0, 1])
Out[29]:
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 4047 entries, (2, 1, 3) to (1, 1, 0)
Data columns:
A5    4047  non-null values
A6    4047  non-null values
A7    4047  non-null values
A8    4047  non-null values
A9    4047  non-null values
dtypes: int64(5)
@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 11, 2013

With a similar DataFrame as above, df.xs can't partially select rows where the passed tuple length equals the total number of levels in the MultIndex. For example below we know (0, 1, 0, 4, 3) will return at least the first row of the data set. Using df.xs cant access rows indexed by (0, 1, 0, 4, 3). However using df.ix[df[0:1].index] to directly access rows that match the first rows index works.

In [35]: df
Out[35]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 1, 0, 4, 3) to (2, 0, 4, 1, 1)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [36]: loc
Out[36]: (2, 3, 2, 4)

In [37]: df.xs(loc, level=[0,1,2,3])
Out[37]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 1 to 3
Data columns:
A5    201  non-null values
A6    201  non-null values
A7    201  non-null values
A8    201  non-null values
A9    201  non-null values
dtypes: int64(5)

In [38]: df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-d8180b507e6b> in <module>()
----> 1 df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])

C:\Python27\lib\site-packages\pandas\core\frame.py in xs(self, key, axis, level, copy)
   2291         labels = self._get_axis(axis)
   2292         if level is not None:
-> 2293             loc, new_ax = labels.get_loc_level(key, level=level)
   2294 
   2295             if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.py in get_loc_level(self, key, level)
   2243 
   2244                 result = loc if result is None else result & loc
-> 2245             return result, _drop_levels(result, level)
   2246 
   2247         level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.py in _drop_levels(indexer, levels)
   2227             levels = [self._get_level_number(i) for i in levels]
   2228             for i in sorted(levels, reverse=True):
-> 2229                 new_index = new_index.droplevel(i)
   2230             return new_index
   2231 

AttributeError: 'Int64Index' object has no attribute 'droplevel'

In [39]: df.ix[df[0:1].index]
Out[39]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  0  4  3    1   1   0   4   2
            3    1   1   3   1   0
            3    3   2   3   0   0
            3    4   3   1   1   1
            3    2   0   1   2   1
            3    4   1   1   2   4
            3    0   4   2   1   0
            3    2   0   2   0   4
            3    0   0   3   0   3
            3    1   0   1   2   3
            3    2   0   3   0   3
            3    1   3   4   2   3
            3    3   1   0   1   2
            3    0   0   1   4   0
            3    3   1   3   1   4
            3    4   0   2   2   1
            3    2   1   0   1   2
            3    0   1   4   3   2
            3    4   0   4   4   3
            3    1   4   0   4   0
            3    2   1   3   3   0
            3    3   2   1   0   4
            3    1   2   0   1   3
            3    4   1   0   1   4
            3    3   2   0   0   0
            3    3   3   3   1   0
            3    0   0   0   0   3
            3    1   0   1   0   2
            3    2   3   2   1   0
            3    3   0   3   2   4
            3    3   0   3   3   3
            3    1   3   4   0   1
            3    4   4   0   1   4
            3    0   0   4   3   2
            3    1   1   2   4   2

dragoljub commented Mar 11, 2013

With a similar DataFrame as above, df.xs can't partially select rows where the passed tuple length equals the total number of levels in the MultIndex. For example below we know (0, 1, 0, 4, 3) will return at least the first row of the data set. Using df.xs cant access rows indexed by (0, 1, 0, 4, 3). However using df.ix[df[0:1].index] to directly access rows that match the first rows index works.

In [35]: df
Out[35]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 1, 0, 4, 3) to (2, 0, 4, 1, 1)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [36]: loc
Out[36]: (2, 3, 2, 4)

In [37]: df.xs(loc, level=[0,1,2,3])
Out[37]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 1 to 3
Data columns:
A5    201  non-null values
A6    201  non-null values
A7    201  non-null values
A8    201  non-null values
A9    201  non-null values
dtypes: int64(5)

In [38]: df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-d8180b507e6b> in <module>()
----> 1 df.xs((0, 1, 0, 4, 3), level=[0,1,2,3,4])

C:\Python27\lib\site-packages\pandas\core\frame.py in xs(self, key, axis, level, copy)
   2291         labels = self._get_axis(axis)
   2292         if level is not None:
-> 2293             loc, new_ax = labels.get_loc_level(key, level=level)
   2294 
   2295             if not copy and not isinstance(loc, slice):

C:\Python27\lib\site-packages\pandas\core\index.py in get_loc_level(self, key, level)
   2243 
   2244                 result = loc if result is None else result & loc
-> 2245             return result, _drop_levels(result, level)
   2246 
   2247         level = self._get_level_number(level)

C:\Python27\lib\site-packages\pandas\core\index.py in _drop_levels(indexer, levels)
   2227             levels = [self._get_level_number(i) for i in levels]
   2228             for i in sorted(levels, reverse=True):
-> 2229                 new_index = new_index.droplevel(i)
   2230             return new_index
   2231 

AttributeError: 'Int64Index' object has no attribute 'droplevel'

In [39]: df.ix[df[0:1].index]
Out[39]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  0  4  3    1   1   0   4   2
            3    1   1   3   1   0
            3    3   2   3   0   0
            3    4   3   1   1   1
            3    2   0   1   2   1
            3    4   1   1   2   4
            3    0   4   2   1   0
            3    2   0   2   0   4
            3    0   0   3   0   3
            3    1   0   1   2   3
            3    2   0   3   0   3
            3    1   3   4   2   3
            3    3   1   0   1   2
            3    0   0   1   4   0
            3    3   1   3   1   4
            3    4   0   2   2   1
            3    2   1   0   1   2
            3    0   1   4   3   2
            3    4   0   4   4   3
            3    1   4   0   4   0
            3    2   1   3   3   0
            3    3   2   1   0   4
            3    1   2   0   1   3
            3    4   1   0   1   4
            3    3   2   0   0   0
            3    3   3   3   1   0
            3    0   0   0   0   3
            3    1   0   1   0   2
            3    2   3   2   1   0
            3    3   0   3   2   4
            3    3   0   3   3   3
            3    1   3   4   0   1
            3    4   4   0   1   4
            3    0   0   4   3   2
            3    1   1   2   4   2
@lodagro

This comment has been minimized.

Show comment
Hide comment
@lodagro

lodagro Mar 12, 2013

Contributor

Note that your DataFrame has an unsorted index (i somehow did not notice this before). This is important when slicing/indexing on a MultiIndex. For details see "The need for sortedness"

In [156]: df
Out[156]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (4, 0, 0, 2, 2) to (1, 4, 1, 2, 0)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [157]: df.index.lexsort_depth
Out[157]: 0

In [158]: df.xs((0,0,0,0,0))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 5'

In [159]: dfs = df.sort_index()  # sort the index

In [160]: dfs.index.lexsort_depth
Out[160]: 5

In [161]: dfs.xs((0,0,0,0,0))
Out[161]: 
                A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  0  0  0  0    4   1   0   1   3
            0    3   0   3   2   4
            0    4   0   2   2   0
            0    1   4   4   1   3
            0    4   4   2   3   0
            0    4   1   0   3   4
            0    3   4   3   3   1
            0    1   3   2   4   2
            0    1   1   3   1   2
            0    2   3   2   3   4
            0    3   3   2   2   2
            0    4   0   1   3   0
            0    2   1   4   1   0
            0    4   1   3   2   1
            0    3   4   2   0   2
            0    3   4   2   4   1
            0    4   4   3   2   3
            0    3   4   4   0   0
            0    3   1   4   1   1
            0    2   4   0   1   4
            0    4   4   2   3   2
            0    4   2   0   1   0
            0    1   4   4   0   3
            0    0   0   1   0   3
            0    3   1   1   1   0
            0    4   3   1   4   2
            0    0   4   1   2   3
            0    1   4   3   4   3

Going back to your original examples, the failing parts work fine after sorting

In [168]: dfs.ix[loc]
Out[168]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 164 entries, 0 to 4
Data columns:
A5    164  non-null values
A6    164  non-null values
A7    164  non-null values
A8    164  non-null values
A9    164  non-null values
dtypes: int64(5)

In [169]: dfs.ix[loc[0], loc[1], loc[2]]
Out[169]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 785 entries, (0, 0) to (4, 4)
Data columns:
A5    785  non-null values
A6    785  non-null values
A7    785  non-null values
A8    785  non-null values
A9    785  non-null values
dtypes: int64(5)
Contributor

lodagro commented Mar 12, 2013

Note that your DataFrame has an unsorted index (i somehow did not notice this before). This is important when slicing/indexing on a MultiIndex. For details see "The need for sortedness"

In [156]: df
Out[156]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (4, 0, 0, 2, 2) to (1, 4, 1, 2, 0)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [157]: df.index.lexsort_depth
Out[157]: 0

In [158]: df.xs((0,0,0,0,0))
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
...
KeyError: 'MultiIndex lexsort depth 0, key was length 5'

In [159]: dfs = df.sort_index()  # sort the index

In [160]: dfs.index.lexsort_depth
Out[160]: 5

In [161]: dfs.xs((0,0,0,0,0))
Out[161]: 
                A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  0  0  0  0    4   1   0   1   3
            0    3   0   3   2   4
            0    4   0   2   2   0
            0    1   4   4   1   3
            0    4   4   2   3   0
            0    4   1   0   3   4
            0    3   4   3   3   1
            0    1   3   2   4   2
            0    1   1   3   1   2
            0    2   3   2   3   4
            0    3   3   2   2   2
            0    4   0   1   3   0
            0    2   1   4   1   0
            0    4   1   3   2   1
            0    3   4   2   0   2
            0    3   4   2   4   1
            0    4   4   3   2   3
            0    3   4   4   0   0
            0    3   1   4   1   1
            0    2   4   0   1   4
            0    4   4   2   3   2
            0    4   2   0   1   0
            0    1   4   4   0   3
            0    0   0   1   0   3
            0    3   1   1   1   0
            0    4   3   1   4   2
            0    0   4   1   2   3
            0    1   4   3   4   3

Going back to your original examples, the failing parts work fine after sorting

In [168]: dfs.ix[loc]
Out[168]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 164 entries, 0 to 4
Data columns:
A5    164  non-null values
A6    164  non-null values
A7    164  non-null values
A8    164  non-null values
A9    164  non-null values
dtypes: int64(5)

In [169]: dfs.ix[loc[0], loc[1], loc[2]]
Out[169]: 
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 785 entries, (0, 0) to (4, 4)
Data columns:
A5    785  non-null values
A6    785  non-null values
A7    785  non-null values
A8    785  non-null values
A9    785  non-null values
dtypes: int64(5)
@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 12, 2013

Great catch lodagro! Sorting the hierarchical index did fix the df.ix and df.xs methods. The fluke indexing working when appending the null slice was just serving to confuse me further.

For a future release it might make sense to have a default bool sort_index parameter when calling df.set_index() ensuring that a sorted index is returned.

Issue closed.

dragoljub commented Mar 12, 2013

Great catch lodagro! Sorting the hierarchical index did fix the df.ix and df.xs methods. The fluke indexing working when appending the null slice was just serving to confuse me further.

For a future release it might make sense to have a default bool sort_index parameter when calling df.set_index() ensuring that a sorted index is returned.

Issue closed.

@dragoljub dragoljub closed this Mar 12, 2013

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 13, 2013

I was playing around with the partial selection and I ran into an inconsistency with how partially selected data frames are returned. When selecting using a subset of levels you get a df is returned with the fixed/selected levels dropped. When you partially select using a tuple on all levels you get a data frame with all indices returned.

Is there a way to return all indices when sub-selecting on a subset of levels? Is there a way to not return the entire hierarchical index when selecting across all levels?

With a df similar to the one above, indexed using 5 integer columns and with the hierarchical indices sorted.

In [79]: df
Out[79]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [80]: df.ix[(0)]
Out[80]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 20011 entries, (0, 0, 0, 0) to (4, 4, 4, 4)
Data columns:
A5    20011  non-null values
A6    20011  non-null values
A7    20011  non-null values
A8    20011  non-null values
A9    20011  non-null values
dtypes: int64(5)

In [81]: df.ix[(0,1)]
Out[81]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 4007 entries, (0, 0, 0) to (4, 4, 4)
Data columns:
A5    4007  non-null values
A6    4007  non-null values
A7    4007  non-null values
A8    4007  non-null values
A9    4007  non-null values
dtypes: int64(5)

In [82]: df.ix[(0,1,2)]
Out[82]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 817 entries, (0, 0) to (4, 4)
Data columns:
A5    817  non-null values
A6    817  non-null values
A7    817  non-null values
A8    817  non-null values
A9    817  non-null values
dtypes: int64(5)

In [83]: df.ix[(0,1,2,3)]
Out[83]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 162 entries, 0 to 4
Data columns:
A5    162  non-null values
A6    162  non-null values
A7    162  non-null values
A8    162  non-null values
A9    162  non-null values
dtypes: int64(5)

In [84]: df.ix[(0,1,2,3,4)]
Out[84]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   2   2   4   2
            4    1   4   4   1   0
            4    2   1   4   1   3
            4    2   4   2   1   1
            4    1   1   2   1   4
            4    0   0   2   1   1
            4    2   0   0   3   1
            4    2   2   3   3   1
            4    3   0   3   4   1
            4    1   1   0   0   1
            4    2   1   0   2   4
            4    3   4   1   2   3
            4    0   4   3   1   0
            4    4   1   4   1   2
            4    1   3   4   3   3
            4    0   1   1   3   1
            4    2   2   2   0   3
            4    0   0   1   4   0
            4    1   0   1   4   2
            4    1   4   2   2   0
            4    4   2   0   3   1
            4    2   1   2   3   2
            4    4   2   0   1   4
            4    1   4   1   1   4
            4    1   0   1   2   4
            4    2   3   0   1   3
            4    2   1   3   3   3
            4    1   2   0   4   2
            4    3   0   4   4   0
            4    4   4   2   3   0
            4    0   0   1   3   2
            4    4   0   0   0   3
            4    2   0   3   4   2
            4    3   3   3   0   2
            4    4   2   2   0   1
            4    2   1   3   4   0

In [86]: df.index.lexsort_depth
Out[86]: 5

dragoljub commented Mar 13, 2013

I was playing around with the partial selection and I ran into an inconsistency with how partially selected data frames are returned. When selecting using a subset of levels you get a df is returned with the fixed/selected levels dropped. When you partially select using a tuple on all levels you get a data frame with all indices returned.

Is there a way to return all indices when sub-selecting on a subset of levels? Is there a way to not return the entire hierarchical index when selecting across all levels?

With a df similar to the one above, indexed using 5 integer columns and with the hierarchical indices sorted.

In [79]: df
Out[79]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 100000 entries, (0, 0, 0, 0, 0) to (4, 4, 4, 4, 4)
Data columns:
A5    100000  non-null values
A6    100000  non-null values
A7    100000  non-null values
A8    100000  non-null values
A9    100000  non-null values
dtypes: int64(5)

In [80]: df.ix[(0)]
Out[80]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 20011 entries, (0, 0, 0, 0) to (4, 4, 4, 4)
Data columns:
A5    20011  non-null values
A6    20011  non-null values
A7    20011  non-null values
A8    20011  non-null values
A9    20011  non-null values
dtypes: int64(5)

In [81]: df.ix[(0,1)]
Out[81]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 4007 entries, (0, 0, 0) to (4, 4, 4)
Data columns:
A5    4007  non-null values
A6    4007  non-null values
A7    4007  non-null values
A8    4007  non-null values
A9    4007  non-null values
dtypes: int64(5)

In [82]: df.ix[(0,1,2)]
Out[82]: <class 'pandas.core.frame.DataFrame'>
MultiIndex: 817 entries, (0, 0) to (4, 4)
Data columns:
A5    817  non-null values
A6    817  non-null values
A7    817  non-null values
A8    817  non-null values
A9    817  non-null values
dtypes: int64(5)

In [83]: df.ix[(0,1,2,3)]
Out[83]: <class 'pandas.core.frame.DataFrame'>
Int64Index: 162 entries, 0 to 4
Data columns:
A5    162  non-null values
A6    162  non-null values
A7    162  non-null values
A8    162  non-null values
A9    162  non-null values
dtypes: int64(5)

In [84]: df.ix[(0,1,2,3,4)]
Out[84]:                 A5  A6  A7  A8  A9
A0 A1 A2 A3 A4                    
0  1  2  3  4    1   2   2   4   2
            4    1   4   4   1   0
            4    2   1   4   1   3
            4    2   4   2   1   1
            4    1   1   2   1   4
            4    0   0   2   1   1
            4    2   0   0   3   1
            4    2   2   3   3   1
            4    3   0   3   4   1
            4    1   1   0   0   1
            4    2   1   0   2   4
            4    3   4   1   2   3
            4    0   4   3   1   0
            4    4   1   4   1   2
            4    1   3   4   3   3
            4    0   1   1   3   1
            4    2   2   2   0   3
            4    0   0   1   4   0
            4    1   0   1   4   2
            4    1   4   2   2   0
            4    4   2   0   3   1
            4    2   1   2   3   2
            4    4   2   0   1   4
            4    1   4   1   1   4
            4    1   0   1   2   4
            4    2   3   0   1   3
            4    2   1   3   3   3
            4    1   2   0   4   2
            4    3   0   4   4   0
            4    4   4   2   3   0
            4    0   0   1   3   2
            4    4   0   0   0   3
            4    2   0   3   4   2
            4    3   3   3   0   2
            4    4   2   2   0   1
            4    2   1   3   4   0

In [86]: df.index.lexsort_depth
Out[86]: 5

@dragoljub dragoljub reopened this Mar 13, 2013

@lodagro

This comment has been minimized.

Show comment
Hide comment
@lodagro

lodagro Mar 14, 2013

Contributor

There is no way to control how the index is returned with xs, either full partial or not.

fyi to remove the index you can do df.reset_index(drop=True)

Since the discussion has moved away from the original issue (which is resolved i think), would it be ok for you to close this issue and if needed open a new one?

Contributor

lodagro commented Mar 14, 2013

There is no way to control how the index is returned with xs, either full partial or not.

fyi to remove the index you can do df.reset_index(drop=True)

Since the discussion has moved away from the original issue (which is resolved i think), would it be ok for you to close this issue and if needed open a new one?

@dragoljub

This comment has been minimized.

Show comment
Hide comment
@dragoljub

dragoljub Mar 15, 2013

Closing this issue and opened new issue #3057 for controlling what index levels are returned when sub-selecting.

dragoljub commented Mar 15, 2013

Closing this issue and opened new issue #3057 for controlling what index levels are returned when sub-selecting.

@dragoljub dragoljub closed this Mar 15, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment