Option to suppress automatic conversion of tuples to MultiIndex #11799

htzh · 2015-12-09T01:33:49Z

Right now we have:

>>> pd.DataFrame({(1,2):pd.Series([2.3])}).columns
MultiIndex(levels=[[1], [2]],                     
           labels=[[0], [0]])                     
>>> pd.Index([(1,2)])
MultiIndex(levels=[[1], [2]],
           labels=[[0], [0]])

Could we have an option to suppress this behavior? One problem this causes is that the .rename method is not uniform and does not work if a tuple is silently converted:

>>> pd.DataFrame({(1,2):pd.Series([2.3])}).rename(columns={(1,2):(1,3)})
     1                                                                  
     2                                                                  
0  2.3

To see why such behavior is problematic consider the following unintuitive example:

>>> a = ('a', 1)

>>> b = ('b', 2)

>>> pd.DataFrame({a:pd.Series([2,3])})
   a
   1
0  2
1  3

>>> pd.DataFrame({a:pd.Series([2,3])}).rename(columns={a:b})
   a
   1
0  2
1  3

The text was updated successfully, but these errors were encountered:

jreback · 2015-12-09T14:20:58Z

You can do this, but using tuples as index is VERY awkward and barely supported. These are much more naturally represented (and performant) as MultiIndexes. I have never seen a case where this is actually a good idea.

In [6]: df.columns = Index([(1,2)],tupleize_cols=False)

In [7]: df
Out[7]: 
   (1, 2)
0     2.3

htzh · 2015-12-09T20:01:30Z

@jreback Thanks for the response. The problem is that DataFrame constructor does not expose tupleize_cols option.

>>> pd.DataFrame({(1,2):pd.Series([2.3])}, tupleize_cols=False).columns
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'tupleize_cols'
__init__() got an unexpected keyword argument 'tupleize_cols'

Even though read_csv and friends do expose this. While I agree MultiIndex may be superior the problem is rename is then inconsistent as I showed earlier, even though in behaves correctly:

>>> (1, 2) in pd.DataFrame({(1,2):pd.Series([2.3])}).columns
True

So there seem to be API inconsistencies here (if not bugs).

jreback · 2015-12-09T20:11:04Z

@htzh its not exposed because its not recommended in any way to do this.

to be honest we should completely ban tuples as columns, this should never have been allowed IMHO. but we are living with it. MultiIndexes are much much better supported. (and tuplize_cols is read_csv should be deprecated as well, just hasn't been bothering anyone so left it in).

not sure what you mean by:

So there seem to be API inconsistencies here (if not bugs).

not in the main API's

htzh · 2015-12-10T02:03:19Z

I accept that tuple is not desirable. My problem is that during data cleaning phase I need to rename index (column or row) at a few places. However the conversion of tuple to MultiIndex is not consistent for row index:

>>> pd.DataFrame({'a':pd.Series([2, 3], index=[(1,1), (2,2)])})
        a
(1, 1)  2
(2, 2)  3

And as rename semantics is not uniform between Index and MultiIndex, it would be nice to have consistent behavior here:

tuple row index converts to MultiIndex
rename method (for DataFrame or Series) accepts tuple as specifying MultiIndex instead of tuple index just like in operator does.

jreback · 2015-12-10T02:05:40Z

#4160 is waiting for you !

that's the fastest / best way to get a change as that is an actual bug. What you want is marginal behavior which mucks with some long defined semantics, so very low likelood of change.

htzh · 2015-12-10T03:14:50Z

Thanks for the reference.

What about the behavior of row index? Is there a reason why row uses tuple index by default while the column uses MultiIndex by default?

jreback · 2015-12-10T03:51:34Z

you are creating a list of tuples not an Index (if u wrapped this in Index it would be the same)

htzh · 2015-12-12T02:21:00Z

Sorry for bringing this up one more time. I wonder if you think the following behavior is expected:

>>> pd.Series([2, 3], index=[(1,1), (2,2)]).index
Index([(1, 1), (2, 2)], dtype='object')

>>> pd.Series([2, 3], index=[(1,1), (2,2)]).rename({(1,1):(5,6)}).index
MultiIndex(levels=[[2, 5], [2, 6]],
           labels=[[1, 0], [1, 0]])

I know you don't like tuple index. I use them as a quick way to specify properties without going through the boilerplates of defining the class. During data cleaning phase I need to change some names depending on the data. It looks like my alternatives are:

not use pandas during data cleaning, which kind of defeats the purpose
wrap the tuple properties in a class to prevent pandas from messing with it, which creates more work during the experimentation phase

max-sixty · 2015-12-12T04:25:11Z

@htzh why don't you use a MultiIndex?

htzh · 2015-12-12T07:18:45Z

@MaximilianR The problem is that I need to occasionally rename some row names during data cleaning and that needs to be done before MultiIndex is created. Let's say I want to merge two tables. The two tables agree on most row names but it is also possible that they name a particular row differently. They have some overlapping columns so I can infer for which rows the tables differ in row names from the data cells in the shared column. The real problem is more complicated but this is the gist. So in my case the data schema is not completely known a priori and needs some data dependent inference based on logical relationships. Using MultiIndex during the cleaning phase would only make things more complicated (for example I may not know how many levels I will have a priori either).

max-sixty · 2015-12-12T08:19:20Z

@htzh If you need a container-like object, use something like argparse.Namespace - pandas will recognize it as an object. Example below.
If you can cope with the enforced structure, MultiIndexes are really worth considering though

In [22]: from argparse import Namespace

In [24]: n=Namespace(a=3, b=4)

In [25]: n
Out[25]: Namespace(a=3, b=4)

In [26]: n.a
Out[26]: 3

jreback · 2015-12-12T14:08:05Z

@htzh as I said what you are describing is a bug similar to #4164

best way to fix is to submit a PR for .rename.

htzh · 2015-12-13T02:39:26Z

Thanks to all for the help. I think I will try to grok MultiIndex better and more effectively incorporate it into the processing pipeline sooner. I also realized that one way to achieve what I wanted is to first convert names of rows as a data column and change the names in the data column. Pandas offers convenient ways to then use the changed column as Index or MultiIndex.

@jreback I read the #4160 thread but it is not completely clear to me what the proposed API is. Is the proposal to make the following statement work for MultiIndex:

df.rename(columns={('abspx','red) : ('foo','orange')})

presumably moving the particular column from one part of the hierarchy to another? If that is the proposal that is also what I requested earlier: "rename method (for DataFrame or Series) accepts tuple as specifying MultiIndex instead of tuple index just like in operator does."

I am not yet familiar with pandas source but I will keep that in mind.

jreback · 2015-12-13T20:53:46Z

@htzh I think what you proposed above should work. There are quite a few cases that need working out though. E.g. what if you only have a partial level rename

df.rename(columns={'foo : 'bar'})

where 'foo' is in level=0

I think this should work too

jreback closed this as completed Dec 9, 2015

jreback added Usage Question MultiIndex labels Dec 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to suppress automatic conversion of tuples to MultiIndex #11799

Option to suppress automatic conversion of tuples to MultiIndex #11799

htzh commented Dec 9, 2015

jreback commented Dec 9, 2015

htzh commented Dec 9, 2015

jreback commented Dec 9, 2015

htzh commented Dec 10, 2015

jreback commented Dec 10, 2015

htzh commented Dec 10, 2015

jreback commented Dec 10, 2015

htzh commented Dec 12, 2015

max-sixty commented Dec 12, 2015

htzh commented Dec 12, 2015

max-sixty commented Dec 12, 2015

jreback commented Dec 12, 2015

htzh commented Dec 13, 2015

jreback commented Dec 13, 2015

Option to suppress automatic conversion of tuples to MultiIndex #11799

Option to suppress automatic conversion of tuples to MultiIndex #11799

Comments

htzh commented Dec 9, 2015

jreback commented Dec 9, 2015

htzh commented Dec 9, 2015

jreback commented Dec 9, 2015

htzh commented Dec 10, 2015

jreback commented Dec 10, 2015

htzh commented Dec 10, 2015

jreback commented Dec 10, 2015

htzh commented Dec 12, 2015

max-sixty commented Dec 12, 2015

htzh commented Dec 12, 2015

max-sixty commented Dec 12, 2015

jreback commented Dec 12, 2015

htzh commented Dec 13, 2015

jreback commented Dec 13, 2015