Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to suppress automatic conversion of tuples to MultiIndex #11799

Closed
htzh opened this issue Dec 9, 2015 · 14 comments
Closed

Option to suppress automatic conversion of tuples to MultiIndex #11799

htzh opened this issue Dec 9, 2015 · 14 comments

Comments

@htzh
Copy link

htzh commented Dec 9, 2015

Right now we have:

>>> pd.DataFrame({(1,2):pd.Series([2.3])}).columns
MultiIndex(levels=[[1], [2]],                     
           labels=[[0], [0]])                     
>>> pd.Index([(1,2)])
MultiIndex(levels=[[1], [2]],
           labels=[[0], [0]])

Could we have an option to suppress this behavior? One problem this causes is that the .rename method is not uniform and does not work if a tuple is silently converted:

>>> pd.DataFrame({(1,2):pd.Series([2.3])}).rename(columns={(1,2):(1,3)})
     1                                                                  
     2                                                                  
0  2.3                                                                  

To see why such behavior is problematic consider the following unintuitive example:

>>> a = ('a', 1)

>>> b = ('b', 2)

>>> pd.DataFrame({a:pd.Series([2,3])})
   a
   1
0  2
1  3

>>> pd.DataFrame({a:pd.Series([2,3])}).rename(columns={a:b})
   a
   1
0  2
1  3
@jreback
Copy link
Contributor

jreback commented Dec 9, 2015

You can do this, but using tuples as index is VERY awkward and barely supported. These are much more naturally represented (and performant) as MultiIndexes. I have never seen a case where this is actually a good idea.

In [6]: df.columns = Index([(1,2)],tupleize_cols=False)

In [7]: df
Out[7]: 
   (1, 2)
0     2.3

@htzh
Copy link
Author

htzh commented Dec 9, 2015

@jreback Thanks for the response. The problem is that DataFrame constructor does not expose tupleize_cols option.

>>> pd.DataFrame({(1,2):pd.Series([2.3])}, tupleize_cols=False).columns
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: __init__() got an unexpected keyword argument 'tupleize_cols'
__init__() got an unexpected keyword argument 'tupleize_cols'

Even though read_csv and friends do expose this. While I agree MultiIndex may be superior the problem is rename is then inconsistent as I showed earlier, even though in behaves correctly:

>>> (1, 2) in pd.DataFrame({(1,2):pd.Series([2.3])}).columns
True

So there seem to be API inconsistencies here (if not bugs).

@jreback
Copy link
Contributor

jreback commented Dec 9, 2015

@htzh its not exposed because its not recommended in any way to do this.

to be honest we should completely ban tuples as columns, this should never have been allowed IMHO. but we are living with it. MultiIndexes are much much better supported. (and tuplize_cols is read_csv should be deprecated as well, just hasn't been bothering anyone so left it in).

not sure what you mean by:

So there seem to be API inconsistencies here (if not bugs).

not in the main API's

@htzh
Copy link
Author

htzh commented Dec 10, 2015

I accept that tuple is not desirable. My problem is that during data cleaning phase I need to rename index (column or row) at a few places. However the conversion of tuple to MultiIndex is not consistent for row index:

>>> pd.DataFrame({'a':pd.Series([2, 3], index=[(1,1), (2,2)])})
        a
(1, 1)  2
(2, 2)  3

And as rename semantics is not uniform between Index and MultiIndex, it would be nice to have consistent behavior here:

  • tuple row index converts to MultiIndex
  • rename method (for DataFrame or Series) accepts tuple as specifying MultiIndex instead of tuple index just like in operator does.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2015

#4160 is waiting for you !

that's the fastest / best way to get a change as that is an actual bug. What you want is marginal behavior which mucks with some long defined semantics, so very low likelood of change.

@htzh
Copy link
Author

htzh commented Dec 10, 2015

Thanks for the reference.

What about the behavior of row index? Is there a reason why row uses tuple index by default while the column uses MultiIndex by default?

@jreback
Copy link
Contributor

jreback commented Dec 10, 2015

you are creating a list of tuples not an Index (if u wrapped this in Index it would be the same)

@htzh
Copy link
Author

htzh commented Dec 12, 2015

Sorry for bringing this up one more time. I wonder if you think the following behavior is expected:

>>> pd.Series([2, 3], index=[(1,1), (2,2)]).index
Index([(1, 1), (2, 2)], dtype='object')

>>> pd.Series([2, 3], index=[(1,1), (2,2)]).rename({(1,1):(5,6)}).index
MultiIndex(levels=[[2, 5], [2, 6]],
           labels=[[1, 0], [1, 0]])

I know you don't like tuple index. I use them as a quick way to specify properties without going through the boilerplates of defining the class. During data cleaning phase I need to change some names depending on the data. It looks like my alternatives are:

  • not use pandas during data cleaning, which kind of defeats the purpose
  • wrap the tuple properties in a class to prevent pandas from messing with it, which creates more work during the experimentation phase

@max-sixty
Copy link
Contributor

@htzh why don't you use a MultiIndex?

@htzh
Copy link
Author

htzh commented Dec 12, 2015

@MaximilianR The problem is that I need to occasionally rename some row names during data cleaning and that needs to be done before MultiIndex is created. Let's say I want to merge two tables. The two tables agree on most row names but it is also possible that they name a particular row differently. They have some overlapping columns so I can infer for which rows the tables differ in row names from the data cells in the shared column. The real problem is more complicated but this is the gist. So in my case the data schema is not completely known a priori and needs some data dependent inference based on logical relationships. Using MultiIndex during the cleaning phase would only make things more complicated (for example I may not know how many levels I will have a priori either).

@max-sixty
Copy link
Contributor

@htzh If you need a container-like object, use something like argparse.Namespace - pandas will recognize it as an object. Example below.
If you can cope with the enforced structure, MultiIndexes are really worth considering though

In [22]: from argparse import Namespace

In [24]: n=Namespace(a=3, b=4)

In [25]: n
Out[25]: Namespace(a=3, b=4)

In [26]: n.a
Out[26]: 3

@jreback
Copy link
Contributor

jreback commented Dec 12, 2015

@htzh as I said what you are describing is a bug similar to #4164

best way to fix is to submit a PR for .rename.

@htzh
Copy link
Author

htzh commented Dec 13, 2015

Thanks to all for the help. I think I will try to grok MultiIndex better and more effectively incorporate it into the processing pipeline sooner. I also realized that one way to achieve what I wanted is to first convert names of rows as a data column and change the names in the data column. Pandas offers convenient ways to then use the changed column as Index or MultiIndex.

@jreback I read the #4160 thread but it is not completely clear to me what the proposed API is. Is the proposal to make the following statement work for MultiIndex:

df.rename(columns={('abspx','red) : ('foo','orange')})

presumably moving the particular column from one part of the hierarchy to another? If that is the proposal that is also what I requested earlier: "rename method (for DataFrame or Series) accepts tuple as specifying MultiIndex instead of tuple index just like in operator does."

I am not yet familiar with pandas source but I will keep that in mind.

@jreback
Copy link
Contributor

jreback commented Dec 13, 2015

@htzh I think what you proposed above should work. There are quite a few cases that need working out though. E.g. what if you only have a partial level rename

df.rename(columns={'foo : 'bar'})

where 'foo' is in level=0

I think this should work too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants