Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/API: Allow dicts with tuple keys in DataFrame constructor #3323

Closed
cpcloud opened this issue Apr 11, 2013 · 15 comments
Closed

ENH/API: Allow dicts with tuple keys in DataFrame constructor #3323

cpcloud opened this issue Apr 11, 2013 · 15 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Milestone

Comments

@cpcloud
Copy link
Member

cpcloud commented Apr 11, 2013

related #4805

It would be nice to allow automatic conversion to a MultiIndex when a dict with tuple keys is passed into the DataFrame constructor. Here's what currently happens:

from pandas import Index, DataFrame
from numpy.random import rand
import itertools as itools
d = {(i, j): rand(10) for i, j in itools.product(xrange(3), repeat=2)}
df = DataFrame(d)
assert type(df.columns) == Index

The same issue shows up in pd.concat when you pass in a dict of sequences (lists and ndarrays and friends) and axis=1, however if you have a dict of DataFrames the columns keys are converted to a MultiIndex. E.g.,

from pandas import MultiIndex
d = {(i, j): DataFrame(rand(10, 2), columns=['a', 'b']) for i, j in itools.product(xrange(3), repeat=2)}
df = pd.concat(d, axis=1)
assert type(df.columns) == MultiIndex
@ghost
Copy link

ghost commented Apr 12, 2013

This looks like a natural extension to me, marked for consideration in 0.12.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

I haven't fully tested it yet, but it looks like changing the call to Index in the _init_dict method to MultiIndex should do the trick since MultiIndex seems like it will construct the appropriate 1D index when necessary.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

Whoops that's not entirely precise: it should be a call to one of the class method helpers.

@ghost
Copy link

ghost commented Apr 12, 2013

Just need to consider the case where users want tuple labels for some reason.
Don't think that's a common case, but someone might be doint it and this change
would make that behaviour impossible.

pd.cut() style bins labels, like in Categorical are a related example though.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

I'm not sure how to address that case without something annoying like a flag in the constructor. I guess my point in raising the issue was for cases like that. It seems like an Index of tuples and a MultiIndex are equivalent in the sense that all(index.values == multiindex.values) and thus the user-facing API should be as similar as possible. What can one do with an index of tuples that one can't do with a multiindex?

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

One difference is that the size attribute of MultiIndex is broken. An Index of tuples returns the correct size while a MultiIndex always returns 0 when the size attribute is queried, but that's easy to fix.

@ghost
Copy link

ghost commented Apr 12, 2013

it's a question of backcompat mostly, tupes are valid labels right now,
interpreting tuples as levels would be a breaking change. maybe worth it.

There's nothing inherent in tuples that makes them mean levels. it's
just the semantics pandas can adopt, or not. Obiously MultiIndex representing
it's level labels as label tuples in some cases makes it reasonable to do that.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

To implement this behavior for both the DataFrame constructor and concat it looks like having the __new__ method of Index call the MultiIndex.from_tuples class method for sequences of sequences of equal length is the most parsimonious solution since concat uses merge under the hood and merge makes many calls to Index. This would require the fewest changes to the code and (I think) only in one place.

@cpcloud
Copy link
Member Author

cpcloud commented Apr 12, 2013

This would also solve the issue that when you call Index on a MultiIndex it returns an Index of tuples.

@cpcloud
Copy link
Member Author

cpcloud commented May 4, 2013

Is this worth keeping open? I'll hack on it if it is, but as @y-p said there's no reason tuples must be multilevel indices, they just happen to be implemented that way. I'm just not sure if this is too big of a breaking change.

@ghost
Copy link

ghost commented May 4, 2013

making this possible would be go, but the constructor is overloaded to
the point of bursting, I can't think of a reasonable way to do this without breaking
back-compat. Leave it open, it's worth figuring out.

@hayd
Copy link
Contributor

hayd commented Jul 10, 2013

Same for multiindex to DataFrame (not working atm)

m = pd.MultiIndex.from_arrays([[1,2], [3,4]])

In [11]: pd.DataFrame(m)
Out[11]:
Empty DataFrame
Columns: [0]
Index: []

As @cpcloud mentions it's m.values doesn't work "as expected" (especially shape):

In [20]: m.values
Out[20]: array([(1, 3), (2, 4)], dtype=object)

In [21]: pd.DataFrame(m.values)
Out[21]:
        0
0  (1, 3)
1  (2, 4)

In [22]: m.shape
Out[22]: (0,)

Possibly related #4187

@cpcloud
Copy link
Member Author

cpcloud commented Aug 8, 2013

i think the numpy attribute consistency can be added in a separate PR

@jreback
Copy link
Contributor

jreback commented Oct 11, 2013

pushing to 0.14

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

closed via #4805

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants