Add to_dataframe() method to MultiIndex #12397

Closed
relativistic opened this Issue Feb 19, 2016 · 8 comments

Comments

Projects
None yet
3 participants

I find the Index.to_series method is a convenient way to allow indices to act as columns of a dataframe where desired. However, the behavior of MultiIndex.to_series, which gives a Series of tuples, is less useful.

Would it be convenient to provide a to_dataframe method for index classes? This would be a natural extension of the utility of to_series, and more useful for MultiIndex objects I would think.

I'm something equivalent to:

def to_dataframe(self):
   DataFrame(self.tolist(), columns=self.names, index=self)
Contributor

jreback commented Feb 22, 2016

can you give an actual usecase?

normally you would simply do .reset_index()

Contributor

toobaz commented Mar 4, 2016

I just searched for this. I have a DF filled with boolean values (an adjacency matrix). I want to store the pairs of labels for which the value is True. I start by doing a simple

df = df.unstack()
df = df[df == True]

now I would like to do

df.to_dataframe().to_csv(path)

instead I need to do

df.reset_index()[['indexname1', 'indexname2']].to_csv(path)

(admittedly, not a huge problem, but on the other hand to_series() itself could easily be replaced by .reset_index()['indexname'] I think; to_dataframe seems to me at least as useful)

Sorry, missed the notification with @jreback 's comment.

One of the nice things about to_series() is that it maintains the original index, which is useful if you need to interact with the original dataframe. reset_index() does not do this. To get what I want with reset_index(), I would need to do something like:

index_names = df.index.names
df_indx = df.reset_index()[index_names]
df_indx.index = df.index

This is a bit verbose, but admittedly, not bad. The main problem is that this is fragile, and will not work on all DataFrames. It would fail if:

  • any string in index_names has a duplicate in df.columns .
  • The index is unnamed

Any one of these problems is relatively easy to get around, if you know a priori the construction of the dataframe. However, its a bit more effort if you want to write a general-purpose function that is guarantied to work in all conditions. And even in the case where you have some idea of your input, it would simply be nice if pandas could deal with this for you so you can focus data analysis, rather than how to wrangle it into the right format.

Contributor

jreback commented Mar 29, 2016

@relativistic again a fully worked out example here would be instructive. .reset_index() DOES preserve the index as its now in a column(s). So not sure what you mean.

@jreback : Okay, sure. I'll try to pull together an example some evening soon. Can't show the code I was working on unfortunately, so I'll have to come up with another example.

Contributor

jreback commented Nov 30, 2016

here is an external impl (e.g. could prob be faster / better if we did this inside of a .to_dataframe()) method

In [24]: i = pd.MultiIndex.from_product([np.arange(1000),np.arange(1000)],names=['one','two'])

In [25]: len(i)
Out[25]: 1000000

In [26]: %timeit DataFrame(i.tolist(), columns=i.names)
1 loop, best of 3: 612 ms per loop

In [27]: %timeit DataFrame([], index=i).reset_index()
10 loops, best of 3: 38.5 ms per loop
Contributor

toobaz commented Nov 30, 2016

Just for reference: at the time, I ended up doing:

In [4]: %timeit pd.DataFrame({i.names[n] or n : i.levels[n][i.labels[n]]
                              for n in range(len(i.names))})
100 loops, best of 3: 16.4 ms per loop

which performs slightly better than (same as above, just to compare on the same CPU):

In [5]: %timeit pd.DataFrame([], index=i).reset_index()
10 loops, best of 3: 21.5 ms per loop
Contributor

jreback commented Nov 30, 2016

@toobaz that's only true for a relatively small frame

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
ENH: allow hashing of MultiIndex

closes #12397
90b8588

jreback added this to the 0.20.0 milestone Jan 24, 2017

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
ENH: allow hashing of MultiIndex

closes #12397
396b9db

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 24, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
ENH: allow hashing of MultiIndex

closes #12397
7193277

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
closes #12397
4a151c6

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
closes #12397
c10cad4

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
closes #12397
595e5e8

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 25, 2017

@jreback jreback ENH: add MultiIndex.to_dataframe
closes #12397
b744fb5

jreback closed this in 7277459 Jan 25, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@jreback @AnkurDedania jreback + AnkurDedania ENH: add MultiIndex.to_dataframe
closes #12397

Author: Jeff Reback <jeff@reback.net>

Closes #15216 from jreback/to_dataframe and squashes the following commits:

b744fb5 [Jeff Reback] ENH: add MultiIndex.to_dataframe
4888ded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment