Stacking MultiIndex DataFrame columns with Timestamps levels fails #8039

Closed
ldkge opened this Issue Aug 15, 2014 · 14 comments

Comments

Projects
None yet
4 participants

ldkge commented Aug 15, 2014

You can see the bug in the following code:

import pandas as pd
import datetime as dt

key = pd.MultiIndex.from_tuples([(
                            dt.datetime(2014,8,1,0,0,0),
                            'SomeColumnName',
                            'AnotherOne')])

data = {
    '1' : 34204,
    '2' : 43580,
    '3' : 84329,
    '5' : 23485
}


ts = pd.Series(data=data)
df = pd.DataFrame(data=ts, columns=key)

stacked = df.stack()

print stacked

We would expect the data to be unchanged, however the returned DataFrame is empty.

The Pandas version used was 0.11.0

TomAugspurger added this to the 0.15.0 milestone Aug 15, 2014

Contributor

jreback commented Aug 15, 2014

when you pass columns it reindexes by the passed columns when you are passing a Series, since the data has the name of '0' (the column assigned) it disappears. This is undocumented (and doesn't work at all > 0.13.0)

Use this to create a frame

In [23]: result = ts.to_frame()

# if you are < 0.13.0
In [31]: result = DataFrame(ts)

In [33]: result
Out[33]: 
       0
1  34204
2  43580
3  84329
5  23485

And simply set the columns.

In [26]: key
Out[26]: 
MultiIndex(levels=[[2014-08-01 00:00:00], [u'SomeColumnName'], [u'AnotherOne']],
           labels=[[0], [0], [0]])

In [28]: result.columns = key

In [29]: result
Out[29]: 
       2014-08-01
   SomeColumnName
       AnotherOne
1           34204
2           43580
3           84329
5           23485

In [30]: result.unstack()
Out[30]: 
2014-08-01  SomeColumnName  AnotherOne  1    34204
                                        2    43580
                                        3    84329
                                        5    23485
dtype: int64

jreback closed this Aug 15, 2014

@jreback jreback added Usage Question and removed Bug Reshaping labels Aug 15, 2014

Contributor

jreback commented Aug 15, 2014

@TomAugspurger not a bug, but a usage issue.

Contributor

TomAugspurger commented Aug 15, 2014

yeah just read your comment.

Contributor

TomAugspurger commented Aug 15, 2014

@ldkge's problem was with stack though. Not sure why

In [67]: result.stack(0)
Out[67]: 
              SomeColumnName
                  AnotherOne
1 2014-08-01           34204
2 2014-08-01           43580
3 2014-08-01           84329
5 2014-08-01           23485

In [68]: result.stack(1)
Out[68]: 
Empty DataFrame
Columns: [(2014-08-01 00:00:00, AnotherOne)]
Index: []

would be different.

Contributor

jreback commented Aug 15, 2014

This works in master (recently added feature).

In [54]: result.stack([0,1])
Out[54]: 
                             AnotherOne
1 2014-08-01 SomeColumnName       34204
2 2014-08-01 SomeColumnName       43580
3 2014-08-01 SomeColumnName       84329
5 2014-08-01 SomeColumnName       23485

I am not what stack(1) would/should actually do

What would you expect?

Contributor

TomAugspurger commented Aug 15, 2014

I thought it should shift the 1 level of the column's MultiIndex down to the row labels. so expected would be

>>>df.stack(1)
                  2014-08-01
                  AnotherOne
1 SomeColumnName           34204
2 SomeColumnName           43580
3 SomeColumnName           84329
5 SomeColumnName           23485
Contributor

jreback commented Aug 15, 2014

cc @onesandzeroes what do you think?

jreback reopened this Aug 15, 2014

Contributor

jreback commented Aug 15, 2014

ok I think agree could be a bug

Contributor

TomAugspurger commented Aug 15, 2014

I'll submit a PR once I figure out what's wrong.

Contributor

TomAugspurger commented Aug 15, 2014

@jreback it has to do with how the MultiIndex is storing the timestamp.

Any idea offhand why with

In [6]: idx = pd.MultiIndex.from_tuples([(pd.datetime(2014, 1, 1), 'A', 'B')])

these two aren't equal?

In [10]: idx.values[0][0]
Out[10]: Timestamp('2014-01-01 00:00:00')

In [8]: idx.levels[0].values
Out[8]: array(['2013-12-31T18:00:00.000000000-0600'], dtype='datetime64[ns]')

edit:

or even clearer, why isn't

In [34]: idx.levels[0].values[0]
Out[34]: numpy.datetime64('2013-12-31T18:00:00.000000000-0600')

equal to

In [33]: idx.levels[0][0]
Out[33]: Timestamp('2014-01-01 00:00:00')

I'm going to go digging in index.py

@TomAugspurger TomAugspurger added MultiIndex and removed Reshaping labels Aug 15, 2014

Contributor

jreback commented Aug 15, 2014

.values on an index returns the underlying data (its a DatetimeIndex).

where is this type of comparison?

Contributor

TomAugspurger commented Aug 15, 2014

(I think) they're compared when constructing the new dataframe in core/reshape.py(661)_stack_multi_columns

ipdb> new_data
{(numpy.datetime64('2013-12-31T18:00:00.000000000-0600'), 'B'): array([1, 2, 3, 4])}
ipdb> new_columns
MultiIndex(levels=[[2014-01-01 00:00:00], ['B']],
           labels=[[0], [0]])

ipdb> result = DataFrame(new_data, index=new_index, columns=new_columns)
ipdb> result
    2014-01-01
             B
0 C        NaN
1 C        NaN
2 C        NaN
3 C        NaN

I'll see why new_data is a dict instead of an array.

Contributor

jreback commented Aug 15, 2014

I can't see exactly where you are pointing too...

levels should be using .equals for comparisons.....an Index method, so maybe need to wrap them

Contributor

onesandzeroes commented Aug 16, 2014

@jreback I agree with TomAugspurger about what the expected behaviour of df.stack(1) should be, so if that's not happening at the moment I think it's a bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment