tz info lost when creating multiindex #3950

hayd · 2013-06-18T23:46:13Z

see http://stackoverflow.com/questions/17159207/change-timezone-of-date-time-column-in-pandas-and-add-as-hierarchical-index/17159276#17159276

dat = pd.DataFrame({'label':['a', 'a', 'a', 'b', 'b', 'b'], 'datetime':['2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00', '2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00'], 'value':range(6)})
dat.index = pd.to_datetime(dat.pop('datetime'), utc=True)
dat.index = dat.index.tz_localize('UTC').tz_convert('US/Pacific')

dat
                          label  value
datetime
2011-07-19 00:00:00-07:00     a      0
2011-07-19 01:00:00-07:00     a      1
2011-07-19 02:00:00-07:00     a      2
2011-07-19 00:00:00-07:00     b      3
2011-07-19 01:00:00-07:00     b      4
2011-07-19 02:00:00-07:00     b      5

If we add another index (we lose the tz):

In [14]: dat.set_index('label', append=True).swaplevel(0, 1)
Out[14]:
                           value
label datetime
a     2011-07-19 07:00:00      0
      2011-07-19 08:00:00      1
      2011-07-19 09:00:00      2
b     2011-07-19 07:00:00      3
      2011-07-19 08:00:00      4
      2011-07-19 09:00:00      5

The text was updated successfully, but these errors were encountered:

cpcloud · 2013-06-19T12:57:43Z

that is a gnarly one-liner there. i get an exception on the index assignment

hayd · 2013-06-19T13:08:32Z

Ah, you're right. Tried to be too clever there. (fixed)

ciamac · 2014-03-01T02:30:04Z

This problem also occurs with groupby operations that create multiple indexes with time zones. For example:

dat = pd.DataFrame({'label':['a', 'a', 'a', 'b', 'b', 'b'], 'datetime':['2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00', '2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00'], 'value':range(6)})
dat['datetime'] = dat['datetime'].apply(lambda d: pandas.Timestamp(d, tz='US/Pacific'))

We start out with timezones:

In [386]: dat.head()
Out[386]: 
                    datetime label  value
0  2011-07-19 07:00:00-07:00     a      0
1  2011-07-19 08:00:00-07:00     a      1
2  2011-07-19 09:00:00-07:00     a      2
3  2011-07-19 07:00:00-07:00     b      3
4  2011-07-19 08:00:00-07:00     b      4

However, if we do a groupby on multiple columns, we will lose the timezones:

In [387]: dat.groupby(['datetime', 'label'])['value'].sum()
Out[387]: 
datetime             label
2011-07-19 14:00:00  a        0
                     b        3
2011-07-19 15:00:00  a        1
                     b        4
2011-07-19 16:00:00  a        2
                     b        5

andrewchou34 · 2014-06-14T22:49:34Z

We just upgraded to pandas 0.14, and the groupby/pivot bugs related to losing tz-info have been fixed. However, reset_index of a MultiIndex still loses tz-info.

In [7]: ts = pd.date_range('1/1/2011', periods=5, freq='10s', tz = 'US/Eastern')

In [8]: foo = pd.DataFrame({'c' : range(5)}, index = pd.MultiIndex([range(5), ts], [range(5), range(5)], names = [ 'a' , 'b' ]))

In [9]: print foo                                                                                                
                             c
a b                           
0 2011-01-01 00:00:00-05:00  0
1 2011-01-01 00:00:10-05:00  1
2 2011-01-01 00:00:20-05:00  2
3 2011-01-01 00:00:30-05:00  3
4 2011-01-01 00:00:40-05:00  4

In [10]: foo2 = foo.reset_index()                                                                                                

In [11]: print foo2                                                                                                              
   a                   b  c
0  0 2011-01-01 05:00:00  0
1  1 2011-01-01 05:00:10  1
2  2 2011-01-01 05:00:20  2
3  3 2011-01-01 05:00:30  3
4  4 2011-01-01 05:00:40  4

And, column assignment still loses tz-info.

In [13]: bar = pd.DataFrame( {'a' : ts, 'b':range(5)} )

In [14]: print bar
                    a  b
0 2011-01-01 05:00:00  0
1 2011-01-01 05:00:10  1
2 2011-01-01 05:00:20  2
3 2011-01-01 05:00:30  3
4 2011-01-01 05:00:40  4

jreback · 2014-06-16T12:54:24Z

cc @sinhrks take a look?

sinhrks · 2014-06-21T07:31:44Z

Yeah, I think reset_index is untouched in previous PRs. The fix looks not difficult.

andrewchou34 · 2014-07-23T15:14:29Z

We just upgraded to pandas 0.14.1, and the reset_index no longer loses tz-info. Thanks for fixing that!

However, it still seems that column assignment loses tz-info

In [16]: ts = pd.date_range('1/1/2011', periods=5, freq='10s', tz = 'US/Eastern')

In [17]: ts
Out[17]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00-05:00, ..., 2011-01-01 00:00:40-05:00]
Length: 5, Freq: 10S, Timezone: US/Eastern

In [18]: pd.DataFrame( {'a' : ts } )
Out[18]:
                    a
0 2011-01-01 05:00:00
1 2011-01-01 05:00:10
2 2011-01-01 05:00:20
3 2011-01-01 05:00:30
4 2011-01-01 05:00:40

jreback · 2014-07-23T15:19:54Z

this is by definition, you are passing a dictionary, where things are coerced. I supposed one might consider this is a bug. I'll create a separate issue Their are several ways to do this in any event.

pd.DataFrame({'a' : ts.to_series(keep_tz=True) })

or

df['a'] = ts

andrewchou34 · 2014-07-23T22:07:57Z

Thanks for creating the new issue and giving other solutions.

When testing the 0.14.1 on old code, I ran into the follow crash when using reset_index with tz-aware data in a multi-index.

In [1]: import pandas as pd                                                                                                                                                                                                                   

In [2]: ts = pd.date_range('1/1/2011', periods = 2, freq = '10s', tz = 'US/Eastern')                                                                                                                                                         

In [3]: df = pd.DataFrame({'a' : [ts[0], ts[0], ts[1], ts[1]], 'b': [1,2,1,2], 'c':[1,2,3,4]})                                                                  

In [4]: df_gb_sum = df.groupby(['a','b']).sum()                         

In [5]: df_gb_sum
Out[5]: 
                             c
a                         b   
2011-01-01 00:00:00-05:00 1  1
                          2  2
2011-01-01 00:00:10-05:00 1  3
                          2  4

In [6]: df_gb_sum.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-47b6fb3071a5> in <module>()
----> 1 df_gb_sum.reset_index()

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2481                     level_values = _maybe_casted_values(lev, lab)
   2482                     if level is None or i in level:
-> 2483                         new_obj.insert(0, col_name, level_values)
   2484 
   2485         elif not drop:

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2107         """
   2108         self._ensure_valid_index(value)
-> 2109         value = self._sanitize_column(column, value)
   2110         self._data.insert(
   2111             loc, column, value, allow_duplicates=allow_duplicates)

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value)
   2139         elif isinstance(value, Index) or _is_sequence(value):
   2140             if len(value) != len(self.index):
-> 2141                 raise ValueError('Length of values does not match length of '
   2142                                  'index')
   2143 

ValueError: Length of values does not match length of index

Note that if we don't set a tz, it seems to work fine.

In [7]: ts = pd.date_range('1/1/2011', periods = 2, freq = '10s')

In [8]: df = pd.DataFrame({'a' : [ts[0], ts[0], ts[1], ts[1]], 'b': [1,2,1,2], 'c':[1,2,3,4]})

In [9]: df_gb_sum = df.groupby(['a','b']).sum()

In [10]: df_gb_sum.reset_index()
Out[10]: 
                    a  b  c
0 2011-01-01 00:00:00  1  1
1 2011-01-01 00:00:00  2  2
2 2011-01-01 00:00:10  1  3
3 2011-01-01 00:00:10  2  4

jreback · 2014-07-23T22:09:51Z

That last is already fixed here: #7533

jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014

sleibman mentioned this issue Mar 11, 2014

tzinfo lost when concatenating multiindex arrays #6606

Closed

This was referenced May 10, 2014

ENH/CLN: Add factorize to IndexOpsMixin #7090

Merged

BUG: tz info lost by set_index and reindex #7092

Merged

BUG: GroupBy doesn't preserve timezone #7099

Merged

jreback modified the milestones: 0.14.1, 0.15.0, 0.14.0 May 12, 2014

jreback closed this as completed in #7099 May 13, 2014

hayd modified the milestones: 0.14.1, 0.14.0 Jun 14, 2014

hayd reopened this Jun 14, 2014

sinhrks mentioned this issue Jun 21, 2014

BUG: df.reset_index loses tz #7533

Merged

jreback closed this as completed in #7533 Jun 21, 2014

jreback mentioned this issue Jul 23, 2014

API: preserver tz on created series from Index when possible #7822

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tz info lost when creating multiindex #3950

tz info lost when creating multiindex #3950

hayd commented Jun 18, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

ciamac commented Mar 1, 2014

andrewchou34 commented Jun 14, 2014

jreback commented Jun 16, 2014

sinhrks commented Jun 21, 2014

andrewchou34 commented Jul 23, 2014

jreback commented Jul 23, 2014

andrewchou34 commented Jul 23, 2014

jreback commented Jul 23, 2014

tz info lost when creating multiindex #3950

tz info lost when creating multiindex #3950

Comments

hayd commented Jun 18, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

ciamac commented Mar 1, 2014

andrewchou34 commented Jun 14, 2014

jreback commented Jun 16, 2014

sinhrks commented Jun 21, 2014

andrewchou34 commented Jul 23, 2014

jreback commented Jul 23, 2014

andrewchou34 commented Jul 23, 2014

jreback commented Jul 23, 2014