Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tz info lost when creating multiindex #3950

Closed
hayd opened this issue Jun 18, 2013 · 10 comments · Fixed by #7099 or #7533
Closed

tz info lost when creating multiindex #3950

hayd opened this issue Jun 18, 2013 · 10 comments · Fixed by #7099 or #7533
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jun 18, 2013

see http://stackoverflow.com/questions/17159207/change-timezone-of-date-time-column-in-pandas-and-add-as-hierarchical-index/17159276#17159276

dat = pd.DataFrame({'label':['a', 'a', 'a', 'b', 'b', 'b'], 'datetime':['2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00', '2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00'], 'value':range(6)})
dat.index = pd.to_datetime(dat.pop('datetime'), utc=True)
dat.index = dat.index.tz_localize('UTC').tz_convert('US/Pacific')

dat
                          label  value
datetime
2011-07-19 00:00:00-07:00     a      0
2011-07-19 01:00:00-07:00     a      1
2011-07-19 02:00:00-07:00     a      2
2011-07-19 00:00:00-07:00     b      3
2011-07-19 01:00:00-07:00     b      4
2011-07-19 02:00:00-07:00     b      5

If we add another index (we lose the tz):

In [14]: dat.set_index('label', append=True).swaplevel(0, 1)
Out[14]:
                           value
label datetime
a     2011-07-19 07:00:00      0
      2011-07-19 08:00:00      1
      2011-07-19 09:00:00      2
b     2011-07-19 07:00:00      3
      2011-07-19 08:00:00      4
      2011-07-19 09:00:00      5
@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2013

that is a gnarly one-liner there. i get an exception on the index assignment

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

Ah, you're right. Tried to be too clever there. (fixed)

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 18, 2014
@ciamac
Copy link

ciamac commented Mar 1, 2014

This problem also occurs with groupby operations that create multiple indexes with time zones. For example:

dat = pd.DataFrame({'label':['a', 'a', 'a', 'b', 'b', 'b'], 'datetime':['2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00', '2011-07-19 07:00:00', '2011-07-19 08:00:00', '2011-07-19 09:00:00'], 'value':range(6)})
dat['datetime'] = dat['datetime'].apply(lambda d: pandas.Timestamp(d, tz='US/Pacific'))

We start out with timezones:

In [386]: dat.head()
Out[386]: 
                    datetime label  value
0  2011-07-19 07:00:00-07:00     a      0
1  2011-07-19 08:00:00-07:00     a      1
2  2011-07-19 09:00:00-07:00     a      2
3  2011-07-19 07:00:00-07:00     b      3
4  2011-07-19 08:00:00-07:00     b      4

However, if we do a groupby on multiple columns, we will lose the timezones:

In [387]: dat.groupby(['datetime', 'label'])['value'].sum()
Out[387]: 
datetime             label
2011-07-19 14:00:00  a        0
                     b        3
2011-07-19 15:00:00  a        1
                     b        4
2011-07-19 16:00:00  a        2
                     b        5

@andrewchou34
Copy link

We just upgraded to pandas 0.14, and the groupby/pivot bugs related to losing tz-info have been fixed. However, reset_index of a MultiIndex still loses tz-info.

In [7]: ts = pd.date_range('1/1/2011', periods=5, freq='10s', tz = 'US/Eastern')

In [8]: foo = pd.DataFrame({'c' : range(5)}, index = pd.MultiIndex([range(5), ts], [range(5), range(5)], names = [ 'a' , 'b' ]))

In [9]: print foo                                                                                                
                             c
a b                           
0 2011-01-01 00:00:00-05:00  0
1 2011-01-01 00:00:10-05:00  1
2 2011-01-01 00:00:20-05:00  2
3 2011-01-01 00:00:30-05:00  3
4 2011-01-01 00:00:40-05:00  4

In [10]: foo2 = foo.reset_index()                                                                                                

In [11]: print foo2                                                                                                              
   a                   b  c
0  0 2011-01-01 05:00:00  0
1  1 2011-01-01 05:00:10  1
2  2 2011-01-01 05:00:20  2
3  3 2011-01-01 05:00:30  3
4  4 2011-01-01 05:00:40  4

And, column assignment still loses tz-info.

In [13]: bar = pd.DataFrame( {'a' : ts, 'b':range(5)} )

In [14]: print bar
                    a  b
0 2011-01-01 05:00:00  0
1 2011-01-01 05:00:10  1
2 2011-01-01 05:00:20  2
3 2011-01-01 05:00:30  3
4 2011-01-01 05:00:40  4

@hayd hayd modified the milestones: 0.14.1, 0.14.0 Jun 14, 2014
@hayd hayd reopened this Jun 14, 2014
@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

cc @sinhrks take a look?

@sinhrks
Copy link
Member

sinhrks commented Jun 21, 2014

Yeah, I think reset_index is untouched in previous PRs. The fix looks not difficult.

@andrewchou34
Copy link

We just upgraded to pandas 0.14.1, and the reset_index no longer loses tz-info. Thanks for fixing that!

However, it still seems that column assignment loses tz-info

In [16]: ts = pd.date_range('1/1/2011', periods=5, freq='10s', tz = 'US/Eastern')

In [17]: ts
Out[17]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-01-01 00:00:00-05:00, ..., 2011-01-01 00:00:40-05:00]
Length: 5, Freq: 10S, Timezone: US/Eastern

In [18]: pd.DataFrame( {'a' : ts } )
Out[18]:
                    a
0 2011-01-01 05:00:00
1 2011-01-01 05:00:10
2 2011-01-01 05:00:20
3 2011-01-01 05:00:30
4 2011-01-01 05:00:40

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

this is by definition, you are passing a dictionary, where things are coerced. I supposed one might consider this is a bug. I'll create a separate issue Their are several ways to do this in any event.

pd.DataFrame({'a' : ts.to_series(keep_tz=True) })

or

df['a'] = ts

@andrewchou34
Copy link

Thanks for creating the new issue and giving other solutions.

When testing the 0.14.1 on old code, I ran into the follow crash when using reset_index with tz-aware data in a multi-index.

In [1]: import pandas as pd                                                                                                                                                                                                                   

In [2]: ts = pd.date_range('1/1/2011', periods = 2, freq = '10s', tz = 'US/Eastern')                                                                                                                                                         

In [3]: df = pd.DataFrame({'a' : [ts[0], ts[0], ts[1], ts[1]], 'b': [1,2,1,2], 'c':[1,2,3,4]})                                                                  

In [4]: df_gb_sum = df.groupby(['a','b']).sum()                         

In [5]: df_gb_sum
Out[5]: 
                             c
a                         b   
2011-01-01 00:00:00-05:00 1  1
                          2  2
2011-01-01 00:00:10-05:00 1  3
                          2  4

In [6]: df_gb_sum.reset_index()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-47b6fb3071a5> in <module>()
----> 1 df_gb_sum.reset_index()

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in reset_index(self, level, drop, inplace, col_level, col_fill)
   2481                     level_values = _maybe_casted_values(lev, lab)
   2482                     if level is None or i in level:
-> 2483                         new_obj.insert(0, col_name, level_values)
   2484 
   2485         elif not drop:

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in insert(self, loc, column, value, allow_duplicates)
   2107         """
   2108         self._ensure_valid_index(value)
-> 2109         value = self._sanitize_column(column, value)
   2110         self._data.insert(
   2111             loc, column, value, allow_duplicates=allow_duplicates)

/Users/andrewchou/dev/bourbakitech/pyenv/lib/python2.7/site-packages/pandas/core/frame.pyc in _sanitize_column(self, key, value)
   2139         elif isinstance(value, Index) or _is_sequence(value):
   2140             if len(value) != len(self.index):
-> 2141                 raise ValueError('Length of values does not match length of '
   2142                                  'index')
   2143 

ValueError: Length of values does not match length of index

Note that if we don't set a tz, it seems to work fine.

In [7]: ts = pd.date_range('1/1/2011', periods = 2, freq = '10s')

In [8]: df = pd.DataFrame({'a' : [ts[0], ts[0], ts[1], ts[1]], 'b': [1,2,1,2], 'c':[1,2,3,4]})

In [9]: df_gb_sum = df.groupby(['a','b']).sum()

In [10]: df_gb_sum.reset_index()
Out[10]: 
                    a  b  c
0 2011-01-01 00:00:00  1  1
1 2011-01-01 00:00:00  2  2
2 2011-01-01 00:00:10  1  3
3 2011-01-01 00:00:10  2  4

@jreback
Copy link
Contributor

jreback commented Jul 23, 2014

That last is already fixed here: #7533

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants