BUG: select_column not preserving a UTC timezone #7777

Closed
alorenzo175 opened this Issue Jul 17, 2014 · 7 comments

Comments

Projects
None yet
2 participants
Contributor

alorenzo175 commented Jul 17, 2014

I was having issues with lost tz-info when retrieving a DatetimeIndex from an HDF store using store.select_column('data', 'index'). I was able to track down the issue to tseries/index.py in the Index._to_embed method. The issue is

    def _to_embed(self, keep_tz=False):
        """ return an array repr of this object, potentially casting to object """
        if keep_tz and self.tz is not None and str(self.tz) != 'UTC':                                                                                                                                                                       
             return self.asobject.values
        return self.values

It looks like it explicitly rejects UTC timezones. Is there a good reason for this?

The below code reproduces the problem for me.

import pandas as pd

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

print drange._to_embed(keep_tz=True)
print drange_utc._to_embed(keep_tz=True)
print drange_mst._to_embed(keep_tz=True)

I'm using python 2.7.6 with the following packages:

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.17.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.2
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 1.8.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
Contributor

jreback commented Jul 17, 2014

this is an internal method

show a complete example of what u r actually doing

Contributor

alorenzo175 commented Jul 17, 2014

Making a Series of a DatetimeIndex also illustrates the problem.

import pandas as pd

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

print pd.Series(drange)
print pd.Series(drange_utc)
print pd.Series(drange_mst)
Contributor

alorenzo175 commented Jul 17, 2014

And an example of my original problem getting the index from an HDF store

import pandas as pd
import numpy as np

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

data = np.ones((drange.size, 3))
df_utc = pd.DataFrame(data, index=drange_utc)
df_mst = pd.DataFrame(data, index=drange_mst)

store_path = 'timezone_test.h5'
with pd.get_store(store_path) as store:
    store.put('utc', df_utc ,'table')
    store.put('mst', df_mst, 'table')

with pd.get_store(store_path) as store:
    print store.select_column('utc', 'index')
    print store.select_column('mst', 'index')
Contributor

jreback commented Jul 17, 2014

@alorenzo175 By definition a datetime64[ns] is UTC (though it technically doesn't have a tz attached). I guess technically you want to keep a tz-attached (but with UTC) series. You can do this, but will pretty much have to specify a dtype=object a lot.

Because a UTC series is de-facto equivalent to a plain-old datetime64[ns] series and is indistiguishable. (However, the index itself IS distriguishable).

I guess this could be a bit confusing. A possible work-around is to store the 'UTC' data as 'GMT', which will be treated as a regular timezone.

selecting this as a full table DOES seem to work though (e.g. store.select('utc')), so maybe their is a bug in select_column as its not doing the proper conversion (when its shoved into the Series).

ok, will mark that as a bug.

interested in doing a pull-request to fix (it will be in io/pytables/Table/read_column)?

jreback added this to the 0.15.0 milestone Jul 17, 2014

jreback changed the title from UTC timezone information lost from DatetimeIndex on _to_embed to BUG: select_column not preserving a UTC timezone Jul 17, 2014

Contributor

alorenzo175 commented Jul 18, 2014

After messing around a little with the Index._to_embed function, I agree that this should just be a fix for pytables.

I'll try to make a fix but this will be my virgin PR.

Contributor

jreback commented Jul 18, 2014

@alorenzo175 np, I don't think its that tricky, but have to get to know the code...lmk

Contributor

jreback commented Jul 18, 2014

jreback closed this in #7790 Jul 22, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment