Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with datetime and HDFStore #809

Closed
mattias-lundell opened this issue Feb 21, 2012 · 8 comments

Comments

@mattias-lundell
Copy link

commented Feb 21, 2012

When storing a DataFrame using HDFStore the datetime information is altered. My guess is that there is some problem with daylight saving and time zones when the DataFrame is loaded from the h5 file. An example:

In [1]: from pandas import *

In [2]: df = DataFrame([0,1], [datetime(2011, 3, 27, 2, 2, 2),datetime(2011, 3, 27, 3, 2, 2)])

In [3]: s = HDFStore("test.h5")

In [4]: s["test"] = df

In [5]: df
Out[5]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1

In [6]: s["test"]
Out[6]: 
                     0
2011-03-27 03:02:02  0
2011-03-27 03:02:02  1
@adamklein

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2012

I get something else. What else is weird is DST is 3/13 in 2011, not 3/27.

What pandas.version do you have, what OS & python? If linux, what are your LC_ variables? (ie, run set | grep "LC" in bash)

## -- End pasted text --

In [2]: df
Out[2]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1

In [3]: s["test"]
Out[3]: 
                     0
2011-03-27 02:02:02  0
2011-03-27 03:02:02  1
@mattias-lundell

This comment has been minimized.

Copy link
Author

commented Feb 22, 2012

Aha, I thought that it was the same date across all countries but that was not the case. According to http://www.timeanddate.com/time/dst/2011.html there are several different dates.

Running locale says "en_US.utf8" but I am located in Sweden and in Sweden the DST was the 27th (probably something broken in my locale). By the way, what is the correct behavior? Maybe it's just me that handles the data wrong.

I'm running Ubuntu 11.04, Python 2.7.1 and pandas-0.7.0rc1-py2.7-linux-x86_64.egg.

@mattias-lundell

This comment has been minimized.

Copy link
Author

commented Feb 22, 2012

My use case is that I read data stored in CSV files. I load them into DataFrames. So far, so good. The problem occurs when persisting the DataFrame in a h5 file. When I load the data from the h5 file, I receive a DataFrame that has an index containing duplicate entries.

@adamklein

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2012

You are right about DST being different where you are :)

Since there is no timezone information attached, the principal of least surprise would suggest it should return exactly what you stored.

However, looking into pandas/io/pytables.py, going into storage, it does:

time.mktime(v.timetuple())
Docstring: Convert a time tuple in local time to seconds since the Epoch.

And coming out,

[datetime.fromtimestamp(v) for v in data]
Docstring:  timestamp[, tz] -> tz's local time from POSIX timestamp.

Ok, so the problem is this: 2:02 on 3/27 is actually a non-existent time, and 2:02 == 3:02. How your locale knows you are in Sweden and your posix API takes advantage of this, I have no idea. Can you prefilter the data before storing?

But even stranger, I cannot reproduce the behavior for me on 3/13, in my timezone (EST5EDT) when I should see the exact same behavior.

@adamklein

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2012

I assume one would want the option to save data in non-standard (ie, no daylight saving time even during daylight saving time period). i'll see whether this is easy.

@mattias-lundell

This comment has been minimized.

Copy link
Author

commented Feb 22, 2012

Thank you for looking into this.

I would definitively profit from having the option of storing and reading data from other timezones than my local.

@adamklein

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2012

I think the best way to deal with this for now is to set the timezone information on your original dates. I.e., if x is a datetime, x = x.replace(tzinfo=pytz.UTC). When it comes out the other side, it should conform to your local time properly. We should have improved time zone handling in 0.8 along with the datetime64 type.

@wesm

This comment has been minimized.

Copy link
Member

commented May 13, 2012

Timestamp data is all represented internally as UTC (even though may appear to be in one time zone vs. another) and should not have any locale issues in pandas 0.8.0. See #1232 re storing time zones in HDFStore, will be done soon

@wesm wesm closed this May 13, 2012

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.