Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_excel incorrectly sets time to midnight for exported datetime column #9139

Closed
scls19fr opened this issue Dec 23, 2014 · 17 comments · Fixed by #9257
Closed

to_excel incorrectly sets time to midnight for exported datetime column #9139

scls19fr opened this issue Dec 23, 2014 · 17 comments · Fixed by #9257
Labels
Bug IO Excel read_excel, to_excel
Milestone

Comments

@scls19fr
Copy link
Contributor

Hello,

I try this:

import requests
from pandas.io.json import json_normalize
import json
import pandas as pd
response = requests.get("http://api.openweathermap.org/data/2.5/history/station?start=1356220800&end=1356307200&type=hour&id=5530")
df = json_normalize(json.loads(response.text)['list'])
df['dt'] = pd.to_datetime(df['dt'], unit='s')
#df = df.set_index('dt')
print(df)
df.to_excel("out.xls")

It displays:

                    dt  humidity.c      ...       pressure.v  temp.c  temp.ma  \
0  2012-12-23 00:00:00           2      ...           1017.0       2   286.15
1  2012-12-23 01:00:00           2      ...           1017.0       2   286.15
2  2012-12-23 02:00:00           2      ...           1017.0       2   285.15
3  2012-12-23 03:00:00           2      ...           1017.0       2   284.15
4  2012-12-23 04:00:00           2      ...           1017.0       2   283.15
...
19 2012-12-23 19:00:00           1      ...           1016.0       1   282.15
20 2012-12-23 20:00:00           1      ...           1015.0       1   281.15
21 2012-12-23 21:00:00           1      ...           1015.0       1   282.15
22 2012-12-23 22:00:00           1      ...           1015.0       1   281.15
23 2012-12-23 23:00:00           1      ...           1005.0       1   286.15
24 2012-12-24 00:00:00           2      ...           1013.5       2   280.15

but if we look at Excel file dt column looks like

dt
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-23 00:00:00
...
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-23 00:00:00
2012-12-24 00:00:00

to_csv method is ok

In [1]: pd.__version__
Out[1]: '0.15.2'

Kind regards

@shoyer shoyer changed the title to_excel doesn't export Timestamp correctly to_excel incorrectly sets time to midnight for exported datetime column Dec 24, 2014
@shoyer shoyer added Bug IO Excel read_excel, to_excel labels Dec 24, 2014
@shoyer shoyer added this to the 0.16.0 milestone Dec 24, 2014
@shoyer
Copy link
Member

shoyer commented Dec 24, 2014

@scls19fr Thanks for the report!

Pandas supports a couple of external modules for writing Excel modules, so this might also depend on which one you're using.

Here is where you can find the relevant source code if you're interested in trying your hand at a fix: https://github.com/pydata/pandas/blob/master/pandas/io/excel.py

@jorisvandenbossche
Copy link
Member

This was also reported here: http://stackoverflow.com/questions/27548833/pandas-to-excel-in-version-0-15-2-not-working-for-datetime-objects

But, as I posted over there, I cannot reproduce this. @scls19fr Can you show pd.show_versions()? @shoyer can you reproduce this?

@scls19fr
Copy link
Contributor Author

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 14.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: fr_FR.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.9.1
scipy: 0.14.1rc1
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.3
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.4.2
openpyxl: 2.0.3
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.6.4
lxml: 3.4.1
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: 0.6.2.None
psycopg2: None

@scls19fr
Copy link
Contributor Author

I did

import requests
from pandas.io.json import json_normalize
import json
import pandas as pd
response = requests.get("http://api.openweathermap.org/data/2.5/history/station?start=1356220800&end=1356307200&type=hour&id=5530")
df = json_normalize(json.loads(response.text)['list'])
df['dt'] = pd.to_datetime(df['dt'], unit='s')
#df = df.set_index('dt')
df = df[['dt', 'main.temp.ma']]
print(df)


for (ext, engine) in [('xlsx', 'openpyxl'), ('xlsx', 'xlsxwriter'), ('xls', 'xlwt')]:
    with pd.ExcelWriter("out_{}.{}".format(engine, ext), engine=engine) as writer:
        df.to_excel(writer)

out_openpyxl.xlsx is ok but out_xlsxwriter.xlsx and out_xlwt.xls are bad (midnight issue).

@jorisvandenbossche
Copy link
Member

@scls19fr Thanks, I can now confirm the same: working for openpyxl, wrong for xlsxwriter and xlwt

@jorisvandenbossche
Copy link
Member

cc @jtratner @neirbowj

@jmcnamara
Copy link
Contributor

I am the author of XlsxWriter. Here is my analysis of what is happening in this issue starting with the smaller example above:

import requests
from pandas.io.json import json_normalize
import json
import pandas as pd
response = requests.get("http://api.openweathermap.org/data/2.5/history/station?start=1356220800&end=1356307200&type=hour&id=5530")

df = json_normalize(json.loads(response.text)['list'])
df['dt'] = pd.to_datetime(df['dt'], unit='s')
df = df[['dt', 'main.temp.ma']]

If we take one of these values as an example:

>>> timestamp = df['dt'][23]
>>> timestamp
Timestamp('2012-12-23 23:00:00')

An Excel date is stored as a delta of days from an epoch (the integer part of the number) and a percentage of the day for the time in milliseconds (the float part). XlsxWriter and afaik Xlwt use a calculation like the following:

>>> import datetime
>>> epoch = datetime.datetime(1899, 12, 31, 0, 0, 0)
>>> excel_datetime = timestamp - epoch
>>> excel_datetime
Timedelta('41265 days 23:00:00')
>>> excel_datetime.days()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' object is not callable
>>> excel_datetime.days
41265
>>> excel_datetime.seconds
0
>>> excel_datetime.microseconds
0

As you can see the seconds part of the Timedelta() doesn't look right and the seconds and microseconds are 0. Hence the 0 times in the Excel files.

For comparison here is how a datetime.datatime object behaves:

>>> mydate = datetime.datetime(2012, 12, 23, 23, 0, 0)
>>> excel_datetime = mydate - epoch
>>> excel_datetime
datetime.timedelta(41265, 82800)
>>> excel_datetime.days
41265
>>> excel_datetime.seconds
82800
>>> excel_datetime.microseconds
0

XlsxWriter performs this epoch calculation because the Timestamp object meets the following criteria:

>>> isinstance(timestamp, datetime.datetime)
True

So for me it looks like the issue here is that the pandas Timestamp() object doesn't handle the epoch calculation correctly.

There are a few workable solutions but I'll hold off until someone familiar with the Timestamp class has a chance to look at this and respond.

John

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

So the problem is the conversion is relying upon an implementation detail of timedelta, meaning that .days gets you the days and .seconds the seconds as the remainder. This was asof 0.15.0.
I noted this specifically here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#new-features.

There are many ways to do this.

In [26]: ts = Timestamp('2012-12-23 23:00:00')

In [27]: epoch = datetime.datetime(1899, 12, 31, 0, 0, 0)

In [28]: td = ts - epoch

In [29]: td
Out[29]: Timedelta('41265 days 23:00:00')

In [32]: (td-Timedelta(days,unit='d')).total_seconds()
Out[32]: 82800.0

In [33]: td.components
Out[33]: Components(days=41265, hours=23, minutes=0, seconds=0, milliseconds=0, microseconds=0, nanoseconds=0)


In [35]: td.hours*3600+td.minutes*60+td.seconds
Out[35]: 82800

In [36]: td.to_pytimedelta().days
Out[36]: 41265

In [37]: td.to_pytimedelta().seconds
Out[37]: 82800

@shoyer
Copy link
Member

shoyer commented Dec 29, 2014

@jmcnamara This is very helpful, thanks!

For an explanation of how pandas.Timedelta works see this section of the docs (especially the warning):
http://pandas.pydata.org/pandas-docs/stable/timedeltas.html#attributes

This is certainly an unfortunate side-effect of overriding super-class behavior in a subclass. I recall raising this issue in the pull request adding Timedelta: #8184. It looks like @jreback convinced me that the overriding the super-class behavior was not so bad, but now I'm not so sure (since it clearly is relied in in external packages).

The simplest fix would be to use to_pydatetime() for coercing datetime64 columns (prior to passing off to the various Excel writing backends)

@shoyer
Copy link
Member

shoyer commented Dec 29, 2014

FWIW, I think pandas should fix this issue on our side (since we broke it with our strange timedelta subclass) rather than asking xlsxwriter and xlwt to fix this upstream.

@jorisvandenbossche
Copy link
Member

@shoyer @jreback Open a new issue to discuss this behaviour of the timedelta attributes again?

@jreback
Copy link
Contributor

jreback commented Dec 29, 2014

this is not an upstream issue
rather a bug in the pandss openpyxl engine

@neirbowj
Copy link
Contributor

@jreback wait... wat?

@jmcnamara
Copy link
Contributor

I've added a PR for this based on coercing the Timestamp object to a datetime object for xlwt and xlsxwriter with a test.

Even if you choose a different solution you can use the testcase.

I was surprised that the existing testcases didn't pick up this issue but none of them actually test a roundtrip of a date and time. o_O But I am not in a position to throw the first stone on that one.

@jreback
Copy link
Contributor

jreback commented Jan 2, 2015

@neirbowj sorry, I meant xlwt/xlsxwriter

@drorata
Copy link
Contributor

drorata commented Feb 17, 2015

Was it actually fixed? It seems like I still get this issue. For example:

ts = pd.Timestamp('2015-02-04 12:45:50.240999')
df = pd.DataFrame([('foo', ts),])
writer = ExcelWriter('timestamp-test.xlsx', engine = 'xlsxwriter')
df.to_excel(writer, 'Test')

and the result is:

selection_003

As you can see the exported timestamp is different than the given one.

BTW: using openpyxl as suggested here solves the issue.

@jreback
Copy link
Contributor

jreback commented Feb 17, 2015

well if you use master it is. This will be in the forthcoming 0.16.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Excel read_excel, to_excel
Projects
None yet
7 participants