to_datetime parsing bug when using format #4152

Closed
michaelaye opened this Issue Jul 7, 2013 · 3 comments

Comments

Projects
None yet
3 participants
Contributor

michaelaye commented Jul 7, 2013

import datetime as dt
import pandas as pd
val = '01-Apr-2011 00:00:01.978'
print 'pandas version:',pd.__version__
print 'Value to parse:',val
format = '%d-%b-%Y %H:%M:%S.%f'
print 'datetime.strptime        :',dt.datetime.strptime(val, format)
print 'to_datetime, w/out format:',pd.to_datetime(val)
print 'to_datetime, w/ format   :', pd.to_datetime(val, format=format)

pandas version: 0.12.0.dev-1101391
Value to parse: 01-Apr-2011 00:00:01.978
datetime.strptime        : 2011-04-01 00:00:01.978000
to_datetime, w/out format: 2011-04-01 00:00:01.978000
to_datetime, w/ format   : 2011-03-31 23:24:13.516352
Contributor

michaelaye commented Jul 8, 2013

FYI, the reason why I want to use to_datetime() with a given format string is speed.

Contributor

hayd commented Jul 8, 2013

perhaps #3669 was prematurely closed (cc #2213 and #3890...)

@hayd I don't think so, #3669 was about that the "format" argument was ignored completely. This is about that there seems to be a bug in the parser used when format is given. As an example, it doesn't matter if you give a single string or an array, both give a wrong result:

In [12]:  pd.to_datetime('01-Apr-2011 00:00:01.978', format= '%d-%b-%Y %H:%M:%S.%f')
Out[12]: Timestamp('2011-03-31 23:24:13.516352', tz=None)

In [14]:  pd.to_datetime(np.array(['01-Apr-2011 00:00:01.978']), format= '%d-%b-%Y %H:%M:%S.%f')
Out[14]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2011-03-31 23:24:13.516352]
Length: 1, Freq: None, Timezone: None

And it seems that it has something to do with the parsing of the microseconds:

In [17]:  pd.to_datetime('01-Apr-2011 00:00:01.000', format= '%d-%b-%Y %H:%M:%S.
Out[17]: Timestamp('2011-04-01 00:00:01', tz=None)

In [18]:  pd.to_datetime('01-Apr-2011 00:00:01.001', format= '%d-%b-%Y %H:%M:%S.
Out[18]: Timestamp('2011-04-01 00:16:41', tz=None)

hayd closed this in #4166 Jul 9, 2013

@hayd hayd added a commit that referenced this issue Jul 9, 2013

@hayd hayd Merge pull request #4166 from jorisvandenbossche/bug-microsecond-parsing
BUG: wrong parsing of microseconds with format arg (#4152)
1a6b967

@yarikoptic yarikoptic added a commit to neurodebian/pandas that referenced this issue Jul 25, 2013

@yarikoptic yarikoptic Merge commit 'v0.12.0rc1-43-g7b2eaa4' into debian
* commit 'v0.12.0rc1-43-g7b2eaa4': (571 commits)
  PERF: add ix scalar get benchmark
  DOC: more prominent HDFStore store docs about storer/table formats
  BUG: invert_xaxis (negative tot_sec) triggers MilliSecondLocator (#3990)
  BUG: (GH4192) fixed broken unit test
  BUG: (GH4192) Fixed buglet in the broadcasting logic in Series.where
  CLN: Ignore warnings generated by 'DROP TABLE IF EXISTS' when table does not exist.
  DOC: more cookbook recipies
  DOC: update ipython_directive with changes from ipython to restart prompt number at 1 each page
  DOC: increased width of text area
  TST: fix ujson tests failures on 32-bit
  TST: raise when no data are found when trying to dld multiple symbols
  TST: Create a MySQL database and run MySQL tests on Travis.
  CLN: write the attributes in a HDFStore as strings
  TST: remove double call to yahoo finance
  DOC to_datetime warning about dayfirst strictness
  TST: to_datetime format fixes
  DOC: minor io/whatsnew doc edits
  BUG/TST: wrong parsing of microseconds with format arg (#4152)
  RLS: first release candidate for v0.12.0
  BLD: use the wheel url for scikits timeseries
  ...
3275685
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment