Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: inconsistent date parsing by to_datetime #7348

Closed
aschilling opened this issue Jun 5, 2014 · 11 comments
Closed

Bug: inconsistent date parsing by to_datetime #7348

aschilling opened this issue Jun 5, 2014 · 11 comments
Labels
Duplicate Report Duplicate issue or pull request Timeseries

Comments

@aschilling
Copy link

Hi everybody,

lets take the following dates:
dates = ['04/07/2013 21:40','22/06/2013 02:40']

if I call pd.to_datetime on these two date strings, it produces incorrect results. For the first date '7' is interpreted as day-value while for the second date '22' is interpreted as the day-value.

Please get me right I have no problem with the wrong parsed date value in the first string, but with the missing propagation of the corrent interpretation scheme after reading the second string.

Thanks

@jorisvandenbossche
Copy link
Member

@aschilling If you want to have dates parsed with the day first, you can use pd.to_datetime(dates, dayfirst=True) or specify the format explicitly with pd.to_datetime(dates, format="%d/%m/%Y %H:%M").

This is a problem with dateutil not being strict (see eg also #3341 as a 'known bug'), as it will parse each date as it thinks is best (with a preference on the day last) (as such it can parse dates with mixed formats, ). If you certainly want the same parsing for the full array of dates, best use format

@jorisvandenbossche jorisvandenbossche changed the title Bub: inconsistent date paring by to_datetime Bub: inconsistent date parsing by to_datetime Jun 5, 2014
@aschilling aschilling changed the title Bub: inconsistent date parsing by to_datetime Bug: inconsistent date parsing by to_datetime Jun 5, 2014
@aschilling
Copy link
Author

I understand, but don't you think that pandas should raise then an Exception as suggested by Wes? Especially, if it is filed for more than a year?

@jorisvandenbossche
Copy link
Member

I think you mean Andy :-) (@hayd)
It is not that simple, as this is handled by dateutil, which is a third party library. And the issue I linked to is still a bit different (although related). That is about when using dayfirst=True, it can still parse a date with day not first. Here it is just that not the same format is applied to all elements of the list. But the underlying reason is the same, namely that dateutil is not strict about its parsing, and only has a precendence behaviour (an order of formats to try).
(the upside of this is that you can still parse lists of dates with messy formats that vary)

But maybe this could be better stated in the docs, that if you want strict parsing, you should use format.

@hayd
Copy link
Contributor

hayd commented Jun 5, 2014

This is mentioned in the docs: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#converting-to-timestamps

I agree, it sucks but solution is non-trivial, I've looked into fixing this part of dateutil in the past (so many if blocks) alternatively it has been discussed using another (C) library for date parsing (not sure issue number but I think that one was Wes). Would love to see this solved!

Perhaps it would mitigate some of this to use default dayfirst based on your locale? :s

closing as dupe of #3341

@hayd hayd closed this as completed Jun 5, 2014
@jorisvandenbossche
Copy link
Member

@hayd It is not really mentioned (although shown in the example) that you can have mixed formats and so each date is parsed individually what would seem best, and if you want uniform parsing for full column you should pass format.

Is it correct what I am saying? I can add a clarification.

@hayd
Copy link
Contributor

hayd commented Jun 5, 2014

I was linking to the Warning and example, it Notes format just below. Do you mean adding an example like ['1/12/2012', '1st Dec 2012'] ?

I've actually come up with in spreadsheets where an inconsistent format was used (potentially with inconsistent datefirsts - but best to ignore that!) here format doesn't really help.

@jorisvandenbossche
Copy link
Member

No, the example has already a mixed format: ['Jul 31, 2009', '2010-01-10']. So that is already in there.
But maybe more explicitely state that being able to parse such mixed format is a 'feature', but can have side effects (example in this issue) and if you want to impose a strict uniform parsing you should use format (as if this is not specified you have no guarantee that all your dates are parsed following the same format).

Do I make myself more clear now?

@hayd
Copy link
Contributor

hayd commented Jun 5, 2014

Not quite sure I understand what you're saying!

The lack of strictness is definitely a bug, you should be able to parse mixed formats provided they are dayfirst (and potentially ambiguous*). I've looked at this before: the dateutil code is IMO super hairy, although IIRC there is some flag for dayfirst which looked like it was supposed to control this (but it falls through when it shouldn't), I couldn't understand how it worked (or why it didn't), but potentially this could be fixed...

__That is, 1st Feb and Feb 1st should both parse in either (as they're not ambiguous) but 1/14 should not parse if dayfirst (as it's ambiguously/incorrectly ordered).*

@jorisvandenbossche
Copy link
Member

Hmm, it can be difficult to put it in words sometimes .. :-) Another attempt:

  • With dayfirst, I think we completely understand each other. When dayfirst=True it will first try to parse with day-first, but when this fails it falls back to non-dayfirst although you specified it. Consequence: the dayfirst keyword arg is not strict and will not fail if you encounter an invalid date (invalid for a dayfirst date).
  • However, in this issue, it was actually not about using the dayfirst=... But about the fact that if you have eg ['22-07-2014', '11-07-2014'], the first will be parse to 2014-07-22 (so in practice with dayfirst), but the second one to 2014-11-07. So in this case, the dates will be parsed using to different 'schemes', although they are in the same array passed to to_datetime. This can be confusing (as the OP understandably argued). Of course in this case, it could be solved with passing dayfirst=True, but still it is confusing when you are not aware of that kwarg and see that it does parse the first date correctly. So that is what I mean with "if you want to impose a strict uniform parsing you should use format", as we can never guarantee that with the dateutil parser.
  • And actually the original issue asked even something else, namely: if one of the dates is parsed as dayfirst (deduced from the date itself, without passing dayfirst=True), then all dates in the array should be parsed like that even without passing dayfirst=True. But this is out of the question I think.

@hayd
Copy link
Contributor

hayd commented Jun 5, 2014

I still think this is 100% solved if dayfirst were strict, as it would raise here... so we're in agreement? Or do you something in mind here (without tweaking dateutil)?

ps I really like the flexible schema approach, I often see messy "date" cols.

@jorisvandenbossche
Copy link
Member

See #12585

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Timeseries
Projects
None yet
Development

No branches or pull requests

4 participants