ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

LeoGrin · 2023-04-07T10:03:55Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

If you run pd.to_datetime on the following Series:

    "11-12-2029",
    "02-12-2012",
    "11-09-2012",
    "13-02-2000",
    "10-11-2001"

pandas (>= 2.0) will infer the datetime format from the first non-missing example (%m%d%Y), try to apply this type to all the series, fail on 13-02-2000, and raise an error (before version 2.0, this would silently create a mixed type). I wish pandas could infer the right format from such a series, where only one format works for all rows.

Feature Description

Pseudo code

If using dayfirst=True and dayfirst=False don't give the same format for guess_datetime_format on the first non missing example (i.e both works):
Try both formats on the Series (probably on a random subset for speed).
If one works for all rows, return this format.
If both work, trust the dayfirst parameter (and maybe raise a warning).
If none work and error="raise", raise an error. If errors = "coerce" or errors="ignore", one could either trust the dayfirst parameter, or see which of dayfirst value leads to the smallest number of non-parsed values.

Implementation

Change function _guess_datetime_format_for_array (in pandas.core.tools.datetimes) so that it tries both dayfirst=True and dayfirst=False on the first non-null example. In the same function, if both options give a different format, try array_strptime with both format on a random subset of the array (100?) with strict error, and check that one of the tries doesn't fail.

Alternative Solutions

I don't know.

Additional Context

No response

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2023-04-07T11:04:57Z

thanks @LeoGrin for the suggestion

yeah I think we could improve the inference, for example by trying the first n non-null rows or taking random sample, and then taking a majority vote

would you be interested in trying this out and submitting a PR?

LeoGrin · 2023-04-07T14:21:52Z

Thanks for the feedback! Yes I would :)

LeoGrin · 2023-04-07T14:21:59Z

take

LeoGrin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023

MarcoGorelli added Timeseries and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023

github-actions bot assigned LeoGrin Apr 7, 2023

LeoGrin mentioned this issue Apr 12, 2023

ENH: Infer best datetime format from a sample #52626

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

LeoGrin commented Apr 7, 2023

MarcoGorelli commented Apr 7, 2023

LeoGrin commented Apr 7, 2023

LeoGrin commented Apr 7, 2023

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

Comments

LeoGrin commented Apr 7, 2023

Feature Type

Problem Description

Feature Description

Pseudo code

Implementation

Alternative Solutions

Additional Context

MarcoGorelli commented Apr 7, 2023

LeoGrin commented Apr 7, 2023

LeoGrin commented Apr 7, 2023