Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

Open
1 of 3 tasks
LeoGrin opened this issue Apr 7, 2023 · 3 comments
Open
1 of 3 tasks

ENH: Try both dayfirst and monthfirst in pd.to_datetime #52508

LeoGrin opened this issue Apr 7, 2023 · 3 comments

Comments

@LeoGrin
Copy link

LeoGrin commented Apr 7, 2023

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

If you run pd.to_datetime on the following Series:

    "11-12-2029",
    "02-12-2012",
    "11-09-2012",
    "13-02-2000",
    "10-11-2001"

pandas (>= 2.0) will infer the datetime format from the first non-missing example (%m%d%Y), try to apply this type to all the series, fail on 13-02-2000, and raise an error (before version 2.0, this would silently create a mixed type). I wish pandas could infer the right format from such a series, where only one format works for all rows.

Feature Description

Pseudo code

If using dayfirst=True and dayfirst=False don't give the same format for guess_datetime_format on the first non missing example (i.e both works):
Try both formats on the Series (probably on a random subset for speed).
If one works for all rows, return this format.
If both work, trust the dayfirst parameter (and maybe raise a warning).
If none work and error="raise", raise an error. If errors = "coerce" or errors="ignore", one could either trust the dayfirst parameter, or see which of dayfirst value leads to the smallest number of non-parsed values.

Implementation

Change function _guess_datetime_format_for_array (in pandas.core.tools.datetimes) so that it tries both dayfirst=True and dayfirst=False on the first non-null example. In the same function, if both options give a different format, try array_strptime with both format on a random subset of the array (100?) with strict error, and check that one of the tries doesn't fail.

Alternative Solutions

I don't know.

Additional Context

No response

@LeoGrin LeoGrin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023
@MarcoGorelli
Copy link
Member

thanks @LeoGrin for the suggestion

yeah I think we could improve the inference, for example by trying the first n non-null rows or taking random sample, and then taking a majority vote

would you be interested in trying this out and submitting a PR?

@MarcoGorelli MarcoGorelli added Timeseries and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023
@LeoGrin
Copy link
Author

LeoGrin commented Apr 7, 2023

Thanks for the feedback! Yes I would :)

@LeoGrin
Copy link
Author

LeoGrin commented Apr 7, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants