Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: UserWarning about lacking infer_datetime_format with pd.to_datetime #46210

Closed
3 tasks done
moojen opened this issue Mar 3, 2022 · 12 comments · Fixed by #47528
Closed
3 tasks done

BUG: UserWarning about lacking infer_datetime_format with pd.to_datetime #46210

moojen opened this issue Mar 3, 2022 · 12 comments · Fixed by #47528
Assignees
Labels
Bug Datetime Datetime data dtype Warnings Warnings that appear or should be added to pandas

Comments

@moojen
Copy link

moojen commented Mar 3, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True, utc=True, errors='ignore')

Issue Description

When I run this code, I get the following warning:

"UserWarning: Parsing '15/09/1979' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing."

Which is strange, because I already have set infer_datetime_format=True.

Expected Behavior

You would expect no warning at all, since "infer_datetime_format=True" is already provided as an argument.

Installed Versions

pymysql : None html5lib : None lxml.etree : 4.8.0 xlsxwriter : None feather : None blosc : None sphinx : None hypothesis : None pytest : 6.2.5 Cython : None setuptools : 58.1.0 pip : 22.0.3 dateutil : 2.8.2 pytz : 2021.3 numpy : 1.22.2 pandas : 1.4.1
@moojen moojen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 3, 2022
@moojen moojen changed the title BUG: BUG: UserWarning about lacking infer_datetime_format with pd.to_datetime Mar 3, 2022
@rhshadrach
Copy link
Member

Can you provide a small DataFrame that reproduces this warning? I attempted with

df = pd.DataFrame({'date': ['15/09/1979']})

on main and did not get a warning.

@rhshadrach rhshadrach added datetime.date stdlib datetime.date support Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 5, 2022
@mroeschke mroeschke added Datetime Datetime data dtype and removed datetime.date stdlib datetime.date support labels Mar 6, 2022
@MarcoGorelli
Copy link
Member

closing for now, will reopen if you provide a reproducible example @moojen

@benoit9126
Copy link
Contributor

benoit9126 commented Apr 19, 2022

@MarcoGorelli Here is a small example to reproduce this warning:

pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

It leads to this output

<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
<input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

The problem is the following: the infer_datetime_format option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with "01/01/2000", the guessed format is MM/DD/YYYY.

Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one DD/MM/YYYY. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.

As you can see in my example, the last date in the array "01/02/2000" is converted using the guessed format into '2000-01-02' (without warning) while I hoped "01-02-2000"...

The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.

This issue seems to be related to #12585

@rhshadrach
Copy link
Member

Agreed that the warning should only be emitted once; and the message saying to pass infer_datetime_format=True is confusing. The trickier issues about how to infer better seems to be well captured by #12585, so I'd recommend scoping this issue on just the warning message and count.

@rhshadrach rhshadrach reopened this Apr 20, 2022
@rhshadrach rhshadrach added Warnings Warnings that appear or should be added to pandas and removed Needs Info Clarification about behavior needed to assess issue labels Apr 20, 2022
@rhshadrach rhshadrach added this to the Contributions Welcome milestone Apr 20, 2022
@DrNickBailey
Copy link

infer_datetime_format seems flawed. Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).

But I'm not sure what it's doing as when I tried with the above sample and just switched the order of the first two dates this happens:

pd.to_datetime(['15/02/2006','01/01/2000','31/05/2001','01/02/2000'], infer_datetime_format=True)

DatetimeIndex(['2006-02-15', '2000-01-01', '2001-05-31', '2000-02-01'], dtype='datetime64[ns]', freq=None)

but.....

pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True)

/var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '15/02/2006' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True)
/var/folders/95/4q5tdld13b72xd3t35759k140000gq/T/ipykernel_39130/1894679187.py:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
pd.to_datetime(['01/01/2000','15/02/2006','31/05/2001','01/02/2000'], infer_datetime_format=True)
DatetimeIndex(['2000-01-01', '2006-02-15', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

What's going on? It's got the right date format for the first three but has oddly switched the 4th date around?

@MarcoGorelli
Copy link
Member

01/01/2000 is ambiguous (is it DD/MM/YYYY or MM/DD/YYYY?) and the format can't be inferred from it. If it was, say, 14/01/2000, then you wouldn't get the warning:

>>> pd.to_datetime(['14/01/2000', '15/01/2000'], infer_datetime_format=True)
DatetimeIndex(['2000-01-14', '2000-01-15'], dtype='datetime64[ns]', freq=None)
>>> pd.to_datetime(['11/01/2000', '15/01/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing '15/01/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-11-01', '2000-01-15'], dtype='datetime64[ns]', freq=None)

I'm tempted to suggest changing the warning to

Parsing '15/02/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing.

, as infer_datetime_format isn't always able to infer the format (as the docstring says, it only tries to). And yes, also only emitting the warning once. I'll take another look at this if I get a chance

Is it assuming the silly American MM-DD-YYYY format? (why is it not taking my locality into account?).

By default, dayfirst is False, so if you pass an ambiguous string without specifying a format, that'll be the format it assumes

@MarcoGorelli MarcoGorelli self-assigned this Jun 28, 2022
@DrNickBailey
Copy link

In your example the wording Parsing '15/01/2006' in DD/MM/YYYY format. Provide format to ensure consistent parsing. still makes no sense to me as a (British/international) user as I already thought '11/01/2000' was in DD/MM/YYYY, so the “warning” is confusing.

I’d also wager that dayfirst as an argument name is unclear – what day is first? dayfirst could mean a number of things to the inexperienced pandas user. monthfirst would be more understandable to the international audience. Though datetime_dayfirst. And while I'm at it format should be datetime_format for clarity.

Sorry for my hate of MM-DD. ISO 8601 is a standard for a reason and solves all ambiguity with dates!

@MarcoGorelli
Copy link
Member

Reckon this would be clearer?

>>> import pandas as pd
>>> pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)
<stdin>:1: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently-parsed dates! Specify a format to ensure consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

Like this, the warning would only be emitted once, and (I think) would be a bit clearer

And while I'm at it format should be datetime_format for clarity.

This'd have to go through a deprecation cycle, not sure it'd be worth it

@TheSwallowCoder
Copy link

@MarcoGorelli Here is a small example to reproduce this warning:

pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

It leads to this output

<input>:1: UserWarning: Parsing '31/05/2000' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
<input>:1: UserWarning: Parsing '31/05/2001' in DD/MM/YYYY format. Provide format or specify infer_datetime_format=True for consistent parsing.
DatetimeIndex(['2000-01-01', '2000-05-31', '2001-05-31', '2000-01-02'], dtype='datetime64[ns]', freq=None)

The problem is the following: the infer_datetime_format option takes the first not NaN object in the array and uses it to guess the format of the date in the provided array. Here, with "01/01/2000", the guessed format is MM/DD/YYYY.

Later in the conversion function, there is an internal error because '31/05/2000' can not be converted using the guessed format and the format is locally changed (per object, not for the entire array) into the good one DD/MM/YYYY. It leads to a warning per unique date in the array, where the inversion of month and day is required to convert the date.

As you can see in my example, the last date in the array "01/02/2000" is converted using the guessed format into '2000-01-02' (without warning) while I hoped "01-02-2000"...

The user warning is annoying because it is emitted per object in the array. With a large array, it leads to thousands of warning lines: one per value when there is no other option than to change the initial guess to parse the date.

This issue seems to be related to #12585

Solution

@MarcoGorelli
Copy link
Member

@TheSwallowCoder can you check on upstream/main (or on the pandas 1.5.0 release candidate)?

@TheSwallowCoder
Copy link

@MarcoGorelli
pandas_1 5rc

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Sep 10, 2022

Can't reproduce, here's the output I get from 1.5.0rc:

(.venv) marco@marco-Predator-PH315-52:~/tmp$ cat t.py 
import pandas as pd
pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

(.venv) marco@marco-Predator-PH315-52:~/tmp$ python -c 'import pandas; print(pandas.__version__)'
1.5.0rc0
(.venv) marco@marco-Predator-PH315-52:~/tmp$ python t.py 
t.py:2: UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
  pd.to_datetime(['01/01/2000','31/05/2000','31/05/2001', '01/02/2000'], infer_datetime_format=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants