Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine datetime_format from more than one element #15075

Closed
Tracked by #12585
j-abi opened this issue Jan 6, 2017 · 6 comments
Closed
Tracked by #12585

Determine datetime_format from more than one element #15075

j-abi opened this issue Jan 6, 2017 · 6 comments

Comments

@j-abi
Copy link

j-abi commented Jan 6, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd

series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered

converted_1 = pd.to_datetime(series_1, infer_datetime_format=True)
converted_2 = pd.to_datetime(series_2, infer_datetime_format=True)

print sorted(series_1) == sorted(series_2)  # True
print sorted(converted_1) == sorted(converted_2)  # False

Problem description

One would hope to obtain the same result when applying pd.to_datetime to the same series twice, but shuffled.

Expected Output

My understanding is that the format determined when setting infer_datetime_format=True is obtained from the first non-null value of the Series (see function _guess_datetime_format_for_array in tseries.tools), which explains the result above. I understand this logic in terms of optimizing the operation, and it does work as expected most of the time.
However, I feel like the example provided above is fairly generic. Ideally, the function would find the best datetime_format for the entire Series. Any ideas on how to implement this ?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-57-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.utf8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: 1.3.7
pip: 9.0.1
setuptools: 28.8.0
Cython: None
numpy: 1.11.3
scipy: 0.18.0
statsmodels: None
xarray: None
IPython: 2.4.1
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.5.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Jan 6, 2017

xref #3341

you need to be explicit. with ambiguous dayfirst dates.

In [19]: series_1 = pd.Series(['13/03/2000', '01/03/2000', '02/03/2000'])  # dates in European format
    ...: series_2 = pd.Series(['01/03/2000', '13/03/2000', '02/03/2000'])  # same dates, reordered
    ...:
    ...: converted_1 = pd.to_datetime(series_1, dayfirst=True)
    ...: converted_2 = pd.to_datetime(series_2, dayfirst=True)
    ...:
    ...:

In [20]: converted_1
Out[20]:
0   2000-03-13
1   2000-03-01
2   2000-03-02
dtype: datetime64[ns]

In [21]: converted_2
Out[21]:
0   2000-03-01
1   2000-03-13
2   2000-03-02
dtype: datetime64[ns]

@jreback jreback closed this as completed Jan 6, 2017
@jreback jreback added this to the No action milestone Jan 6, 2017
@j-abi
Copy link
Author

j-abi commented Jan 6, 2017

I understand that dayfirst=True solves my issue ... if I already know whether my dates are in European format.
The point of this feature request was not to report a bug but, more generally, whether we would be able to determine this automatically based on the input Series: the dates individually are ambiguous, but the Series as a whole is not.
Perhaps this is not feasible within reasonable computation cost, but as far as I am concerned it would be a big improvement.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2017

see this issue: #12585

we have explicity discussed this. its is just waiting for someone to come along and do it. its not very difficult actually, pull -requests are welcome.

@jorisvandenbossche
Copy link
Member

@jreback It is not completely related to issue #12585.
I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

But here, there are also other issues:

  • what @abiteboul asks: can the inferring of the datetime be based on more than the first value? (so the format inference would not fail in this case). Should look it up in the original issues/PRs discussing this feature, but probably performance is a good reason to not do this.
  • The failure to infer the datetime format is silent in this case, and for this reason the inconsistent parsing is maybe even more suprising, because the user thinks a format has been inferred.

@jreback
Copy link
Contributor

jreback commented Jan 6, 2017

yeah I suppose the format could be guess better here, but (maybe take up to three elements) and see if they match. BUT you still can contrive an example where this fails, so I would not be in favor of changing this, rather having the parser raise if the day/year first are violated, which is #12585

@j-abi
Copy link
Author

j-abi commented Jan 6, 2017

I suspect that in pd.to_datetime(series_2, infer_datetime_format=True), the inferring of the format fails (or at least, you get an error with the inferred format based on the first date), and as a result it falls back to default pd.to_datetime(series_2) parsing with dateutil, and this has indeed the known inconsistent parsing -> #12585

That is correct, I tried checking with timeit and it does seem that with the second series the inferring fails (on bigger series than in my example, setting infer_datetime_format=True leads to a ~50x speedup for the first series, and no speedup for the second).
Since this option is mainly there for speedup, it probably makes sense not to infer from more than the first value, and maybe some kind of warning would be appropriate when the inferring fails.

If we agree on this point, then I guess my issue is indeed basically the same as #12585. Sorry for mixing up two different problems, and thanks for your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants