Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pd.to_datetime/pd.DatetimeIndex unexpected behavior #12501

Closed
marcomayer opened this issue Mar 1, 2016 · 6 comments
Closed

pd.to_datetime/pd.DatetimeIndex unexpected behavior #12501

marcomayer opened this issue Mar 1, 2016 · 6 comments
Labels
Duplicate Report Duplicate issue or pull request Timeseries

Comments

@marcomayer
Copy link

I don't know if this is a bug or not (I guess not), but I think this behavior might be very confusing to many users who like me until now might not be aware of this.

When converting a string column to Date using pd.to_datetime or pd.DatetimeIndex the way the date is interpreted might change from row to row. I experienced this today where March 1st was interpreted as January 3rd. Probably pandas does this simply on a per row set considering only one sample at a time. One way to improve this would be to first go through the whole

At least I'd suggest to implement a warning if none of the format/dayfirst parameters are used.

Here's a screenshot to clarify:

http://snag.gy/mXT7g.jpg

output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.1.final.0
python-bits: 64
OS: Darwin
OS-release: 15.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: de_DE.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.0.2
setuptools: 19.6.2
Cython: 0.23.4
numpy: 1.10.4
scipy: 0.17.0
statsmodels: 0.6.1
IPython: 4.1.1
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.5.0
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: None
psycopg2: None
Jinja2: 2.8

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

pls show a copy-pastable example, not a screen shot.

@marcomayer
Copy link
Author

Sure, you can use this:


data = pd.Series(['29.02.2016 03:30:00', '29.02.2016 04:00:00', '01.03.2016 03:30:00', '01.03.2016 04:00:00'])

pd.to_datetime(data)

will make difference choices regarding day or month first:


0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-01-03 03:30:00
3   2016-01-03 04:00:00
dtype: datetime64[ns]

@jorisvandenbossche
Copy link
Member

This is a known issue with to_datetime, which is caused by the (too much) flexibility of dateutil. You can 'solve' it by providing a format argument.

See also #7348

@jreback
Copy link
Contributor

jreback commented Mar 1, 2016

Well you have an option to control this, dayfirst. Having mixed day-first, not day-first data will require a pre-processing step.

In [18]: pd.to_datetime(data, dayfirst=True)
Out[18]: 
0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-03-01 03:30:00
3   2016-03-01 04:00:00
dtype: datetime64[ns]

In [19]: pd.to_datetime(data, dayfirst=False)
Out[19]: 
0   2016-02-29 03:30:00
1   2016-02-29 04:00:00
2   2016-01-03 03:30:00
3   2016-01-03 04:00:00
dtype: datetime64[ns]

@jorisvandenbossche
Copy link
Member

@marcomayer But I agree that this can be very confusing!

Note that using dayfirst is also not strict: #3341

@marcomayer
Copy link
Author

yes, a warning or so might be something to consider I think.

The way I work is that I read in millions of records, then convert the dates etc. and usually just do a head() to check if things are fine. My guess is that others do the same thing...

In a perfect world, it would check if there is any ambiguity and if so at least warn or just stop and require format.

@jorisvandenbossche jorisvandenbossche added the Duplicate Report Duplicate issue or pull request label Mar 10, 2016
@jorisvandenbossche jorisvandenbossche added this to the No action milestone Mar 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Timeseries
Projects
None yet
Development

No branches or pull requests

3 participants