New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auto convert from string to datetime64 in iterrows. #19671

Closed
xiaoluffy opened this Issue Feb 13, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@xiaoluffy

xiaoluffy commented Feb 13, 2018

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '0.20.1'

In [4]: pd.DataFrame({'symbol': ['M1609', 'M1701'], 'date': pd.to_datetime(['2016-09-01', '2017-01-01'])})
Out[4]:
        date symbol
0 2016-09-01  M1609
1 2017-01-01  M1701

In [5]: df = pd.DataFrame({'symbol': ['M1609', 'M1701'], 'date': pd.to_datetime(['2016-09-01', '2017-01-01'])})

In [6]: for i, row in df.iterrows():
   ...:     print(row)
   ...:
date      2016-09-01 00:00:00
symbol                  M1609
Name: 0, dtype: object
date     2017-01-01
symbol   1701-01-01
Name: 1, dtype: datetime64[ns]

Problem description

Hi, I found the auto convert issue in iterrows or index, the string 'M1701' is converted to '1701-01-01', and It's not supposed to happen. So is it a bug here?

Thanks.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Windows OS-release: 7 machine: AMD64 processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder: little LC_ALL: None LANG: None LOCALE: None.None

pandas: 0.20.1
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.27.3
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 13, 2018

we had an old issue about this, but can't seem to find it. so

In [1]: pd.to_datetime('M1701')
Out[1]: Timestamp('1701-01-01 00:00:00')

is a legit (though maybe weird) parse by dateutil. When iterows happens a Series is constructed of the appropriate type. Since all of the elements are combined into a single dtype, they are upcast, here the should be upcast to object type.

love for you to take a look.

@jreback jreback added this to the Next Major Release milestone Feb 13, 2018

@minggli

This comment has been minimized.

Contributor

minggli commented Feb 14, 2018

hope don't mind if I jump into this issue. since Series are constructed row by row of the frame, the dtype is inferred case by case. as @jreback has rightly put 'M1701' is inferred as datetime whereas 'M1609' is object. so I think the issue here is that dtype upcast for each Series is not always consistent as it's evaluated individually in Series > _sanitize_array > _try_cast > maybe_cast_to_datetime. quite interesting but goes quite deep in terms of when to cast datetime and when not to.

A potential fix could be to pass numpy.dtype (self.values.dtype in iterrows) to create Series so that all series (i.e. rows of the dataframe) generated by iterrows will have consistent upcast across columns. Although numpy seems to upcast numeric (e.g. int to float), it doesn't handle datetime inference.

@jreback

This comment has been minimized.

Contributor

jreback commented Feb 14, 2018

Here's the basic problem. We do inference on mixed strings & datetimes upon construction. This allows one to pass 'NaT and/or mixed datetimelike (e.g. you can also pass dateitime.date and dateime.datetime and np.datetime64'. However we should make this much more strict, so that we don't fully parse strings. It still needs to hit to_datetime (which is where the conversion happens).

In [1]: pd.Series(['M1701', pd.Timestamp('20130101')])
Out[1]: 
0   1701-01-01
1   2013-01-01
dtype: datetime64[ns]

so @minggli its the logic actually in to_datetime(arr, errors='raise'). I think we need to expose require_iso8601 and pass this thru as True.

@minggli

This comment has been minimized.

Contributor

minggli commented Feb 19, 2018

Sorry for the delay. raised a PR exposing require_iso8601 in dataframe.iterrows(), to_datetime, and Series APIs.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 19, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment