Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: coercion of mixed dt-aware data in Series constructor #13051

Closed
frankcleary opened this issue May 1, 2016 · 4 comments

Comments

Projects
None yet
3 participants
@frankcleary
Copy link
Contributor

commented May 1, 2016

I ran into an issue where the behavior of .apply() changed from 0.16 to 0.17, causing different results on tz aware data. Extracting the hour of day from datetimes is different for Series(x).apply(func) vs func(x). Below is a minimal example of the issue in 0.17, it seems the behavior is the same in 0.18 but different (though still not equal) on master, also shown below.

On 0.17.1 and 0.18.0:

>>> import pandas as pd
>>> def hour_of_day(dt):
...     return dt.hour
...
>>> dt = pd.to_datetime(1462068217, unit='s')
>>> dt_localized = dt.tz_localize('UTC').tz_convert('US/Pacific')
>>> dt_list = [dt, dt_localized]
>>> apply_series = pd.Series(dt_list).apply(hour_of_day)
>>> map_series = pd.Series(map(hour_of_day, dt_list))
>>> print dt_list
[Timestamp('2016-05-01 02:03:37'), Timestamp('2016-04-30 19:03:37-0700', tz='US/Pacific')]
>>> print apply_series
0    2
1    2
dtype: int64
>>> print map_series
0     2
1    19
dtype: int64
>>> print apply_series - map_series
0     0
1   -17
dtype: int64
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.24
numpy: 1.11.0
scipy: 0.17.0
statsmodels: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: 0.4.1
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.5
lxml: 3.6.0
bs4: 4.3.2
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: 0.6.7.None
psycopg2: None
Jinja2: None

On 0.16.2 (what I expected):

# Same setup as above
>>> print apply_series
0     2
1    19
dtype: int64
>>> print map_series
0     2
1    19
dtype: int64
>>> print apply_series - map_series
0    0
1    0
dtype: int64
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.16.2
nose: None
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.2
pytz: 2016.3
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None

On master:

# Same setup as above
>>> print apply_series
0    19
1    19
dtype: int32
>>> print map_series
0     2
1    19
dtype: int64
>>> print apply_series - map_series
0    17
1     0
dtype: int64
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 05e734ab171be0fda838c6b12839c38fa588da2c
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 15.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0+203.g05e734a
nose: None
pip: 8.1.1
setuptools: 20.7.0
Cython: None
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

Expected Output

I would expect the output to be [2, 19], as in 0.16, and matching map(f, data).

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 1, 2016

So this is correct if you construct the Series correctly.

In [17]: dt_list
Out[17]: 
[Timestamp('2016-05-01 02:03:37'),
 Timestamp('2016-04-30 19:03:37-0700', tz='US/Pacific')]

In [18]: Series(dt_list,dtype=object)
Out[18]: 
0          2016-05-01 02:03:37
1    2016-04-30 19:03:37-07:00
dtype: object

In [19]: Series(dt_list,dtype=object).apply(lambda x: x.hour)
Out[19]: 
0     2
1    19
dtype: int64

I suppose this is a bug, we should not be coercing a non-tz-aware AND a dt-aware.

In [20]: Series(dt_list)
Out[20]: 
0   2016-04-30 19:03:37-07:00
1   2016-04-30 19:03:37-07:00
dtype: datetime64[ns, US/Pacific]

However as a user you have to be aware that putting tz-aware and tz-naive data together is basically meaningless.

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 1, 2016

@jreback jreback changed the title Series(dt_data).apply(f) != Series(map(f, dt_data)) for tz aware datetime data BUG: coercion of mixed dt-ware data in Series constructor May 1, 2016

@jreback jreback added this to the 0.18.2 milestone May 1, 2016

@frankcleary

This comment has been minimized.

Copy link
Contributor Author

commented May 1, 2016

Good to keep in mind, sorry my example wasn't the best. I don't think the breakage in my client code was due to a mixing of tz-aware and tz-native data though, of course it is possible I wasn't using the API correctly. Here's some more testing:

In 0.18.0 this is the case:

>>> pd.Series(dt_localized).apply(hour_of_day)
0    2
dtype: int64
# Adding dtype results in expected behavior.
>>> pd.Series(dt_localized, dtype=object).apply(hour_of_day)
0    19
dtype: int64

However in 0.16 and in master it does give the result I expect:

>>> pd.Series(dt_localized).apply(hour_of_day)
0    19
dtype: int32

@jreback jreback changed the title BUG: coercion of mixed dt-ware data in Series constructor BUG: coercion of mixed dt-aware data in Series constructor May 1, 2016

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 1, 2016

@frankcleary what you are doing is quite inefficient, don't use apply expect as a last resort. Series holds a single dtype, using object is a further inefficiency.

In [1]: s =  Series(pd.date_range('20130101',periods=3,tz='US/Pacific', freq='2H'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00-08:00
1   2013-01-01 02:00:00-08:00
2   2013-01-01 04:00:00-08:00
dtype: datetime64[ns, US/Pacific]

In [3]: s.dt.hour
Out[3]: 
0    0
1    2
2    4
dtype: int64

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.20.0, 0.19.0 Sep 1, 2016

@jreback jreback modified the milestones: 0.20.0, Next Major Release Mar 23, 2017

@mroeschke mroeschke referenced this issue Jun 23, 2018

Merged

TST: Clean old timezone issues PT2 #21612

10 of 10 tasks complete

@jreback jreback modified the milestones: Next Major Release, 0.24.0 Jun 25, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.