Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone lost on DataFrame assignments with realignment #12981

Closed
ajenkins-cargometrics opened this issue Apr 25, 2016 · 3 comments
Closed

Timezone lost on DataFrame assignments with realignment #12981

ajenkins-cargometrics opened this issue Apr 25, 2016 · 3 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Milestone

Comments

@ajenkins-cargometrics
Copy link
Contributor

Starting from pandas 0.17, certain assignments to DataFrames cause offset-aware datetime columns to be converted to offset-naive columns. Specifically, it seems that if any data realignment is required when assigning the RHS to a a slice of the DataFrame, then timezone info is lost. Here's an example:

from __future__ import print_function
import pandas

print("Pandas version:", pandas.__version__)

start = pandas.Timestamp('2015-01-01', tz='utc')
df = pandas.DataFrame({'dates': pandas.date_range(start, periods=3)})

print("Before assignment")
print(df['dates'])

# Shuffle column and reassign, causing RHS to need to be realigned on assignment
df['dates'] = df['dates'][[1,0,2]]

print("\nAfter assignment")
print(df['dates'])

The output I'd expect, which is what I get from pandas 0.16.2, is:

Pandas version: 0.16.2
Before assignment
0    2015-01-01 00:00:00+00:00
1    2015-01-02 00:00:00+00:00
2    2015-01-03 00:00:00+00:00
Name: dates, dtype: object

After assignment
0    2015-01-01 00:00:00+00:00
1    2015-01-02 00:00:00+00:00
2    2015-01-03 00:00:00+00:00
Name: dates, dtype: object

However when I run this with pandas 0.18.0, after the assignment the timezone info is lost:

Pandas version: 0.18.0
Before assignment
0   2015-01-01 00:00:00+00:00
1   2015-01-02 00:00:00+00:00
2   2015-01-03 00:00:00+00:00
Name: dates, dtype: datetime64[ns, UTC]

After assignment
0   2015-01-01
1   2015-01-02
2   2015-01-03
Name: dates, dtype: datetime64[ns]

It seems the custom timezone-aware dtype that pandas started using for timezone-aware time series in 0.17.x doesn't get correctly propagated in this operation.

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 15.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.0
nose: 1.3.7
pip: 8.1.1
setuptools: 20.9.0
Cython: None
numpy: 1.11.0
scipy: 0.15.1
statsmodels: None
xarray: None
IPython: 3.1.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.3
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.4.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: None
boto: None
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype labels Apr 25, 2016
@jreback jreback added this to the 0.18.2 milestone Apr 25, 2016
@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

yep looks like a bug. pull-requests are welcome!

@ajenkins-cargometrics
Copy link
Contributor Author

After a little digging, I believe I've found the fix. In DataFrame._santize_column, there is a statement which accesses the values property, which should access _values. This statement:

value = value.reindex(self.index).values

should be

value = value.reindex(self.index)._values

The values property returns a numpy array, which loses the custom dtype, whereas _values returns a DateTimeIndex which preserves the dtype. I'll submit a PR.

@jreback
Copy link
Contributor

jreback commented Apr 25, 2016

@ajenkins-cargometrics great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants