set_index(DatetimeIndex) unexpectedly shifts tz-aware datetime #12358

Closed
wavexx opened this Issue Feb 16, 2016 · 9 comments

Comments

Projects
None yet
2 participants

wavexx commented Feb 16, 2016

This is another issue I've found in code that used to work:

import pandas as pd
tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
df.set_index(df.tm, inplace=True)
print(df.tm[0].hour)
print(df.index[0].hour)

writes:

11
10

It's unclear to me why the time is shifted. If we take a pd.DatetimeIndex which is not directly contained in the df, it works as it should:

tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
df.set_index(tm, inplace=True)
print(df.tm[0].hour)
print(df.index[0].hour)
11
11
Contributor

jreback commented Feb 16, 2016

a couple of things:

  1. your syntax is incorrect (yes this did work, but it is completely misleading), as its not clear that you actually mean to localize

so construct the index like this. IOW. you have to say, hey this a local UTC time, THEN convert it.

tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"])).tz_localize('UTC').tz_convert('Europe/Rome')

In [4]: tm
Out[4]: DatetimeIndex(['2014-01-01 11:10:10+01:00'], dtype='datetime64[ns, Europe/Rome]', freq=None)
  1. df.set_index(df.tm, inplace=True)

This is a nonsensical operation, what do you think this should do?

you probably mean
df.index = df.tm

You are effectively setting the index with a 'key' from they array; this technically works as you only have 1 element (otherwise it would raise). but as I said doesn't make any sense.

In [30]: df.set_index?
Signature: df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False)
Docstring:
Set the DataFrame index (row labels) using one or more existing
columns. By default yields a new object.

Parameters
----------
keys : column label or list of column labels / arrays
drop : boolean, default True
    Delete columns to be used as the new index
append : boolean, default False
    Whether to append columns to existing index
inplace : boolean, default False
    Modify the DataFrame in place (do not create a new object)
verify_integrity : boolean, default False
    Check the new index for duplicates. Otherwise defer the check until
    necessary. Setting to False will improve the performance of this
    method

Examples
--------
>>> indexed_df = df.set_index(['A', 'B'])
>>> indexed_df2 = df.set_index(['A', [0, 1, 2, 0, 1, 2]])
>>> indexed_df3 = df.set_index([[0, 1, 2, 0, 1, 2]])

Returns
-------
dataframe : DataFrame
File:      ~/pandas/pandas/core/frame.py
Type:      instancemethod

jreback closed this Feb 16, 2016

wavexx commented Feb 17, 2016

It seems clear enough to me that if I know the tz of the series, there's no point to "localize" it later.
In fact, I always start from UTC. pd.to_datetime has an utc keyword which I would have expected to make the DatetimeIndex UTC and tz-aware, which would be what I need 99% of the time, but it doesn't (what's the point of this argument is still unclear to me!?).

As for setting the index, yes, it's dodgy. It's a reduced test-case from some convoluted code.
However, why does it shift time? I see no reason why in this explicit case it should.

Contributor

jreback commented Feb 17, 2016

passing it rather than explicity localizing leads to a lot of ambiguity, what should I doing here?

In [1]: DatetimeIndex(['2014-01-01 11:10:10+01:00'],tz='UTC')
Out[1]: DatetimeIndex(['2014-01-01 10:10:10+00:00'], dtype='datetime64[ns, UTC]', freq=None)

as to your second point, it is converted to a numpy array, thus the tz is lost. the first arg only accepts a list or np.array NOT a Series, excactly for this reason.

wavexx commented Feb 17, 2016

tz_localize() converts the timezone, I explicitly don't want it to do any conversion as my dates do not contain any.

In fact, if I could bug you one more time about this, what's the more efficient way to start from a unix timestamp (obviously in UTC) and get to a localized series?

Contributor

jreback commented Feb 17, 2016

NO tz_localize, SETS the timezone, tz_convert converts it!

Here's some examples.

You CAN use the utc=True flag on pd.to_datetime; this WILL return it localized to UTC. (just don't do this directly with DatetimeIndex. All will be well if you use pd.to_datetime for all conversion needs, then operate on the resulting objects

In [2]: v = Timestamp('20130101').value

In [3]: v
Out[3]: 1356998400000000000

In [4]: pd.to_datetime(v,unit='ns')
Out[4]: Timestamp('2013-01-01 00:00:00')

In [5]: pd.to_datetime(v/1000000,unit='ms')
Out[5]: Timestamp('2013-01-01 00:00:00')

In [6]: pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC')
Out[6]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [7]: pd.to_datetime(v/1000000,unit='ms',utc=True)
Out[7]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [8]: pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC')
Out[8]: Timestamp('2013-01-01 00:00:00+0000', tz='UTC')

In [9]: Series(pd.to_datetime(v/1000000,unit='ms').tz_localize('UTC'))
Out[9]: 
0   2013-01-01 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [10]: Series(pd.to_datetime(v/1000000,unit='ms')).dt.tz_localize('UTC').dt.tz_convert('US/Eastern')
Out[10]: 
0   2012-12-31 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]

wavexx commented Feb 17, 2016

On Wed, Feb 17 2016, Jeff Reback notifications@github.com wrote:

You CAN use the utc=True flag on pd.to_datetime; this WILL return it localized
to UTC. (just don't do this directly with DatetimeIndex. All will be well if
you use pd.to_datetime for all conversion needs, then operate on the resulting
objects

Ok, this made things a little bit clearer regarding the tz.
Point understood.

I'm still not super-happy about the set_index behavior. I've given it
some extra-though, but I don't see where and why the tz would be lost.

Where exactly this conversion happens?

import pandas as pd
tm = pd.DatetimeIndex(pd.to_datetime(["2014-01-01 10:10:10"]), tz='UTC').tz_convert('Europe/Rome')
df = pd.DataFrame({'tm': tm})
print(df.set_index(tm).index[0].hour)
print(pd.DatetimeIndex(pd.Series(df.tm))[0].hour)
print(df.set_index(df.tm).index[0].hour)

=> 11 11 10

Ignore the fact that I could assign to index for a moment.

I'm supplying a type to set_index that should be equivalent to the
first or second print statement.

@jreback jreback added a commit to jreback/pandas that referenced this issue Feb 17, 2016

@jreback jreback BUG: Bug in DataFrame.set_index() with tz-aware Series
closes #12358
46d5c9d

@jreback jreback added Bug and removed Usage Question labels Feb 17, 2016

jreback added this to the 0.18.0 milestone Feb 17, 2016

jreback reopened this Feb 17, 2016

Contributor

jreback commented Feb 17, 2016

looks like a bug after all!

fixed by #12365

wavexx commented Feb 17, 2016

On Wed, Feb 17 2016, Jeff Reback notifications@github.com wrote:

looks like a bug after all!

fixed by #12365

Sorry for being pedantic!

Contributor

jreback commented Feb 17, 2016

no, persistence is good! you got me to actually step thru and see what was happening. always better to test.

jreback closed this in 69baf4c Feb 17, 2016

@rinoc rinoc added a commit to rinoc/pandas that referenced this issue Feb 17, 2016

@rinoc rinoc BUG: Init categorical series with scalar value (#12336)
Move and change tests.

BUG: Bug in DataFrame.set_index() with tz-aware Series

closes #12358

Author: Jeff Reback <jeff@reback.net>

Closes #12365 from jreback/set_index_tz and squashes the following commits:

46d5c9d [Jeff Reback] BUG: Bug in DataFrame.set_index() with tz-aware Series
555a76f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment