Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: iterator of DatetimeIndex broken with tzoffset timezone #8890

Closed
Oxtay opened this issue Nov 24, 2014 · 7 comments

Comments

Projects
None yet
3 participants
@Oxtay
Copy link

commented Nov 24, 2014

Summary:

The trigger is the tzoffset timezone. This bug can be reproduced as follows:

In [86]: index = pd.date_range("2012-01-01", periods=3, freq='H', tz=dateutil.tz.tzoffset(None, -28800))

In [87]: index
Out[87]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00-08:00, ..., 2012-01-01 02:00:00-08:00]
Length: 3, Freq: H, Timezone: tzoffset(None, -28800)

In [88]: index[0]
Out[88]: Timestamp('2012-01-01 00:00:00-0800', tz='tzoffset(None, -28800)', offset='H')

In [90]: list(iter(index))[0]
Out[90]: Timestamp('2011-12-31 16:00:00-0800', tz='tzoffset(None, -28800)', offset='H')

In [91]: list(iter(index))[0] == index[0]
Out[91]: False

In 0.14 this last comparison gives True.

This appears in iterating over the index (for time in index: ...) or with using DataFrame.iterrows() (#8951).


Original report:

I see there are a number of issues related to datetime and timeindex here in issues, and I suspect that mine has a lot in common with them. The cure for one of them will probably solve all of them. So here it goes.

My code was using Pandas 0.13.1 without an issue. I recently upgraded to 0.15.1
This is where my code acts unexpectedly:

time_points = df.index[df['candidate'] == 1]
        for time in time_points:
            [...]

The index is in US/Pacific timezone. When the for loop returns time, it is still in US/Pacific timezone but with an added 8h to the time. So, while the actual time is 2014-11-23 23:25:02.916000-08:00, time is set to 2014-11-23 15:49:12.972000-08:00.
I have made sure that the type of index is pandas.Timestamp and I can't find an elegant workaround that would ensure running it in both versions of Pandas.

Any thoughts on this?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 24, 2014

Can you provide a small reproducible example that shows the problem?

Because I don't see a problem if I run the following:

In [5]: index = pd.date_range("2012-01-01", periods=3, freq='H', tz='US/Pacific')

In [6]: index
Out[6]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00-08:00, ..., 2012-01-01 02:00:00-08:00]
Length: 3, Freq: H, Timezone: US/Pacific

In [7]: for time in index:
   ...:     print time
   ...:
2012-01-01 00:00:00-08:00
2012-01-01 01:00:00-08:00
2012-01-01 02:00:00-08:00

And I get the same in 0.15 as 0.13.

@Oxtay

This comment has been minimized.

Copy link
Author

commented Nov 24, 2014

@jorisvandenbossche Thanks for the reply. Here is what I have:
I have a variable called time_points, which is:

    <class 'pandas.tseries.index.DatetimeIndex'>
    [2014-11-23 23:25:02.916000-08:00, ..., 2014-11-24 02:58:06.510000-08:00]
    Length: 18, Freq: None, Timezone: tzoffset(None, -28800)

with values:

    ['2014-11-23T23:25:02.916000000-0800' '2014-11-23T23:49:12.972000000-0800'
     '2014-11-24T00:40:13.378000000-0800' '2014-11-24T00:49:05.325000000-0800'
     '2014-11-24T00:55:14.480000000-0800' '2014-11-24T01:13:11.850000000-0800'
     '2014-11-24T01:19:08.454000000-0800' '2014-11-24T01:22:14.071000000-0800'
     '2014-11-24T01:25:06.543000000-0800' '2014-11-24T01:28:09.774000000-0800'
     '2014-11-24T02:16:06.774000000-0800' '2014-11-24T02:19:09.741000000-0800'
     '2014-11-24T02:22:03.609000000-0800' '2014-11-24T02:25:07.005000000-0800'
     '2014-11-24T02:28:09.487000000-0800' '2014-11-24T02:34:07.675000000-0800'
     '2014-11-24T02:37:11.246000000-0800' '2014-11-24T02:58:06.510000000-0800']

And when I try to do:

    for time in time_points:
        print time

it returns:

    2014-11-23 15:25:02.916000-08:00
    2014-11-23 15:49:12.972000-08:00
    2014-11-23 16:40:13.378000-08:00
    2014-11-23 16:49:05.325000-08:00
    2014-11-23 16:55:14.480000-08:00
    2014-11-23 17:13:11.850000-08:00
    2014-11-23 17:19:08.454000-08:00
    2014-11-23 17:22:14.071000-08:00
    2014-11-23 17:25:06.543000-08:00
    2014-11-23 17:28:09.774000-08:00
    2014-11-23 18:16:06.774000-08:00
    2014-11-23 18:19:09.741000-08:00
    2014-11-23 18:22:03.609000-08:00
    2014-11-23 18:25:07.005000-08:00
    2014-11-23 18:28:09.487000-08:00
    2014-11-23 18:34:07.675000-08:00
    2014-11-23 18:37:11.246000-08:00
    2014-11-23 18:58:06.510000-08:00

This wasn't the case in 0.13 and I can still run it in that version without a problem. So the only thing I can attribute it to is a version change in pandas.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 24, 2014

pd.show_versions() in 0.13.1 (and 0.15.1)

pls show how you constructed this

@Oxtay

This comment has been minimized.

Copy link
Author

commented Nov 25, 2014

The line right before this, which makes time_points is

time_points = self.df.index[self.df['candidate'] == 1]

I have traced it line by line and everything up to that point is the same. Unfortunately, I can't post how the self.df is being made. That would require posting a lot of code here.

Here is the output of versions in two pandas:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 14.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.13.1
Cython: None
numpy: 1.8.1
scipy: 0.14.0
statsmodels: None
IPython: 2.0.0
sphinx: None
patsy: None
scikits.timeseries: None
dateutil: 2.2
pytz: 2013b
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
sqlalchemy: None
lxml: None
bs4: None
html5lib: None
bq: None
apiclient: None

and for Pandas 0.15.1:

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Darwin
OS-release: 14.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: 1.3.1
Cython: None
numpy: 1.9.1
scipy: 0.14.0
statsmodels: None
IPython: 2.0.0
sphinx: None
patsy: None
dateutil: 2.2
pytz: 2013b
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 25, 2014

The trigger is the tzoffset timezone. It can be reproduced as follows:

In [86]: index = pd.date_range("2012-01-01", periods=3, freq='H', tz=dateutil.tz.tzoffset(None, -28800))

In [87]: index
Out[87]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2012-01-01 00:00:00-08:00, ..., 2012-01-01 02:00:00-08:00]
Length: 3, Freq: H, Timezone: tzoffset(None, -28800)

In [88]: index[0]
Out[88]: Timestamp('2012-01-01 00:00:00-0800', tz='tzoffset(None, -28800)', offset='H')

In [89]: for time in index:
   ....:     print time
   ....:
2011-12-31 16:00:00-08:00
2011-12-31 17:00:00-08:00
2011-12-31 18:00:00-08:00

In [90]: list(iter(index))[0]
Out[90]: Timestamp('2011-12-31 16:00:00-0800', tz='tzoffset(None, -28800)', offset='H')

In [91]: list(iter(index))[0] == index[0]
Out[91]: False
@Oxtay

This comment has been minimized.

Copy link
Author

commented Nov 25, 2014

Thank you. So it would seem Pandas sees tzoffset something in addition to -08:00 and adds it to the time?

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Dec 1, 2014

This is a regression from 0.14

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.