Indexing into Series of tz-aware datetime64s fails using __getitem__ #12089

Closed
JackKelly opened this Issue Jan 19, 2016 · 8 comments

Comments

Projects
None yet
3 participants
Contributor

JackKelly commented Jan 19, 2016

I'm a huge fan of Pandas. Thanks for all the hard work!

I believe I have stumbled across a small bug in Pandas 0.17.1 which was not present in 0.16.2. Indexing into Series of timezone-aware datetime64s fails using __getitem__ but indexing succeeds if the datetime64s are timezone-naive. Here is a minimal code example and the exception produced by Pandas 0.17.1:

In [37]: dates_with_tz = pd.date_range("2011-01-01", periods=3, tz="US/Eastern")

In [46]: dates_with_tz
Out[46]: 
DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-02 00:00:00-05:00',
               '2011-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [38]: s_with_tz = pd.Series(dates_with_tz, index=['a', 'b', 'c'])

In [39]: s_with_tz
Out[39]: 
a   2011-01-01 00:00:00-05:00
b   2011-01-02 00:00:00-05:00
c   2011-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [40]: s_with_tz['a']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-40-81d0bf655282> in <module>()
----> 1 s_with_tz['a']

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in __getitem__(self, key)
    555     def __getitem__(self, key):
    556         try:
--> 557             result = self.index.get_value(self, key)
    558 
    559             if not np.isscalar(result):

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_value(self, series, key)
   1778         s = getattr(series,'_values',None)
   1779         if isinstance(s, Index) and lib.isscalar(key):
-> 1780             return s[key]
   1781 
   1782         s = _values_from_object(series)

/usr/local/lib/python2.7/dist-packages/pandas/tseries/base.pyc in __getitem__(self, key)
     98         getitem = self._data.__getitem__
     99         if np.isscalar(key):
--> 100             val = getitem(key)
    101             return self._box_func(val)
    102         else:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If the dates are timezone-aware then we can access them using loc but, as far as I'm aware, we should be able to use __getitem__ in this situation too:

In [41]: s_with_tz.loc['a']
Out[41]: Timestamp('2011-01-01 00:00:00-0500', tz='US/Eastern')

However, if the dates are timezone-naive then indexing using __getitem__ works as expected:

In [32]: dates_naive = pd.date_range("2011-01-01", periods=3)

In [33]: dates_naive
Out[33]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

In [34]: s = pd.Series(dates_naive, index=['a', 'b', 'c'])

In [35]: s
Out[35]: 
a   2011-01-01
b   2011-01-02
c   2011-01-03
dtype: datetime64[ns]

In [36]: s['a']
Out[36]: Timestamp('2011-01-01 00:00:00')

So indexing into a Series using __getitem__ works if the data is a list of timezone-naive datetime64s but indexing fails if the datetime64s are timezone-aware.

In [47]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-23-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 1.5.6
setuptools: 15.2
Cython: 0.23.1
numpy: 1.10.1
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
Jinja2: None

jorisvandenbossche added this to the 0.18.0 milestone Jan 19, 2016

I can confirm this bug, also with current master.

Contributor

jreback commented Jan 19, 2016

just need a try: except: around line 1780 catching IndexError and passing if its caught.

@JackKelly want to do a PR?

Contributor

jreback commented Jan 19, 2016

tests can go in the same place as in #12054

Contributor

JackKelly commented Jan 19, 2016

sure, I'll give it a go now...

@JackKelly JackKelly added a commit to JackKelly/pandas that referenced this issue Jan 19, 2016

@JackKelly JackKelly Attempt to fix #12089:
Indexing into Series of tz-aware datetime64s fails using __getitem__
420c926
Contributor

JackKelly commented Jan 19, 2016

OK, I've attempted the fix. Here's the relevant commit on my fork of Pandas.

However, this hasn't fixed the issue and I'm not sure what's best to do. My 'fix' has revealed a new issue. The problem appears to be that, now, when we do series['a'], we get back a tz-naive Timestamp (even though the series contains a bunch of tz-aware datetime64s):

In [5]: dates = pd.date_range("2011-01-01", periods=3, tz='utc')

In [6]: dates
Out[6]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns, UTC]', freq='D')

In [7]: series = pd.Series(dates, index=['a', 'b', 'c'])

# Note the lack of timezone:
In [8]: series['a']
Out[8]: Timestamp('2011-01-01 00:00:00')

# But using `loc` we do get the timezone:
In [9]: series.loc['a']
Out[9]: Timestamp('2011-01-01 00:00:00+0000', tz='UTC')

In [10]: series
Out[10]: 
a   2011-01-01 00:00:00+00:00
b   2011-01-02 00:00:00+00:00
c   2011-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [11]: series['a'] == series.loc['a']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-0de902e8919c> in <module>()
----> 1 series['a'] == series.loc['a']

/home/jack/workspace/python/pandas/pandas/tslib.pyx in pandas.tslib._Timestamp.__richcmp__ (pandas/tslib.c:19258)()
    971                                 (type(self).__name__, type(other).__name__))
    972 
--> 973         self._assert_tzawareness_compat(other)
    974         return _cmp_scalar(self.value, ots.value, op)
    975 

/home/jack/workspace/python/pandas/pandas/tslib.pyx in pandas.tslib._Timestamp._assert_tzawareness_compat (pandas/tslib.c:19638)()
   1000         if self.tzinfo is None:
   1001             if other.tzinfo is not None:
-> 1002                 raise TypeError('Cannot compare tz-naive and tz-aware '
   1003                                  'timestamps')
   1004         elif other.tzinfo is None:

TypeError: Cannot compare tz-naive and tz-aware timestamps
Contributor

jreback commented Jan 19, 2016

yeh, prob some issues down the path. lmk if you get stuck.

Contributor

JackKelly commented Jan 19, 2016

Hmm, I think this is way over my head to be honest. I'm really not very familiar with Pandas' internals. I have had a quick shot at getting to the bottom of it. Not sure if I've found any bugs or not. Here are my notes:

Set up a debugging session in IPython like this:

dates_with_tz = pd.date_range("2011-01-01", periods=3, tz="US/Eastern")
s_with_tz = pd.Series(dates_with_tz, index=['a', 'b', 'c'])
%debug s_with_tz['a']

we find that:

In Index.get_value(), the line s = _values_from_object(series) sets s to be:

['2011-01-01T05:00:00.000000000+0000' '2011-01-02T05:00:00.000000000+0000'
 '2011-01-03T05:00:00.000000000+0000']

i.e. timezone is switched from "US/Eastern" to UTC. I've tried stepping into _values_from_object(series) to find where the timezone is switched but I'm not sure I understand what I'm looking at. My only hunch is that the following is broken (because the timezone should still be US/Eastern, surely?):

In [32]: s_with_tz._values._values
Out[32]: 
array(['2011-01-01T05:00:00.000000000+0000',
       '2011-01-02T05:00:00.000000000+0000',
       '2011-01-03T05:00:00.000000000+0000'], dtype='datetime64[ns]')

but I'm really not sure! Is s_with_tz._values._values supposed to return an array where the timezone is set to UTC instead of 'US/Eastern'? Here are the other values (which look correct to me):

In [33]: s_with_tz._values
Out[33]: 
DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-02 00:00:00-05:00',
               '2011-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [34]: s_with_tz
Out[34]: 
a   2011-01-01 00:00:00-05:00
b   2011-01-02 00:00:00-05:00
c   2011-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 27, 2016

@jreback jreback BUG: getitem and a series with a non-ndarray values
closes #12089
824ddbe

jreback closed this in 3152bdc Jan 27, 2016

Contributor

JackKelly commented Jan 27, 2016

thank you @jreback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment