Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KeyError when comparing DataFrame with tz-aware DatetimeIndex on columns with DST change #19970

Closed
Liam3851 opened this issue Mar 2, 2018 · 14 comments · Fixed by #21407
Closed
Labels
Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Milestone

Comments

@Liam3851
Copy link
Contributor

Liam3851 commented Mar 2, 2018

Code Sample, a copy-pastable example if possible

dr = pd.date_range('20160101', '20161130', freq='4H', tz='America/New_York')
df = pd.DataFrame({'a':np.arange(len(dr)), 'b':np.arange(len(dr))}, index=dr)
dfnov_view = df.loc['2016-11']
drnov = pd.date_range('20161101', '20161130', freq='4H', tz='America/New_York')
dfnov = pd.DataFrame({'a':np.arange(len(drnov)), 'b':np.arange(len(drnov))}, index=drnov)

df == df # works
dfnov_view == dfnov_view # works
dfnov == dfnov # works

dfnov.T == dfnov.T # works
df.T == df.T # raises KeyError on master; works on 0.22
dfnov_view.T == dfnov_view.T # raises KeyError on master; works on 0.22

# KeyError stacktrace:
In [7]: dfnov_view.T == dfnov_view.T # raises KeyError
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\projects\pandas-dk\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    933
--> 934     cpdef get_item(self, int64_t val):
    935         cdef khiter_t k

C:\projects\pandas-dk\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    939         else:
--> 940             raise KeyError(val)
    941

KeyError: 1478412000000000000

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\projects\pandas-dk\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2635             try:
-> 2636                 return self._engine.get_loc(key)
   2637             except KeyError:

C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):

C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:

KeyError: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    457         try:
--> 458             return self.mapping.get_item(val.value)
    459         except KeyError:

C:\projects\pandas-dk\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    933
--> 934     cpdef get_item(self, int64_t val):
    935         cdef khiter_t k

C:\projects\pandas-dk\pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
    939         else:
--> 940             raise KeyError(val)
    941

KeyError: 1478412000000000000

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-7-495715640ebe> in <module>()
----> 1 dfnov_view.T == dfnov_view.T # raises KeyError

C:\projects\pandas-dk\pandas\core\ops.py in f(self, other)
   1558                 raise ValueError('Can only compare identically-labeled '
   1559                                  'DataFrame objects')
-> 1560             return self._compare_frame(other, func, str_rep)
   1561
   1562         elif isinstance(other, ABCSeries):

C:\projects\pandas-dk\pandas\core\frame.py in _compare_frame(self, other, func, str_rep)
   4032                 return {col: func(a[col], b[col]) for col in a.columns}
   4033
-> 4034             new_data = expressions.evaluate(_compare, str_rep, self, other)
   4035             return self._constructor(data=new_data, index=self.index,
   4036                                      columns=self.columns, copy=False)

C:\projects\pandas-dk\pandas\core\computation\expressions.py in evaluate(op, op_str, a, b, use_numexpr, **eval_kwargs)
    203     use_numexpr = use_numexpr and _bool_arith_check(op_str, a, b)
    204     if use_numexpr:
--> 205         return _evaluate(op, op_str, a, b, **eval_kwargs)
    206     return _evaluate_standard(op, op_str, a, b)
    207

C:\projects\pandas-dk\pandas\core\computation\expressions.py in _evaluate_numexpr(op, op_str, a, b, truediv, reversed, **eval_kwargs)
    118
    119     if result is None:
--> 120         result = _evaluate_standard(op, op_str, a, b)
    121
    122     return result

C:\projects\pandas-dk\pandas\core\computation\expressions.py in _evaluate_standard(op, op_str, a, b, **eval_kwargs)
     63         _store_test_result(False)
     64     with np.errstate(all='ignore'):
---> 65         return op(a, b)
     66
     67

C:\projects\pandas-dk\pandas\core\frame.py in _compare(a, b)
   4030
   4031             def _compare(a, b):
-> 4032                 return {col: func(a[col], b[col]) for col in a.columns}
   4033
   4034             new_data = expressions.evaluate(_compare, str_rep, self, other)

C:\projects\pandas-dk\pandas\core\frame.py in <dictcomp>(.0)
   4030
   4031             def _compare(a, b):
-> 4032                 return {col: func(a[col], b[col]) for col in a.columns}
   4033
   4034             new_data = expressions.evaluate(_compare, str_rep, self, other)

C:\projects\pandas-dk\pandas\core\frame.py in __getitem__(self, key)
   2202             return self._getitem_multilevel(key)
   2203         else:
-> 2204             return self._getitem_column(key)
   2205
   2206     def _getitem_column(self, key):

C:\projects\pandas-dk\pandas\core\frame.py in _getitem_column(self, key)
   2209         # get column
   2210         if self.columns.is_unique:
-> 2211             return self._get_item_cache(key)
   2212
   2213         # duplicate columns & possible reduce dimensionality

C:\projects\pandas-dk\pandas\core\generic.py in _get_item_cache(self, item)
   2193         res = cache.get(item)
   2194         if res is None:
-> 2195             values = self._data.get(item)
   2196             res = self._box_item_values(item, values)
   2197             cache[item] = res

C:\projects\pandas-dk\pandas\core\internals.py in get(self, item, fastpath)
   4070
   4071             if not isna(item):
-> 4072                 loc = self.items.get_loc(item)
   4073             else:
   4074                 indexer = np.arange(len(self.items))[isna(self.items)]

C:\projects\pandas-dk\pandas\core\indexes\datetimes.py in get_loc(self, key, method, tolerance)
   1555             # needed to localize naive datetimes
   1556             key = Timestamp(key, tz=self.tz)
-> 1557             return Index.get_loc(self, key, method, tolerance)
   1558
   1559         if isinstance(key, time):

C:\projects\pandas-dk\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2636                 return self._engine.get_loc(key)
   2637             except KeyError:
-> 2638                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2639
   2640         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    428         return algos.is_monotonic_int64(values, timelike=True)
    429
--> 430     cpdef get_loc(self, object val):
    431         if is_definitely_invalid_key(val):
    432             raise TypeError

C:\projects\pandas-dk\pandas\_libs\index.pyx in pandas._libs.index.DatetimeEngine.get_loc()
    458             return self.mapping.get_item(val.value)
    459         except KeyError:
--> 460             raise KeyError(val)
    461         except AttributeError:
    462             pass

KeyError: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

Problem description

On current master, if you have a DataFrame with a tz-aware DatetimeIndex in the columns, comparison can fail with a KeyError. Oddly, it appears this occurs only if the DatetimeIndex is the columns and not on the index.

Based on the KeyError it appears that for some reason .loc is looking for the pre-DST change value when it should be looking at the post-DST value. Reproducing the bug requires at least 2 DST switches in the original DataFrame. DataFrames with just a single DST switch do not seem to exhibit the behavior unless they are views on a larger DataFrame with two switches.

This is new on master; behavior doesn't occur in 0.22.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: e3b87c1
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 62 Stepping 4, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.0.dev0+422.ge3b87c1
pytest: 3.3.2
pip: 9.0.1
setuptools: 38.4.0
Cython: 0.27.3
numpy: 1.14.0
scipy: 1.0.0
pyarrow: 0.8.0
xarray: 0.10.0
IPython: 6.2.1
sphinx: 1.6.6
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: 2.4.10
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.1
pymysql: 0.7.11.None
psycopg2: None
jinja2: 2.10
s3fs: 0.1.2
fastparquet: 0.1.4
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype labels Mar 2, 2018
@gfyoung
Copy link
Member

gfyoung commented Mar 2, 2018

If you're familiar with git bisect, you can track down the exact commit that created this regression. Investigation and PR are welcome!

@Liam3851
Copy link
Contributor Author

Liam3851 commented Mar 5, 2018

Checked out git bisect-- it appears the issue is with #18618 (CC: @jbrockmendel)

[c0e3767] handle DST appropriately in Timestamp.replace (#18618)

I'm not familiar with Cython and it's unclear to me why this commit would have this effect.

This was the unit test I wrote for the git bisect, if that helps:

    def test_transpose_dst_changes(self):
        dstchgidx = date_range('20160101', '20161130', freq='4H',
                               tz='America/New_York')
        
        df = DataFrame({'a':np.arange(len(dstchgidx))},
                       dstchgidx)
        assert_frame_equal(df, df)
        assert_frame_equal(df.T, df.T)

@Liam3851
Copy link
Contributor Author

Liam3851 commented May 2, 2018

@jbrockmendel @TomAugspurger This regression from 0.22 is persisting in 0.23.0rc2.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 2, 2018 via email

@Liam3851
Copy link
Contributor Author

Liam3851 commented May 2, 2018

@TomAugspurger Mostly bumping because of the request for bug reports against the RC. I've never written any Cython at all... was sort of hoping someone who understood #18618 would have an idea what was going on here. That said, I'll take another look, since it might be fixable outside the Cython if we think the Cython logic is correct.

@jbrockmendel
Copy link
Member

I'm looking into this now. (df.T).equals(df.T) still returns True. It's col in df.T.columns that comes back False for col == Timestamp('2016-11-06 01:00:00-0400', tz='America/New_York', freq='4H')

It looks like the issue is in DatetimeIndex.get_loc. With key = col from above, get_loc casts key = Timestamp(key, tz=self.tz), which loses the freq attribute, then passes through to Index.get_loc. If we bypass the casting and go straight to Index.get_loc, it goes through correctly.

@jbrockmendel
Copy link
Member

I semi-take it back. The problem is this:

col = dstchgidx[1860]
col2 = pd.Timestamp(col, tz=dstchgidx.tz)

>>> col.value
1478408400000000000
>>> col2.value
1478412000000000000

1 AM on 2016-11-06 happens twice in America/New York, and calling the Timestamp constructor here is changing which one we're looking at.

@Liam3851
Copy link
Contributor Author

Liam3851 commented May 2, 2018

Thanks @jbrockmendel! In that case would this be a sufficient fix?

-            key = Timestamp(key, tz=self.tz)

+            if key.tzinfo is None and self.tz is not None:
+                key = Timestamp(key, tz=self.tz)
+            elif not isinstance(key, Timestamp):
+                key = Timestamp(key)

Idea here being that if the Timestamp comes in tz-localized we don't subsequently try to re-localize it (and lose the ambiguity resolution that has already taken place)?

I notice that pattern occurs in other places in datetimes.py-- get_value, get_value_maybe_box, etc. Those should presumably change as well?

@jbrockmendel
Copy link
Member

Tracking it down a little further, I'm now thinking the fix is in convert_datetime_to_tsobject (via Timestamp.__new__). This will take a little more tracking down.

In that case would this be a sufficient fix?

It looks like the suggested solution leaves key unchanged in the case where neither key.tzinfo nor self.tz is None. There are a handful of corner cases to worry about. For starters, what if they are both tz-aware but have different timezones?

@Liam3851
Copy link
Contributor Author

Liam3851 commented May 2, 2018

What if they are both tz-aware but have different timezones?

To be honest, I don't know why, but this does seem to work with the above fix:

In [15]: dt_tokyo = pytz.timezone('Asia/Tokyo').localize(datetime.datetime(2016, 11, 6, 14))

In [16]: df.index.get_loc(dt_tokyo)
Out[16]: 1860

In [17]: df.loc[dt_tokyo]
Out[17]:
a    1860
Name: 2016-11-06 01:00:00-04:00, dtype: int32

That said, my understanding of how any of this works is extremely hazy (I've never really looked at any of this code before) so I defer to you as to the best solution.

@mroeschke
Copy link
Member

mroeschke commented May 2, 2018

@jbrockmendel xref #20854 (comment)

I think the problem is in this branch:

if hasattr(tz, 'normalize') and hasattr(ts.tzinfo, '_utcoffset'):
ts = tz.normalize(ts)
obj.value = pydatetime_to_dt64(ts, &obj.dts)
obj.tzinfo = ts.tzinfo

I think tz.normalize(ts) where ts is a Timestamp with a tz might not be re-normalizing this correctly. (if a Timestamp with a tz hits this branch, probably doesn't need to be normalized again?)

@jbrockmendel
Copy link
Member

@mroeschke that's a part of the conversion code that I've been uncomfortable with (see above it the comment "# sort of a temporary hack"). In particular, the thing we may need @jreback to weigh in on is if the ts = tz.normalize(ts) call should always be ts.value-preserving (when ts is a Timestamp). If so, then line 353 is definitely a (possibly the) problem.

@jreback
Copy link
Contributor

jreback commented May 3, 2018

here's a repro

In [62]: t1 = pd.Timestamp('20161106 01:00:00').tz_localize('America/New_York',ambiguous=True)

In [63]: t2 = pd.Timestamp('20161106 01:00:00').tz_localize('America/New_York',ambiguous=False)

In [64]: t1
Out[64]: Timestamp('2016-11-06 01:00:00-0400', tz='America/New_York')

In [65]: t2
Out[65]: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

In [66]: pd.Timestamp(t1.tz.normalize(t1.to_pydatetime()))
Out[66]: Timestamp('2016-11-06 01:00:00-0400', tz='America/New_York')

In [67]: pd.Timestamp(t1.tz.normalize(t1))
Out[67]: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

In [68]: pd.Timestamp(t2.tz.normalize(t2.to_pydatetime()))
Out[68]: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

In [69]: pd.Timestamp(t2.tz.normalize(t2))
Out[69]: Timestamp('2016-11-06 01:00:00-0500', tz='America/New_York')

would expected 68==69 [True]
and 66==67 [False]

so the round-trip works if we have a datetime but not a Timestamp.

here's the doc-string for .normalize()In [76]: tz.normalize?

Signature: tz.normalize(dt)
Docstring:
Correct the timezone information on the given datetime

If date arithmetic crosses DST boundaries, the tzinfo
is not magically adjusted. This method normalizes the
tzinfo to the correct one.

I think that the 'hack' here (should remove that line!). is that you have to do this. I think this may have been working around a construction bug elsewhere in pandas, but not really sure.

Since this routine can take datetimes or Timestamps, then we prob could just coerce to datetimes at the beginning leave this logic.

@jreback
Copy link
Contributor

jreback commented May 3, 2018

so @jbrockmendel is correct, we need a .value preserving call here.

@jreback jreback added this to the 0.23.0 milestone May 3, 2018
@jreback jreback modified the milestones: 0.23.0, 0.23.1 May 14, 2018
@jreback jreback modified the milestones: 0.23.1, 0.23.2 Jun 7, 2018
mroeschke added a commit to mroeschke/pandas that referenced this issue Jun 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants