Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AmbiguousTimeError merging two timezone-aware DataFrames with DST change #18885

Closed
tapia opened this issue Dec 20, 2017 · 2 comments · Fixed by #22825
Closed

AmbiguousTimeError merging two timezone-aware DataFrames with DST change #18885

tapia opened this issue Dec 20, 2017 · 2 comments · Fixed by #22825
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@tapia
Copy link

tapia commented Dec 20, 2017

When merging two DataFrames by a timezone-aware datetime column, if the datetime values doesn't include a DST change, there's no problem:

df1 = pd.DataFrame([pd.to_datetime('2017-10-30 02:00:00+01:00'),
                    pd.to_datetime('2017-10-30 03:00:00+01:00'),
                    pd.to_datetime('2017-10-30 04:00:00+01:00')],columns=['date'])
df1['date'] = df1['date'].dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df1['value'] = 1

df2 = pd.DataFrame([pd.to_datetime('2017-10-30 04:00:00+01:00'),
                    pd.to_datetime('2017-10-30 05:00:00+01:00'),
                    pd.to_datetime('2017-10-30 06:00:00+01:00')],columns=['date'])
df2['date'] = df2['date'].dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df2['value'] = 2

pd.merge(df1, df2, how='outer', on='date')

Output

                       date  value_x  value_y
0 2017-10-30 02:00:00+01:00      1.0      NaN
1 2017-10-30 03:00:00+01:00      1.0      NaN
2 2017-10-30 04:00:00+01:00      1.0      2.0
3 2017-10-30 05:00:00+01:00      NaN      2.0
4 2017-10-30 06:00:00+01:00      NaN      2.0

This is correct. But if the datetime values include a date with DST change, we get an AmbiguousTimeError exception:

df1 = pd.DataFrame([pd.to_datetime('2017-10-29 02:00:00+02:00'),
                    pd.to_datetime('2017-10-29 02:00:00+01:00'),
                    pd.to_datetime('2017-10-29 03:00:00+01:00')],columns=['date'])
df1['date'] = df1['date'].dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df1['value'] = 1

df2 = pd.DataFrame([pd.to_datetime('2017-10-29 03:00:00+01:00'),
                    pd.to_datetime('2017-10-29 04:00:00+01:00'),
                    pd.to_datetime('2017-10-29 05:00:00+01:00')],columns=['date'])
df2['date'] = df2['date'].dt.tz_localize('UTC').dt.tz_convert('Europe/Madrid')
df2['value'] = 2

pd.merge(df1, df2, how='outer', on='date')

Expected output

                       date  value_x   value_y
0 2017-10-29 02:00:00+02:00      1.0       NaN
1 2017-10-29 02:00:00+01:00      1.0       NaN
2 2017-10-29 03:00:00+01:00      1.0       2.0
3 2017-10-29 04:00:00+01:00      NaN       2.0
4 2017-10-29 05:00:00+01:00      NaN       2.0

Actual output

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "(...)/venv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 58, in merge
    return op.get_result()
  File "(...)//venv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 604, in get_result
    self._maybe_add_join_keys(result, left_indexer, right_indexer)
  File "(...)//venv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 715, in _maybe_add_join_keys
    key_col = Index(lvals).where(~mask, rvals)
  File "(...)//venv/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py", line 809, in where
    result = self._ensure_localized(result)
  File "(...)//venv/lib/python3.6/site-packages/pandas/core/indexes/datetimelike.py", line 230, in _ensure_localized
    result = result.tz_localize(self.tz)
  File "(...)//venv/lib/python3.6/site-packages/pandas/util/_decorators.py", line 118, in wrapper
    return func(*args, **kwargs)
  File "(...)//venv/lib/python3.6/site-packages/pandas/core/indexes/datetimes.py", line 1858, in tz_localize
    errors=errors)
  File "pandas/_libs/tslib.pyx", line 3593, in pandas._libs.tslib.tz_localize_to_utc
pytz.exceptions.AmbiguousTimeError: Cannot infer dst time from Timestamp('2017-10-29 02:00:00'), try using the 'ambiguous' argument

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Darwin
OS-release: 17.3.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: es_ES.UTF-8
LOCALE: es_ES.UTF-8

pandas: 0.21.1
pytest: 3.2.5
pip: 9.0.1
setuptools: 36.8.0
Cython: None
numpy: 1.13.3
scipy: None
pyarrow: None
xarray: None
IPython: None
sphinx: 1.5.3
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.9.6
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Dec 21, 2017

yeah, we are joining indicies and do a .where() on them. we drop the tz, do the op in i8, then localize to the original zone.

what we need is an attribute for Timestamp and DatetimeIndex like is_ambiguous, then we could record the ambiguous so we can recreate properly.

interested in a PR?

cc @jbrockmendel

@jreback jreback added Bug Timezones Timezone data dtype Difficulty Intermediate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 21, 2017
@jreback jreback added this to the Next Major Release milestone Dec 21, 2017
@tapia
Copy link
Author

tapia commented Dec 21, 2017

I'd love to, but I don't know the pandas/numpy internals, and merge() doesn't sound like an easy place to start :-) Maybe with some guidance...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
2 participants