Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

`join` on tz-aware column and tz-aware index wrongly fails (regression from 0.22) #23931

Closed
louispotok opened this issue Nov 26, 2018 · 5 comments

Comments

@louispotok
Copy link
Contributor

commented Nov 26, 2018

Code Sample, a copy-pastable example if possible

df1 = pd.DataFrame(
    {
        'date': pd.date_range(start='2018-01-01', periods=5,tz='America/Chicago'), 
        'vals':list('abcde')
    }
)

df2 = pd.DataFrame(
    {
        'date': pd.date_range(start='2018-01-03',periods=5, tz='America/Chicago'), 
        'vals_2': list('tuvwx')
    }
)
joined = df1.join(df2.set_index('date'),on='date')

Problem description

In 0.23.4, this raises:
ValueError: You are trying to merge on datetime64[ns, America/Chicago] and datetime64[ns] columns. If you wish to proceed you should use pd.concat

This is incorrect - the two columns are both tz-aware and indeed df2.set_index('date').index.dtype == df1.date.dtype returns True.

In 0.22.0, this does not raise, and returns the expected output. Haven't run git bisect yet but seems like the logical next step.

Expected Output

0  2018-01-01 00:00:00-06:00    a    NaN
1  2018-01-02 00:00:00-06:00    b    NaN
2  2018-01-03 00:00:00-06:00    c      t
3  2018-01-04 00:00:00-06:00    d      u
4  2018-01-05 00:00:00-06:00    e      v

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.6.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-1072-aws
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 39.2.0
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.9.0
xarray: None
IPython: 6.5.0
sphinx: None
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: 0.4.0
matplotlib: 2.1.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: None
psycopg2: 2.7.6.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Nov 26, 2018

Somewhere in self._get_merge_keys we coerce right's index to a ndarray

ipdb> self.right_join_keys
[array(['2018-01-03T06:00:00.000000000', '2018-01-04T06:00:00.000000000',
       '2018-01-05T06:00:00.000000000', '2018-01-06T06:00:00.000000000',
       '2018-01-07T06:00:00.000000000'], dtype='datetime64[ns]')]
ipdb>

I'd recommend going through core/reshape/merge.py::_MergeOperation._get_merge_keys to see where things go wrong.

@mroeschke

This comment has been minimized.

Copy link
Member

commented Nov 26, 2018

I think #21184 may be affected by the same root cause as this issue.

@louispotok

This comment has been minimized.

Copy link
Contributor Author

commented Dec 4, 2018

Thanks @TomAugspurger ! I think this patch should fix the issue (and added tests), if that looks about right I'll submit a PR.

louispotok@b29c6bc

@mroeschke haven't checked if this also fixed #21184 but will take a look.

@zhuoqiang

This comment has been minimized.

Copy link

commented Jan 9, 2019

same bug here:

while the following works fine

a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz=None)})
b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz=None))
a.join(b, on='x', lsuffix='_')

if we change tz to a specific time zone, like following:

a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc')})
b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc'))
a.join(b, on='x', lsuffix='_')

it failed:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-84-30f4c345c06a> in <module>
      1 a = pd.DataFrame({'a': range(5), 'x': pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc')})
      2 b = pd.DataFrame({'b': range(5)}, index=pd.date_range('2018-11-01', periods=5, freq='1T', tz='utc'))
----> 3 a.join(b, on='x', lsuffix='_')

lib/python3.7/site-packages/pandas/core/frame.py in join(self, other, on, how, lsuffix, rsuffix, sort)
   6334         # For SparseDataFrame's benefit
   6335         return self._join_compat(other, on=on, how=how, lsuffix=lsuffix,
-> 6336                                  rsuffix=rsuffix, sort=sort)
   6337 
   6338     def _join_compat(self, other, on=None, how='left', lsuffix='', rsuffix='',

lib/python3.7/site-packages/pandas/core/frame.py in _join_compat(self, other, on, how, lsuffix, rsuffix, sort)
   6349             return merge(self, other, left_on=on, how=how,
   6350                          left_index=on is None, right_index=True,
-> 6351                          suffixes=(lsuffix, rsuffix), sort=sort)
   6352         else:
   6353             if on is not None:

lib/python3.7/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63 

lib/python3.7/site-packages/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    553         # validate the merge keys dtypes. We may need to coerce
    554         # to avoid incompat dtypes
--> 555         self._maybe_coerce_merge_keys()
    556 
    557         # If argument passed to validate,

lib/python3.7/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    990                 raise ValueError(msg)
    991             elif is_datetime64tz_dtype(lk) and not is_datetime64tz_dtype(rk):
--> 992                 raise ValueError(msg)
    993             elif not is_datetime64tz_dtype(lk) and is_datetime64tz_dtype(rk):
    994                 raise ValueError(msg)

ValueError: You are trying to merge on datetime64[ns, UTC] and datetime64[ns] columns. If you wish to proceed you should use pd.concat

pandas.version == '0.23.4'

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jan 9, 2019

FYI, pd.merge works fine for these. Something is buggy in join before it calls the merge code.

@mroeschke mroeschke added the Timezones label Jan 9, 2019

@jreback jreback added this to the 0.24.2 milestone Feb 11, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.