Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: merge_asof with tz-aware datetime "by" parameter raises #21184

Closed
ldmnt opened this issue May 23, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@ldmnt
Copy link

commented May 23, 2018

Code Sample

import pandas as pd

left = pd.DataFrame({'by_col': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
                     'on_col': [2], 'values': ['a']})
right = pd.DataFrame({'by_col': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
                      'on_col': [1], 'values': ['b']})

merged = pd.merge_asof(left, right, by='by_col', on='on_col')

Error traceback

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Hamb\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 478, in merge_asof
    return op.get_result()
  File "C:\Users\Hamb\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 1163, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "C:\Users\Hamb\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 776, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "C:\Users\Hamb\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\reshape\merge.py", line 1437, in _get_join_indexers
    tolerance)
TypeError: Argument 'left_by_values' has incorrect type (expected numpy.ndarray, got Index)

Problem description

Function pandas.merge_asof raises when "by" parameter is provided a column of tz-aware datetime type.
Note that the same code with tz-naive datetimes works :

import pandas as pd

left = pd.DataFrame({'by_col': pd.DatetimeIndex(['2018-01-01']),
                     'on_col': [2], 'values': ['a']})
right = pd.DataFrame({'by_col': pd.DatetimeIndex(['2018-01-01']),
                      'on_col': [1], 'values': ['b']})

merged = pd.merge_asof(left, right, by='by_col', on='on_col')
print(merged)

outputs :

    by_col  on_col values_x values_y
0 2018-01-01       2        a        b

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: FR
LOCALE: None.None
pandas: 0.23.0
pytest: None
pip: 10.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.1
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jschendel

This comment has been minimized.

Copy link
Member

commented May 23, 2018

Thanks, I can confirm that this bug is occurring on master. PR to fix is welcome!

@jschendel

This comment has been minimized.

Copy link
Member

commented May 23, 2018

xref #14844 : a similar merge_asof tz-aware issue that's been fixed, and could potentially be useful for determining a fix here (not certain though).

@ldmnt

This comment has been minimized.

Copy link
Author

commented May 23, 2018

I'd be glad to help, but have no experience in contributing to a big project. So I can try to find a fix when I have time to dive into it, but no promises yet ! :)

@mroeschke

This comment has been minimized.

Copy link
Member

commented May 24, 2018

When the _get_merge_keys function preps the keys in pandas/core/reshape/merge.py, essentially this happens:

In [8]: left
Out[8]:
                     by_col  on_col values
0 2018-01-01 00:00:00+00:00       2      a

In [9]: left_naive
Out[9]:
      by_col  on_col values
0 2018-01-01       2      a

In [10]: left._get_label_or_level_values('by_col')
Out[10]: DatetimeIndex(['2018-01-01'], dtype='datetime64[ns, UTC]', freq=None)

In [11]: left_naive._get_label_or_level_values('by_col')
Out[11]: array(['2018-01-01T00:00:00.000000000'], dtype='datetime64[ns]')

The results are cast to object dtype, but are passed to a cython function that expects a numpy array instead of an Index. A .values call is needed somewhere in the flow to cast timezone aware keys to a numpy array.

@jreback jreback added this to the 0.23.1 milestone May 25, 2018

@jreback

This comment has been minimized.

Copy link
Contributor

commented May 25, 2018

here's a patch to fix. prob could use some general refactoring in this routine (maybe), but can do in the future

diff --git a/pandas/core/reshape/merge.py b/pandas/core/reshape/merge.py
index 4d8897fb7..58454d0cf 100644
--- a/pandas/core/reshape/merge.py
+++ b/pandas/core/reshape/merge.py
@@ -1420,6 +1420,11 @@ class _AsOfMerge(_OrderedMerge):
                 left_by_values = flip(left_by_values)
                 right_by_values = flip(right_by_values)
 
+            # initial type conversion as needed
+            if needs_i8_conversion(left_by_values):
+                left_by_values = left_by_values.view('i8')
+                right_by_values = right_by_values.view('i8')
+
             # upcast 'by' parameter because HashTable is limited
             by_type = _get_cython_type_upcast(left_by_values.dtype)
             by_type_caster = _type_casters[by_type]
@ldmnt

This comment has been minimized.

Copy link
Author

commented May 25, 2018

I was actually looking into a fix too this morning, and in the process found out that the bug originated in a typing issue in the following function of pandas/core/generic.py (line 1327) :

 def _get_label_or_level_values(self, key, axis=0, stacklevel=1):
        """
        Return a 1-D array of values associated with `key`, a label or level
        from the given `axis`.

The expected return type is a numpy array, but the following (line 1375) :

values = self.xs(key, axis=other_axes[0])._values

produces a DateTimeIndex in the case when _values is accessed on a tz-aware datetime Series.
This is because _values is overridden in pandas/core/indexes/datetimes.py (line 675):

@property
    def _values(self):
        # tz-naive -> ndarray
        # tz-aware -> DatetimeIndex
        if self.tz is not None:
            return self
        else:
            return self.values

edit: The problem i'm pointing out also yields an error when doing things such as

left = pd.DataFrame({'on_col': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'), 'values': ['a']})
right = pd.DataFrame({'values': ['b']}, index=pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'))

merged = left.merge(right, left_on='on_col', right_index=True)

Sorry for the lack of details, i'm a bit out of time right now, i can elaborate later if needed.

@mroeschke

This comment has been minimized.

Copy link
Member

commented Jan 4, 2019

This looks to be fixed on master now. Could use a test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.