merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

0x26res · 2019-06-04T18:08:00Z

Code Sample, a copy-pastable example if possible

import pandas as pd

left = pd.DataFrame({
    'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
    'by_col2': ['HELLO'],
    'on_col': [2],
    'value': ['a']})
right = pd.DataFrame({
    'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
    'by_col2': ['WORLD'],
    'on_col': [1],
    'value': ['b']})
pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')

Problem description

This is very similar to: #21184

The only difference is that the merge_asof by is made of 2 columns (instead of one):

one is tz-aware
the other one is something else (string, number etc...)

When running this, I get:

Traceback (most recent call last):
  File "test.py", line 13, in <module>
    pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 462, in merge_asof
    return op.get_result()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1256, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 756, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1504, in _get_join_indexers
    left_by_values = flip(left_by_values)
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1457, in flip
    return np.array(lzip(*xs), labeled_dtypes)
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 150, in __repr__
    return str(self)
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 129, in __str__
    return self.__unicode__()
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 704, in __unicode__
    return "datetime64[{unit}, {tz}]".format(unit=self.unit, tz=self.tz)
SystemError: PyEval_EvalFrameEx returned a result with an error set

Expected Output

I expect the merge_asof to work, and pick up the by column accordingly

Output of `0.24.2`

[paste the output of ``pd.show_versions()`` here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.3.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 40.8.0
Cython: 0.28.5
numpy: 1.16.4
scipy: 1.1.0
pyarrow: 0.12.1
xarray: None
IPython: 7.3.0
sphinx: 1.4.6
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.1
bs4: None
html5lib: None
sqlalchemy: 1.2.18
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

The text was updated successfully, but these errors were encountered:

mroeschke · 2019-06-04T22:42:43Z

Thanks for the report. Here's the traceback I get on master. Investigations and PRs welcome!

In [4]: import pandas as pd
   ...:
   ...: left = pd.DataFrame({
   ...:     'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
   ...:     'by_col2': ['HELLO'],
   ...:     'on_col': [2],
   ...:     'value': ['a']})
   ...: right = pd.DataFrame({
   ...:     'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
   ...:     'by_col2': ['WORLD'],
   ...:     'on_col': [1],
   ...:     'value': ['b']})
   ...: pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
<DatetimeTZDtype object at 0x1180be080>---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-450ebb8f2376> in <module>
     11     'on_col': [1],
     12     'value': ['b']})
---> 13 pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')

~/pandas-mroeschke/pandas/core/reshape/merge.py in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
    465                     allow_exact_matches=allow_exact_matches,
    466                     direction=direction)
--> 467     return op.get_result()
    468
    469

~/pandas-mroeschke/pandas/core/reshape/merge.py in get_result(self)
   1296
   1297     def get_result(self):
-> 1298         join_index, left_indexer, right_indexer = self._get_join_info()
   1299
   1300         # this is a bit kludgy

~/pandas-mroeschke/pandas/core/reshape/merge.py in _get_join_info(self)
    759         else:
    760             (left_indexer,
--> 761              right_indexer) = self._get_join_indexers()
    762
    763             if self.right_index:

~/pandas-mroeschke/pandas/core/reshape/merge.py in _get_join_indexers(self)
   1560                 right_by_values = right_by_values[0]
   1561             else:
-> 1562                 left_by_values = flip(left_by_values)
   1563                 right_by_values = flip(right_by_values)
   1564

~/pandas-mroeschke/pandas/core/reshape/merge.py in flip(xs)
   1513             dtypes = [x.dtype for x in xs]
   1514             labeled_dtypes = list(zip(labels, dtypes))
-> 1515             return np.array(list(zip(*xs)), labeled_dtypes)
   1516
   1517         # values to compare

TypeError: data type not understood

In [5]: pd.__version__
Out[5]: '0.25.0.dev0+657.gc07d71d13'

luckydenis · 2019-06-11T12:21:52Z

Good day, when debug I found this. There seems to be an error in the second column type. There's got to be, [('a', datetime64[ns, UTC]), ('b', dtype('U'))] or we have to send back the object. Judging by the description of the error, it looks plausible.

pandas/pandas/core/reshape/merge.py

Lines 1510 to 1515 in ea06f8d

    
           def flip(xs): 
        
               """ unlike np.transpose, this returns an array of tuples """ 
        
               labels = list(string.ascii_lowercase[:len(xs)]) 
        
               dtypes = [x.dtype for x in xs] 
        
               labeled_dtypes = list(zip(labels, dtypes)) 
        
               return np.array(list(zip(*xs)), labeled_dtypes)

(Pdb) xs
[<DatetimeArray>
['2018-01-01 00:00:00+00:00']
Length: 1, dtype: datetime64[ns, UTC], array(['HELLO'], dtype=object)]

(Pdb) lzip(*xs)
[(Timestamp('2018-01-01 00:00:00+0000', tz='UTC'), 'HELLO')]

(Pdb) labeled_dtypes
[('a', datetime64[ns, UTC]), ('b', dtype('O'))]

(Pdb)

TomAugspurger · 2019-06-11T14:31:16Z

DatetimeArray[ns, tz].__iter__ will return an ndarray of Timestamp objects. I'm not familiar with this section of the code, but can we use i8values rather that the datetimes at this point?

luckydenis · 2019-06-13T12:17:20Z

DatetimeArray[ns, tz].__iter__ will return an ndarray of Timestamp objects. I'm not familiar with this section of the code, but can we use i8values rather that the datetimes at this point?

pandas/pandas/core/reshape/merge.py

Lines 1510 to 1515 in ea06f8d

    
           def flip(xs): 
        
               """ unlike np.transpose, this returns an array of tuples """ 
        
               labels = list(string.ascii_lowercase[:len(xs)]) 
        
               dtypes = [x.dtype for x in xs] 
        
               labeled_dtypes = list(zip(labels, dtypes)) 
        
               return np.array(list(zip(*xs)), labeled_dtypes)

Rewrote the conversion in type 'i8' thus:

dtypes = [x.view('i8') if needs_i8_conversion(x.dtype) else x.dtype for x in xs]

Error:

TypeError: data type not understood

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "TestStand/Main.py", line 16, in <module>
    pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 462, in merge_asof
    return op.get_result()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1258, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 758, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1507, in _get_join_indexers
    left_by_values = flip(left_by_values)
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1459, in flip
    return np.array(buff, labeled_dtypes)
  File "venv/lib/python3.7/site-packages/numpy/core/arrayprint.py", line 1404, in _array_repr_implementation
    if type(arr) is not ndarray:
SystemError: <class 'type'> returned a result with an error set

And here is what prints out on one of the stages, if you go deep into using pdb.

next
array([1514764800000000000])TypeError: data type not understood

It seems to me that there should be a type of float. Looks like a bug or normal?

(Pdb) dtypes
[array([1514764800000000000]), dtype('O')]

TomAugspurger · 2019-06-13T17:11:14Z

I don't really understand what flip is doing, but we're making a numpy record array / structured dtype. We apparently can't pass a datetime64[ns, tz] array into flip.

mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Jun 4, 2019

mroeschke mentioned this issue Jul 5, 2019

BUG: merge_asof with multiple by columns with tz #27243

Merged

4 tasks

jreback added this to the 0.25.0 milestone Jul 5, 2019

jreback closed this as completed in #27243 Jul 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

0x26res commented Jun 4, 2019 •

edited

Loading

INSTALLED VERSIONS

mroeschke commented Jun 4, 2019

luckydenis commented Jun 11, 2019 •

edited

Loading

TomAugspurger commented Jun 11, 2019

luckydenis commented Jun 13, 2019 •

edited

Loading

TomAugspurger commented Jun 13, 2019

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

Comments

0x26res commented Jun 4, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of 0.24.2

INSTALLED VERSIONS

mroeschke commented Jun 4, 2019

luckydenis commented Jun 11, 2019 • edited Loading

TomAugspurger commented Jun 11, 2019

luckydenis commented Jun 13, 2019 • edited Loading

TomAugspurger commented Jun 13, 2019

0x26res commented Jun 4, 2019 •

edited

Loading

Output of `0.24.2`

luckydenis commented Jun 11, 2019 •

edited

Loading

luckydenis commented Jun 13, 2019 •

edited

Loading