Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

Closed
0x26res opened this issue Jun 4, 2019 · 5 comments · Fixed by #27243
Closed

merge_asof with one tz-aware datetime "by" parameter and another parameter raises #26649

0x26res opened this issue Jun 4, 2019 · 5 comments · Fixed by #27243
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Milestone

Comments

@0x26res
Copy link

0x26res commented Jun 4, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd

left = pd.DataFrame({
    'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
    'by_col2': ['HELLO'],
    'on_col': [2],
    'value': ['a']})
right = pd.DataFrame({
    'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
    'by_col2': ['WORLD'],
    'on_col': [1],
    'value': ['b']})
pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')

Problem description

This is very similar to: #21184

The only difference is that the merge_asof by is made of 2 columns (instead of one):

  • one is tz-aware
  • the other one is something else (string, number etc...)

When running this, I get:

Traceback (most recent call last):
  File "test.py", line 13, in <module>
    pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 462, in merge_asof
    return op.get_result()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1256, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 756, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1504, in _get_join_indexers
    left_by_values = flip(left_by_values)
  File "myenv/lib/python3.6/site-packages/pandas/core/reshape/merge.py", line 1457, in flip
    return np.array(lzip(*xs), labeled_dtypes)
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 150, in __repr__
    return str(self)
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 129, in __str__
    return self.__unicode__()
  File "myenv/lib/python3.6/site-packages/pandas/core/dtypes/dtypes.py", line 704, in __unicode__
    return "datetime64[{unit}, {tz}]".format(unit=self.unit, tz=self.tz)
SystemError: PyEval_EvalFrameEx returned a result with an error set

Expected Output

I expect the merge_asof to work, and pick up the by column accordingly

Output of 0.24.2

[paste the output of ``pd.show_versions()`` here below this line]

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.10.0-862.3.3.el7.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.2
pytest: 4.5.0
pip: 19.1.1
setuptools: 40.8.0
Cython: 0.28.5
numpy: 1.16.4
scipy: 1.1.0
pyarrow: 0.12.1
xarray: None
IPython: 7.3.0
sphinx: 1.4.6
patsy: 0.5.1
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.9
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.3
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.3.1
bs4: None
html5lib: None
sqlalchemy: 1.2.18
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: 0.7.0
gcsfs: None

@mroeschke
Copy link
Member

Thanks for the report. Here's the traceback I get on master. Investigations and PRs welcome!

In [4]: import pandas as pd
   ...:
   ...: left = pd.DataFrame({
   ...:     'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
   ...:     'by_col2': ['HELLO'],
   ...:     'on_col': [2],
   ...:     'value': ['a']})
   ...: right = pd.DataFrame({
   ...:     'by_col1': pd.DatetimeIndex(['2018-01-01']).tz_localize('UTC'),
   ...:     'by_col2': ['WORLD'],
   ...:     'on_col': [1],
   ...:     'value': ['b']})
   ...: pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
<DatetimeTZDtype object at 0x1180be080>---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-450ebb8f2376> in <module>
     11     'on_col': [1],
     12     'value': ['b']})
---> 13 pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')

~/pandas-mroeschke/pandas/core/reshape/merge.py in merge_asof(left, right, on, left_on, right_on, left_index, right_index, by, left_by, right_by, suffixes, tolerance, allow_exact_matches, direction)
    465                     allow_exact_matches=allow_exact_matches,
    466                     direction=direction)
--> 467     return op.get_result()
    468
    469

~/pandas-mroeschke/pandas/core/reshape/merge.py in get_result(self)
   1296
   1297     def get_result(self):
-> 1298         join_index, left_indexer, right_indexer = self._get_join_info()
   1299
   1300         # this is a bit kludgy

~/pandas-mroeschke/pandas/core/reshape/merge.py in _get_join_info(self)
    759         else:
    760             (left_indexer,
--> 761              right_indexer) = self._get_join_indexers()
    762
    763             if self.right_index:

~/pandas-mroeschke/pandas/core/reshape/merge.py in _get_join_indexers(self)
   1560                 right_by_values = right_by_values[0]
   1561             else:
-> 1562                 left_by_values = flip(left_by_values)
   1563                 right_by_values = flip(right_by_values)
   1564

~/pandas-mroeschke/pandas/core/reshape/merge.py in flip(xs)
   1513             dtypes = [x.dtype for x in xs]
   1514             labeled_dtypes = list(zip(labels, dtypes))
-> 1515             return np.array(list(zip(*xs)), labeled_dtypes)
   1516
   1517         # values to compare

TypeError: data type not understood

In [5]: pd.__version__
Out[5]: '0.25.0.dev0+657.gc07d71d13'

@mroeschke mroeschke added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype labels Jun 4, 2019
@luckydenis
Copy link
Contributor

luckydenis commented Jun 11, 2019

Good day, when debug I found this. There seems to be an error in the second column type. There's got to be, [('a', datetime64[ns, UTC]), ('b', dtype('U'))] or we have to send back the object. Judging by the description of the error, it looks plausible.

def flip(xs):
""" unlike np.transpose, this returns an array of tuples """
labels = list(string.ascii_lowercase[:len(xs)])
dtypes = [x.dtype for x in xs]
labeled_dtypes = list(zip(labels, dtypes))
return np.array(list(zip(*xs)), labeled_dtypes)

(Pdb) xs
[<DatetimeArray>
['2018-01-01 00:00:00+00:00']
Length: 1, dtype: datetime64[ns, UTC], array(['HELLO'], dtype=object)]

(Pdb) lzip(*xs)
[(Timestamp('2018-01-01 00:00:00+0000', tz='UTC'), 'HELLO')]

(Pdb) labeled_dtypes
[('a', datetime64[ns, UTC]), ('b', dtype('O'))]

(Pdb) 

@TomAugspurger
Copy link
Contributor

DatetimeArray[ns, tz].__iter__ will return an ndarray of Timestamp objects. I'm not familiar with this section of the code, but can we use i8values rather that the datetimes at this point?

@luckydenis
Copy link
Contributor

luckydenis commented Jun 13, 2019

DatetimeArray[ns, tz].__iter__ will return an ndarray of Timestamp objects. I'm not familiar with this section of the code, but can we use i8values rather that the datetimes at this point?

def flip(xs):
""" unlike np.transpose, this returns an array of tuples """
labels = list(string.ascii_lowercase[:len(xs)])
dtypes = [x.dtype for x in xs]
labeled_dtypes = list(zip(labels, dtypes))
return np.array(list(zip(*xs)), labeled_dtypes)

Rewrote the conversion in type 'i8' thus:

dtypes = [x.view('i8') if needs_i8_conversion(x.dtype) else x.dtype for x in xs]

Error:

TypeError: data type not understood

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "TestStand/Main.py", line 16, in <module>
    pd.merge_asof(left, right, by=['by_col1', 'by_col2'], on='on_col')
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 462, in merge_asof
    return op.get_result()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1258, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 758, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1507, in _get_join_indexers
    left_by_values = flip(left_by_values)
  File "venv/lib/python3.7/site-packages/pandas/core/reshape/merge.py", line 1459, in flip
    return np.array(buff, labeled_dtypes)
  File "venv/lib/python3.7/site-packages/numpy/core/arrayprint.py", line 1404, in _array_repr_implementation
    if type(arr) is not ndarray:
SystemError: <class 'type'> returned a result with an error set

And here is what prints out on one of the stages, if you go deep into using pdb.

next
array([1514764800000000000])TypeError: data type not understood

It seems to me that there should be a type of float. Looks like a bug or normal?

(Pdb) dtypes
[array([1514764800000000000]), dtype('O')]

@TomAugspurger
Copy link
Contributor

I don't really understand what flip is doing, but we're making a numpy record array / structured dtype. We apparently can't pass a datetime64[ns, tz] array into flip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants