Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.merge fail with numpy.intc on Windows #52451

Closed
2 of 3 tasks
hoxbro opened this issue Apr 5, 2023 · 8 comments · Fixed by #53175
Closed
2 of 3 tasks

BUG: pd.merge fail with numpy.intc on Windows #52451

hoxbro opened this issue Apr 5, 2023 · 8 comments · Fixed by #53175
Assignees
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@hoxbro
Copy link
Contributor

hoxbro commented Apr 5, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
from scipy.spatial import Delaunay

n_verts = 10
pts = np.random.randint(1, n_verts, (n_verts, 2))
tris = Delaunay(pts)

A = pd.DataFrame(tris.simplices)
B = pd.DataFrame(pts)
pd.merge(A, B, left_on=[0], right_on=[0])

Issue Description

The example raises a KeyError on Windows and Pandas 2.0.

The example work if I convert A = pd.DataFrame(tris.simplices.astype(int))

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[26], line 3
      1 A = pd.DataFrame(tris.simplices)
      2 B = pd.DataFrame(pts)
----> 3 pd.merge(A, B, left_on=[0], right_on=[0])

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:156, in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
    125 @Substitution("\nleft : DataFrame or named Series")
    126 @Appender(_merge_doc, indents=0)
    127 def merge(
   (...)
    140     validate: str | None = None,
    141 ) -> DataFrame:
    142     op = _MergeOperation(
    143         left,
    144         right,
   (...)
    154         validate=validate,
    155     )
--> 156     return op.get_result(copy=copy)

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:803, in _MergeOperation.get_result(self, copy)
    800 if self.indicator:
    801     self.left, self.right = self._indicator_pre_merge(self.left, self.right)
--> 803 join_index, left_indexer, right_indexer = self._get_join_info()
    805 result = self._reindex_and_concat(
    806     join_index, left_indexer, right_indexer, copy=copy
    807 )
    808 result = result.__finalize__(self, method=self._merge_type)

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:1051, in _MergeOperation._get_join_info(self)
   1047     join_index, right_indexer, left_indexer = _left_join_on_index(
   1048         right_ax, left_ax, self.right_join_keys, sort=self.sort
   1049     )
   1050 else:
-> 1051     (left_indexer, right_indexer) = self._get_join_indexers()
   1053     if self.right_index:
   1054         if len(self.left) > 0:

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:1024, in _MergeOperation._get_join_indexers(self)
   1022 def _get_join_indexers(self) -> tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]:
   1023     """return the join indexers"""
-> 1024     return get_join_indexers(
   1025         self.left_join_keys, self.right_join_keys, sort=self.sort, how=self.how
   1026     )

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:1645, in get_join_indexers(left_keys, right_keys, sort, how, **kwargs)
   1640 # get left & right join labels and num. of levels at each location
   1641 mapped = (
   1642     _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
   1643     for n in range(len(left_keys))
   1644 )
-> 1645 zipped = zip(*mapped)
   1646 llab, rlab, shape = (list(x) for x in zipped)
   1648 # get flat i8 keys from label lists

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:1642, in <genexpr>(.0)
   1638         return _get_no_sort_one_missing_indexer(left_n, False)
   1640 # get left & right join labels and num. of levels at each location
   1641 mapped = (
-> 1642     _factorize_keys(left_keys[n], right_keys[n], sort=sort, how=how)
   1643     for n in range(len(left_keys))
   1644 )
   1645 zipped = zip(*mapped)
   1646 llab, rlab, shape = (list(x) for x in zipped)

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:2382, in _factorize_keys(lk, rk, sort, how)
   2378         # error: Item "ndarray" of "Union[Any, ndarray]" has no attribute
   2379         # "_values_for_factorize"
   2380         rk, _ = rk._values_for_factorize()  # type: ignore[union-attr]
-> 2382 klass, lk, rk = _convert_arrays_and_get_rizer_klass(lk, rk)
   2384 rizer = klass(max(len(lk), len(rk)))
   2386 if isinstance(lk, BaseMaskedArray):

File ~\AppData\Local\mambaforge\envs\tmp\Lib\site-packages\pandas\core\reshape\merge.py:2449, in _convert_arrays_and_get_rizer_klass(lk, rk)
   2447         klass = _factorizers[lk.dtype.type]  # type: ignore[index]
   2448     else:
-> 2449         klass = _factorizers[lk.dtype.type]
   2451 else:
   2452     klass = libhashtable.ObjectFactorizer

KeyError: <class 'numpy.intc'>

Expected Behavior

No KeyError and that the merge happens.

Installed Versions

INSTALLED VERSIONS
------------------
commit           : 478d340667831908b5b4bf09a2787a11a14560c9
python           : 3.11.2.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.19045
machine          : AMD64
processor        : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : English_United Kingdom.1252

pandas           : 2.0.0
numpy            : 1.24.2
pytz             : 2023.3
dateutil         : 2.8.2
setuptools       : 67.6.1
pip              : 23.0.1
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 3.1.2
IPython          : 8.12.0
pandas_datareader: None
bs4              : 4.12.0
bottleneck       : None
brotli           : 
fastparquet      : None
fsspec           : None
gcsfs            : None
matplotlib       : 3.7.1
numba            : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyreadstat       : None
pyxlsb           : None
s3fs             : None
scipy            : 1.10.1
snappy           : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
zstandard        : None
tzdata           : 2023.3
qtpy             : None
pyqt5            : None
@hoxbro hoxbro added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023
@rjgildea
Copy link

rjgildea commented Apr 5, 2023

Also stumbled across this issue when upgrading to pandas 2.0.0. Here is a slightly simpler reproducer that doesn't involve scipy:

import numpy as np
import pandas as pd

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': np.array([1, 2], dtype=np.intc)})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'b': np.array([3, 4], dtype=np.intc)})
df3 = df1.merge(df2, how='outer')
print(df3)

@rjgildea
Copy link

rjgildea commented Apr 5, 2023

Adding np.intc: libhashtable.Int32Factorizer to the _factorizers dict in pandas/core/reshape/merge.py fixes the above code for me.

@mroeschke
Copy link
Member

Thanks for the report! Pull requests welcome

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 5, 2023
@mroeschke mroeschke added this to the 2.0.1 milestone Apr 5, 2023
ndevenish pushed a commit to dials/dials that referenced this issue Apr 6, 2023
Without this, on Windows, the underlying dtype.type is np.intc rather than
np.int32 or np.int64, which results in an error when calling pd.merge.

Work around pandas 2.0.0 bug on windows:
    pandas-dev/pandas#52451

Fixes #2382.
@reddyrg1
Copy link
Contributor

reddyrg1 commented Apr 7, 2023

take

@datapythonista datapythonista modified the milestones: 2.0.1, 2.0.2 Apr 23, 2023
@hoxbro
Copy link
Contributor Author

hoxbro commented May 11, 2023

I cannot reproduce it on Pandas 2.0.1.

@hoxbro hoxbro closed this as completed May 11, 2023
@hoxbro
Copy link
Contributor Author

hoxbro commented May 11, 2023

I forgot the problem only happens on Windows...

@hoxbro hoxbro reopened this May 11, 2023
@hoxbro hoxbro changed the title BUG: pd.merge fail with numpy.intc BUG: pd.merge fail with numpy.intc on Windows May 11, 2023
@Pouyanpi
Copy link

Hi @hoxbro,

I noticed that the same issue arises for np.uintc, similar solution fixes it.
Thanks

@hoxbro
Copy link
Contributor Author

hoxbro commented Sep 15, 2023

If you have already tested it, I suggest opening a PR with the fix and adding a test 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Regression Functionality that used to work in a prior pandas version Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
6 participants