Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas 0.23 fails on column<->index boolean merge #21119

Closed
crepererum opened this Issue May 18, 2018 · 10 comments

Comments

Projects
None yet
3 participants
@crepererum
Copy link

crepererum commented May 18, 2018

Code Sample, a copy-pastable example if possible

import pandas as pd

df1 = pd.DataFrame({'x': pd.Series([True, False], dtype=bool)})
df2 = df1.set_index('x')

df1.merge(df2, left_on='x', right_index=True)

Problem description

On Pandas 0.23, I get the following Exception:

venv/local/lib/python2.7/site-packages/pandas/core/frame.pyc in merge(self, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
   6377                      right_on=right_on, left_index=left_index,
   6378                      right_index=right_index, sort=sort, suffixes=suffixes,
-> 6379                      copy=copy, indicator=indicator, validate=validate)
   6380
   6381     def round(self, decimals=0, *args, **kwargs):

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     58                          right_index=right_index, sort=sort, suffixes=suffixes,
     59                          copy=copy, indicator=indicator,
---> 60                          validate=validate)
     61     return op.get_result()
     62

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    552         # validate the merge keys dtypes. We may need to coerce
    553         # to avoid incompat dtypes
--> 554         self._maybe_coerce_merge_keys()
    555
    556         # If argument passed to validate,

venv/local/lib/python2.7/site-packages/pandas/core/reshape/merge.pyc in _maybe_coerce_merge_keys(self)
    976             # incompatible dtypes GH 9780, GH 15800
    977             elif is_numeric_dtype(lk) and not is_numeric_dtype(rk):
--> 978                 raise ValueError(msg)
    979             elif not is_numeric_dtype(lk) and is_numeric_dtype(rk):
    980                 raise ValueError(msg)

ValueError: You are trying to merge on bool and object columns. If you wish to proceed you should use pd.concat

Expected Output

       x
0   True
1  False

Output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.49-moby
machine: x86_64
processor:
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.23.0
pytest: 3.4.0
pip: 9.0.3
setuptools: 39.0.1
Cython: 0.28.1
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.5.0
sphinx: 1.6.5
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 2.1.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Bug label May 21, 2018

@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented May 21, 2018

@crepererum : Thanks for reporting this! Was this working on a previous version by any chance?

@crepererum

This comment has been minimized.

Copy link
Author

crepererum commented May 21, 2018

Yes, the expected output was produced by 0.22.0.

Does pandas, by any chance, have some infrastructure for bisecting the commit / merge that introduced the issue?

@gfyoung

This comment has been minimized.

Copy link
Member

gfyoung commented May 21, 2018

git bisect is what you're looking for. That's a pure git construct BTW.

@crepererum

This comment has been minimized.

Copy link
Author

crepererum commented May 21, 2018

I know GIT does the job, I just thought there may be already be some scripts to simplify the build + test process or even an archive of pre built wheels for every master commit. Anyway, no worries, I'll see what if I find the the commit this week.

@crepererum

This comment has been minimized.

Copy link
Author

crepererum commented May 23, 2018

1355df64af894836f86cbc75b0c3b1435331aeb4 is the first bad commit
commit 1355df64af894836f86cbc75b0c3b1435331aeb4
Author: Paul Reidy <paul_reidy@outlook.com>
Date:   Sun Dec 10 15:41:36 2017 +0000

    ERR: ValueError when merging on incompatible dtypes (#18674)

@jorisvandenbossche jorisvandenbossche added this to the 0.23.1 milestone May 23, 2018

@crepererum

This comment has been minimized.

Copy link
Author

crepererum commented May 23, 2018

The problem here is: the check introduced is in this commit does the correct thing. The index dtype is wrong (it's object, not bool), which can also be shown be this simple example (identical result for 0.22.0 and 0.23.0):

>>> pd.Index([True, False], dtype=bool)
Index([True, False], dtype='object')

Or in other words: the index dtype is wrong in both versions, the check that was introduced in-between just makes the problem visible.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented May 23, 2018

@crepererum Thanks for figuring that out.
So the underlying cause (which now comes up due to that change), is that boolean values in the index are stored as object array (we don't have a BooleanIndex), so it things the types are not compatible (object vs bool).

So we will have to 'infer' the dtype of the index to cover those cases.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented May 23, 2018

Yes, fully correct, we've come to the same conclusion :-)
The index dtype being object is not really "wrong", it is just how boolean values are currently stored in indexes. So the code to check if the dtypes are compatible in the merging code, needs to be made aware of this fact.

@crepererum

This comment has been minimized.

Copy link
Author

crepererum commented Jun 4, 2018

BTW: the problem is NOT specific to indices. This code also doesn't work:

import pandas as pd

df1 = pd.DataFrame({'x': [True, False]})
df2 = pd.DataFrame({'x': [True, False, None]})
df1.merge(df2)

This is especially bad because pandas uses None as NULL-placeholder in bool-columns.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

jorisvandenbossche commented Jun 4, 2018

@crepererum yes, that's indeed also a reason for having object boolean columns.
We will fix this in the upcooming 0.23.1 release, I started a PR at #21310

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.