Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: breaking change in df.replace() from 1.0.5 to 1.1.0 #35931

Closed
3 tasks done
agnesbao opened this issue Aug 27, 2020 · 4 comments
Closed
3 tasks done

BUG: breaking change in df.replace() from 1.0.5 to 1.1.0 #35931

agnesbao opened this issue Aug 27, 2020 · 4 comments
Labels
Interval Interval data type Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@agnesbao
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

# Your code here
df = pd.DataFrame({"age": range(1,100)})
df["age"] = pd.cut(df["age"], [0, 50, 65, 99], right=False)
df.replace(to_replace={
    "age":{
        pd.Interval(0, 50, closed='left'): "Ages < 50",
        pd.Interval(50, 65, closed='left'): "Ages 50-64",
        pd.Interval(65, 99, closed='left'): "Ages 65+"
    }
})

Problem description

The above sample code works in 1.0.5 but error out in 1.1.0 and above with error msg:

TypeError: Cannot compare types 'ndarray(dtype=object)' and 'Interval'

It only happens with pd.Interval dtype, nested dict of string to string mapping works fine.

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit : e3bcf8d
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.6.0
Version : Darwin Kernel Version 19.6.0: Thu Jun 18 20:49:00 PDT 2020; root:xnu-6153.141.1~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.2.0.dev0+165.ge3bcf8d19
numpy : 1.19.1
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.2
setuptools : 49.6.0.post20200814
Cython : 0.29.21
pytest : 6.0.1
hypothesis : 5.29.0
sphinx : 2.4.0
blosc : None
feather : None
xlsxwriter : 1.3.3
lxml.etree : 4.5.2
html5lib : 1.1
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.2
IPython : 7.17.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fsspec : 0.8.0
fastparquet : 0.4.1
gcsfs : None
matplotlib : 3.3.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.5
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pyxlsb : None
s3fs : 0.2.0
scipy : 1.5.2
sqlalchemy : 1.3.19
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.16.0
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.50.1

@agnesbao agnesbao added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2020
@dsaxton dsaxton added Interval Interval data type Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 27, 2020
@asishm
Copy link
Contributor

asishm commented Aug 27, 2020

#32107

c6e3e15 is the first bad commit
commit c6e3e15
Author: Daniel Saxton 2658661+dsaxton@users.noreply.github.com
Date: Sat Apr 25 18:58:51 2020 -0500

BUG: Allow addition of Timedelta to Timestamp interval (#32107)

@dsaxton
Copy link
Member

dsaxton commented Aug 27, 2020

Hmm, this is apparently due to

__array_priority__ = 1000

which was added to allow np.timedelta64 + pd.Interval of timestamps in the linked PR: #32107 (comment). Not sure exactly what's happening to be honest but will take a look.

cc @jbrockmendel if any thoughts

@dsaxton
Copy link
Member

dsaxton commented Aug 28, 2020

So evidently __array_priority__ = 1000 has the not so nice side effect of no longer broadcasting things like equality comparisons (which is causing the bug in the OP):

In [1]: import numpy as np
   ...: import pandas as pd
   ...:
   ...: arr = np.array([pd.Interval(0, 1), pd.Interval(1, 2)])
   ...: arr == pd.Interval(0, 1)
   ...:
Out[1]: False

Probably this behavior is worse than not being able to add np.timedelta64 + pd.Interval(pd.Timestamp, pd.Timestamp).

@simonjayhawkins
Copy link
Member

fixed in #35938

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interval Interval data type Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests

4 participants