Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Weird behavior for comparison operations eq and ne #36377

Closed
2 of 3 tasks
YarShev opened this issue Sep 15, 2020 · 5 comments · Fixed by #36440
Closed
2 of 3 tasks

BUG: Weird behavior for comparison operations eq and ne #36377

YarShev opened this issue Sep 15, 2020 · 5 comments · Fixed by #36440
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@YarShev
Copy link
Contributor

YarShev commented Sep 15, 2020

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import pandas as pd
import numpy as np
random_state = np.random.RandomState(seed=42)
ncols = 64
nrows = 156
test_data = {
    "col{}".format(int(i)): random_state.randint(
        0, 100, size=(nrows)
    )
    for i in range(ncols)
}
df = pd.DataFrame(test_data)
df.eq("a") # it works good
nrows = 157 # change row count
test_data = {
    "col{}".format(int(i)): random_state.randint(
        0, 100, size=(nrows)
    )
    for i in range(ncols)
}
df = pd.DataFrame(test_data)
df.eq("a") # it doesn't work
# ValueError: unknown type str32
nrows = 156 # return row count to valid value
ncols = 65 # change col count
test_data = {
    "col{}".format(int(i)): random_state.randint(
        0, 100, size=(nrows)
    )
    for i in range(ncols)
}
df = pd.DataFrame(test_data)
df.eq("a") # it doesn't work as well
# ValueError: unknown type str32

Problem description

Looks like the problem is related to size of operands for comparison operations. Could anyone explain please? Is it normal behavior?

Output of pd.show_versions()

pandas : 1.1.2
numpy : 1.18.4
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 41.2.0
Cython : None
pytest : 5.4.2
hypothesis : None
sphinx : None
blosc : None
feather : 0.4.1
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : None
fsspec : 0.7.3
fastparquet : None
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : None
xarray : 0.15.1
xlrd : 1.2.0
xlwt : None
numba : None

@YarShev YarShev added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 15, 2020
@YarShev
Copy link
Contributor Author

YarShev commented Sep 15, 2020

Also, an exception is changing for comparison operations ge, gt, le, lt. When we specify valid size for data frame as mentioned above, the exception is TypeError: '>' not supported between instances of 'numpy.ndarray' and 'str'. However, when we change size for data frame as mentioned above, the exception is the same as for eq and ne: ValueError: unknown type str32.

@satrio-hw
Copy link
Contributor

hi @YarShev , so from what I know, the problem seems came from numpy. They have some issues when try to compare different datatype, so the quick solution to this problem by add df = df.astype(str) before comparing 'a' to the DataFrame.

about size problem, I'm not quite sure, but I tried to loop nrows and ncols to 300, and the result for size nrows * ncols > 10000 will raise a ValueError.. for TypeError, I think it's related to dtype comparison and solvable by changing the DataFrame dtype.

@dsaxton dsaxton removed the Needs Triage Issue that has not been reviewed by a pandas team member label Sep 16, 2020
@YarShev
Copy link
Contributor Author

YarShev commented Sep 16, 2020

Hi @satrio-hw , perhaps, the problem comes from numpy. However, a back trace falls down into numexpr finally. There is no the issue in pandas==1.0.5 with numpy==1.18.3 and numpexr==2.7.1, but there is the problem in pandas==1.1.1/2 with numpy==1.18.4 and numpexr==2.7.1

@satrio-hw
Copy link
Contributor

@YarShev I tried to compare pandas v1.0.x and the current master.. I can't find significant difference in function def _evaluate_numexpr(op, op_str, a, b): and def _can_use_numexpr(op, op_str, a, b, dtype_check): from both version (pandas/core/computation/expressions.py).. I look at this function since this function also mentioned in the back trace

I also think this issue is not related to size of operands, since nrows*ncols>_MIN_ELEMENTS (which MIN_ELEMENTS = 10000) only will trigger numexpr lib to make calculation faster.
To make sure, I tried to compare between DataFrame (180 rows x 64 cols) and df.eq(1) and there is no error..

Error rise from numexpr because we tried to compare 2 different dtype and I think change any code in numexpr is not best practice, since it's external library..

I think I could make a PR in pandas/core/computation/expressions.py to handle this particular dtype problem, but I don't think it's a best practice since it could effect other function in this project and also by change DataFrame dtype (df.astype('str')) already resolve this problem related numpy and numexpr beautifully.. but let me know if I miss something

@YarShev
Copy link
Contributor Author

YarShev commented Sep 17, 2020

@satrio-hw , thanks for significant description! Yes, I think just change DataFrame dtype (df.astype('str')) makes sense.

@jreback jreback added this to the 1.1.3 milestone Sep 18, 2020
@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version labels Sep 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Numeric Operations Arithmetic, Comparison, and Logical operations Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants