Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call #57989

Open
3 tasks done
filip-komarzyniec opened this issue Mar 25, 2024 · 8 comments
Open
3 tasks done
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

Comments

@filip-komarzyniec
Copy link

filip-komarzyniec commented Mar 25, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.NA in [1,2,3]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Issue Description

checking for pd.NA existence in a list results in TypeError: boolean value of NA is ambiguous.
Why is performing in operation calls __bool__ method of the pd.NAType class?

Seems a bit similar to the issue regarding incorrect implementation of some operators: #49828

Expected Behavior

Checking for existence of pd.NA type in any container should correctly return either True or False

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.13.final.0
python-bits : 64
OS : Darwin
OS-release : 23.2.0
Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.2.1
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : 8.0.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@filip-komarzyniec filip-komarzyniec added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 25, 2024
@filip-komarzyniec filip-komarzyniec changed the title BUG: CONTAINS_OP run on pd.NA results in pd.__bool__ call BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call Mar 25, 2024
@rhshadrach
Copy link
Member

Thanks for the report - this is a consequence of having comparisons return pd.NA:

print(pd.NA == 1)
# <NA>

When Python checks "is pd.NA == 1", the result is NA, which Python then evaluates the truthiness of this result, giving you the TypeError as reported. As long as we are returning pd.NA on comparisons, I do not believe anything can be done here.

cc @jorisvandenbossche @phofl

@rhshadrach rhshadrach added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Mar 25, 2024
@phofl
Copy link
Member

phofl commented Mar 25, 2024

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

@rhshadrach rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 25, 2024
@20revsined
Copy link

take

@asishm
Copy link
Contributor

asishm commented Apr 6, 2024

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

@phofl Would this change only apply for boolean ops or do you anticipating changing the behavior of numerical ops like 1 + pd.NA as well?

@phofl
Copy link
Member

phofl commented Apr 6, 2024

not it's only

bool(pd.NA) that we want to change.

@20revsined this is probably not a good issue for a beginner in pandas

@20revsined 20revsined removed their assignment May 5, 2024
@julia-pfarr
Copy link

I don't know if my issue is related to this, please remove my comment if not!

I have a function which gives me the following output (pd df):

timestamp duration trial_type blink message
9199380 <NA> NaN <NA> RECORD_START
9199345 392 fixation 0 NaN
etc...

column dtypes are:
timestamp Int64
duration Int64
trial_type object
blink Int64
message object
dtype: object

To be precise: timestamp and duration hold numerics plus nans, trial_type holds strings plus nans, blink holds numerics (0 and 1) plus nans, and message hold strings plus nans.

Now I wrote a unit test to test the output for the first row:

@pytest.mark.parametrize(     
"folder, expected",     
[("emg", [9199380, pd.NA, np.nan, pd.NA, "RECORD_START"])]
# + *other folders, removed for simplicity*)

def test_physioevents_value(folder, expected, eyelink_test_data_dir):
    input_dir = eyelink_test_data_dir / folder
    asc_file = asc_test_files(input_dir=input_dir, suffix="*_events")[0]
    events = _load_asc_file(asc_file)
    events_after_start = _df_events_after_start(events)
    physioevents_reordered = _df_physioevents(events_after_start)
    physioevents_eye1 = _physioevents_eye1(physioevents_reordered)
    assert physioevents_eye1.iloc[0].tolist() == expected

And the list obviously looks like this: [9199380, <NA>, nan, <NA>, 'RECORD_START']

I get the following error when running the test:

E AssertionError: assert [9199380, <NA>...CORD_START'] == [9199380, <NA>...CORD_START']
E
E (pytest_assertion plugin: representation of details failed: missing.pyx:392: TypeError: boolean value of NA is ambiguous.
E Probably an object has a faulty repr.)

tests/test_edf2bids.py:670: AssertionError

So I guess I cannot use pd.NA to check if the value in that field is <NA>. However, I also cannot check it using "<NA>", i.e. encoding it as a string.

How I can check if pd.NAs s in the dataframe exist?

I tried changing the dtypes so that every column has the dtype 'object'. However, that's not really what I want.

@rhshadrach
Copy link
Member

While somewhat related, this:

How I can check if pd.NAs s in the dataframe exist?

is more of a usage question. Please try asking on StackOverflow first - if you don't get your question resolved in a few days, open a new issue here and link to your SO post. We do this as otherwise we fear our issue tracker would be flooded with usage questions.

@julia-pfarr
Copy link

Great, thank you for your reply! I already asked on SO a couple of days ago. I'll wait a bit more and then do as you asked if I don't get it resolved otherwise :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

No branches or pull requests

6 participants