BUG: CONTAINS_OP run on pd.NA results in pd.NAType.bool call #57989

filip-komarzyniec · 2024-03-25T00:52:11Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

pd.NA in [1,2,3]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "missing.pyx", line 392, in pandas._libs.missing.NAType.__bool__
TypeError: boolean value of NA is ambiguous

Issue Description

checking for pd.NA existence in a list results in TypeError: boolean value of NA is ambiguous.
Why is performing in operation calls __bool__ method of the pd.NAType class?

Seems a bit similar to the issue regarding incorrect implementation of some operators: #49828

Expected Behavior

Checking for existence of pd.NA type in any container should correctly return either True or False

Installed Versions

INSTALLED VERSIONS

commit : bdc79c1
python : 3.10.13.final.0
python-bits : 64
OS : Darwin
OS-release : 23.2.0
Version : Darwin Kernel Version 23.2.0: Wed Nov 15 21:55:06 PST 2023; root:xnu-10002.61.3~2/RELEASE_ARM64_T6020
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.2.1
numpy : 1.26.3
pytz : 2024.1
dateutil : 2.8.2
setuptools : 68.2.2
pip : 23.3.1
Cython : None
pytest : 8.0.0
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.12.0
sqlalchemy : None
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

rhshadrach · 2024-03-25T01:24:47Z

Thanks for the report - this is a consequence of having comparisons return pd.NA:

print(pd.NA == 1)
# <NA>

When Python checks "is pd.NA == 1", the result is NA, which Python then evaluates the truthiness of this result, giving you the TypeError as reported. As long as we are returning pd.NA on comparisons, I do not believe anything can be done here.

cc @jorisvandenbossche @phofl

phofl · 2024-03-25T01:26:27Z

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

20revsined · 2024-03-29T19:36:21Z

take

asishm · 2024-04-06T16:41:57Z

We intend to change this to return false (discussed in Basel), should probably get this into 3.0

@phofl Would this change only apply for boolean ops or do you anticipating changing the behavior of numerical ops like 1 + pd.NA as well?

phofl · 2024-04-06T19:25:22Z

not it's only

bool(pd.NA) that we want to change.

@20revsined this is probably not a good issue for a beginner in pandas

julia-pfarr · 2024-06-19T10:15:10Z

I don't know if my issue is related to this, please remove my comment if not!

I have a function which gives me the following output (pd df):

timestamp	duration	trial_type	blink	message
9199380	<NA>	NaN	<NA>	RECORD_START
`9199345`	392	fixation	0	NaN
etc...

column dtypes are:
timestamp Int64
duration Int64
trial_type object
blink Int64
message object
dtype: object

To be precise: timestamp and duration hold numerics plus nans, trial_type holds strings plus nans, blink holds numerics (0 and 1) plus nans, and message hold strings plus nans.

Now I wrote a unit test to test the output for the first row:

@pytest.mark.parametrize(     
"folder, expected",     
[("emg", [9199380, pd.NA, np.nan, pd.NA, "RECORD_START"])]
# + *other folders, removed for simplicity*)

def test_physioevents_value(folder, expected, eyelink_test_data_dir):
    input_dir = eyelink_test_data_dir / folder
    asc_file = asc_test_files(input_dir=input_dir, suffix="*_events")[0]
    events = _load_asc_file(asc_file)
    events_after_start = _df_events_after_start(events)
    physioevents_reordered = _df_physioevents(events_after_start)
    physioevents_eye1 = _physioevents_eye1(physioevents_reordered)
    assert physioevents_eye1.iloc[0].tolist() == expected

And the list obviously looks like this: [9199380, <NA>, nan, <NA>, 'RECORD_START']

I get the following error when running the test:

E AssertionError: assert [9199380, <NA>...CORD_START'] == [9199380, <NA>...CORD_START']
E
E (pytest_assertion plugin: representation of details failed: missing.pyx:392: TypeError: boolean value of NA is ambiguous.
E Probably an object has a faulty repr.)

tests/test_edf2bids.py:670: AssertionError

So I guess I cannot use pd.NA to check if the value in that field is <NA>. However, I also cannot check it using "<NA>", i.e. encoding it as a string.

How I can check if pd.NAs s in the dataframe exist?

I tried changing the dtypes so that every column has the dtype 'object'. However, that's not really what I want.

rhshadrach · 2024-06-19T20:40:41Z

While somewhat related, this:

How I can check if pd.NAs s in the dataframe exist?

is more of a usage question. Please try asking on StackOverflow first - if you don't get your question resolved in a few days, open a new issue here and link to your SO post. We do this as otherwise we fear our issue tracker would be flooded with usage questions.

julia-pfarr · 2024-06-20T09:01:39Z

Great, thank you for your reply! I already asked on SO a couple of days ago. I'll wait a bit more and then do as you asked if I don't get it resolved otherwise :-)

filip-komarzyniec added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 25, 2024

filip-komarzyniec changed the title ~~BUG: CONTAINS_OP run on pd.NA results in pd.__bool__ call~~ BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call Mar 25, 2024

rhshadrach added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Mar 25, 2024

rhshadrach removed the Needs Triage Issue that has not been reviewed by a pandas team member label Mar 25, 2024

github-actions bot assigned 20revsined Mar 29, 2024

20revsined removed their assignment May 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.bool call #57989

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.bool call #57989

filip-komarzyniec commented Mar 25, 2024 •

edited

Loading

INSTALLED VERSIONS

rhshadrach commented Mar 25, 2024

phofl commented Mar 25, 2024

20revsined commented Mar 29, 2024

asishm commented Apr 6, 2024

phofl commented Apr 6, 2024

julia-pfarr commented Jun 19, 2024

rhshadrach commented Jun 19, 2024

julia-pfarr commented Jun 20, 2024

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call #57989

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.__bool__ call #57989

Comments

filip-komarzyniec commented Mar 25, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

rhshadrach commented Mar 25, 2024

phofl commented Mar 25, 2024

20revsined commented Mar 29, 2024

asishm commented Apr 6, 2024

phofl commented Apr 6, 2024

julia-pfarr commented Jun 19, 2024

rhshadrach commented Jun 19, 2024

julia-pfarr commented Jun 20, 2024

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.bool call #57989

BUG: CONTAINS_OP run on pd.NA results in pd.NAType.bool call #57989

filip-komarzyniec commented Mar 25, 2024 •

edited

Loading