BUG: Replacing `pd.NA` by `None` has no effect #45601

treiher · 2022-01-24T21:52:20Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})

assert df.replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}

Issue Description

Since version 1.4.0 .replace({pd.NA: None}) has no effect, pd.NA is not be replaced by None anymore.

Expected Behavior

The assertion in the example shown above is fulfilled in version 1.3.5.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.13-arch1-1
Version : #1 SMP PREEMPT Wed, 05 Jan 2022 16:20:59 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : de_DE.UTF-8

pandas : 1.4.0
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.1.0
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.29
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

The text was updated successfully, but these errors were encountered:

roib20 · 2022-01-27T20:18:59Z

I tried to investigate this issue. NDFrame.replace is found in /pandas/core/generic.py, and there has indeed been changes between the 1.3.x branch and 1.4.x branch. There has been no further changes to this method in the main branch.

In the updated method, there is a new comment:

pandas/pandas/core/generic.py

Line 6597 in 20d0964

# GH#36984 if the user explicitly passes value=None we want to

So there has been a change in response to issue #36984

Further investigation is needed for if this change is working as intended.

roib20 · 2022-01-27T20:40:12Z

Change in behavior is from merged PR #45081

roib20 · 2022-01-29T18:45:28Z

I did further debugging. I found that PR #45081 is not what's causing this issue. The issue is from a change in a different method: _replace_coerce under /pandas/core/internals/blocks.py. The relevant change is from PR #44940

I did a basic test with your reproducible example; using the _replace_coerce method from 1.3.x under 1.4.x code does achieve the expected behavior. However PR #44940 was designed to fix multiple other bugs so reverting the change isn't so simple.

roib20 · 2022-01-29T21:41:53Z

take

roib20 · 2022-01-30T22:45:56Z

I investigated this issue thoroughly, I considered sending a pull request at first but I now I'm starting to think this might be intended behavior after all. I need help in determining whether or not this is really intended behavior. My explanation is below.

First, here's a way to make the Reproducible Example work on pandas 1.4.0 and above (also tested on the main branch):

import pandas as pd

df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})

assert df.astype(object).replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}

The addition here is explicit casting of df.astype(object) before calling replace.
Note that this solution is not backwards-compatible with pandas 1.3.5.

To further explain: When using data types in pandas, None value can only be kept under the object dtype which is incompatible with Int64. In pandas 1.3.5, the replace method was able to implicitly convert types, which made every value in a converted df an object. I am not sure if this was actually intended behavior, and evidence to this might be a quirk where calling replace twice can convert the type back into float64. So in v1.3.5, calling a replace method similar to the reproducible example but three times for the same DataFrame, will result in converting the value's dtype from Int64 to object to float64. This is confusing behavior especially for a value that was initially explicitly cast astype({"value": "Int64"}).

It is worth noting however, that if the value is not explicitly cast initially then the implicit conversion to object can work. The following example works on both pandas 1.3.5 and 1.4.0 (but not on the main branch):

import pandas as pd
import numpy as np

df = pd.DataFrame({"value": [42, None]})
assert df.replace({np.nan: None}).to_dict() == {"value": {0: 42, 1: None}}

Although this requires using NumPy's NaN type instead, because that is the default null value of float64. This solution fails on the main branch and 1.5 milestone, because of a different change. I need to open another issue addressing this (EDIT: Opened a new issue #45725).

treiher · 2022-02-01T18:44:03Z

@roib20 Thanks for your thorough investigation and your explanation how the code can be made compatible to pandas 1.4.0! I'm also interested to know if this new behavior is intended and will be kept as it is.

roib20 · 2022-02-01T19:06:37Z

@roib20 Thanks for your thorough investigation and your explanation how the code can be made compatible to pandas 1.4.0! I'm also interested to know if this new behavior is intended and will be kept as it is.

It turns out that it is not yet fully decided if this should be intended behavior or not. On a different but related issue I received this comment:

However I would like to know first whether these changes in behavior are regressions or actually intended. I'm still not sure

I don't think we have completely decided on this matter. @jorisvandenbossche has advocated non-casting behavior for __setitem__-like ops (#25288 (comment), #39584) and I've advocated having casting logic be consistent between methods #45153.

I'd suggest weighing in on the discussion in #39584, #24020, #45153, #25288

Originally posted by @jbrockmendel in #45729 (comment)

For now my recommendation would be to be explicit with the types you want to use with astype. Implicit type conversions can cause many issues so it's always better to be explicit (it's even in The Zen of Python). You can use the code sample in my previous reply.

I did however write a small fix for this bug that makes the behavior more consistent with 1.3.5. I wrote this commit 0dd5a1d that appears to solve this issue and #45729, however I am waiting for more input on the desired behavior before making a pull request with more tests.

jorisvandenbossche · 2022-02-03T00:17:11Z

@roib20 thanks for the investigation!

I also wanted to highlight the solution / workaround that you already suggested as well to actually get None in both old and new pandas versions, which is explicitly casting to object dtype:

In [23]: df.astype(object).replace({pd.NA: None}).to_dict()
Out[23]: {'value': {0: 42, 1: None}}

This way you ensure that you have a data type that can hold any value exactly as you want it (in this case None)

But so the underlying question, as you mentioned above, is indeed: should operations like replace preserve the dtype (or try to preserve it if possible)?
I personally think that for setitem or fillna operations we should preserve the dtype (#39584, #25288). But replace is something different (eg you could use it to change a set of string values to numeric value with a given replacement dictionary), and we might want to be more flexible. But we should probably still try to preserve the dtype. And then the issue here is: is None a valid value for Int64 dtype or not? (in constructors, it is generally considered as a valid value ... but for replace we should maybe be stricter).

I did however write a small fix for this bug that makes the behavior more consistent with 1.3.5. I wrote this commit 0dd5a1d that appears to solve this issue and #45729, however I am waiting for more input on the desired behavior before making a pull request with more tests.

Small note on this: I don't think a potential solution can use to_native_types here, because that will convert to object dtype, which should be avoided if not needed.

Aside: we should maybe add a dtype keyword to replace() to be able to specify the target dtype of the values after replacement, in case pandas inference doesn't do it correctly (as long as we try to preserve dtype but fallback to other, we will always do some dtype inference, and dtype inference can always do something else as what you wanted.

xmatthias · 2022-02-07T05:58:05Z

@jorisvandenbossche the problem with explicit casting is that in more complex dataframes, the whole dataframe will become of type "object", not just "where necessary".

The usual reason to replace NaN / NaT (at least for my usecase) is to get valid json after df.values.tolist().
For that, i don't need to know the columns i'm replacing the values to - so a broad df.replace() will suffice, and so far, it did change all necessary values to the correct type.

Now using .astype(object) first, forces a change in ALL datatypes (maybe across 20 columns) - where only 3 columns contained NAN/NaT types.

Obviously we could loop through the columns, analyze "does this contain .NAN" - and only then change the type on that column - but that's quite some effort (and quite slower as it's a loop and analysis of a potential big dataframe "column"-times) than how it did work in 1.3.5.

a very crude example for this would be the following:

pd.DataFrame([[pd.NaT, 5],  [pd.NaT, 3]]).replace({pd.np.NaN:None})

The dtypes after replacement are as follows:

  pd.DataFrame([[pd.NaT, 5],  [pd.NaT, 3]]).replace({pd.np.NaN:None}).dtypes
Out[10]: 
0    object
1     int64
dtype: object

Notice that only necessary types have changed. using .astype(object) before the replacement forces all types to object (That's what it's supposed to do).

Maybe a changeTypeWhereNecessary flag to the replace() call could be an option, which reverts to prior behaviour - and only changes the dtype where necessary (as it was the case in 1.3.5).

jorisvandenbossche · 2022-02-07T16:44:03Z

the problem with explicit casting is that in more complex dataframes, the whole dataframe will become of type "object", not just "where necessary".

When replacing with None, and you actually want to have None in the values, then object dtype is the only dtype that can hold None.
But yes, as you explained clearly, it's true that replace will automatically only replace it for columns where there are such values (and so columns that don't have missing values can be left as is without converting to object dtype), while for the preceding astype you would need to manually specify which columns to convert to object dtype.

Writing this, another aside: we should maybe add a keyword to to_dict to be able to specify the "NA value" to use for missing values (so if you convert to a python dict of native python objects, as is the case for the default orient, you can specify in to_dict that you want to use None for missing values, without having to do the replace in the step before).

jorisvandenbossche · 2022-02-07T16:53:00Z

The best way forward here is maybe to be more "strict" about inferring the type from the replaced values. So if the replaced values have None, that doesn't match for a Int64Dtype, and thus we return object dtype as result. While if you would replace a value with pd.NA, that could preserve the nullable dtype.

simonjayhawkins · 2022-02-10T19:11:39Z

I did further debugging. I found that PR #45081 is not what's causing this issue. The issue is from a change in a different method: _replace_coerce under /pandas/core/internals/blocks.py. The relevant change is from PR #44940

can confirm first bad commit: [9cd1c6f] BUG: nullable dtypes not preserved in Series.replace (#44940)

jbrockmendel · 2022-02-11T00:50:50Z

mentioned in #45836, one option would be to check if both to_replace and value are NA and if so not treat them as matching, regardless of is_valid_na_for_dtype

simonjayhawkins · 2022-02-11T19:28:44Z

moving to 1.4.2

simonjayhawkins · 2022-03-17T13:37:59Z

I've opened a PR that reverts to the 1.3.5 behavior for now. splitting the block so that only the columns where values are replaced are cast to object can be done as a follow-up on the main branch

matanox · 2024-06-13T09:31:07Z

Should we upgrade to a latest pandas version to have this fix? I think I'm getting the issue also with pandas 2.2.2.

Any advice for initializing a DataFrame for an integer dtype, that avoids this?
I'm initializing with dtype = 'Int64' because dtype = int has no effect (the dtype will be float64 after initialization). with Int64 Pandas fills the table with Pandas's NA, which then bumps into the issue being the topic of this discussion:

When assigning to a row a list containing None values with .loc, the NA values are not overriden by Nones where present in the list. Which cascades to failures in downstream processing of the dataframe.

I'd probably just switch pd.NA to be my only allowed null value in integer DataFrames in my code, departing from the usual python syntax and functions that seamlessly club together the various null types. as I understand, nan is a float type, so that kind of makes sense (though an api surface could be envisioned to be a bit more elegant).

treiher added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2022

github-actions bot assigned roib20 Jan 29, 2022

lithomas1 added NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method labels Jan 30, 2022

phofl mentioned this issue Feb 6, 2022

BUG: Pandas 1.4.0 - pd.NaT can not be replaced. #45836

Closed

3 tasks

xmatthias mentioned this issue Feb 8, 2022

freqUI: fix can't import backtest with missing datetime data freqtrade/freqtrade#6340

Merged

simonjayhawkins added this to the 1.4.1 milestone Feb 10, 2022

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Feb 10, 2022

code sample for pandas-dev#45601

9c44c81

mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 10, 2022

simonjayhawkins modified the milestones: 1.4.1, 1.4.2 Feb 11, 2022

This was referenced Mar 16, 2022

REGR: only convert at end for Block.replace_list #46393

Closed

BUG: RecursionError when attempting to replace np.nan values (#45725) #45749

Merged

REGR: DataFrame.replace when the replacement value was explicitly None #46404

Merged

jreback closed this as completed in #46404 Mar 19, 2022

simonjayhawkins mentioned this issue Apr 12, 2022

BUG: dtype not being preserved for replace on a CategoricalDtype #46672

Closed

3 tasks

roib20 removed their assignment Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Replacing `pd.NA` by `None` has no effect #45601

BUG: Replacing `pd.NA` by `None` has no effect #45601

treiher commented Jan 24, 2022

INSTALLED VERSIONS

roib20 commented Jan 27, 2022

roib20 commented Jan 27, 2022

roib20 commented Jan 29, 2022

roib20 commented Jan 29, 2022

roib20 commented Jan 30, 2022 •

edited

Loading

treiher commented Feb 1, 2022

roib20 commented Feb 1, 2022

jorisvandenbossche commented Feb 3, 2022

xmatthias commented Feb 7, 2022 •

edited

Loading

jorisvandenbossche commented Feb 7, 2022 •

edited

Loading

jorisvandenbossche commented Feb 7, 2022

simonjayhawkins commented Feb 10, 2022

jbrockmendel commented Feb 11, 2022

simonjayhawkins commented Feb 11, 2022

simonjayhawkins commented Mar 17, 2022 •

edited

Loading

matanox commented Jun 13, 2024 •

edited

Loading

BUG: Replacing pd.NA by None has no effect #45601

BUG: Replacing pd.NA by None has no effect #45601

Comments

treiher commented Jan 24, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

roib20 commented Jan 27, 2022

roib20 commented Jan 27, 2022

roib20 commented Jan 29, 2022

roib20 commented Jan 29, 2022

roib20 commented Jan 30, 2022 • edited Loading

treiher commented Feb 1, 2022

roib20 commented Feb 1, 2022

jorisvandenbossche commented Feb 3, 2022

xmatthias commented Feb 7, 2022 • edited Loading

jorisvandenbossche commented Feb 7, 2022 • edited Loading

jorisvandenbossche commented Feb 7, 2022

simonjayhawkins commented Feb 10, 2022

jbrockmendel commented Feb 11, 2022

simonjayhawkins commented Feb 11, 2022

simonjayhawkins commented Mar 17, 2022 • edited Loading

matanox commented Jun 13, 2024 • edited Loading

BUG: Replacing `pd.NA` by `None` has no effect #45601

BUG: Replacing `pd.NA` by `None` has no effect #45601

roib20 commented Jan 30, 2022 •

edited

Loading

xmatthias commented Feb 7, 2022 •

edited

Loading

jorisvandenbossche commented Feb 7, 2022 •

edited

Loading

simonjayhawkins commented Mar 17, 2022 •

edited

Loading

matanox commented Jun 13, 2024 •

edited

Loading