Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Replacing pd.NA by None has no effect #45601

Closed
2 of 3 tasks
treiher opened this issue Jan 24, 2022 · 16 comments · Fixed by #46404
Closed
2 of 3 tasks

BUG: Replacing pd.NA by None has no effect #45601

treiher opened this issue Jan 24, 2022 · 16 comments · Fixed by #46404
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method
Milestone

Comments

@treiher
Copy link

treiher commented Jan 24, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})

assert df.replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}

Issue Description

Since version 1.4.0 .replace({pd.NA: None}) has no effect, pd.NA is not be replaced by None anymore.

Expected Behavior

The assertion in the example shown above is fulfilled in version 1.3.5.

Installed Versions

INSTALLED VERSIONS

commit : bb1f651
python : 3.10.1.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.13-arch1-1
Version : #1 SMP PREEMPT Wed, 05 Jan 2022 16:20:59 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : de_DE.UTF-8

pandas : 1.4.0
numpy : 1.21.5
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.1.0
Cython : None
pytest : 6.2.5
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.5.1
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.29
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
zstandard : None

@treiher treiher added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 24, 2022
@roib20
Copy link
Contributor

roib20 commented Jan 27, 2022

I tried to investigate this issue. NDFrame.replace is found in /pandas/core/generic.py, and there has indeed been changes between the 1.3.x branch and 1.4.x branch. There has been no further changes to this method in the main branch.

In the updated method, there is a new comment:

# GH#36984 if the user explicitly passes value=None we want to

So there has been a change in response to issue #36984

Further investigation is needed for if this change is working as intended.

@roib20
Copy link
Contributor

roib20 commented Jan 27, 2022

Change in behavior is from merged PR #45081

@roib20
Copy link
Contributor

roib20 commented Jan 29, 2022

I did further debugging. I found that PR #45081 is not what's causing this issue. The issue is from a change in a different method: _replace_coerce under /pandas/core/internals/blocks.py. The relevant change is from PR #44940

I did a basic test with your reproducible example; using the _replace_coerce method from 1.3.x under 1.4.x code does achieve the expected behavior. However PR #44940 was designed to fix multiple other bugs so reverting the change isn't so simple.

@roib20
Copy link
Contributor

roib20 commented Jan 29, 2022

take

@lithomas1 lithomas1 added NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method labels Jan 30, 2022
@roib20
Copy link
Contributor

roib20 commented Jan 30, 2022

I investigated this issue thoroughly, I considered sending a pull request at first but I now I'm starting to think this might be intended behavior after all. I need help in determining whether or not this is really intended behavior. My explanation is below.

First, here's a way to make the Reproducible Example work on pandas 1.4.0 and above (also tested on the main branch):

import pandas as pd

df = pd.DataFrame({"value": [42, None]}).astype({"value": "Int64"})

assert df.astype(object).replace({pd.NA: None}).to_dict() == {"value": {0: 42, 1: None}}

The addition here is explicit casting of df.astype(object) before calling replace.
Note that this solution is not backwards-compatible with pandas 1.3.5.

To further explain: When using data types in pandas, None value can only be kept under the object dtype which is incompatible with Int64. In pandas 1.3.5, the replace method was able to implicitly convert types, which made every value in a converted df an object. I am not sure if this was actually intended behavior, and evidence to this might be a quirk where calling replace twice can convert the type back into float64. So in v1.3.5, calling a replace method similar to the reproducible example but three times for the same DataFrame, will result in converting the value's dtype from Int64 to object to float64. This is confusing behavior especially for a value that was initially explicitly cast astype({"value": "Int64"}).

It is worth noting however, that if the value is not explicitly cast initially then the implicit conversion to object can work. The following example works on both pandas 1.3.5 and 1.4.0 (but not on the main branch):

import pandas as pd
import numpy as np

df = pd.DataFrame({"value": [42, None]})
assert df.replace({np.nan: None}).to_dict() == {"value": {0: 42, 1: None}}

Although this requires using NumPy's NaN type instead, because that is the default null value of float64. This solution fails on the main branch and 1.5 milestone, because of a different change. I need to open another issue addressing this (EDIT: Opened a new issue #45725).

@treiher
Copy link
Author

treiher commented Feb 1, 2022

@roib20 Thanks for your thorough investigation and your explanation how the code can be made compatible to pandas 1.4.0! I'm also interested to know if this new behavior is intended and will be kept as it is.

@roib20
Copy link
Contributor

roib20 commented Feb 1, 2022

@roib20 Thanks for your thorough investigation and your explanation how the code can be made compatible to pandas 1.4.0! I'm also interested to know if this new behavior is intended and will be kept as it is.

It turns out that it is not yet fully decided if this should be intended behavior or not. On a different but related issue I received this comment:

However I would like to know first whether these changes in behavior are regressions or actually intended. I'm still not sure

I don't think we have completely decided on this matter. @jorisvandenbossche has advocated non-casting behavior for __setitem__-like ops (#25288 (comment), #39584) and I've advocated having casting logic be consistent between methods #45153.

I'd suggest weighing in on the discussion in #39584, #24020, #45153, #25288

Originally posted by @jbrockmendel in #45729 (comment)

For now my recommendation would be to be explicit with the types you want to use with astype. Implicit type conversions can cause many issues so it's always better to be explicit (it's even in The Zen of Python). You can use the code sample in my previous reply.

I did however write a small fix for this bug that makes the behavior more consistent with 1.3.5. I wrote this commit 0dd5a1d that appears to solve this issue and #45729, however I am waiting for more input on the desired behavior before making a pull request with more tests.

@jorisvandenbossche
Copy link
Member

@roib20 thanks for the investigation!

I also wanted to highlight the solution / workaround that you already suggested as well to actually get None in both old and new pandas versions, which is explicitly casting to object dtype:

In [23]: df.astype(object).replace({pd.NA: None}).to_dict()
Out[23]: {'value': {0: 42, 1: None}}

This way you ensure that you have a data type that can hold any value exactly as you want it (in this case None)

But so the underlying question, as you mentioned above, is indeed: should operations like replace preserve the dtype (or try to preserve it if possible)?
I personally think that for setitem or fillna operations we should preserve the dtype (#39584, #25288). But replace is something different (eg you could use it to change a set of string values to numeric value with a given replacement dictionary), and we might want to be more flexible. But we should probably still try to preserve the dtype. And then the issue here is: is None a valid value for Int64 dtype or not? (in constructors, it is generally considered as a valid value ... but for replace we should maybe be stricter).

I did however write a small fix for this bug that makes the behavior more consistent with 1.3.5. I wrote this commit 0dd5a1d that appears to solve this issue and #45729, however I am waiting for more input on the desired behavior before making a pull request with more tests.

Small note on this: I don't think a potential solution can use to_native_types here, because that will convert to object dtype, which should be avoided if not needed.


Aside: we should maybe add a dtype keyword to replace() to be able to specify the target dtype of the values after replacement, in case pandas inference doesn't do it correctly (as long as we try to preserve dtype but fallback to other, we will always do some dtype inference, and dtype inference can always do something else as what you wanted.

@xmatthias
Copy link

xmatthias commented Feb 7, 2022

@jorisvandenbossche the problem with explicit casting is that in more complex dataframes, the whole dataframe will become of type "object", not just "where necessary".

The usual reason to replace NaN / NaT (at least for my usecase) is to get valid json after df.values.tolist().
For that, i don't need to know the columns i'm replacing the values to - so a broad df.replace() will suffice, and so far, it did change all necessary values to the correct type.

Now using .astype(object) first, forces a change in ALL datatypes (maybe across 20 columns) - where only 3 columns contained NAN/NaT types.

Obviously we could loop through the columns, analyze "does this contain .NAN" - and only then change the type on that column - but that's quite some effort (and quite slower as it's a loop and analysis of a potential big dataframe "column"-times) than how it did work in 1.3.5.

a very crude example for this would be the following:

pd.DataFrame([[pd.NaT, 5],  [pd.NaT, 3]]).replace({pd.np.NaN:None})

The dtypes after replacement are as follows:

  pd.DataFrame([[pd.NaT, 5],  [pd.NaT, 3]]).replace({pd.np.NaN:None}).dtypes
Out[10]: 
0    object
1     int64
dtype: object

Notice that only necessary types have changed. using .astype(object) before the replacement forces all types to object (That's what it's supposed to do).

Maybe a changeTypeWhereNecessary flag to the replace() call could be an option, which reverts to prior behaviour - and only changes the dtype where necessary (as it was the case in 1.3.5).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Feb 7, 2022

the problem with explicit casting is that in more complex dataframes, the whole dataframe will become of type "object", not just "where necessary".

When replacing with None, and you actually want to have None in the values, then object dtype is the only dtype that can hold None.
But yes, as you explained clearly, it's true that replace will automatically only replace it for columns where there are such values (and so columns that don't have missing values can be left as is without converting to object dtype), while for the preceding astype you would need to manually specify which columns to convert to object dtype.

Writing this, another aside: we should maybe add a keyword to to_dict to be able to specify the "NA value" to use for missing values (so if you convert to a python dict of native python objects, as is the case for the default orient, you can specify in to_dict that you want to use None for missing values, without having to do the replace in the step before).

@jorisvandenbossche
Copy link
Member

The best way forward here is maybe to be more "strict" about inferring the type from the replaced values. So if the replaced values have None, that doesn't match for a Int64Dtype, and thus we return object dtype as result. While if you would replace a value with pd.NA, that could preserve the nullable dtype.

@simonjayhawkins
Copy link
Member

I did further debugging. I found that PR #45081 is not what's causing this issue. The issue is from a change in a different method: _replace_coerce under /pandas/core/internals/blocks.py. The relevant change is from PR #44940

can confirm first bad commit: [9cd1c6f] BUG: nullable dtypes not preserved in Series.replace (#44940)

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Feb 10, 2022
@jbrockmendel
Copy link
Member

mentioned in #45836, one option would be to check if both to_replace and value are NA and if so not treat them as matching, regardless of is_valid_na_for_dtype

@simonjayhawkins
Copy link
Member

moving to 1.4.2

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Mar 17, 2022

I've opened a PR that reverts to the 1.3.5 behavior for now. splitting the block so that only the columns where values are replaced are cast to object can be done as a follow-up on the main branch

@matanox
Copy link

matanox commented Jun 13, 2024

Should we upgrade to a latest pandas version to have this fix? I think I'm getting the issue also with pandas 2.2.2.

Any advice for initializing a DataFrame for an integer dtype, that avoids this?
I'm initializing with dtype = 'Int64' because dtype = int has no effect (the dtype will be float64 after initialization). with Int64 Pandas fills the table with Pandas's NA, which then bumps into the issue being the topic of this discussion:

When assigning to a row a list containing None values with .loc, the NA values are not overriden by Nones where present in the list. Which cascades to failures in downstream processing of the dataframe.

I'd probably just switch pd.NA to be my only allowed null value in integer DataFrames in my code, departing from the usual python syntax and functions that seamlessly club together the various null types. as I understand, nan is a float type, so that kind of makes sense (though an api surface could be envisioned to be a bit more elegant).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays replace replace method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants