Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.fillna raising with float32 dtype when using value arg #43424

Closed
3 tasks done
mzeitlin11 opened this issue Sep 6, 2021 · 8 comments · Fixed by #43455
Closed
3 tasks done

BUG: Series.fillna raising with float32 dtype when using value arg #43424

mzeitlin11 opened this issue Sep 6, 2021 · 8 comments · Fixed by #43455
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@mzeitlin11
Copy link
Member

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

ser = pd.Series([1.0, 2.0], dtype="float32")
val = {1: 1.01}
ser.fillna(val)

Problem description

The above code raises with
TypeError: Cannot cast array data from dtype('float64') to dtype('float32') according to the rule 'safe'

Expected Output

0    1.0
1    2.0
dtype: float32

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 343ac2a
python : 3.8.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.4.0
Version : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.0.dev0+599.g343ac2a417.dirty
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 49.6.0.post20210108
Cython : 0.29.24
pytest : 6.2.4
hypothesis : 6.14.3
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : 0.6.3
gcsfs : 2021.05.0
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : 0.4.2
scipy : 1.7.0
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

@mzeitlin11 mzeitlin11 added Bug Needs Triage Issue that has not been reviewed by a pandas team member Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 6, 2021
@mzeitlin11
Copy link
Member Author

One potential fix is the following diff:

--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -6309,7 +6309,7 @@ class NDFrame(PandasObject, indexing.IndexingMixin):
             if self.ndim == 1:
                 if isinstance(value, (dict, ABCSeries)):
                     value = create_series_with_explicit_dtype(
-                        value, dtype_if_empty=object
+                        value, dtype_if_empty=object, dtype=self.dtype
                     )
                     value = value.reindex(self.index, copy=False)
                     value = value._values

However, this breaks pandas/tests/series/methods/test_fillna.py::TestSeriesFillNA::test_fillna_categorical_raises because of a behavior change in the following example:

data = ["a", np.nan, "b", np.nan, np.nan]
ser = pd.Series(pd.Categorical(data, categories=["a", "b"]))
print(ser.fillna({1: "d", 3: "a"}))

On master this raises TypeError, but under this patch the result would be

0      a
1    NaN
2      b
3      a
4    NaN
dtype: category
Categories (2, object): ['a', 'b']

This result seems reasonable to me since "d" would be a missing category and so becomes nan. Do others have thoughts on this change?

@mzeitlin11
Copy link
Member Author

Counterargument to this would be inconsistency with the following:

data = ["a", np.nan]
ser = pd.Series(pd.Categorical(data, categories=["a"]))
ser.fillna("d")

raises TypeError: Cannot setitem on a Categorical with a new category (d), set the categories first

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 7, 2021
@simonjayhawkins
Copy link
Member

The above code raises with
TypeError: Cannot cast array data from dtype('float64') to dtype('float32') according to the rule 'safe'

was giving the expected output in 1.1.5

first bad commit: [6d5da02] PERF: use np.putmask instead of ndarray.setitem (#37945)

cc @jbrockmendel

@simonjayhawkins simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2021
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Sep 7, 2021
@mzeitlin11
Copy link
Member Author

General issue here seems to stem from a disagreement between _can_hold_element and np.putmask -> since _can_hold_element evaluates to True we end up in putmask_inplace, but np.putmask raises for placing float64 in float32 buffer

@mzeitlin11
Copy link
Member Author

A nice patch would seem to be

--- a/pandas/core/dtypes/cast.py
+++ b/pandas/core/dtypes/cast.py
@@ -2208,10 +2208,11 @@ def can_hold_element(arr: ArrayLike, element: Any) -> bool:
 
     elif dtype.kind == "f":
         if tipo is not None:
-            # TODO: itemsize check?
             if tipo.kind not in ["f", "i", "u"]:
                 # Anything other than float/integer we cannot hold
                 return False
+            elif dtype.itemsize < tipo.itemsize:
+                return False
             elif not isinstance(tipo, np.dtype):
                 # i.e. nullable IntegerDtype or FloatingDtype;
                 #  we can put this into an ndarray losslessly iff it has no NAs

since this would match the itemsize check for ints. But this breaks a test because of the following behavior change:

s = pd.Series(np.arange(10), dtype="float32")
mask = s < 5
s[mask] = range(2, 7)  
# s is float64 after change, float32 before

@simonjayhawkins
Copy link
Member

A nice patch would seem to be

my comment was more to indicate that the issue is a regression rather than suggesting that the old code is a better fix. Not all users upgrade on every pandas release, so if this is fixed, we could backport and may save some users some pain. (but have not milestoned 1.3.x since it is an "old" regression)

@jbrockmendel
Copy link
Member

since _can_hold_element evaluates to True

There is a longstanding problem with ExtensionBlock._can_hold_element always evaluating (incorrectly) to True. It could be amended (at least for CategoricalBlock) fairly easily, but that would in turn change some __setitem__ behavior (to cast instead of raise). I'd be OK with this as a it would be consistent with other dtypes.

@jbrockmendel
Copy link
Member

Not sure if I follow the whole thread, but if changing a np.putmask usage to a __setitem__ usage fixes it, thats fine. We made an effort to use np.putmask where possible a while back bc it improved perf in dtlike comparisons

@jreback jreback modified the milestones: Contributions Welcome, 1.3.3 Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants