BUG: Series.fillna raising with `float32` dtype when using value arg #43424

mzeitlin11 · 2021-09-06T00:54:25Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

ser = pd.Series([1.0, 2.0], dtype="float32")
val = {1: 1.01}
ser.fillna(val)

Problem description

The above code raises with
TypeError: Cannot cast array data from dtype('float64') to dtype('float32') according to the rule 'safe'

Expected Output

0    1.0
1    2.0
dtype: float32

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : 343ac2a
python : 3.8.10.final.0
python-bits : 64
OS : Darwin
OS-release : 20.4.0
Version : Darwin Kernel Version 20.4.0: Thu Apr 22 21:46:47 PDT 2021; root:xnu-7195.101.2~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 1.4.0.dev0+599.g343ac2a417.dirty
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.2.1
setuptools : 49.6.0.post20210108
Cython : 0.29.24
pytest : 6.2.4
hypothesis : 6.14.3
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : 1.4.4
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : 7.25.0
pandas_datareader: None
bs4 : 4.9.3
bottleneck : 1.3.2
fsspec : 2021.05.0
fastparquet : 0.6.3
gcsfs : 2021.05.0
matplotlib : 3.4.2
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : 4.0.1
pyxlsb : None
s3fs : 0.4.2
scipy : 1.7.0
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : 0.8.9
xarray : 0.19.0
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.53.1

The text was updated successfully, but these errors were encountered:

mzeitlin11 · 2021-09-06T01:01:17Z

One potential fix is the following diff:

--- a/pandas/core/generic.py
+++ b/pandas/core/generic.py
@@ -6309,7 +6309,7 @@ class NDFrame(PandasObject, indexing.IndexingMixin):
             if self.ndim == 1:
                 if isinstance(value, (dict, ABCSeries)):
                     value = create_series_with_explicit_dtype(
-                        value, dtype_if_empty=object
+                        value, dtype_if_empty=object, dtype=self.dtype
                     )
                     value = value.reindex(self.index, copy=False)
                     value = value._values

However, this breaks pandas/tests/series/methods/test_fillna.py::TestSeriesFillNA::test_fillna_categorical_raises because of a behavior change in the following example:

data = ["a", np.nan, "b", np.nan, np.nan]
ser = pd.Series(pd.Categorical(data, categories=["a", "b"]))
print(ser.fillna({1: "d", 3: "a"}))

On master this raises TypeError, but under this patch the result would be

0      a
1    NaN
2      b
3      a
4    NaN
dtype: category
Categories (2, object): ['a', 'b']

This result seems reasonable to me since "d" would be a missing category and so becomes nan. Do others have thoughts on this change?

mzeitlin11 · 2021-09-07T02:54:10Z

Counterargument to this would be inconsistency with the following:

data = ["a", np.nan]
ser = pd.Series(pd.Categorical(data, categories=["a"]))
ser.fillna("d")

raises TypeError: Cannot setitem on a Categorical with a new category (d), set the categories first

simonjayhawkins · 2021-09-07T15:37:12Z

The above code raises with
TypeError: Cannot cast array data from dtype('float64') to dtype('float32') according to the rule 'safe'

was giving the expected output in 1.1.5

first bad commit: [6d5da02] PERF: use np.putmask instead of ndarray.setitem (#37945)

cc @jbrockmendel

mzeitlin11 · 2021-09-07T16:37:21Z

General issue here seems to stem from a disagreement between _can_hold_element and np.putmask -> since _can_hold_element evaluates to True we end up in putmask_inplace, but np.putmask raises for placing float64 in float32 buffer

mzeitlin11 · 2021-09-07T17:30:00Z

A nice patch would seem to be

--- a/pandas/core/dtypes/cast.py
+++ b/pandas/core/dtypes/cast.py
@@ -2208,10 +2208,11 @@ def can_hold_element(arr: ArrayLike, element: Any) -> bool:
 
     elif dtype.kind == "f":
         if tipo is not None:
-            # TODO: itemsize check?
             if tipo.kind not in ["f", "i", "u"]:
                 # Anything other than float/integer we cannot hold
                 return False
+            elif dtype.itemsize < tipo.itemsize:
+                return False
             elif not isinstance(tipo, np.dtype):
                 # i.e. nullable IntegerDtype or FloatingDtype;
                 #  we can put this into an ndarray losslessly iff it has no NAs

since this would match the itemsize check for ints. But this breaks a test because of the following behavior change:

s = pd.Series(np.arange(10), dtype="float32")
mask = s < 5
s[mask] = range(2, 7)  
# s is float64 after change, float32 before

simonjayhawkins · 2021-09-07T17:41:05Z

A nice patch would seem to be

my comment was more to indicate that the issue is a regression rather than suggesting that the old code is a better fix. Not all users upgrade on every pandas release, so if this is fixed, we could backport and may save some users some pain. (but have not milestoned 1.3.x since it is an "old" regression)

jbrockmendel · 2021-09-07T17:51:29Z

since _can_hold_element evaluates to True

There is a longstanding problem with ExtensionBlock._can_hold_element always evaluating (incorrectly) to True. It could be amended (at least for CategoricalBlock) fairly easily, but that would in turn change some __setitem__ behavior (to cast instead of raise). I'd be OK with this as a it would be consistent with other dtypes.

jbrockmendel · 2021-09-07T18:04:04Z

Not sure if I follow the whole thread, but if changing a np.putmask usage to a __setitem__ usage fixes it, thats fine. We made an effort to use np.putmask where possible a while back bc it improved perf in dtlike comparisons

mzeitlin11 added Bug Needs Triage Issue that has not been reviewed by a pandas team member Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Sep 6, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Sep 7, 2021

code sample for pandas-dev#43424

eda7729

simonjayhawkins added Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 7, 2021

simonjayhawkins added this to the Contributions Welcome milestone Sep 7, 2021

mzeitlin11 mentioned this issue Sep 8, 2021

REGR: fillna on f32 column raising for f64 #43455

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.3.3 Sep 9, 2021

jreback closed this as completed in #43455 Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Series.fillna raising with `float32` dtype when using value arg #43424

BUG: Series.fillna raising with `float32` dtype when using value arg #43424

mzeitlin11 commented Sep 6, 2021

INSTALLED VERSIONS

mzeitlin11 commented Sep 6, 2021

mzeitlin11 commented Sep 7, 2021

simonjayhawkins commented Sep 7, 2021

mzeitlin11 commented Sep 7, 2021

mzeitlin11 commented Sep 7, 2021

simonjayhawkins commented Sep 7, 2021

jbrockmendel commented Sep 7, 2021

jbrockmendel commented Sep 7, 2021

BUG: Series.fillna raising with float32 dtype when using value arg #43424

BUG: Series.fillna raising with float32 dtype when using value arg #43424

Comments

mzeitlin11 commented Sep 6, 2021

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

mzeitlin11 commented Sep 6, 2021

mzeitlin11 commented Sep 7, 2021

simonjayhawkins commented Sep 7, 2021

mzeitlin11 commented Sep 7, 2021

mzeitlin11 commented Sep 7, 2021

simonjayhawkins commented Sep 7, 2021

jbrockmendel commented Sep 7, 2021

jbrockmendel commented Sep 7, 2021

BUG: Series.fillna raising with `float32` dtype when using value arg #43424

BUG: Series.fillna raising with `float32` dtype when using value arg #43424

Output of `pd.show_versions()`