diff --git a/web/pandas/pdeps/0006-ban-upcasting.md b/web/pandas/pdeps/0006-ban-upcasting.md index 325c25313af53..12e4c084cc826 100644 --- a/web/pandas/pdeps/0006-ban-upcasting.md +++ b/web/pandas/pdeps/0006-ban-upcasting.md @@ -9,7 +9,7 @@ ## Abstract The suggestion is that setitem-like operations would -not change a ``Series`` dtype (nor that of a ``DataFrame``'s column). +not change a ``Series``' dtype (nor that of a ``DataFrame``'s column). Current behaviour: ```python @@ -51,65 +51,71 @@ In[11]: ser[2] = "2000-01-04x" # typo - but pandas does not error, it upcasts t ``` The scope of this PDEP is limited to setitem-like operations on Series (and DataFrame columns). -For example, starting with +For example, starting with: ```python df = DataFrame({"a": [1, 2, np.nan], "b": [4, 5, 6]}) ser = df["a"].copy() ``` then the following would all raise: -- setitem-like operations: - - ``ser.fillna('foo', inplace=True)``; - - ``ser.where(ser.isna(), 'foo', inplace=True)`` - - ``ser.fillna('foo', inplace=False)``; - - ``ser.where(ser.isna(), 'foo', inplace=False)`` -- setitem indexing operations (where ``indexer`` could be a slice, a mask, +* setitem-like operations: + + - ``ser.fillna('foo', inplace=True)`` + - ``ser.where(ser.isna(), 'foo', inplace=True)`` + - ``ser.fillna('foo', inplace=False)`` + - ``ser.where(ser.isna(), 'foo', inplace=False)`` + +* setitem indexing operations (where ``indexer`` could be a slice, a mask, a single value, a list or array of values, or any other allowed indexer): - - ``ser.iloc[indexer] = 'foo'`` - - ``ser.loc[indexer] = 'foo'`` - - ``df.iloc[indexer, 0] = 'foo'`` - - ``df.loc[indexer, 'a'] = 'foo'`` - - ``ser[indexer] = 'foo'`` + + - ``ser.iloc[indexer] = 'foo'`` + - ``ser.loc[indexer] = 'foo'`` + - ``df.iloc[indexer, 0] = 'foo'`` + - ``df.loc[indexer, 'a'] = 'foo'`` + - ``ser[indexer] = 'foo'`` It may be desirable to expand the top list to ``Series.replace`` and ``Series.update``, but to keep the scope of the PDEP down, they are excluded for now. Examples of operations which would not raise are: -- ``ser.diff()``; -- ``pd.concat([ser, ser.astype(object)])``; -- ``ser.mean()``; -- ``ser[0] = 3``; # same dtype -- ``ser[0] = 3.``; # 3.0 is a 'round' float and so compatible with 'int64' dtype -- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)``; -- ``df.index.intersection(ser.index)``. + +- ``ser.diff()`` +- ``pd.concat([ser, ser.astype(object)])`` +- ``ser.mean()`` +- ``ser[0] = 3`` # same dtype +- ``ser[0] = 3.`` # 3.0 is a 'round' float and so compatible with 'int64' dtype +- ``df['a'] = pd.date_range(datetime(2020, 1, 1), periods=3)`` +- ``df.index.intersection(ser.index)`` ## Detailed description Concretely, the suggestion is: -- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype; -- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. + +- If a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype. +- If a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. For a start, this would involve: -1. changing ``Block.setitem`` such that it does not have an ``except`` block in +1. changing ``Block.setitem`` such that it does not have an ``except`` block in: + + - ```python - value = extract_array(value, extract_numpy=True) - try: - casted = np_can_hold_element(values.dtype, value) - except LossySetitemError: - # current dtype cannot store value, coerce to common dtype - nb = self.coerce_to_target_dtype(value) - return nb.setitem(indexer, value) - else: - ``` + value = extract_array(value, extract_numpy=True) + try: + casted = np_can_hold_element(values.dtype, value) + except LossSetitiemError: + # current dtype cannot store value, coerce to common dtype + nb = self.coerce_to_target_dtype(value) + return nb.setitem(index, value) + else: 2. making a similar change in: - - ``Block.where``; - - ``Block.putmask``; - - ``EABackedBlock.setitem``; - - ``EABackedBlock.where``; - - ``EABackedBlock.putmask``; + + - ``Block.where`` + - ``Block.putmask`` + - ``EABackedBlock.setitem`` + - ``EABackedBlock.where`` + - ``EABackedBlock.putmask`` The above would already require several hundreds of tests to be adjusted. Note that once implementation starts, the list of locations to change may turn out to be slightly @@ -147,11 +153,13 @@ numeric (without much regard for ``int`` vs ``float``) - ``'int64'`` is just wha when constructing it. Possible options could be: -1. only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``); -2. convert the float value to ``int`` before setting it (i.e. silently round all float values); -3. limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64) . + +1. Only accept round floats (e.g. ``1.0``) and raise on anything else (e.g. ``1.01``). +2. Convert the float value to ``int`` before setting it (i.e. silently round all float values). +3. Limit "banning upcasting" to when the upcasted dtype is ``object`` (i.e. preserve current behavior of upcasting the int64 Series to float64). Let us compare with what other libraries do: + - ``numpy``: option 2 - ``cudf``: option 2 - ``polars``: option 2 @@ -165,12 +173,13 @@ if the objective of this PDEP is to prevent bugs, then this is also not desirabl someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. There are several downsides to option ``3``: -- it would be inconsistent with the nullable dtypes' behaviour; -- it would also add complexity to the codebase and to tests; -- it would be hard to teach, as instead of being able to teach a simple rule, - there would be a rule with exceptions; -- there would be a risk of loss of precision and or overflow; -- it opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``. + +- It would be inconsistent with the nullable dtypes' behaviour. +- It would also add complexity to the codebase and to tests. +- It would be hard to teach, as instead of being able to teach a simple rule, + There would be a rule with exceptions. +- There would be a risk of loss of precision and or overflow. +- It opens the door to other exceptions, such as not upcasting ``'int8'`` to ``'int16'``. Option ``1`` is the maximally safe one in terms of protecting users from bugs, being consistent with the current behaviour of nullable dtypes, and in being simple to teach. @@ -208,22 +217,25 @@ at all. To keep this proposal focused, it is intentionally excluded from the sco **A**: The current behavior would be to upcast to ``int32``. So under this PDEP, it would instead raise. -**Q: What happens in setting ``16.000000000000001`` in an `int8`` Series?** +**Q: What happens in setting ``16.000000000000001`` in an ``int8`` Series?** **A**: As far as Python is concerned, ``16.000000000000001`` and ``16.0`` are the same number. So, it would be inserted as ``16`` and the dtype would not change (just like what happens now, there would be no change here). -**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an `'int8'` Series?** +**Q: What if I want ``1.0000000001`` to be inserted as ``1.0`` in an ``int8`` Series?** + +**A**: You may want to define your own helper function, such as: + +```python +def maybe_convert_to_int(x: int | float, tolerance: float): + if np.abs(x - round(x)) < tolerance: + return round(x) + return x +``` + +which you could adapt according to your needs. -**A**: You may want to define your own helper function, such as - ```python - >>> def maybe_convert_to_int(x: int | float, tolerance: float): - if np.abs(x - round(x)) < tolerance: - return round(x) - return x - ``` - which you could adapt according to your needs. ## Timeline