DataFrame reduction methods broken #174

coroa · 2023-05-12T15:57:45Z

With the versions:

In [20]: pint_pandas.show_versions()
{'numpy': '1.24.2',
 'pandas': '1.5.3',
 'pint': '0.20.1',
 'pint_pandas': '0.4.dev32+gc58a7fc'}

I am unable to run any dataframe-wise aggregation functions along axis=0:

In [19]: pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
/Users/coroa/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369: UnitStrippedWarning: The unit of the quantity is stripped when downcasting to ndarray.
  res_values = np.array([[result]])
---------------------------------------------------------------------------
DimensionalityError                       Traceback (most recent call last)
File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pint/facets/plain/quantity.py:702, in PlainQuantity.__int__(self)
    701     return int(self._convert_magnitude_not_inplace(UnitsContainer()))
--> 702 raise DimensionalityError(self._units, "dimensionless")

DimensionalityError: Cannot convert from 'meter' to 'dimensionless'

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
Cell In[19], line 1
----> 1 pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11797, in NDFrame._add_numeric_operations.<locals>.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11777 @doc(
  11778     _num_doc,
  11779     desc="Return the sum of the values over the requested axis.\n\n"
   (...)
  11795     **kwargs,
  11796 ):
> 11797     return NDFrame.sum(
  11798         self, axis, skipna, level, numeric_only, min_count, **kwargs
  11799     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11501, in NDFrame.sum(self, axis, skipna, level, numeric_only, min_count, **kwargs)
  11492 def sum(
  11493     self,
  11494     axis: Axis | None = None,
   (...)
  11499     **kwargs,
  11500 ):
> 11501     return self._min_count_stat_function(
  11502         "sum", nanops.nansum, axis, skipna, level, numeric_only, min_count, **kwargs
  11503     )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/generic.py:11483, in NDFrame._min_count_stat_function(self, name, func, axis, skipna, level, numeric_only, min_count, **kwargs)
  11467     warnings.warn(
  11468         "Using the level keyword in DataFrame and Series aggregations is "
  11469         "deprecated and will be removed in a future version. Use groupby "
   (...)
  11472         stacklevel=find_stack_level(),
  11473     )
  11474     return self._agg_by_level(
  11475         name,
  11476         axis=axis,
   (...)
  11480         numeric_only=numeric_only,
  11481     )
> 11483 return self._reduce(
  11484     func,
  11485     name=name,
  11486     axis=axis,
  11487     skipna=skipna,
  11488     numeric_only=numeric_only,
  11489     min_count=min_count,
  11490 )

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/frame.py:10856, in DataFrame._reduce(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)
  10852 ignore_failures = numeric_only is None
  10854 # After possibly _get_data and transposing, we are now in the
  10855 #  simple case where we can use BlockManager.reduce
> 10856 res, _ = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
  10857 out = df._constructor(res).iloc[0]
  10858 if out_dtype is not None:

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/managers.py:1569, in BlockManager.reduce(self, func, ignore_failures)
   1567 res_blocks: list[Block] = []
   1568 for blk in self.blocks:
-> 1569     nbs = blk.reduce(func, ignore_failures)
   1570     res_blocks.extend(nbs)
   1572 index = Index([None])  # placeholder

File ~/.local/conda/envs/pandas-indexing/lib/python3.11/site-packages/pandas/core/internals/blocks.py:369, in Block.reduce(self, func, ignore_failures)
    365     raise
    367 if self.values.ndim == 1:
    368     # TODO(EA2D): special case not needed with 2D EAs
--> 369     res_values = np.array([[result]])
    370 else:
    371     res_values = result.reshape(-1, 1)

ValueError: setting an array element with a sequence.

Similarly df.min(), df.max() and so on are failing.

This might be connected with:
hgrecco/pint#1128 (comment)

The text was updated successfully, but these errors were encountered:

andrewgsavage · 2023-05-14T18:00:45Z

Can this be closed now? Or we could leave this open until df.min(), df.max() etc are tested.

coroa · 2023-05-15T12:28:55Z

No, this is still broken under pandas~=1.5, as well as pandas~=2.0. I don't think anything short of a change in the __array__ or __array_func__ implementation of pint would help here.

andrewgsavage · 2023-05-17T21:14:59Z

What versions was this last working with? I am unsure if it was a pandas change or a pint change that has caused this.
I'm leaning towards a pint change as I don't see anything obvious in pandas that would cause it. Could also trying implementing __array_func__ on PintArray

In Block.reduce, pandas 1.4.x there's a try, except clause https://github.com/pandas-dev/pandas/blob/1.4.x/pandas/core/internals/blocks.py#L413
Which is also in 1.5.x
https://github.com/pandas-dev/pandas/blob/1.5.x/pandas/core/internals/blocks.py#L360
but not in 2.0.
https://github.com/pandas-dev/pandas/blob/2.0.x/pandas/core/internals/blocks.py#L340

coroa · 2023-05-23T12:24:48Z

Note that it is failing outside the try-except block. result is produced successfully (as a scalar Quantity), but wrapping it in np.array([[result]]) then raises at https://github.com/pandas-dev/pandas/blob/778ab8234059059d77eb5d71442c959f8d15c1e0/pandas/core/internals/blocks.py#L369. I do not know how or if this worked previously.

andrewgsavage · 2023-06-06T20:14:13Z

I tried several versions which didn't work.

# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.3.0', 'pint': '0.21', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.4'} fails
# {'numpy': '1.24.3', 'pandas': '1.4.0', 'pint': '0.19', 'pint_pandas': '0.3'} fails

I note that running pd.DataFrame([[0, 1, 2]]).astype("pint[m]").sum() works

andrewgsavage · 2023-06-07T20:50:39Z

Looks like pandas is working on this already and should be in the next version!
pandas-dev/pandas#52788

MichaelTiemannOSC · 2023-07-23T20:13:06Z

On it...this has one extraneous change (preserve dtype when converting NA to array of Quantities), but I offer these as essentially the minimal changes needed to work with upcoming Pandas 2.1. (This work work any Pandas < 2.1)

I will be bundling this with a PR that's already slated to land in Pint 0.23 (and other PRs in Pandas, Pint-Pandas, and uncertainties). But if you want to apply and test by itself...

(itr_env) iMac-Pro:pint-pandas michael$ git diff
diff --git a/pint_pandas/pint_array.py b/pint_pandas/pint_array.py
index a445b77..4605704 100644
--- a/pint_pandas/pint_array.py
+++ b/pint_pandas/pint_array.py
@@ -208,8 +208,8 @@ dtypemap = {
     float: pd.Float64Dtype(),
     np.float64: pd.Float64Dtype(),
     np.float32: pd.Float32Dtype(),
-    np.complex128: pd.core.dtypes.dtypes.PandasDtype("complex128"),
-    np.complex64: pd.core.dtypes.dtypes.PandasDtype("complex64"),
+    np.complex128: pd.core.dtypes.dtypes.NumpyEADtype("complex128"),
+    np.complex64: pd.core.dtypes.dtypes.NumpyEADtype("complex64"),
     # np.float16: pd.Float16Dtype(),
 }
 dtypeunmap = {v: k for k, v in dtypemap.items()}
@@ -520,7 +520,10 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
                 # magnitude is in fact an array scalar, which will get rejected by pandas.
                 fill_value = fill_value[()]
 
-        result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore")
+            # Turn off warning that PandasArray is deprecated for ``take``
+            result = take(data, indices, fill_value=fill_value, allow_fill=allow_fill)
 
         return PintArray(result, dtype=self.dtype)
 
@@ -990,7 +993,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
         qtys = [
             self._Q(item, self._dtype.units)
             if item is not self.dtype.na_value.m
-            else item
+            else self.dtype.na_value
             for item in self._data
         ]
         with warnings.catch_warnings(record=True):
@@ -1048,7 +1051,7 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
             value = [item.to(self.units).magnitude for item in value]
         return arr.searchsorted(value, side=side, sorter=sorter)
 
-    def _reduce(self, name, **kwds):
+    def _reduce(self, name, *, skipna: bool = True, keepdims: bool = False, **kwds):
         """
         Return a scalar result of performing the reduction operation.
 
@@ -1092,14 +1095,18 @@ class PintArray(ExtensionArray, ExtensionOpsMixin):
 
         if isinstance(self._data, ExtensionArray):
             try:
-                result = self._data._reduce(name, **kwds)
+                result = self._data._reduce(name, skipna=skipna, keepdims=keepdims, **kwds)
             except NotImplementedError:
                 result = functions[name](self.numpy_data, **kwds)
 
         if name in {"all", "any", "kurt", "skew"}:
             return result
         if name == "var":
+            if keepdims:
+                return PintArray(result, f"pint[({self.units})**2]")
             return self._Q(result, self.units**2)
+        if keepdims:
+            return PintArray(result, self.dtype)
         return self._Q(result, self.units)
 
     def _accumulate(self, name: str, *, skipna: bool = True, **kwds):
diff --git a/pint_pandas/testsuite/test_issues.py b/pint_pandas/testsuite/test_issues.py
index c288e71..384f84f 100644
--- a/pint_pandas/testsuite/test_issues.py
+++ b/pint_pandas/testsuite/test_issues.py
@@ -202,3 +202,16 @@ def test_issue_139():
     assert np.all(a_m[0:4] == a_cm[0:4])
     for x, y in zip(a_m[4:], a_cm[4:]):
         assert unp.isnan(x) == unp.isnan(y)
+
+class TestIssue174(BaseExtensionTests):
+    def test_sum(self):
+        a = pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]")
+        row_sum = a.sum(axis=0)
+        expected_1 = pd.Series([3, 5, 7], dtype="pint[m]")
+
+        self.assert_series_equal(row_sum, expected_1)
+
+        col_sum = a.sum(axis=1)
+        expected_2 = pd.Series([3, 12], dtype="pint[m]")
+
+        self.assert_series_equal(col_sum, expected_2)
diff --git a/pint_pandas/testsuite/test_pandas_extensiontests.py b/pint_pandas/testsuite/test_pandas_extensiontests.py
index 7c92751..a638c5b 100644
--- a/pint_pandas/testsuite/test_pandas_extensiontests.py
+++ b/pint_pandas/testsuite/test_pandas_extensiontests.py
@@ -523,7 +523,7 @@ class TestArithmeticOps(base.BaseArithmeticOpsTests):
                 divmod(s, other)
 
     def _get_exception(self, data, op_name):
-        if data.data.dtype == pd.core.dtypes.dtypes.PandasDtype("complex128"):
+        if data.data.dtype == pd.core.dtypes.dtypes.NumpyEADtype("complex128"):
             if op_name in ["__floordiv__", "__rfloordiv__", "__mod__", "__rmod__"]:
                 return op_name, TypeError
         if op_name in ["__pow__", "__rpow__"]:
@@ -627,6 +627,10 @@ class TestNumericReduce(base.BaseNumericReduceTests):
             expected = expected_m
         assert result == expected
 
+    @pytest.mark.skip("tests not written yet")
+    def check_reduce_frame(self, ser: pd.Series, op_name: str, skipna: bool):
+        pass
+
     @pytest.mark.parametrize("skipna", [True, False])
     def test_reduce_scaling(
         self, data, all_numeric_reductions, skipna, USE_UNCERTAINTIES

Issue hgrecco#174 reports that DataFrame reduction was broken by the latest Pint-Pandas changes. This commit adapts Pint-Pandas to work with upcoming Pandas 2.1, currently scheduled for release Aug 20, 2023. Signed-off-by: Michael Tiemann <72577720+MichaelTiemannOSC@users.noreply.github.com>

topper-123 · 2023-08-01T11:37:49Z

Just a small note that pandas-dev/pandas#52788 has been merged in pandas. I think that should help fix this issue in pint-pandas?

topper-123 · 2023-08-01T12:10:51Z

I agree this should be possible in v2.1 (not released yet), but pint_pandas still needs to implement a keepdims parameter to PintArray._reduce at your end to make it happen, though.

I could take a stab at a PR for this if I can get someone to review the PR. I assume the tests should be added to pint_pandas/testsuite/test_pandas_extensiontests.py in e.g. a new TestReduction class?

MichaelTiemannOSC · 2023-08-01T12:22:46Z

Hey, that was me 😄

PR#140 has the keepdims changes. You could split them out and merge them independently. The TestReductions (note 's' at the end) does need to be incorporated. It looks like the Pandas base test case makes up its own data, and that such data is not as trivial as data, so a TestReductions class in Pint-Pandas would take some work and would be welcome!

andrewgsavage · 2023-08-01T12:27:24Z

yea go for it.
it would be good to split Michaels changes out in seperate PRs to make it easier to review.

I was holding out for a rc (which doesn't look like is happening) before taking a look as it takes ages for the CI to build pandas otherwise.

MichaelTiemannOSC · 2023-08-07T14:46:56Z

I've split out the changes here: #196

MichaelTiemannOSC · 2023-08-12T09:11:19Z

Pandas 2.1.0rc0 is now available. Could be a good time to sync up with our own release candidates.

195: fix pandas v2.1 NumpyEADtype issue r=andrewgsavage a=topper-123 - [x] Closes #194 - [x] Executed `pre-commit run --all-files` with no errors - [x] The change is fully covered by automated unit tests - [ ] Documented in docs/ as appropriate - [ ] Added an entry to the CHANGES file Precursor to working on #174. Co-authored-by: Terji Petersen <terji78@gmail.com>

andrewgsavage · 2023-08-15T09:10:33Z

close by #195

pd.DataFrame([[0, 1, 2], [3, 4, 5]]).astype("pint[m]").sum()
0    3.0
1    5.0
2    7.0
dtype: pint[meter]

coroa · 2023-08-15T18:16:36Z

Amazing, thanks @topper-123 !

coroa mentioned this issue May 13, 2023

Remove __array__ interface hgrecco/pint#1778

Closed

5 tasks

MichaelTiemannOSC mentioned this issue Jul 23, 2023

Add support for UFloat in PintArray (#139) #140

Open

5 tasks

topper-123 mentioned this issue Aug 1, 2023

fix pandas v2.1 NumpyEADtype issue #195

Merged

5 tasks

andrewgsavage closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame reduction methods broken #174

DataFrame reduction methods broken #174

coroa commented May 12, 2023

andrewgsavage commented May 14, 2023

coroa commented May 15, 2023

andrewgsavage commented May 17, 2023 •

edited

coroa commented May 23, 2023

andrewgsavage commented Jun 6, 2023

andrewgsavage commented Jun 7, 2023

MichaelTiemannOSC commented Jul 23, 2023 •

edited

topper-123 commented Aug 1, 2023

topper-123 commented Aug 1, 2023

MichaelTiemannOSC commented Aug 1, 2023

andrewgsavage commented Aug 1, 2023

MichaelTiemannOSC commented Aug 7, 2023

MichaelTiemannOSC commented Aug 12, 2023

andrewgsavage commented Aug 15, 2023

coroa commented Aug 15, 2023

DataFrame reduction methods broken #174

DataFrame reduction methods broken #174

Comments

coroa commented May 12, 2023

andrewgsavage commented May 14, 2023

coroa commented May 15, 2023

andrewgsavage commented May 17, 2023 • edited

coroa commented May 23, 2023

andrewgsavage commented Jun 6, 2023

andrewgsavage commented Jun 7, 2023

MichaelTiemannOSC commented Jul 23, 2023 • edited

topper-123 commented Aug 1, 2023

topper-123 commented Aug 1, 2023

MichaelTiemannOSC commented Aug 1, 2023

andrewgsavage commented Aug 1, 2023

MichaelTiemannOSC commented Aug 7, 2023

MichaelTiemannOSC commented Aug 12, 2023

andrewgsavage commented Aug 15, 2023

coroa commented Aug 15, 2023

andrewgsavage commented May 17, 2023 •

edited

MichaelTiemannOSC commented Jul 23, 2023 •

edited