BUG: pd.testing.assert_series_equal has problems with ExtensionArray input #45240

MichaelTiemannOSC · 2022-01-07T03:50:56Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import unittest
import pandas as pd
import numpy as np
from pandas._testing import *

from pint import set_application_registry
from pint_pandas import PintArray, PintType
from openscm_units import unit_registry
PintType.ureg = unit_registry
ureg = unit_registry
set_application_registry(ureg)
Q_ = ureg.Quantity

ureg.define("CO2e = CO2 = CO2eq = CO2_eq")

pd.show_versions()

def pandas_mult_acc(a, b):
    df = a.multiply(b)
    return df.sum(axis=1)

def pint_mult_acc(a, b):
    df = a.multiply(b)
    return df.sum(axis=1).astype('pint[g CO2]')

class TestBaseProvider(unittest.TestCase):
    """
    Test the Base provider
    """

    def setUp(self) -> None:
        pass

    # PASS: series are equal
    def test_pandas_series_equality_1(self):
        projected_ei = pd.DataFrame([[1.0, 2.0], [4.0, 2.0]])
        projected_production = pd.DataFrame([[1.0, 2.0], [1.0, 2.0]])
        expected_data = pd.Series([5.0, 8.0], index=[0, 1])
        result_data = pandas_mult_acc(projected_ei,projected_production)
        pd.testing.assert_series_equal(expected_data, result_data)

    # FAIL: series differ
    def test_pandas_series_equality_2(self):
        projected_ei = pd.DataFrame([[1.0, 2.0], [4.0, 2.0]])
        projected_production = pd.DataFrame([[1.0, 2.0], [1.0, 3.0]])
        expected_data = pd.Series([5.0, 8.0], index=[0, 1])
        result_data = pandas_mult_acc(projected_ei,projected_production)
        pd.testing.assert_series_equal(expected_data, result_data)

    # PASS: series are equal
    def test_pint_series_equality_1(self):
        projected_ei = pd.DataFrame([[Q_(1.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')], [Q_(4.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')]], dtype='pint[g CO2/Wh]')
        projected_production = pd.DataFrame([[Q_(1.0, 'Wh'), Q_(2.0, 'Wh')], [Q_(1.0, 'Wh'), Q_(2.0, 'Wh')]], dtype='pint[Wh]')
        expected_data = pd.Series([5.0, 8.0], index=[0, 1], dtype='pint[g CO2]')
        result_data = pint_mult_acc(projected_ei,projected_production)
        pd.testing.assert_series_equal(expected_data, result_data)

    # PASS: extension arrays are equal
    def test_pint_series_equality_2(self):
        projected_ei = pd.DataFrame([[Q_(1.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')], [Q_(4.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')]], dtype='pint[g CO2/Wh]')
        projected_production = pd.DataFrame([[Q_(1.0, 'Wh'), Q_(2.0, 'Wh')], [Q_(1.0, 'Wh'), Q_(2.0, 'Wh')]], dtype='pint[Wh]')
        expected_data = pd.Series([5.0, 8.0], index=[0, 1], dtype='pint[g CO2]')
        result_data = pint_mult_acc(projected_ei,projected_production)
        pd.testing.assert_extension_array_equal(expected_data.values, result_data.values)

    # Should FAIL, but ERROR instead
    def test_pint_series_equality_3(self):
        projected_ei = pd.DataFrame([[Q_(1.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')], [Q_(4.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')]], dtype='pint[g CO2/Wh]')
        projected_production = pd.DataFrame([[Q_(1.0, 'Wh'), Q_(2.0, 'Wh')], [Q_(1.0, 'Wh'), Q_(3.0, 'Wh')]], dtype='pint[Wh]')
        expected_data = pd.Series([5.0, 8.0], index=[0, 1], dtype='pint[g CO2]')
        result_data = pint_mult_acc(projected_ei,projected_production)
        # Expected to fail because expected data and result data differ,
        pd._testing.assert_series_equal(expected_data, result_data)

    # Should FAIL, but ERROR instead
    def test_pint_series_equality_4(self):
        projected_ei = pd.DataFrame([[Q_(1.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')], [Q_(4.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')]], dtype='pint[g CO2/Wh]')
        projected_production = pd.DataFrame([[Q_(1.0, 'Wh'), Q_(2.0, 'Wh')], [Q_(1.0, 'Wh'), Q_(3.0, 'Wh')]], dtype='pint[Wh]')
        expected_data = pd.Series([5.0, 8.0], index=[0, 1], dtype='pint[g CO2]')
        result_data = pint_mult_acc(projected_ei,projected_production)
        # Expected to fail because expected data and result data differ
        pd._testing.assert_extension_array_equal(expected_data.values, result_data.values)

    # FAIL: numpy arrays differ differ
    def test_pint_series_equality_5(self):
        projected_ei = pd.DataFrame([[Q_(1.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')], [Q_(4.0, 'g CO2/Wh'), Q_(2.0, 'g CO2/Wh')]], dtype='pint[g CO2/Wh]')
        projected_production = pd.DataFrame([[Q_(1.0, 'Wh'), Q_(2.0, 'Wh')], [Q_(1.0, 'Wh'), Q_(3.0, 'Wh')]], dtype='pint[Wh]')
        expected_data = pd.Series([5.0, 8.0], index=[0, 1], dtype='pint[g CO2]')
        result_data = pint_mult_acc(projected_ei,projected_production)
        # Expected to fail because expected data and result data differ
        pd._testing.assert_numpy_array_equal(np.asarray(expected_data), np.asarray(result_data))

Issue Description

I am using Pint-Pandas, with a lot of working code. However, my Python unittest test suite doesn't work because pd.testing.assert_series_equal is having trouble with my pd.Series objects.

In the above example there are two cases showing, respectively, a pandas test passing (expected) and failing (expected) based on whether the resulting series matches an expected series. That's the warm-up.

The test case then tests some pint functionality. If the two series are EXACTLY the same, the function works, as shown by test_pint_series_equality_1

test_pint_series_equality_2 shows that if we dig in past the series to the Extension Arrays, it still works with identical data.

test_pint_series_equality_5 shows that if we convert the arrays to numpy arrays, then test test case shows an expected failure when the two different series (converted to numpy arrays) are compared.

The problems are with test_pint_series_equality_3 and test_pint_series_equality_4. Both attempt to compare different values, but both lead to ERROR, not FAIL, due to mis-handling of ExtensionArrays. Here's the error message from _3 (using Pandas 1.4.0rc0):

======================================================================
ERROR: test_pint_series_equality_3 (pint-pandas-problem.TestBaseProvider)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/opt/app-root/src/ITR/examples/pint-pandas-problem.py", line 68, in test_pint_series_equality_3
    pd._testing.assert_series_equal(expected_data, result_data)
  File "/opt/app-root/lib64/python3.8/site-packages/pandas/_testing/asserters.py", line 1072, in assert_series_equal
    assert_extension_array_equal(
  File "/opt/app-root/lib64/python3.8/site-packages/pandas/_testing/asserters.py", line 858, in assert_extension_array_equal
    _testing.assert_almost_equal(
  File "pandas/_libs/testing.pyx", line 52, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 158, in pandas._libs.testing.assert_almost_equal
  File "pandas/_libs/testing.pyx", line 143, in pandas._libs.testing.assert_almost_equal
  File "/opt/app-root/lib64/python3.8/site-packages/pint/quantity.py", line 1864, in __len__
    return len(self._magnitude)
TypeError: object of type 'numpy.float64' has no len()

Expected Behavior

I expect Pandas to report that the different series are different without giving an ERROR.

Installed Versions

INSTALLED VERSIONS

commit : d023ba7
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.18.0-305.28.1.el8_4.x86_64
Version : #1 SMP Mon Nov 8 07:45:47 EST 2021
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.4.0rc0
numpy : 1.22.0
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1
setuptools : 56.0.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.22.0
pandas_datareader: None
bs4 : 4.6.3
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : 1.4.10
tables : None
tabulate : 0.8.9
xarray : None
xlrd : None
xlwt : None
numba : None
zstandard : None

I am also using Pint and Pint-Pandas:

$ pip list | grep Pint
Pint                              0.18
Pint-Pandas                       0.2

The text was updated successfully, but these errors were encountered:

MichaelTiemannOSC · 2022-01-08T15:35:25Z

I am looking harder at the function assert_extension_array_equal in pandas/_testing/asserters.py, specifically these last few lines:

    left_valid = np.asarray(left[~left_na].astype(object))
    right_valid = np.asarray(right[~right_na].astype(object))
    if check_exact:
        assert_numpy_array_equal(
            left_valid, right_valid, obj="ExtensionArray", index_values=index_values
        )
    else:
        _testing.assert_almost_equal(
            left_valid,
            right_valid,
            check_dtype=check_dtype,
            rtol=rtol,
            atol=atol,
            obj="ExtensionArray",
            index_values=index_values,
        )

The quick-and-dirty solution for Pint's Quantity type would be to convert the arrays not to object (which creates the problem subsequently), but to convert them to float (which will strip the units and preserve the magnitude). However, there are at least two unsatisfying aspects:

users get a warning message for a problem they did not create.
some units can have complex type--so we probably need to check for that as well.

So now the question is whether the problem is really here, hgrecco/pint-pandas#26, or somewhere else.

MichaelTiemannOSC · 2022-01-08T19:56:21Z

This hack does what I want, but surely needs work to make it pandas-friendly:

diff --git a/pandas/_testing/asserters.py b/pandas/_testing/asserters.py
index ea75af20bb..a37d8c5d08 100644
--- a/pandas/_testing/asserters.py
+++ b/pandas/_testing/asserters.py
@@ -848,13 +848,15 @@ def assert_extension_array_equal(
         left_na, right_na, obj="ExtensionArray NA mask", index_values=index_values
     )

-    left_valid = np.asarray(left[~left_na].astype(object))
-    right_valid = np.asarray(right[~right_na].astype(object))
     if check_exact:
+        left_valid = np.asarray(left[~left_na].astype(object))
+        right_valid = np.asarray(right[~right_na].astype(object))
         assert_numpy_array_equal(
             left_valid, right_valid, obj="ExtensionArray", index_values=index_values
         )
     else:
+        left_valid = np.asarray(left[~left_na].astype(float))
+        right_valid = np.asarray(right[~right_na].astype(float))
         _testing.assert_almost_equal(
             left_valid,
             right_valid,

MichaelTiemannOSC added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 7, 2022

MichaelTiemannOSC mentioned this issue Jan 8, 2022

TypeError: object of type 'numpy.float64' has no len() hgrecco/pint-pandas#26

Closed

MichaelTiemannOSC mentioned this issue Jan 11, 2022

Better annotations support hgrecco/pint#1166

Open

mroeschke added ExtensionArray Extending pandas with custom dtypes or arrays. Testing pandas testing functions or related to the test suite and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.testing.assert_series_equal has problems with ExtensionArray input #45240

BUG: pd.testing.assert_series_equal has problems with ExtensionArray input #45240

MichaelTiemannOSC commented Jan 7, 2022

INSTALLED VERSIONS

MichaelTiemannOSC commented Jan 8, 2022

MichaelTiemannOSC commented Jan 8, 2022

BUG: pd.testing.assert_series_equal has problems with ExtensionArray input #45240

BUG: pd.testing.assert_series_equal has problems with ExtensionArray input #45240

Comments

MichaelTiemannOSC commented Jan 7, 2022

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

MichaelTiemannOSC commented Jan 8, 2022

MichaelTiemannOSC commented Jan 8, 2022