Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: metadata does not always propagate in binary operations #49916

Open
3 tasks done
MarcoGorelli opened this issue Nov 26, 2022 · 6 comments
Open
3 tasks done

BUG: metadata does not always propagate in binary operations #49916

MarcoGorelli opened this issue Nov 26, 2022 · 6 comments
Labels
Unreliable Test Unit tests that occasionally fail

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Nov 26, 2022

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

pytest pandas/tests/generic/test_finalize.py  # works fine

pytest pandas/tests/generic/test_finalize.py -n 4  # fails

Issue Description

In the second case, I get:

======================================================= FAILURES =======================================================_____________________________________________ test_binops[pow-args7-right] _____________________________________________[gw1] linux -- Python 3.8.15 /home/marcogorelli/mambaforge/envs/pandas-dev/bin/python3.8

request = <FixtureRequest for <Function test_binops[pow-args7-right]>>, args = (   A
0  1, 0    1
dtype: int64)
annotate = 'right', all_binary_operators = <built-in function pow>

    @pytest.mark.filterwarnings(
        "ignore:Automatic reindexing on DataFrame vs Series:FutureWarning"
    )
    @pytest.mark.parametrize("annotate", ["left", "right", "both"])
    @pytest.mark.parametrize(
        "args",
        [
            (1, pd.Series([1])),
            (1, pd.DataFrame({"A": [1]})),
            (pd.Series([1]), 1),
            (pd.DataFrame({"A": [1]}), 1),
            (pd.Series([1]), pd.Series([1])),
            (pd.DataFrame({"A": [1]}), pd.DataFrame({"A": [1]})),
            (pd.Series([1]), pd.DataFrame({"A": [1]})),
            (pd.DataFrame({"A": [1]}), pd.Series([1])),
        ],
    )
    def test_binops(request, args, annotate, all_binary_operators):
        # This generates 624 tests... Is that needed?
        left, right = args
        if annotate == "both" and isinstance(left, int) or isinstance(right, int):
            return

        if annotate in {"left", "both"} and not isinstance(left, int):
            left.attrs = {"a": 1}
        if annotate in {"left", "both"} and not isinstance(right, int):
            right.attrs = {"a": 1}

        result = all_binary_operators(left, right)
>       assert result.attrs == {"a": 1}
E       AssertionError: assert {} == {'a': 1}
E         Right contains 1 more item:
E         {'a': 1}
E         Use -v to get more diff

pandas/tests/generic/test_finalize.py:508: AssertionError

Has anyone else come across this?

Expected Behavior

Tests should pass whether run in parallel or not

Has anyone else come across this / can reproduce?

Installed Versions

INSTALLED VERSIONS

commit : 3c72d6f
python : 3.8.15.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.102.1-microsoft-standard-WSL2
Version : #1 SMP Wed Mar 2 00:30:59 UTC 2022
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 2.0.0.dev0+762.g3c72d6f1c0
numpy : 1.23.5
pytz : 2022.6
dateutil : 2.8.2
setuptools : 65.5.1
pip : 22.3.1
Cython : 0.29.32
pytest : 7.2.0
hypothesis : 6.58.1
sphinx : 4.5.0
blosc : None
feather : None
xlsxwriter : 3.0.3
lxml.etree : 4.9.1
html5lib : 1.1
pymysql : 1.0.2
psycopg2 : 2.9.3
jinja2 : 3.0.3
IPython : 8.6.0
pandas_datareader: 0.10.0
bs4 : 4.11.1
bottleneck : 1.3.5
brotli :
fastparquet : 2022.11.0
fsspec : 2021.11.0
gcsfs : 2021.11.0
matplotlib : 3.6.2
numba : 0.56.4
numexpr : 2.8.3
odfpy : None
openpyxl : 3.0.10
pandas_gbq : 0.17.9
pyarrow : 9.0.0
pyreadstat : 1.2.0
pyxlsb : 1.0.10
s3fs : 2021.11.0
scipy : 1.9.3
snappy :
sqlalchemy : 1.4.44
tables : 3.7.0
tabulate : 0.9.0
xarray : 2022.11.0
xlrd : 2.0.1
zstandard : 0.19.0
tzdata : None
qtpy : None
pyqt5 : None
None

@MarcoGorelli MarcoGorelli added Bug Needs Triage Issue that has not been reviewed by a pandas team member Unreliable Test Unit tests that occasionally fail and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 26, 2022
@MarcoGorelli
Copy link
Member Author

Actually, just

pytest pandas/tests/generic/test_finalize.py::test_binops[ne-args1-right]

fails every time

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Nov 26, 2022

This line looks a bit odd to me:

if annotate in {"left", "both"} and not isinstance(right, int):
right.attrs = {"a": 1}

if annotate is 'left', then why would right.attrs be set?

I'll take a closer look later


Looks like

left, right = left.align(
right, join="outer", axis=axis, level=level, copy=False
)
right = _maybe_align_series_as_frame(left, right, axis)

might be the issue - prior to that line, right.attrs is set, but after it, it's unset


Looks like a real issue:

$ cat f.py
import pandas as pd
from pandas import *

left = Series([1])
right = DataFrame({'a': [1]})
left.attrs = {'a': 1}
result = left + right
print(result.attrs)
$ python f.py
{}

@phofl
Copy link
Member

phofl commented Nov 26, 2022

same on my machine.

@mroeschke
Copy link
Member

There might be vague connection to #34373

@MarcoGorelli MarcoGorelli changed the title BUG: pandas/tests/generic/test_finalize.py fails if run in parallel? BUG: metadata does not always propagate in binary operations Nov 27, 2022
@TomAugspurger
Copy link
Contributor

Just to confirm, the original test failures are fixed, and the issue is now described in #49916 (comment), about propagating attrs in binary operations?

I'd recommend we follow xarray here (docs). By default, attrs aren't propagated through binary operations:

In [1]: import xarray as xr

In [2]: a = xr.DataArray([1, 2], name='a', attrs={"foo": "bar"})

In [3]: b = xr.DataArray([1, 2], name='a')

In [4]: (a + a).attrs
Out[4]: {}

In [5]: (a + b).attrs
Out[5]: {}

But a flag

In [8]: xr.set_options(keep_attrs=True)
Out[8]: <xarray.core.options.set_options at 0x10fc70ad0>

In [9]: (a + b).attrs
Out[9]: {'foo': 'bar'}

In [10]: (a + b).attrs
Out[10]: {'foo': 'bar'}

And I believe there are other values to control how merging / conflicts are handled

In [11]: b = xr.DataArray([1, 2], name='a', attrs={"foo": "baz", "other": "other"})

In [12]: (a + b).attrs
Out[12]: {'foo': 'bar'}

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option, and there is a global flag, accessible with xarray.set_options(), for setting this to be always True or False). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

@MarcoGorelli
Copy link
Member Author

the original test failures are fixed

that's right! several runs of pytest pandas/tests/generic/test_finalize.py -n 4 all pass for me locally

I'd recommend we follow xarray here

Do you mean to follow them in the sense of propagating through binary operations, or adding keep_attrs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Unreliable Test Unit tests that occasionally fail
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants