Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

Open
3 tasks done
erik-hasse opened this issue Jul 15, 2021 · 20 comments
Open
3 tasks done

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

erik-hasse opened this issue Jul 15, 2021 · 20 comments
Labels
Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).

Comments

@erik-hasse
Copy link
Contributor

erik-hasse commented Jul 15, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

import subprocess
import sys

import pandas as pd

proto = """
syntax="proto3";

message Data {
  repeated float values = 1;
}
"""

with open('data.proto', 'w') as f:
    f.write(proto)

subprocess.run([
    sys.executable, '-m', 'grpc_tools.protoc', '-I.',
    '--python_out=.', '--grpc_python_out=.', 'data.proto'
])

from data_pb2 import Data

proto_data = Data(values=range(3))
print(proto_data.values)
print(type(proto_data.values))
df = pd.DataFrame(index=range(3), data={'a': proto_data.values})
print(df)

Problem description

On 1.3.0 and the master branch this code prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
                 a
0  [0.0, 1.0, 2.0]
1  [0.0, 1.0, 2.0]
2  [0.0, 1.0, 2.0]

The issue seems to arise from #41592. In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False. Note that RepeatedScalarContainer implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, so list(proto_data.values) works fine, but the hasattr(obj, "__iter__") check in is_list_like is False.

Based on all this, I suspect that this same issue will occur on any object which implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, however this protobuf object the only example I have right now so I can't test further.

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "__iter__") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Expected Output

On 1.2.x this prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
     a
0  0.0
1  1.0
2  2.0

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : f00ed8f python : 3.8.6.final.0 python-bits : 64 OS : Darwin OS-release : 20.3.0 Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 53.0.0
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.3.4
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@MeisterP
Copy link

I see the same issue when using pandas 1.3 from within Goldencheetah
https://groups.google.com/g/golden-cheetah-developers/c/lpHNWKdNICY

@simonjayhawkins simonjayhawkins added this to the 1.3.2 milestone Jul 26, 2021
@simonjayhawkins
Copy link
Member

The issue seems to arise from #41592

cc @jbrockmendel

@jbrockmendel
Copy link
Member

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "iter") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Would PySequence_Check(obj) return True for this object?

@erik-hasse
Copy link
Contributor Author

I can confirm that it does. The following prints True

from cpython.sequence cimport PySequence_Check
from data_pb2 import Data

proto_data = Data(values=range(3))

print(PySequence_Check(proto_data.values))

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@jbrockmendel
Copy link
Member

When I change the check from hasattr(obj, "__iter__") to PySequence_Check(obj) we fail to identify zip objects as listlike. We could make the check hasattr(obj, "__iter__") or PySequence_Check(obj). I don't know enough about PySequence_Check to be confident it won't produce false-positives. @erik-hasse any idea?

@mroeschke mroeschke added Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021
@aiudirog
Copy link
Contributor

aiudirog commented Sep 4, 2021

@jbrockmendel I came across this while looking for any issues that may be related to #43373. Modifying is_list_like to look like this should work:

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    cdef object iterfunc = getattr(obj, "__iter__", ...)
    if iterfunc is None:  # Explicitly not iterable
        return False
    return (
        (iterfunc is not ... or PySequence_Check(obj))
        and not isinstance(obj, type)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not cnp.PyArray_IsZeroDim(obj)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

Edit: Realized I forgot to explain myself. Using hasattr(obj, "__iter__") or PySequence_Check(obj) or getattr(obj, "__iter__", None) is not None or PySequence_Check(obj) would result in false positives when __iter__ is explicitly set to None and __getitem__ is implemented. Therefore, we need to explicitly verify that __iter__ is not None before proceeding. Since we needed a new sentinel value, I chose ... but maybe that isn't the best way.

When using the builtin iter function, this is taken care of in slot_tp_iter when called (indirectly) by PyObject_GetIter.

@MeisterP
Copy link

MeisterP commented Oct 5, 2021

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.

@jbrockmendel
Copy link
Member

is_listlike has been changed to use getattr(obj, "__iter__", None) is not None and ... does this solve the original issue @erik-hasse ?

@erik-hasse
Copy link
Contributor Author

Unfortunately no, because the object in the original issue does not have an __iter__ attribute. I still see the results in the issue on the current master branch 1.4.0.dev0+839.gab727bfee4.

As far as I can tell, PySequence_Check and try/except-ing list(obj) are the only ways to actually detect objects like this. But I definitely see the downsides of both.

@aiudirog
Copy link
Contributor

aiudirog commented Oct 6, 2021

@jbrockmendel That change is for when __iter__ is present but explicitly set to None. My usage for it is actually the opposite of @erik-hasse, where I had an object which implemented __getitem__ but was not iterable through indexing 0-N. His use case is an object that is only iterable through __getitem__ and doesn't have __iter__ implemented at all. The code I posted previously should solve both problems correctly.

@jbrockmendel
Copy link
Member

@erik-hasse we'll need to check the perf impact but it sounds like PySequence_Check is the way to go here (definitely not list(obj)). PR welcome.

@aiudirog it is not obvious to me that we should see an item as listlike if item.__iter__ is None

@aiudirog
Copy link
Contributor

aiudirog commented Oct 7, 2021

You shouldn't. It's explicitly not list like if __iter__ is None. That's how you explicitly disable iteration while still implementing __getitem__.

Here you can see the source code for collectiona.abc.Iterable which performs the check:
https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L265

https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L83

@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@jreback
Copy link
Contributor

jreback commented Nov 28, 2021

good to fix but not needed for 1.3.x

@simonjayhawkins
Copy link
Member

Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

as an MRE maybe...

import pandas as pd

print(pd.__version__)


class MySequence:
    def __getitem__(self, key):
        return range(3)[key]

    def __len__(self):
        return 3


my_sequence = MySequence()

df = pd.DataFrame(index=range(3), data={"a": my_sequence})
print(df)

on 1.2.5

1.2.5
   a
0  0
1  1
2  2

and on pandas 1.3.x/master

1.4.0.dev0+1286.g9db55a7524
           a
0  (0, 1, 2)
1  (0, 1, 2)
2  (0, 1, 2)

@simonjayhawkins
Copy link
Member

simonjayhawkins commented Dec 3, 2021

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.

tbc the code diff for the proposed change would be...

diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx
index f527882a9d..ecd2906744 100644
--- a/pandas/_libs/lib.pyx
+++ b/pandas/_libs/lib.pyx
@@ -1098,9 +1098,12 @@ def is_list_like(obj: object, allow_sets: bool = True) -> bool:
 
 
 cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
+    cdef object iterfunc = getattr(obj, "__iter__", ...)
+    if iterfunc is None:  # Explicitly not iterable
+        return False
     return (
-        # equiv: `isinstance(obj, abc.Iterable)`
-        getattr(obj, "__iter__", None) is not None and not isinstance(obj, type)
+        (iterfunc is not ... or PySequence_Check(obj))
+        and not isinstance(obj, type)
         # we do not count strings/unicode/bytes as list-like
         and not isinstance(obj, (str, bytes))
         # exclude zero-dimensional numpy arrays, effectively scalars

but this causes an infinite loop when testing with 1 core and node down for multicore for test_pprint_pathological_object

All other tests pass.

@simonjayhawkins
Copy link
Member

good to fix but not needed for 1.3.x

moving off milestone

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.5, Contributions Welcome Dec 3, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 3, 2021
@simonjayhawkins
Copy link
Member

The issue seems to arise from #41592.

first bad commit: [cec2f5f] REF: handle non-list_like cases upfront in sanitize_array (#38563)

In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False.

so it appears that the regression could have been caused by the change from lib.is_scalar(data) and index is not None and dtype is not None: to elif not is_list_like(data): causing the change from handled by _try_cast to handled by construct_1d_arraylike_from_scalar as identified. (i've not yet confirmed this)

also tbc is_list_like(my_sequence) returned False for the MRE from #42549 (comment) in 1.2.5 also.

import pandas as pd

print(pd.__version__)


class MySequence:
    def __getitem__(self, key):
        return range(3)[key]

    def __len__(self):
        return 3


my_sequence = MySequence()

print(pd._libs.lib.is_list_like(my_sequence))
1.2.5
False

good to fix but not needed for 1.3.x

moving off milestone

tbc i've moved off the milestone since IMO the proposed change is not suitable for backport since not only is tested behavior changed (i.e. test_pprint_pathological_object) but also the change is not directly related to the change that caused the regression. The proposed change could, assuming the recursion issue is sorted, be included for 1.4.

If we want to fix this for 1.3.x, we could probably consider re-instating the data = list(data) materialization in sanitize_array from #41592 or reverting the is_list_like back to the is_scalar check, the direct cause of the regression and not make any changes to is_list_like in a patch release.

PRs welcome either way.

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@bkyryliuk
Copy link

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

@MeisterP
Copy link

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

It is still an issue with pandas 2.0.0. It was never fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.).
Projects
None yet
Development

No branches or pull requests

8 participants