BUG: list-like objects are broadcast to each row (1.3 regression) #42549

erik-hasse · 2021-07-15T14:47:32Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

import subprocess
import sys

import pandas as pd

proto = """
syntax="proto3";

message Data {
  repeated float values = 1;
}
"""

with open('data.proto', 'w') as f:
    f.write(proto)

subprocess.run([
    sys.executable, '-m', 'grpc_tools.protoc', '-I.',
    '--python_out=.', '--grpc_python_out=.', 'data.proto'
])

from data_pb2 import Data

proto_data = Data(values=range(3))
print(proto_data.values)
print(type(proto_data.values))
df = pd.DataFrame(index=range(3), data={'a': proto_data.values})
print(df)

Problem description

On 1.3.0 and the master branch this code prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
                 a
0  [0.0, 1.0, 2.0]
1  [0.0, 1.0, 2.0]
2  [0.0, 1.0, 2.0]

The issue seems to arise from #41592. In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False. Note that RepeatedScalarContainer implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, so list(proto_data.values) works fine, but the hasattr(obj, "__iter__") check in is_list_like is False.

Based on all this, I suspect that this same issue will occur on any object which implements PyTypeObject.tp_as_sequence but not PyTypeObject.tp_iter, however this protobuf object the only example I have right now so I can't test further.

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "__iter__") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Expected Output

On 1.2.x this prints:

[0.0, 1.0, 2.0]
<class 'google.protobuf.pyext._message.RepeatedScalarContainer'>
     a
0  0.0
1  1.0
2  2.0

Output of `pd.show_versions()`

INSTALLED VERSIONS ------------------ commit : f00ed8f python : 3.8.6.final.0 python-bits : 64 OS : Darwin OS-release : 20.3.0 Version : Darwin Kernel Version 20.3.0: Thu Jan 21 00:07:06 PST 2021; root:xnu-7195.81.3~1/RELEASE_X86_64 machine : x86_64 processor : i386 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 1.3.0
numpy : 1.20.3
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 53.0.0
Cython : 0.29.22
pytest : 6.2.2
hypothesis : 6.3.4
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.20.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : 3.3.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 3.0.0
pyxlsb : None
s3fs : None
scipy : 1.6.0
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

MeisterP · 2021-07-26T13:14:31Z

I see the same issue when using pandas 1.3 from within Goldencheetah
https://groups.google.com/g/golden-cheetah-developers/c/lpHNWKdNICY

simonjayhawkins · 2021-07-28T09:07:20Z

The issue seems to arise from #41592

cc @jbrockmendel

jbrockmendel · 2021-07-28T17:50:26Z

I'm not familiar enough with Cython to provide a full fix, but is there some way to examine the struct fields of obj in is_list_like? If so, that function could be amended to check hasattr(obj, "iter") or hasattr(obj, "tp_as_sequence"). If not I think the logic of sanitize_array needs to be amended.

Would PySequence_Check(obj) return True for this object?

erik-hasse · 2021-08-02T20:20:04Z

I can confirm that it does. The following prints True

from cpython.sequence cimport PySequence_Check
from data_pb2 import Data

proto_data = Data(values=range(3))

print(PySequence_Check(proto_data.values))

jbrockmendel · 2021-08-19T21:16:13Z

When I change the check from hasattr(obj, "__iter__") to PySequence_Check(obj) we fail to identify zip objects as listlike. We could make the check hasattr(obj, "__iter__") or PySequence_Check(obj). I don't know enough about PySequence_Check to be confident it won't produce false-positives. @erik-hasse any idea?

aiudirog · 2021-09-04T17:49:12Z

@jbrockmendel I came across this while looking for any issues that may be related to #43373. Modifying is_list_like to look like this should work:

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    cdef object iterfunc = getattr(obj, "__iter__", ...)
    if iterfunc is None:  # Explicitly not iterable
        return False
    return (
        (iterfunc is not ... or PySequence_Check(obj))
        and not isinstance(obj, type)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not cnp.PyArray_IsZeroDim(obj)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

Edit: Realized I forgot to explain myself. Using hasattr(obj, "__iter__") or PySequence_Check(obj) or getattr(obj, "__iter__", None) is not None or PySequence_Check(obj) would result in false positives when __iter__ is explicitly set to None and __getitem__ is implemented. Therefore, we need to explicitly verify that __iter__ is not None before proceeding. Since we needed a new sentinel value, I chose ... but maybe that isn't the best way.

When using the builtin iter function, this is taken care of in slot_tp_iter when called (indirectly) by PyObject_GetIter.

MeisterP · 2021-10-05T10:55:54Z

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.

jbrockmendel · 2021-10-05T23:03:44Z

is_listlike has been changed to use getattr(obj, "__iter__", None) is not None and ... does this solve the original issue @erik-hasse ?

erik-hasse · 2021-10-06T03:49:56Z

Unfortunately no, because the object in the original issue does not have an __iter__ attribute. I still see the results in the issue on the current master branch 1.4.0.dev0+839.gab727bfee4.

As far as I can tell, PySequence_Check and try/except-ing list(obj) are the only ways to actually detect objects like this. But I definitely see the downsides of both.

aiudirog · 2021-10-06T16:06:51Z

@jbrockmendel That change is for when __iter__ is present but explicitly set to None. My usage for it is actually the opposite of @erik-hasse, where I had an object which implemented __getitem__ but was not iterable through indexing 0-N. His use case is an object that is only iterable through __getitem__ and doesn't have __iter__ implemented at all. The code I posted previously should solve both problems correctly.

jbrockmendel · 2021-10-06T23:33:34Z

@erik-hasse we'll need to check the perf impact but it sounds like PySequence_Check is the way to go here (definitely not list(obj)). PR welcome.

@aiudirog it is not obvious to me that we should see an item as listlike if item.__iter__ is None

aiudirog · 2021-10-07T00:07:45Z

You shouldn't. It's explicitly not list like if __iter__ is None. That's how you explicitly disable iteration while still implementing __getitem__.

Here you can see the source code for collectiona.abc.Iterable which performs the check:
https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L265

https://github.com/python/cpython/blob/v3.9.7/Lib/_collections_abc.py#L83

simonjayhawkins · 2021-10-16T19:25:26Z

changing milestone to 1.3.5

jreback · 2021-11-28T16:00:23Z

good to fix but not needed for 1.3.x

simonjayhawkins · 2021-12-02T13:19:00Z

Code Sample, a copy-pastable example

This requires both pandas and grpcio-tools.

as an MRE maybe...

import pandas as pd

print(pd.__version__)


class MySequence:
    def __getitem__(self, key):
        return range(3)[key]

    def __len__(self):
        return 3


my_sequence = MySequence()

df = pd.DataFrame(index=range(3), data={"a": my_sequence})
print(df)

on 1.2.5

and on pandas 1.3.x/master

1.4.0.dev0+1286.g9db55a7524
           a
0  (0, 1, 2)
1  (0, 1, 2)
2  (0, 1, 2)

simonjayhawkins · 2021-12-03T11:37:44Z

Modifying is_list_like to look like this should work

For what it's worth, the change proposed in #42549 (comment) does fix my Goldencheetah use case.

tbc the code diff for the proposed change would be...

diff --git a/pandas/_libs/lib.pyx b/pandas/_libs/lib.pyx
index f527882a9d..ecd2906744 100644
--- a/pandas/_libs/lib.pyx
+++ b/pandas/_libs/lib.pyx
@@ -1098,9 +1098,12 @@ def is_list_like(obj: object, allow_sets: bool = True) -> bool:
 
 
 cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
+    cdef object iterfunc = getattr(obj, "__iter__", ...)
+    if iterfunc is None:  # Explicitly not iterable
+        return False
     return (
-        # equiv: `isinstance(obj, abc.Iterable)`
-        getattr(obj, "__iter__", None) is not None and not isinstance(obj, type)
+        (iterfunc is not ... or PySequence_Check(obj))
+        and not isinstance(obj, type)
         # we do not count strings/unicode/bytes as list-like
         and not isinstance(obj, (str, bytes))
         # exclude zero-dimensional numpy arrays, effectively scalars

but this causes an infinite loop when testing with 1 core and node down for multicore for test_pprint_pathological_object

All other tests pass.

simonjayhawkins · 2021-12-03T11:38:02Z

good to fix but not needed for 1.3.x

moving off milestone

simonjayhawkins · 2021-12-03T13:21:11Z

The issue seems to arise from #41592.

first bad commit: [cec2f5f] REF: handle non-list_like cases upfront in sanitize_array (#38563)

In 1.2.x this object was handled by _try_cast in this else clause. After that change, it's handled by construct_1d_arraylike_from_scalar because is_list_like(data) returns False.

so it appears that the regression could have been caused by the change from lib.is_scalar(data) and index is not None and dtype is not None: to elif not is_list_like(data): causing the change from handled by _try_cast to handled by construct_1d_arraylike_from_scalar as identified. (i've not yet confirmed this)

also tbc is_list_like(my_sequence) returned False for the MRE from #42549 (comment) in 1.2.5 also.

import pandas as pd

print(pd.__version__)


class MySequence:
    def __getitem__(self, key):
        return range(3)[key]

    def __len__(self):
        return 3


my_sequence = MySequence()

print(pd._libs.lib.is_list_like(my_sequence))

1.2.5
False

good to fix but not needed for 1.3.x

moving off milestone

tbc i've moved off the milestone since IMO the proposed change is not suitable for backport since not only is tested behavior changed (i.e. test_pprint_pathological_object) but also the change is not directly related to the change that caused the regression. The proposed change could, assuming the recursion issue is sorted, be included for 1.4.

If we want to fix this for 1.3.x, we could probably consider re-instating the data = list(data) materialization in sanitize_array from #41592 or reverting the is_list_like back to the is_scalar check, the direct cause of the regression and not make any changes to is_list_like in a patch release.

PRs welcome either way.

bkyryliuk · 2023-04-28T09:48:55Z

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

MeisterP · 2023-04-28T12:04:43Z

What is a earliest version of pandas that has this issue fixed? Is it affecting 1.4.X ?

It is still an issue with pandas 2.0.0. It was never fixed.

erik-hasse added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 15, 2021

erik-hasse mentioned this issue Jul 15, 2021

Implement __iter__ in Python RepeatedScalarContainer protocolbuffers/protobuf#8816

Open

simonjayhawkins added this to the 1.3.2 milestone Jul 26, 2021

simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021

mroeschke added Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 21, 2021

aiudirog mentioned this issue Sep 4, 2021

BUG: Fix regression in is_list_like #43373

Merged

3 tasks

simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021

simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021

simonjayhawkins mentioned this issue Nov 28, 2021

RLS: 1.3.5 #44080

Closed

simonjayhawkins modified the milestones: 1.3.5, Contributions Welcome Dec 3, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 3, 2021

code sample for pandas-dev#42549

d2d4ea7

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

bkyryliuk mentioned this issue May 3, 2023

Relax panda and alembic requirements databricks/databricks-sql-python#116

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

erik-hasse commented Jul 15, 2021 •

edited

MeisterP commented Jul 26, 2021

simonjayhawkins commented Jul 28, 2021

jbrockmendel commented Jul 28, 2021

erik-hasse commented Aug 2, 2021

jbrockmendel commented Aug 19, 2021

aiudirog commented Sep 4, 2021 •

edited

MeisterP commented Oct 5, 2021 •

edited

jbrockmendel commented Oct 5, 2021

erik-hasse commented Oct 6, 2021

aiudirog commented Oct 6, 2021

jbrockmendel commented Oct 6, 2021

aiudirog commented Oct 7, 2021

simonjayhawkins commented Oct 16, 2021

jreback commented Nov 28, 2021

simonjayhawkins commented Dec 2, 2021

Code Sample, a copy-pastable example

simonjayhawkins commented Dec 3, 2021 •

edited

simonjayhawkins commented Dec 3, 2021

simonjayhawkins commented Dec 3, 2021

bkyryliuk commented Apr 28, 2023

MeisterP commented Apr 28, 2023

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

BUG: list-like objects are broadcast to each row (1.3 regression) #42549

Comments

erik-hasse commented Jul 15, 2021 • edited

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

MeisterP commented Jul 26, 2021

simonjayhawkins commented Jul 28, 2021

jbrockmendel commented Jul 28, 2021

erik-hasse commented Aug 2, 2021

jbrockmendel commented Aug 19, 2021

aiudirog commented Sep 4, 2021 • edited

MeisterP commented Oct 5, 2021 • edited

jbrockmendel commented Oct 5, 2021

erik-hasse commented Oct 6, 2021

aiudirog commented Oct 6, 2021

jbrockmendel commented Oct 6, 2021

aiudirog commented Oct 7, 2021

simonjayhawkins commented Oct 16, 2021

jreback commented Nov 28, 2021

simonjayhawkins commented Dec 2, 2021

Code Sample, a copy-pastable example

simonjayhawkins commented Dec 3, 2021 • edited

simonjayhawkins commented Dec 3, 2021

simonjayhawkins commented Dec 3, 2021

bkyryliuk commented Apr 28, 2023

MeisterP commented Apr 28, 2023

erik-hasse commented Jul 15, 2021 •

edited

Output of `pd.show_versions()`

aiudirog commented Sep 4, 2021 •

edited

MeisterP commented Oct 5, 2021 •

edited

simonjayhawkins commented Dec 3, 2021 •

edited