ENH: generalize `init` on a `dict` to `abc.collections.Mapping` and `getitem` on a `list` to `abc.collections.Sequence` #58803

MilesCranmer · 2024-05-21T17:04:23Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I wish that DataFrame was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.

Feature Description

Initialization

This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:

using PythonCall

pd = pyimport("pandas")

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

would result in the following dataframe:

julia> df
Python:
   0
0  b
1  a

As @cjdoris discovered, this this is due to the fact "pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it becomes a juliacall.DictValue in Python)."

The way to fix this would be to have __init__ instead check for its input being an instance of abc.collections.Mapping instead, which includes both dict and abc.collections.Mapping.

Indexing

The second incompatibility here is that indexing using lists that are not exactly list does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

As @cjdoris found, this is due to __getitem__ checking for list.

The solution would be to check for the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

Alternative Solutions

Alternative solution 1

There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.

Alternative solution 2

The current workaround is to use these operations as follows (thanks to @mrkn)

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid pydict - it's only from checking the contents do you see it resulted in an unexpected dataframe.

Similarly, for indexing:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6

However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.

Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.

Alternative solution 3

As brought up by @cjdoris in JuliaPy/PythonCall.jl#501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.

However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.

Additional Context

Discussed in JuliaPy/PythonCall.jl#501

The text was updated successfully, but these errors were encountered:

jbrockmendel · 2024-05-22T00:12:09Z

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

MilesCranmer · 2024-05-22T02:15:30Z

Ah, there would be a performance penalty? I would have thought it would just be changing an isinstance(i, list) to isinstance(i, Sequence)?

Aloqeely · 2024-05-22T03:40:54Z

Slightly unrelated, but have you tried the Julia pandas wrapper? I believe it automatically does the conversion from Julia Vector/Dict to a Python list/dict for you

MilesCranmer · 2024-05-22T06:05:29Z

Thanks, sadly that one looks to use the older PyCall.jl instead of the newer PythonCall.jl, so not (yet) compatible.

It's not a big deal if not possible. I guess it's a bit of a sharp edge, especially as it doesn't throw an error, but for users who google this, the workaround does work. I just thought it seemed like it might be more duck-typey/pythonic if any dict-like input was acceptable for initializing from a dict (and similar for sequences) rather than only explicit dicts – I could imagine other cases where it might be useful to have this. But I understand this is totally subjective!

mrkn · 2024-05-22T07:16:10Z

@jbrockmendel I checked the performance degradation by accepting Mapping. The asv benchmark said BENCHMARKS NOT SIGNIFICANTLY CHANGED. See mrkn#1 for more details and the patch I applied.

Note that I don't have much knowledge of the pandas internals so this patch can be insufficient to accept Mapping to create a dataframe.

cjdoris · 2024-05-22T16:54:53Z

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

In the case of dict -> Mapping this isn't true - the dataframe constructor calls dict_to_mgr(data, ...) (

pandas/pandas/core/frame.py

Line 763 in 2aa155a

mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy)

) which internally just calls data.keys() and data[key] for each key and converts them to some other internal data structure - so it doesn't matter whether the source is a dict or any other Mapping.

AFAICT the only performance concern is in the actual type check isinstance(data, dict) -> isinstance(data, Mapping) which I presume is negligible (and backed up by mrkn's post).

I haven't looked at the indexing code to see if the same conclusions hold for list -> Sequence there.

mrkn · 2024-05-23T06:21:48Z

I made a pull-request to propose introduce Mapping support in DataFrame construction #58814.

Aloqeely · 2024-05-23T10:36:27Z

#58814 only fixes the issue of constructing a DataFrame from a Mapping, but using a Mapping can still work unexpectedly for other methods, take this example from the docs (using #58814's DictWrapper class):

df = pd.DataFrame({"num_legs": [2, 4], "num_wings": [2, 0]}, index=["falcon", "dog"])

values = {"num_wings": [0, 3]}
my_dict = DictWrapper(values)  # <-- Mapping

print(df.isin(values))  # Correct result
print(df.isin(my_dict))  # Wrong result

A quick search shows 100+ results of isinstance(_, dict) checks and also 100+ isinstance(_, list) checks, so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

mrkn · 2024-05-27T06:40:15Z

@MilesCranmer Regarding __getitem__, the error in this method is not due to checking for list. The real cause is that juliacall.VectorValue is callable. The traceback indicating that the error occurred in com.apply_if_callable is evidence of this.

This means we cannot support juliacall.VectorValue simply by accepting Sequence in addition to list.

mrkn · 2024-05-27T07:00:18Z

I guess the following change, skipping the call to the object if an error occurs, could be acceptable for improving pandas's interoperability with other libraries.

diff --git a/pandas/core/common.py b/pandas/core/common.py
index 9629199122..31442d9f40 100644
--- a/pandas/core/common.py
+++ b/pandas/core/common.py
@@ -385,7 +385,10 @@ def apply_if_callable(maybe_callable, obj, **kwargs):
     **kwargs
     """
     if callable(maybe_callable):
-        return maybe_callable(obj, **kwargs)
+        try:
+            return maybe_callable(obj, **kwargs)
+        except Exception:
+            pass

     return maybe_callable

After passing com.apply_if_callable, the key seems to be checked by the is_list_like function, and it accepts juliacall.VectorValue:

julia> using PythonCall

julia> pd = pyimport("pandas")
Python: <module 'pandas' from '/home/mrkn/.pyenv/versions/3.11.7/lib/python3.11/site-packages/pandas/__init__.py'>

julia> pd._libs.lib.is_list_like(["a", "b"])
Python: True

jbrockmendel · 2024-05-31T17:12:18Z

so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

Definitely -1 on this. The request adds a lot of maintenance burden for a use case that isn't remotely supported.

Aloqeely · 2024-05-31T17:55:29Z

The request adds a lot of maintenance burden for a use case that isn't remotely supported.

I'm -1 on this entirely, the point I wanted to emphasize is that allowing Mapping in the DataFrame constructor does not fully solve this problem, so why bother fixing some parts only?

MilesCranmer · 2024-05-31T18:00:16Z

then we'd have to support it anywhere else

I think the easiest approach would just be to convert it to a dict or list internally, so you would not need to do those checks.

I would of course not expect you to support Mapping throughout the entire library, my apologies if that is what it came off like.

The request adds a lot of maintenance burden

I think if you simply convert incoming types to a dict if they are Mapping-like, then it should be fairly low-cost, and would improve Pandas extensibility to other libraries that inherit from Python's Mapping abstract type, which defines what interfaces they support: https://docs.python.org/3/library/collections.abc.html.

a use case that isn't remotely supported.

While I of course don't expect support for my use-case, I just want to gently suggest that this sort of change seems like the "right thing to do" at some level. The fact Julia is running into errors (despite implementing the abstract class correctly according to the Python docs) is moreso a signal of a general interface mismatch, rather than a specific library to support.

For example, if an incoming object inherits from Python's abstract Mapping class, then pandas __init__ may want to treat it like a mapping, rather than the current behavior seen below. Only checking dict means you end up treating dict-like objects as if they were something else:

import pandas as pd
from collections.abc import Mapping

class MyMapping(Mapping):
    def __init__(self, **kwargs):
        self.d = dict(**kwargs)
    def __getitem__(self, k):
        return self.d[k]
    def __iter__(self):
        return iter(self.d)
    def __len__(self):
        return len(self.d)

d = MyMapping(a=[1, 2], b=[3, 4])

df = pd.DataFrame(d)

print(df)

This results in:

   0
0  a
1  b

which is unexpected.

The Python docs describe this class as:

A container object that supports arbitrary key lookups and implements the methods specified in the collections.abc.Mapping or collections.abc.MutableMapping abstract base classes. Examples include dict, collections.defaultdict, collections.OrderedDict and collections.Counter.

So I think it's reasonable to check for this. Or at the very least throw an error instead of producing unexpected behavior like seen above.

Of course as the maintainers the decision is up to you.

MilesCranmer · 2024-06-01T00:27:45Z

Looks like there was also thread a while ago on this: #6792 (thanks for linking @mroeschke)

MilesCranmer added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 21, 2024

MilesCranmer mentioned this issue May 21, 2024

Pandas compatibility JuliaPy/PythonCall.jl#501

Open

2 tasks

mrkn mentioned this issue May 23, 2024

Support construction of a DataFrame from a Mapping #58814

Closed

5 tasks

Aloqeely added Needs Discussion Requires discussion from core team before further action API Design Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2024

mroeschke mentioned this issue May 31, 2024

isinstance check against mutable mappings rather than dicts #6792

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: generalize `init` on a `dict` to `abc.collections.Mapping` and `getitem` on a `list` to `abc.collections.Sequence` #58803

ENH: generalize `init` on a `dict` to `abc.collections.Mapping` and `getitem` on a `list` to `abc.collections.Sequence` #58803

MilesCranmer commented May 21, 2024 •

edited

Loading

jbrockmendel commented May 22, 2024

MilesCranmer commented May 22, 2024

Aloqeely commented May 22, 2024

MilesCranmer commented May 22, 2024

mrkn commented May 22, 2024

cjdoris commented May 22, 2024

mrkn commented May 23, 2024

Aloqeely commented May 23, 2024

mrkn commented May 27, 2024

mrkn commented May 27, 2024

jbrockmendel commented May 31, 2024

Aloqeely commented May 31, 2024

MilesCranmer commented May 31, 2024 •

edited

Loading

MilesCranmer commented Jun 1, 2024

ENH: generalize __init__ on a dict to abc.collections.Mapping and __getitem__ on a list to abc.collections.Sequence #58803

ENH: generalize __init__ on a dict to abc.collections.Mapping and __getitem__ on a list to abc.collections.Sequence #58803

Comments

MilesCranmer commented May 21, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Initialization

Indexing

Alternative Solutions

Alternative solution 1

Alternative solution 2

Alternative solution 3

Additional Context

jbrockmendel commented May 22, 2024

MilesCranmer commented May 22, 2024

Aloqeely commented May 22, 2024

MilesCranmer commented May 22, 2024

mrkn commented May 22, 2024

cjdoris commented May 22, 2024

mrkn commented May 23, 2024

Aloqeely commented May 23, 2024

mrkn commented May 27, 2024

mrkn commented May 27, 2024

jbrockmendel commented May 31, 2024

Aloqeely commented May 31, 2024

MilesCranmer commented May 31, 2024 • edited Loading

MilesCranmer commented Jun 1, 2024

ENH: generalize `init` on a `dict` to `abc.collections.Mapping` and `getitem` on a `list` to `abc.collections.Sequence` #58803

ENH: generalize `init` on a `dict` to `abc.collections.Mapping` and `getitem` on a `list` to `abc.collections.Sequence` #58803

MilesCranmer commented May 21, 2024 •

edited

Loading

MilesCranmer commented May 31, 2024 •

edited

Loading