Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: generalize __init__ on a dict to abc.collections.Mapping and __getitem__ on a list to abc.collections.Sequence #58803

Open
1 of 3 tasks
MilesCranmer opened this issue May 21, 2024 · 14 comments
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@MilesCranmer
Copy link
Contributor

MilesCranmer commented May 21, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I wish that DataFrame was more compatible with initializing from more general classes of dictionaries, and being indexed by more general classes of sequences.

Feature Description

Initialization

This is motivated by attempting to use Pandas from Julia via PythonCall.jl, where I found the strange behavior where:

using PythonCall

pd = pyimport("pandas")

df = pd.DataFrame(Dict([
    "a" => [1, 2, 3],
    "b" => [4, 5, 6]
]))

would result in the following dataframe:

julia> df
Python:
   0
0  b
1  a

As @cjdoris discovered, this this is due to the fact "pandas.DataFrame.__init__ explicitly checks if its argument is a dict and Py(::Dict) is not a dict (it becomes a juliacall.DictValue in Python)."

The way to fix this would be to have __init__ instead check for its input being an instance of abc.collections.Mapping instead, which includes both dict and abc.collections.Mapping.

Indexing

The second incompatibility here is that indexing using lists that are not exactly list does not work as expected. For example, if I try to index a pandas dataframe using a sequence of strings Julia, I get this error:

julia> df[["a", "b"]]
ERROR: Python: TypeError: Julia: MethodError: objects of type Vector{String} are not callable
Use square brackets [] for indexing an Array.
Python stacktrace:
 [1] __call__
   @ ~/.julia/packages/PythonCall/S5MOg/src/JlWrap/any.jl:223
 [2] apply_if_callable
   @ pandas.core.common ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/common.py:384
 [3] __getitem__
   @ pandas.core.frame ~/Documents/pysr_projects/arya/bigbench/.CondaPkg/env/lib/python3.12/site-packages/pandas/core/frame.py:4065
Stacktrace:
 [1] pythrow()
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:92
 [2] errcheck
   @ ~/.julia/packages/PythonCall/S5MOg/src/Core/err.jl:10 [inlined]
 [3] pygetitem(x::Py, k::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/builtins.jl:171
 [4] getindex(x::Py, i::Vector{String})
   @ PythonCall.Core ~/.julia/packages/PythonCall/S5MOg/src/Core/Py.jl:292
 [5] top-level scope
   @ REPL[18]:1

As @cjdoris found, this is due to __getitem__ checking for list.

The solution would be to check for the more general abc.collections.Sequence, which includes both list and juliacall.VectorValue.

Alternative Solutions

Alternative solution 1

There is https://github.com/JuliaData/DataFrames.jl which implements a dataframe type from scratch, but pandas is more actively maintained, and I find it easier to use, so would like to call it from Julia.

Alternative solution 2

The current workaround is to use these operations as follows (thanks to @mrkn)

julia> df = pd.DataFrame(pydict(Dict("a" => [1, 2, 3], "b" => [4, 5, 6])))
Python:
   b  a
0  4  1
1  5  2
2  6  3

where we explicitly convert to a Python dict object. However, this is not obvious, and there is no error when you avoid pydict - it's only from checking the contents do you see it resulted in an unexpected dataframe.

Similarly, for indexing:

julia> df[pylist(["a", "b"])]
Python:
   a  b
0  1  4
1  2  5
2  3  6

However, this is not immediately obvious, and is a bit verbose. My immediate thought is that passing a list-like object to pandas getitem should work for indexing, and passing a dict-like object should work for initialization. Perhaps this is subjective but I thought I'd see what others think.

Checking the more general class of types would fix this, and also be compatible with any other list-like or dict-like inputs.

Alternative solution 3

As brought up by @cjdoris in JuliaPy/PythonCall.jl#501, one option is to change PythonCall so that it automatically convert Julia Dict to Python dict whenever passed to a Python object.

However, this would prevent people from mutating Julia dicts from within Python code, so would be a massive change in behavior. And converting Julia Vectors to Python lists automatically would also prevent mutation from Python, so is basically a non-option.

Additional Context

Discussed in JuliaPy/PythonCall.jl#501

@MilesCranmer MilesCranmer added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 21, 2024
@jbrockmendel
Copy link
Member

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

@MilesCranmer
Copy link
Contributor Author

Ah, there would be a performance penalty? I would have thought it would just be changing an isinstance(i, list) to isinstance(i, Sequence)?

@Aloqeely
Copy link
Member

Slightly unrelated, but have you tried the Julia pandas wrapper? I believe it automatically does the conversion from Julia Vector/Dict to a Python list/dict for you

@MilesCranmer
Copy link
Contributor Author

Thanks, sadly that one looks to use the older PyCall.jl instead of the newer PythonCall.jl, so not (yet) compatible.

It's not a big deal if not possible. I guess it's a bit of a sharp edge, especially as it doesn't throw an error, but for users who google this, the workaround does work. I just thought it seemed like it might be more duck-typey/pythonic if any dict-like input was acceptable for initializing from a dict (and similar for sequences) rather than only explicit dicts – I could imagine other cases where it might be useful to have this. But I understand this is totally subjective!

@mrkn
Copy link

mrkn commented May 22, 2024

@jbrockmendel I checked the performance degradation by accepting Mapping. The asv benchmark said BENCHMARKS NOT SIGNIFICANTLY CHANGED. See mrkn#1 for more details and the patch I applied.

Note that I don't have much knowledge of the pandas internals so this patch can be insufficient to accept Mapping to create a dataframe.

@cjdoris
Copy link

cjdoris commented May 22, 2024

I’m -0.5 on this. Internally we would just convert to dict/list anyway. I’d rather users do that where necessary and avoid the perf penalty in the cases we do supporr

In the case of dict -> Mapping this isn't true - the dataframe constructor calls dict_to_mgr(data, ...) (

mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy)
) which internally just calls data.keys() and data[key] for each key and converts them to some other internal data structure - so it doesn't matter whether the source is a dict or any other Mapping.

AFAICT the only performance concern is in the actual type check isinstance(data, dict) -> isinstance(data, Mapping) which I presume is negligible (and backed up by mrkn's post).

I haven't looked at the indexing code to see if the same conclusions hold for list -> Sequence there.

@mrkn
Copy link

mrkn commented May 23, 2024

I made a pull-request to propose introduce Mapping support in DataFrame construction #58814.

@Aloqeely
Copy link
Member

#58814 only fixes the issue of constructing a DataFrame from a Mapping, but using a Mapping can still work unexpectedly for other methods, take this example from the docs (using #58814's DictWrapper class):

df = pd.DataFrame({"num_legs": [2, 4], "num_wings": [2, 0]}, index=["falcon", "dog"])

values = {"num_wings": [0, 3]}
my_dict = DictWrapper(values)  # <-- Mapping

print(df.isin(values))  # Correct result
print(df.isin(my_dict))  # Wrong result

A quick search shows 100+ results of isinstance(_, dict) checks and also 100+ isinstance(_, list) checks, so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

@Aloqeely Aloqeely added Needs Discussion Requires discussion from core team before further action API Design Constructors Series/DataFrame/Index/pd.array Constructors and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 23, 2024
@mrkn
Copy link

mrkn commented May 27, 2024

@MilesCranmer Regarding __getitem__, the error in this method is not due to checking for list. The real cause is that juliacall.VectorValue is callable. The traceback indicating that the error occurred in com.apply_if_callable is evidence of this.

This means we cannot support juliacall.VectorValue simply by accepting Sequence in addition to list.

@mrkn
Copy link

mrkn commented May 27, 2024

I guess the following change, skipping the call to the object if an error occurs, could be acceptable for improving pandas's interoperability with other libraries.

diff --git a/pandas/core/common.py b/pandas/core/common.py
index 9629199122..31442d9f40 100644
--- a/pandas/core/common.py
+++ b/pandas/core/common.py
@@ -385,7 +385,10 @@ def apply_if_callable(maybe_callable, obj, **kwargs):
     **kwargs
     """
     if callable(maybe_callable):
-        return maybe_callable(obj, **kwargs)
+        try:
+            return maybe_callable(obj, **kwargs)
+        except Exception:
+            pass

     return maybe_callable

After passing com.apply_if_callable, the key seems to be checked by the is_list_like function, and it accepts juliacall.VectorValue:

julia> using PythonCall

julia> pd = pyimport("pandas")
Python: <module 'pandas' from '/home/mrkn/.pyenv/versions/3.11.7/lib/python3.11/site-packages/pandas/__init__.py'>

julia> pd._libs.lib.is_list_like(["a", "b"])
Python: True

@jbrockmendel
Copy link
Member

so I think if we're going to support Mapping for DataFrame construction then we'd have to support it anywhere else.

Definitely -1 on this. The request adds a lot of maintenance burden for a use case that isn't remotely supported.

@Aloqeely
Copy link
Member

The request adds a lot of maintenance burden for a use case that isn't remotely supported.

I'm -1 on this entirely, the point I wanted to emphasize is that allowing Mapping in the DataFrame constructor does not fully solve this problem, so why bother fixing some parts only?

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented May 31, 2024

then we'd have to support it anywhere else

I think the easiest approach would just be to convert it to a dict or list internally, so you would not need to do those checks.

I would of course not expect you to support Mapping throughout the entire library, my apologies if that is what it came off like.

The request adds a lot of maintenance burden

I think if you simply convert incoming types to a dict if they are Mapping-like, then it should be fairly low-cost, and would improve Pandas extensibility to other libraries that inherit from Python's Mapping abstract type, which defines what interfaces they support: https://docs.python.org/3/library/collections.abc.html.

a use case that isn't remotely supported.

While I of course don't expect support for my use-case, I just want to gently suggest that this sort of change seems like the "right thing to do" at some level. The fact Julia is running into errors (despite implementing the abstract class correctly according to the Python docs) is moreso a signal of a general interface mismatch, rather than a specific library to support.

For example, if an incoming object inherits from Python's abstract Mapping class, then pandas __init__ may want to treat it like a mapping, rather than the current behavior seen below. Only checking dict means you end up treating dict-like objects as if they were something else:

import pandas as pd
from collections.abc import Mapping

class MyMapping(Mapping):
    def __init__(self, **kwargs):
        self.d = dict(**kwargs)
    def __getitem__(self, k):
        return self.d[k]
    def __iter__(self):
        return iter(self.d)
    def __len__(self):
        return len(self.d)

d = MyMapping(a=[1, 2], b=[3, 4])

df = pd.DataFrame(d)

print(df)

This results in:

   0
0  a
1  b

which is unexpected.

The Python docs describe this class as:

A container object that supports arbitrary key lookups and implements the methods specified in the collections.abc.Mapping or collections.abc.MutableMapping abstract base classes. Examples include dict, collections.defaultdict, collections.OrderedDict and collections.Counter.

So I think it's reasonable to check for this. Or at the very least throw an error instead of producing unexpected behavior like seen above.

Of course as the maintainers the decision is up to you.

@MilesCranmer
Copy link
Contributor Author

Looks like there was also thread a while ago on this: #6792 (thanks for linking @mroeschke)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Constructors Series/DataFrame/Index/pd.array Constructors Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants