Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: require _constructor/_constructor_sliced to return a class #51772

Open
jbrockmendel opened this issue Mar 3, 2023 · 26 comments
Open

DEPR: require _constructor/_constructor_sliced to return a class #51772

jbrockmendel opened this issue Mar 3, 2023 · 26 comments
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Mar 3, 2023

(DataFrame|Series)Subclass._constructor(_sliced|_expanddim)? does not currently need to return a class. It can return a callable. This means that we cannot write obj._constructor_sliced.some_method, which has caused bugs/confusion at times. Most recently this surprised a contributor in #51765 and makes the fix there less straightforward.

AFAIK the only/main subclass that relies on this is geopandas, which inspects the arguments to decide which subclass to use. IIUC this could be accomplished with a __new__ method just as easily. cc @jorisvandenbossche

We could even go as far as deprecating _constructor entirely in favor of type(self) and just having _constructor_sliced/_constructor_expanddim.

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects and removed Needs Triage Issue that has not been reviewed by a pandas team member Bug labels Mar 3, 2023
@jbrockmendel
Copy link
Member Author

@phofl since passing Managers to the constructors came up in another issue: my preferred solution is to do this deprecation, then implement something like NDFrame._from_mgr to pass Manager objects to instead of the constructors

@phofl
Copy link
Member

phofl commented Mar 16, 2023

+1

@jorisvandenbossche
Copy link
Member

With the idea that _from_mgr is also available to override for subclasses? (it would be needed to preserve the current subclassing functionality)

I think for internal purposes, more consistently using something like _from_mgr instead of _constructor makes sense for code clarity. But I am not sure it solves any of the issues from the top post about interaction with subclasses?

@jbrockmendel
Copy link
Member Author

Good catch on an implicit assumption worth making explicit! I was thinking we'd call self._constructor._from_mgr where relevant. For that usage an overridden self._from_mgr would work fine. For self._constructor_sliced._from_mgr (which im not sure how often it would be needed) we'd either need a _from_mgr_sliced or for _constructor_sliced to be a class.

@jorisvandenbossche
Copy link
Member

I was thinking we'd call self._constructor._from_mgr where relevant.

But then you can't deprecate _constructor? (and what would be the benefit of doing self._constructor._from_mgr compared to calling self._from_mgr?)

@jbrockmendel
Copy link
Member Author

_constructor._from_mgr would only work if _constructor reliably returned a class. So the deprecation is for allowing non-class _constructor

@jorisvandenbossche
Copy link
Member

Ah, yes, of course. Still, the second part of my question still holds I think? I.e. what would be the benefit of doing self._constructor._from_mgr compared to calling self._from_mgr?

@jbrockmendel
Copy link
Member Author

For the non-_sliced version either way would work

@jbrockmendel
Copy link
Member Author

Looks like we call _constructor_sliced with a Manager object in _box_col_values, _ixs, and xs

@jreback
Copy link
Contributor

jreback commented Mar 23, 2023

+1 in deprecation / change to require a class

@m-richards
Copy link
Contributor

m-richards commented Mar 25, 2023

(for context I worked implementing the _constructor behaviour in GeoPandas)

AFAIK the only/main subclass that relies on this is geopandas, which inspects the arguments to decide which subclass to use. IIUC this could be accomplished with a __new__ method just as easily. cc jorisvandenbossche

I've just opened geopandas/geopandas#2845 to test what this might look like on the geopandas side. Generally, it's a bit awkward because for GeoPandas we have subclass typecasting logic which we only introduce for _constructor/ _constructor_sliced/ _constructor_expanddim - the same rules don't apply to a user constructing a GeoSeries/ GeoDataFrame calling the constructor.

Seems like it's possible from a quick test creating dummy classes, but not the most elegant. For _constructor_sliced it's particularly unfortunate right now as we were defining the _constructor_sliced callable inside the method scope, and that means the corresponding class is getting defined at runtime.

Aside, GeoPandas is currently patching over the fact that self._constructor_expanddim._get_axis_number is used by pandas for concat by monkey patching that as an attribute onto the callable we define, so this proposal would remove the need for that.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 25, 2023

Aside, GeoPandas is currently patching over the fact that self._constructor_expanddim._get_axis_number is used by pandas for concat by monkey patching that as an attribute onto the callable we define, so this proposal would remove the need for that.

This has already been fixed in pandas more than a year ago (#46018). So we only still have that monkeypatch in geopandas for compat with older pandas versions.

@jorisvandenbossche
Copy link
Member

We chatted about this at the dev meeting this week as well. What is not yet fully clear to me is what is the exact motivation for wanting to change this.

  1. Because we want to make use of the fact that it returns a class (access class attributes, or methods, ..). Personally, I don't think that this currently blocks new developments where this would be needed? Is there a concrete use case where we would want to make use of this that would otherwise not be possible?
    I would not be in favor of actually making use of this, because that makes the story more complex for subclasses. In what situation would we make use of such a method? And should it then be the DataFrame method or the subclassed method? In @m-richards's POC to do this in geopandas, he used DataFrame and not GeoDataFrame, but will that always be correct? Limiting _constructor to just a constructor is conceptually simpler to reason about.

  2. Because it is confusing for developers / contributors (expecting that the properties return a class). But I think it is only confusing if they actually try to use it as being a class, and then we are back at point 1? Further, if we want to avoid this (contributors thinking they can use it as a class, reviewers having to catch that to avoid introducing a bug), we can also add a code check to catch such cases. I think that should be quite easy to do (eg checking for ._contructor. won't catch all cases, like where it's first assigned to a variable, but I think it could catch the majority of cases where someone tries to do this)

As I mentioned in the meeting as well, and as @m-richards shows, it's not that this is technically difficult for GeoPandas (we just need to create another subclass with only a custom __new__ method). While it's not as nice as the current code, it is certainly possible, but my main problem with this proposal is that I don't really see the benefits of this change, or if we start using methods of the returned class, that this will just introduce new questions/bugs.

@m-richards
Copy link
Contributor

We chatted about this at the dev meeting this week as well. What is not yet fully clear to me is what is the exact motivation for wanting to change this.

Sorry, I'm probably not adding much value to the discussion here then, I'm unfortunately not that up to date on pandas dev stuff.

But I will briefly comment on this

Limiting _constructor to just a constructor is conceptually simpler to reason about.

This is certainly a nice mental model for someone who isn't super familiar on pandas internals.

If pandas does have a need or desire to make this _constructor.some_method always work, it would be really nice to have a discrete list of methods that are required to be defined (perhaps as part of the subclassing documentation, or a typing protocol defined for _constructor), rather than any property on a DataFrame, as this certainly makes the story much more complex for subclasses as joris says. If the counterpoint to this that the subclass doesn't need to implement/ override these - then they shouldn't need to be accessed via _constructor, the parent implementation should suffice?

@jorisvandenbossche
Copy link
Member

Sorry, I'm probably not adding much value to the discussion here then, I'm unfortunately not that up to date on pandas dev stuff.

No, no, your feedback is useful and very much appreciated! (my mention of that this was discussed at a meeting was just to give some context to my comment, as it was partly responding to things were discussed in the meeting)

@jbrockmendel
Copy link
Member Author

Just noticed that we have cons = cast(Type["DataFrame"], self._constructor) in NDFrame.convert_dtypes. Not really a problem, but not great.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Apr 19, 2023

I've re-read through #51772. If this was never meant to work in the first place, and it was never documented to work, then it seems perfectly fine to deprecate. Especially if it simplifies the codebase and if it's not technically difficult to workaround in geopandas

Regarding the point that it'd be easy to write a code check to ensure that .constructor is only ever used assuming it's a class - I'm not sure I agree. At least, not statically, and anything involving inspecting annotations tends to be noticeably slower

[reposted from #52420]

@jorisvandenbossche
Copy link
Member

when we originally did this there was NO intention or reason to ever make these callable.

If this was never meant to work in the first place, and it was never documented to work, then it seems perfectly fine to deprecate

It's true that it is not explicitly documented that it can be a generic callable, but I think I can also say that it is not explicitly documented that it must be a class. We only speak about "constructor" (in prose docs and in the name of the methods), and IMO that doesn't have to indicate a class. I think "constructor" is generally understood as "a callable to construct an object".

Personally I don't know what the "intention" was when this was added 12 years ago (I could dig up a commit of that time that calls this "factory functions", but I don't want to put any intent in those words). But the fact is that this has been used for quite a while as a callable (and not just by geopandas).

Anyway, I have already said that it is technically possible for subclasses to adapt to this proposal (you just put your current callable in a __new__ method on some class). But it just makes it more complex for a subclass, for IMO hardly any benefit for pandas. I raised some concerns and questions above (and @m-richards as well), and I am not going to repeat those to not get accused of going in circles, but it would be nice to get some response to those.

@jorisvandenbossche
Copy link
Member

Regarding the point that it'd be easy to write a code check to ensure that .constructor is only ever used assuming it's a class - I'm not sure I agree. At least, not statically, and anything involving inspecting annotations tends to be noticeably slower

What I mean is a simple pre-commit check like the following:

diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 43b3699907..6384f94774 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -292,6 +292,13 @@ repos:
         files: ^pandas/tests/extension/base/
         exclude: ^pandas/tests/extension/base/base\.py$
         types_or: [python, cython, rst]
+    -   id: unwanted-patterns-constructor
+        name: Unwanted patterns for _constructor
+        language: pygrep
+        entry: |
+            (?x)
+            \._constructor\.
+            |\._constructor_sliced\.
+            |\._constructor_expanddim\.
+        types: [python]

This catches it if someone does self._constructor._some_method(..), and so can help making this mistake if we keep it as a generic callable.
That certainly doesn't catch all cases (eg where you first assigned self._constructor to a variable, and only in a second step call a method on that variable, like we do in groupby code), but it would already highlight the simple cases (and would for example have caught the occurrences that were fixed in #46018)

(sidenote: the above wouldn't actually pass right now, because we also have Index._constructor that is a class, but that's something we could rename).

@MarcoGorelli
Copy link
Member

Could you clarify which concerns/questions exactly you'd like a response to please? It looks to me like that overall conclusion is

it is technically possible for subclasses to adapt to this proposal (you just put your current callable in a new method on some class)

Relying on a regex (like the one you posted) to prevent bugs sounds dangerous

I kind of feel like flipping the question around - is there a strong enough reason to not do this? I may be misunderstanding, but moving the callable to a __new__ methods sounds alright?

@jorisvandenbossche
Copy link
Member

Relying on a regex (like the one you posted) to prevent bugs sounds dangerous

Can you explain why you think it would be dangerous? (We have plenty of similar pre-commit checks that do some regex)

@MarcoGorelli
Copy link
Member

sure - the other ones are more for linting issues, rather than for preventing bugs

@jorisvandenbossche
Copy link
Member

OK, I understand, it's indeed less about style or patterns we just don't like. Now, I would personally still not categorize it as dangerous, if it is only to help catch the pattern. In the end it is still up to the reviewers (and our test suite, and the test suites of subclasses) to catch those things. It also doesn't cause bugs in pandas itself, only potentially for subclasses. But we have plenty of ways that we can do something wrong for subclasses that now need to be catched by the human reviewer and which can easily be missed. For example, ensuring that we always use _constructor(..) for creating a final result in a method is also something that regularly got missed / fixed.

@jorisvandenbossche
Copy link
Member

Could you clarify which concerns/questions exactly you'd like a response to please?

It's essentially #51772 (comment) and #51772 (comment) (the latest comments here without any response, before you revived the discussion last week).
But let me try to rephrase / summarize (and repeat part of those comments above):

  • I would like to better understand why we want to change this: is it only because this can be confusing while working on the code base (with #51765 given as example in the top post), or is this actually blocking new developments? (i.e. is there a concrete use case where we would want to make use of this that would otherwise not be possible?)
    I assume it is the first: just a bit of convenience for us. But in this trade-off, I am personally not convinced the little benefits are worth the trouble of changing this (and the trouble being: the hassle of the deprecation and making the life of subclasses more difficult, see next point).
  • My main concern is that this makes the story more complex for subclasses. In what situation would pandas make use of such a class method? And should it then be the DataFrame method or the subclassed method? Limiting _constructor to just a constructor is conceptually simpler to reason about. Quoting Matt: "this is a nice mental model for someone who isn't super familiar on pandas internals."
    Related to this, Matt asked in his comment if pandas could then define a list of methods that it might call on this _constructor class (and thus need to be defined on the class in a compatible way). But I assume we would not want to do that, because it would fully defeat the purpose of simplifying things (because then we could use _constructor as a class but only in a subset of cases, instead of the current situation of just never using it as a class, and so that would only make it more complicated for working on the pandas code base).

Let's try to make both of those points more concrete with an example from #51765 (comment) (the case where Brock correctly commented that the initial code in the PR was wrong because we currently can't assume _constructor is a class). Simplified a bit, this was essentially doing:

# original code:
result = self.apply(Series.value_counts, **kwargs)
# the PR changed this to:
result = self.apply(self.obj._constructor.value_counts, **kwargs)

where self.apply in this case was a SeriesGroupby (and so self.obj is the original series on which we are calling _constructor), and so what this apply does is essentially calling the function on the series object of each group (just for understanding, this is not actually written in our code): result = series._constructor.value_counts(series_subset, **kwargs).

The above relies on calling the value_counts method on the Series class, and then pass a series instance as the first argument (self) of that class method (instead of calling the method on an instance).

The above code relies on _constructor being a class, and so if we want this to work correctly on subclasses, I suppose another way to write this would be (without having looked in detail at our groupby code, so this could be wrong):

result = self.apply(lambda ser: ser.value_counts(**kwargs))

This assumes that the object passed to the function by apply is already the correct Series (subclass) object, and AFAIK this is the case.

So first, I am personally not convinced that this is making the fix less straightforward (but of course this is only one example, and there are probably other examples where it might be more complicated to not rely on the class methods).

But this also illustrates the second point why this can complicate things for subclasses: assume a subclass has overridden value_counts. At that point you have Series.value_counts and SubclassedSeries.value_counts being two different methods with potentially different behaviour. Which of the two should be called? That should depend on the actual class of the subset, not determined by the parent object, because getting a subset from the parent object can change the class.

The exact example using Series.value_counts here is not perfect to illustrate this, because it is quite unlikely that selecting a subset of the series (for each group) would give a different class than the parent Series. But in the context of a DataFrame applying a method on each of its columns, that is certainly not unrealistic. For example, assume the following pseudo code:

func_to_apply = df._constructor_sliced.value_counts
for col in df.columns:
    ser = df[col]
    result = func_to_apply(ser)
    ...

Using this pattern of calling a method on one of the constructor attributes could in this case lead to wrong behaviour if not every column (ser in the example) gives the same class.
And when the argument is: just define a class with a __new__ to return from _constructor(_sliced), then that's exactly to do this: to not always return the same class (otherwise you don't need a __new__ to start with).

BTW, there are also other reasons for wanting to have a custom callable, apart from being able to be flexible in which class to return. Another reason is to have custom logic relying on characteristics of the calling self. theOehrly/Fast-F1#349 (comment) explains such a use case (also in geopandas we make use of checking the calling self in certain corner cases to determine which class to return). Of course this can also be done with a custom class init by passing self as extra argument. But I think the current way to do this with a constructor function is quite a bit simpler.


(sorry, in the end it turned out not to be a summary .., but rather a lengthy but hopefully somewhat illustrative post)

@jbrockmendel
Copy link
Member Author

The fact that this has caused bugs and development burden in the not-so-distant past is one concern. Another big one is that this is a blocker to implementing perf-improving _from_mgr method (#52132), which it itself a blocker for deprecating allowing managers in __init__ (#52419), which it itself a blocker for anything like ArrayManager ever being made fully functional.

Allowing this particular usage significantly impedes pandas development. And AFAICT the use case geopandas needs it for can be better supported by just registering an accessor, which is actually intentionally supported.

@jorisvandenbossche
Copy link
Member

Another big one is that this is a blocker to implementing perf-improving _from_mgr method (#52132)

I might have missed it, but I don't see it mentioned in that PR that it is being blocked by this issue (the only mention I see is about my alternative implementation proposal that would make use of a callable _constructor, but not about the current version of the PR needing this).
Anyway, this also triggered me to have another look at #52132 (see my comments there, had another possible idea to implement a transition). I would personally suggest to first focus on that, given that that's a feature we agree on being beneficial, and it might help clarify this discussion whether it would require this or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants