Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Private-looking symbols in the public API? #56874

Open
1 task done
twoertwein opened this issue Jan 14, 2024 · 12 comments
Open
1 task done

DOC: Private-looking symbols in the public API? #56874

twoertwein opened this issue Jan 14, 2024 · 12 comments
Labels
Docs ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@twoertwein
Copy link
Member

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

doc/source/reference/extensions.rst

Documentation problem

doc/source/reference/extensions.rst (and probably some of the other references) documents methods to be public even though they start with a leading _ (which typically is used to indicate private symbols). I know pandas declares that the documentation dictates what is public (as opposed to typical Python conventions) but it seems very confusing to document private-looking symbols as being part of the public API.

xref pandas-dev/pandas-stubs#850

Suggested fix for documentation

It would be good to either

  • remove private-looking symbols from the documentation (if they are actually private), or
  • rename them and deprecate the private-looking name

After that, it would be great to have a script in pre-commit to check for private-looking symbols in the public documentation.

@twoertwein twoertwein added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 14, 2024
@mroeschke
Copy link
Member

I definitely think these should not be renamed since these methods should not be used directly by users because they are called internally instead.

So I guess I would support removing this from the API reference, but IMO it would be nice that these methods have exposure somewhere in the docs to signal that these are important for EA authors to implement.

@twoertwein
Copy link
Member Author

it would be nice that these methods have exposure somewhere in the docs to signal that these are important for EA authors to implement.

I think in other programming languages, this would correspond to a "protected" symbol. Not sure whether there is a PEP for public/protected/private in Python, but I understand now why it would be good to keep them documented.

these methods should not be used directly by users

Should they then be exposed in pandas-stubs? I guess it depends on whether we expect/want EA authors to use pandas or pandas-stubs @Dr-Irv

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 15, 2024

Should they then be exposed in pandas-stubs? I guess it depends on whether we expect/want EA authors to use pandas or pandas-stubs @Dr-Irv

I would expect that an EA author would use pandas-stubs. You'd want to make sure your EA code and tests of your EA were properly typed. This issue came from a report in pandas-stubs where someone is trying to write an EA using the stubs to support typing.

Having said that, one of the issues with the EA interface design is that EA authors have to implement a bunch of methods and some of them have "private" signatures, and others do not. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.ExtensionArray.html#pandas.api.extensions.ExtensionArray

Why do we tell EA authors to implement _from_sequence() and isna(), for example?

@rhshadrach
Copy link
Member

Having said that, one of the issues with the EA interface design is that EA authors have to implement a bunch of methods and some of them have "private" signatures, and others do not.

Because the public vs private is for users of the EAs, not the authors of the EA. Unfortunately, Python only provides public vs internal indications, but we need three: public to EA user, internal to EA user but "public" to EA author, internal to EA author.

Why do we tell EA authors to implement _from_sequence() and isna(), for example?

We don't want EA users to call _from_sequence so it is marked as internal. On the other hand, EA users need to be able to call isna. We could have EA authors implement _isna and just have isna call _isna, but that seems unnecessary.

@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Jan 16, 2024
@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Jan 16, 2024
@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 16, 2024

Because the public vs private is for users of the EAs, not the authors of the EA. Unfortunately, Python only provides public vs internal indications, but we need three: public to EA user, internal to EA user but "public" to EA author, internal to EA author.

Maybe I'm just missing something here, but let's say we have the following:

  • EA Author "Mary" authors her own EA called MaryEA. Mary makes this available to others.
  • A pandas user "Bob" writes code that uses MaryEA. He does this by creating Series, via pd.Series(MaryEA(arguments, ...)).

My question is then would there be any reason for Bob to call the methods on the MaryEA object. Wouldn't Bob just use the methods on Series ?

And, if that's the case, is there really an "EA user", especially since EA is documented as an abstract class, which requires you to be an EA Author (like Mary) to use it?

@jbrockmendel
Copy link
Member

@mroeschke
Copy link
Member

My question is then would there be any reason for Bob to call the methods on the MaryEA object. Wouldn't Bob just use the methods on Series ?

The common case would probably be that the user would just use the Series methods, yes. But there is the public Series.array that allows the user to access the underlying array which would be the MaryEA. A user may want to work with a similar Series object but without the index.

@rhshadrach
Copy link
Member

And the underlying EA could also expose additional functionality that isn't directly used by Series/DataFrame.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 16, 2024

And the underlying EA could also expose additional functionality that isn't directly used by Series/DataFrame.

Yes, but in my example, wouldn't that be part of the documentation of MaryEA, and not something we worry about?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 16, 2024

See also https://mail.python.org/pipermail/pandas-dev/2023-January/001559.html

Copying that text here, because it is quite relevant (thanks @jbrockmendel ):

For historical reasons we've built up an EA namespace without much internal
logic in terms of what is public/private. While this isn't that big of a
deal, it'd be nice to make this more coherent. I see two useful options:

  1. Use the traditional "an underscore means this should only be called from
    within self". Very few methods on the base class satisfy that
    characteristic, including the constructor _from_sequence. One benefit of
    moving to this is it would make "official" that we shouldn't be using
    _values_for_foo from outside EA methods.

  2. Use underscores to signal to 3rd party authors whether or not there
    exists a working (not necessarily performant) implementation on the base
    class. In this scenario authors would have to implement private methods,
    while implementing public methods would be optional.

Thoughts?

@rhshadrach
Copy link
Member

rhshadrach commented Jan 16, 2024

And the underlying EA could also expose additional functionality that isn't directly used by Series/DataFrame.

Yes, but in my example, wouldn't that be part of the documentation of MaryEA, and not something we worry about?

I'm giving a reason why a user might need access to the underlying EA via .array as @mroeschke commented, in response to #56874 (comment):

My question is then would there be any reason for Bob to call the methods on the MaryEA object. Wouldn't Bob just use the methods on Series ?


Copying that text here, because it is quite relevant (thanks @jbrockmendel ):
...
Thoughts?

In both of the proposed conventions, EA authors need to implement _foo methods, and so we'd want to keep these documented.

@MichaelTiemannOSC
Copy link
Contributor

In both of the proposed conventions, EA authors need to implement _foo methods, and so we'd want to keep these documented.

...and typed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

8 participants