New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: pandas.api #13634

Closed
jreback opened this Issue Jul 13, 2016 · 66 comments

Comments

Projects
None yet
4 participants
@jreback
Contributor

jreback commented Jul 13, 2016

In xref #13147

established a bit of a public api in pandas.api; ATM this only contains the type introspection routines.

1st question

Some disagreement on whether we should not do this, and rather just expose pandas.types directly.

I think pandas.api is actually very important because:

  1. limits scope of what we choose to expose in the future; its not simply 'everything' that isnt nailed down (the case now). This does change pandas to make everything by default private, EXCEPT what is explicitly public.

  2. a single entry point for other packages to use the pandas public API that is known / stable / maintained (will with c-API as well)

  3. provide consistent naming externally (and allow us to fix / hide internally as needed)

  4. namespaced. I only import what I need as a user / other package, rather than everything

2nd question

as discussed here, should these deprecated API functions should be DeprecationWrarning rather than FutureWarning?
-> done in #13990

Ideally we should resolve this before the end of 0.19.0 (or remove it).

@jreback jreback added this to the 0.19.0 milestone Jul 13, 2016

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 13, 2016

@jreback

This comment has been minimized.

Contributor

jreback commented Jul 13, 2016

@jorisvandenbossche updated the top section.

jreback added a commit that referenced this issue Aug 13, 2016

Use DeprecationWarning for core.common deprecations (GH13634)
Related to second question in #13634 (whether to use FutureWarning or
DeprecationWarning in deprecating the public pandas.core.common
functions).    As those functions are mostly used in library code, and
less directly by users in their own code, I think a DeprecationWarning
is more appropriate in this case.   For example, in our own docs, we
started to get warnings due to an example with a statsmodels
regression that uses patsy using one of those functions. Note that
recent IPython also shows DeprecationWarnings when using a deprecated
function interactively.

Author: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Closes #13990 from jorisvandenbossche/common-depr-warning and squashes the following commits:

2de5d48 [Joris Van den Bossche] Use DeprecationWarning for core.common deprecations (GH13634)

@jreback jreback modified the milestones: 1.0, 0.19.0 Sep 28, 2016

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 5, 2017

We have discussed this on some several places, recently also in the PR about the errors module (#15541) and previously when pandas.types was created (#13147 (comment) and comments below).
For the exceptions, we now have a top-level pandas.errors module, for type-related we have a private pandas.types and public pandas.api.types.

With the risk of keeping a settled discussion alive, I would still like to see this discussed some more.

To repeat the comment of @shoyer:

Ultimately, I don't think this matters too much, but given a choice between:

  1. pandas.* indicates private namespace, pandas.api.* includes a public namespace, and
  2. pandas.* includes public namespace, pandas._internals (and so forth) indicates private namespace

I think the later (option 2) is more user friendly and consistent with the top level namespace pandas already being public facing.

On the other hand, it's true that option 2 is a little harder to transition to (for us), so maybe it's not worth the trouble.

We have recently added new submodules to the top-level namespace (pandas.types, pandas.indexes, pandas.formats, code that was before mainly in pandas.core. ..), but towards the user these should be regarded as private. IMO this is not the good direction. For new code / refactors, I think we should stick to: only what is public can go in the top-level namespace.

Let's take the example of types. We now have a private pandas.types and a public pandas.api.types, so the less nested location is private, which I don't find ideal.
If we use the approach of pandas.api.types, I think we should at least make the actual implementation in pandas._types, so you don't have the two publicly visible locations.
The other option is to make put now what is in pandas.api.types, in pandas.types (while the exact implementation and organization in the different submodules of pandas.types remains a private implementation detail).

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 5, 2017

Here's my rationale:

We have 3 stakeholders.

  • users: regular folks who do import pandas as pd
  • library builders / advanced users: folks who need more 'library' code
  • implementation

implementation: this may seem strange to mention, but pandas is a large library. We need the flexibility to move code around, expand modules, rename and generally get stuff done internally, with out having to constantly deprecate things. This would be extremely disruptive to the public (not to mention burdensome).

For example. pandas.types exists with several sub-modules that are logically defined, simply named and concise. There is not reason a for 'users' or 'library' writers to access this at all. It exists simply to organize the code. Sure they can reach in if they really want, but we want to discourage this. Making it pandas._types would surely do this, but that is an ugly name and since this is pure python code a bit misleading.

'users': they are served by the top-level pd namespace, supplemented by pandas.errors. Everything that one could need is here (or as a method).

'library': these folks need code generally for instrospection that is just too cumbersome to be in a 'user' namespace. Things like is_integer_dtype is a canonicial way to introspect things in pandas. We use these internally and provide them as an external API to library writers.

I recently added pandas.api.lib with infer_dtype, which is a useful routine, again to external library writers. This is (or another sub-module of pandas.api) can also be the home of a c-api / development api.

So, everything as public is just fine. If someone wants to reach in and use a routine from pandas.format.* or whatever they should have no expectations that this is stable across releases.

For simplicity from a user AND a library pov, its much better to have a single namespace pandas.api that say: hey they routines are what I should use, they are documented and won't change

@shoyer

This comment has been minimized.

Member

shoyer commented Apr 5, 2017

I don't see a strong division between "users" and "advanced users". There is a continuum of use cases. Certainly we should group more things into submodules: this makes it easier to find related functionality.

One choice that would make this super clear is move all internal stuff into a top level submodule called internals, e.g., pandas.internals.core, pandas.internals.indexes, etc. Then if you're importing from pandas.internals.something, you know it's an internal routine. Otherwise, as a user it's not obvious whether pandas.something exists to as a namespace to logically group together the something routines or if it's an internal implementation detail.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 5, 2017

For example. pandas.types exists with several sub-modules that are logically defined, simply named and concise. There is not reason a for 'users' or 'library' writers to access this at all. It exists simply to organize the code.

That is not fully what I meant. The pandas.types module could still have different sub-modules, but what I meant was that only the 'top-level' of pandas.types would be public, so we are still free to implement it as we like in the submodules.
And sure, we will change things in the future, but whether we expose the public names in pandas.api.types or in pandas.types (top-level, not submodules) does not really matter for that? In both cases the functions will not actually be defined in that location.

Making it pandas._types would surely do this, but that is an ugly name and since this is pure python code a bit misleading.

This is IMO not misleading at all, it is actually very explicit that it is internal.
And I agree it is a bit uglier, but is only about a few import lines in each pandas file. I don't think it is that problematic there is a bit ugliness there, and IMO worth the clarity (if we use the pandas.api approach).
The internals namespace is also a possiblity, but this is actually more or less what core is currently. So we could use it for that I think (although less explicit in name). But we moved types, formats, indexes out of core .., so this would be putting it back (but keeping the reorganization in submodules).

'users': they are served by the top-level pd namespace, supplemented by pandas.errors. Everything that one could need is here (or as a method).

This is already not fully the case. There are public methods in pandas.io.json, pandas.tseries.offsets, pandas.plotting (when that PR is merged), ..

I recently added pandas.api.lib with infer_dtype,

Slightly of topic, but I personally would rather put this in pandas.types, as it is type related. And the fact that it is in our lib is more an implemenation detail for the user.

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 6, 2017

ok so plan is to:

  • document that pandas.core is private

  • potentially search on github to see if we should be deprecating things more explicity

  • move out of pandas to pandas.core namspace

    • computation
    • formats
    • indexes
    • sparse
    • tools
    • tseries
    • types
    • util (maybe)?
  • going to leave stats (as when we remove the deprecations for things like pd.rolling_mean this will go away anyhow

cc @wesm
@jorisvandenbossche
@TomAugspurger

@jreback jreback modified the milestones: 0.20.0, 1.0 Apr 6, 2017

@jreback jreback self-assigned this Apr 6, 2017

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 6, 2017

I started some search on github, but most of the things on the first pages were just internal imports in people who embedded the full pandas codebase in their repo ... Does anybody know of more advanced github search methods for such things?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 6, 2017

people who embedded the full pandas codebase in their repo

who the heck does that????

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 6, 2017

Regarding the above list, I am not sure we can easily move tseries, there are too many public functions in that module (offsets and frequencies). So we should think how to organize that one.

For tools, I would take the opportunity to reorganize that a bit. For example add a reshaping submodule for stack/unstack/pivot etc functionality? (just an idea, the current 'tools' name is not very descriptive)

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 14, 2017

I am shortly going to move a bunch of things around. But pandas.io is going to stay (pandas.formats will move).

Can you maybe first give some more details here, so we can discuss some things first before you do the work? (I also have some time this weekend to look at it)

See my questions above (tseries). Further, formats could maybe be moved to io instead of core ?
Regarding util (your questionmark), I would maybe leave util where it is (that feels more logically), but then state that this is also private (together with core) ?

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 14, 2017

quickly did #15997 (stil WIP), moving pandas.formats -> pandas.core.formats. (will rebase after @TomAugspurger #15954

I think easy to move .computation, .types. Let's do that (I can provide a proxy deprecation module if needed). Then go from there.

Further we will then have a big warning somewhere that pandas.core is now considered private.

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 14, 2017

.util yes could leave there where it is.

I think we should also state that .tseries is private as well.

.tools I will also move to .core

.formats belongs in .core its about printing core stuff (not really about io per se).

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 14, 2017

formats belongs in .core its about printing core stuff (not really about io per se).

But if we expose certain aspects of it in io or io.formats, it seems logical to just move it there. It is about printing core objects, yes, but the line between output and repr is not that clear (to_html, to_string are basically used for both)

@jreback

This comment has been minimized.

Contributor

jreback commented Apr 14, 2017

could move to io just as well. ok, i'll do that after @TomAugspurger merges then.

any problem with the others that I mentioned?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented Apr 14, 2017

I think we should also state that .tseries is private as well.

Then we first need a assessment on what exactly is public in tseries (as there certainly are public objects there) and how we expose this.

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2017

what is the purpose of pandas.plotting?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 1, 2017

It holds the non-method plotting functions like scatter_matrix

@jreback

This comment has been minimized.

Contributor

jreback commented May 1, 2017

so what is the argument for having that as a top level module then?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 1, 2017

so what is the argument for having that as a top level module then?

plotting or types? Assuming you mean types, Joris is saying pandas.types is more consistent with pandas.errors, pandas.plotting, pandas.testing.

@jreback

This comment has been minimized.

Contributor

jreback commented May 2, 2017

actually no, I mean I don't really see the purpose of the top-level pandas.plotting.

further pandas.api.* will be expanded in the future. I see no reason to change this.

@jreback

This comment has been minimized.

Contributor

jreback commented May 2, 2017

the more I look at this the more I don't think we need pandas.plotting at all. These are all methods on Series or DataFarme, with the exception of scatter_matrix.

Why do we need this again?

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 2, 2017

scatter_matrix, radviz, andrews_curves, parallel_coordinates, and autocorrelation_plot are all available in pandas.plotting and not on DataFrame.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 2, 2017

pandas.api.* will be expanded in the future.

What do we envisage that will end up in there, apart from type related things?

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 2, 2017

@jreback To come back to the hashing utilities, I didn't respond yet to this (#13634 (comment)):

I think b) is fine for now. These should really be internal with a couple of things exposed via pandas.api.lib (e.g. just hash_pandas_object and hash_array), so could do that as well, but b) allows us to do that later

We now exposed the public hashing functions in pandas.util.hashing, which means we should keep it (the public exposure) there, and not later move it somewhere else.
If we are not sure the implementation will be kept in util.hashing, we could also expose the public method just in pandas.util instead of pandas.util.hashing

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

both of those are not great
this is purely a library exposed for other developers (dask) so it's public to them
if it's moved we can adjust dask but ituis is not public in any other way

@TomAugspurger

This comment has been minimized.

Contributor

TomAugspurger commented May 3, 2017

Thoughts from anyone else? I think the proposals on the table are

Option 1 (current master);

  • pd.api.types
  • pd.errors
  • pd.io (.formats, .json, etc.)
  • pd.plotting
  • pd.testing

Option 2 (move types up):

  • pd.types
  • pd.errors
  • pd.io (.formats, .json, etc.)
  • pd.plotting
  • pd.testing

Option 3 (maybe? Not sure if this is on the table, but for symmetry)

  • pd.api.types
  • pd.api.errors
  • pd.api.io (.formats, .json, etc.)
  • pd.api.plotting
  • pd.api.testing
@shoyer

This comment has been minimized.

Member

shoyer commented May 3, 2017

I like option 2.

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

master is what it is
so -1 on any changes

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 3, 2017

@jreback Can you answer my comment above? #13634 (comment) (if you have an idea of course)

I am ok with option 1 / master (pandas.api.types), if we have an idea of what else we would put in pandas.api

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

@jreback Can you answer my comment above? #13634 (comment) (if you have an idea of course)

yes I think we should move hashing to pandas.api.lib (or something like that) and import the 2 functions (and I'll update dask as well).

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

pandas.util is also a bunch of random code. these should also not really be exposed and should really go in pandas.core.util

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 3, 2017

I think we should move hashing to pandas.api.lib

The objection I currently have against pandas.api.lib, is that I think that the fact that some functionality (eg in this case the hashing functions) are implemented in _lib is an implementation detail (if they were not in compiled code, they would not be there), and IMO this implementation detail is not need to 'leak' to the user API.

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 3, 2017

Another example is the cache_readonly decorator. I was thinking to use this one in geopandas, so it would be nice if there is a public exposure of it. It would also fit in util or pandas.api.lib, depending on this discussion.

Util is indeed a bunch of random things, but that is in its name, it is a bunch of various useful utility functions that don't really belong somewhere else. I think something like cache_readonly fits in that description.

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

ok what i could do is make everything in until private but leave the top level namespace to export certain functions as a flat namespace

@jorisvandenbossche

This comment has been minimized.

Member

jorisvandenbossche commented May 3, 2017

make everything in until private but leave the top level namespace to export certain functions as a flat namespace

yes, that would be a nice idea to expose certain things (like eg the cache_readonly) in the util submodule

@jreback

This comment has been minimized.

Contributor

jreback commented May 3, 2017

ok have a look for anything else
maybe Appender, Substitution

jreback added a commit to jreback/pandas that referenced this issue May 4, 2017

jreback added a commit to jreback/pandas that referenced this issue May 4, 2017

CLN: make submodules of pandas.util private
xref pandas-dev#13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

jreback added a commit to jreback/pandas that referenced this issue May 4, 2017

CLN: make submodules of pandas.util private
xref pandas-dev#13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

jreback added a commit to jreback/pandas that referenced this issue May 4, 2017

CLN: make submodules of pandas.util private
xref pandas-dev#13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

jreback added a commit that referenced this issue May 4, 2017

CLN: make submodules of pandas.util private (#16223)
* CLN: make submodules of pandas.util private

xref #13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

* CLN: move remaning extensions to _libs

* pandas.tools.hashing FutureWarning -> DeprecationWarning

pcluo added a commit to pcluo/pandas that referenced this issue May 22, 2017

CLN: make submodules of pandas.util private (pandas-dev#16223)
* CLN: make submodules of pandas.util private

xref pandas-dev#13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

* CLN: move remaning extensions to _libs

* pandas.tools.hashing FutureWarning -> DeprecationWarning

stangirala added a commit to stangirala/pandas that referenced this issue Jun 11, 2017

CLN: make submodules of pandas.util private (pandas-dev#16223)
* CLN: make submodules of pandas.util private

xref pandas-dev#13634

CLN: move pandas.util.*

validators, depr_module, decorators, print_versions

to _ leading

* CLN: move remaning extensions to _libs

* pandas.tools.hashing FutureWarning -> DeprecationWarning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment