API: Define API for pandas plotting backends #26747

datapythonista · 2019-06-09T00:23:15Z

In #26414 we splitted the pandas plotting module into a general plotting framework able to call different backends and the current matplotlib backends. The idea is that other backends can be implemented in a simpler way, and be used with a common API by pandas users.

The API defined by the current matplotlib backend includes the objects listed next, but this API can probably be simplified. Here is the list with questions/proposals:

Non-controversial methods to keep in the API (They provide the Series.plot(kind='line')... functionality):

LinePlot
BarPlot
BarhPlot
HistPlot
BoxPlot
KdePlot
AreaPlot
PiePlot
ScatterPlot
HexBinPlot

Plotting functions provided in pandas (e.g. pandas.plotting.andrews_curves(df))

andrews_curves
autocorrelation_plot
bootstrap_plot
lag_plot
parallel_coordinates
radviz
scatter_matrix
table

Should those be part of the API and other backends should also implement them? Would it make sense to convert to the format .plot (e.g. DataFrame.plot(kind='autocorrelation')...)? Does it make sense to keep out of the API, or move to a third-party module?

Redundant methods that can possibly be removed:

hist_series
hist_frame
boxplot
boxplot_frame
boxplot_frame_groupby

In the case of boxplot, we currently have several ways of generating a plot (calling mainly the same code):

DataFrame.plot.boxplot()
DataFrame.plot(kind='box')
DataFrame.boxplot()
pandas.plotting.boxplot(df)

Personally, I'd deprecate number 4, and for number 3, deprecate or at least not require a separate boxplot_frame method in the backend, but try to reuse BoxPlot (for number 3 comments, same applies to hist).

For boxplot_frame_groupby, didn't check in detail, but not sure if BoxPlot could be reused for this?

Functions to register converters:

register
deregister

Do those make sense for other backends?

Deprecated in pandas 0.23, to be removed:

tsplot

To see what each of these functions do in practise, it may be useful this notebook by @liirusuk: https://github.com/python-sprints/pandas_plotting_library/blob/master/AllPlottingExamples.ipynb

CC: @pandas-dev/pandas-core @tacaswell, @jakevdp, @philippjfr, @PatrikHlobil

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2019-06-09T12:01:20Z

I think keep things like autocorrelation out of the swappable backend API.

I think we’ve left things like df.boxplot and hist around because they have slightly different behavior than the .plot API. I wouldn’t recommend making them part of the backend API.

TomAugspurger · 2019-06-09T12:05:14Z

Here’s my start on a proposed backend API from a few months ago: TomAugspurger@b07aba2

datapythonista · 2019-06-09T12:17:20Z

I think it's worth mentioning that at least hvplot (didn't check the rest) does already provide the functions like andrews_curves, scatter_matrix, lag_plot,...

May be if we don't want to force all backends to implement those, we can check if the selected backend implements them, and default to the matplotlib plots?

I assumed boxplot and hist behaved exactly the same, but just had shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the plot grid, but other than that I haven't seen any difference.

TomAugspurger · 2019-06-10T03:45:09Z

IMO, the main value of this option is the `.plot` namespace. If users want hvplot's Andrew's curve plot, they should import the function from hvplot and pass the dataframe there.

…

On Sun, Jun 9, 2019 at 7:17 AM Marc Garcia ***@***.***> wrote: I think it's worth mentioning that at least hvplot (didn't check the rest) does already provide the functions like andrews_curves, scatter_matrix, lag_plot,... May be if we don't want to force all backends to implement those, we can check if the selected backend implements them, and default to the matplotlib plots? I assumed boxplot and hist behaved exactly the same, but just had shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the plot grid, but other than that I haven't seen any difference. — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#26747?email_source=notifications&email_token=AAKAOIRLJHBMXMXKK2IG2NDPZTYFPA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXII77Y#issuecomment-500207615>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOISDHL6H7PVOOJAQXELPZTYFPANCNFSM4HWIMEKQ> .

datapythonista · 2019-06-10T09:21:03Z

I think that makes sense, but if we do that, I think we should move them to pandas.plotting.matplotlib.andrews_curves, instead of pandas.plotting.andrews_curves.

@TomAugspurger I need to check in more detail, but I think the API you implemented in TomAugspurger@b07aba2 is the one that makes more sense. I'll work on it once I finish #26753. I'll also experiment on whether it's feasible to move andrews_curves, scatter_matrix... to the .plot() syntax, I think that will make things simpler and easier for everyone (us, third-party libraries, and users).

jakevdp · 2019-06-10T15:42:04Z

What's the intention here regarding extra kwargs passed to plotting functions? Should additional backends attempt to duplicate the functionality of all matplotlib-style plot customizations, or should they allow keywords to be passed that correspond to those used by the particular backend?

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

The second option is not as nice from the perspective of interchangeability, but would allow other backends to be added with a more reasonable set of expectations.

TomAugspurger · 2019-06-10T15:51:51Z

I think that's up to the backend on what they do with them. Achieving 100% compatibility across backends isn't really feasible, since the return type isn't going to be a matplotlib Axes anymore. And if we aren't compatible on the return type, I don't think backends should bend over backwards to try to handle every possible keyword argument. So I think pandas should document that `**kwargs` will be passed through to the underlying plotting engine, and they can do whatever they please with them.

…

On Mon, Jun 10, 2019 at 10:42 AM Jake Vanderplas ***@***.***> wrote: What's the intention here regarding extra kwargs passed to plotting functions? Should additional backends attempt to duplicate the functionality of all matplotlib-style plot customizations, or should they allow keywords to be passed that correspond to those used by the particular backend? The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back). The second option is not as nice from the perspective of interchangeability, but would allow other backends to be added with a more reasonable set of expectations. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26747?email_source=notifications&email_token=AAKAOIS3IBV4XSSY7BPSCF3PZZY5LA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXKH4AY#issuecomment-500465155>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOIQ3GYOGAPUZ4LSNK2DPZZY5LANCNFSM4HWIMEKQ> .

ghost · 2019-06-14T21:39:17Z

I'm sorry if this is a stupid question, but If you define a plotting "API" which is basically a group of canned plots, wouldn't every backend produce more or less the same output? what new capability is this meant to enable? something like a pandas to vega exporter perhaps?

jakevdp · 2019-06-14T23:34:27Z

I don't think it's correct to say that every backend produces more or less the same output.

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

On the other hand, bokeh, altair, et al. are great for interactive charts, but aren't quite as mature as matplotlib for static charts.

Being able to produce both with the same API would be a big win.

tacaswell · 2019-06-15T17:43:17Z

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

and also pins Matplotlib down even more than we already are API wise. I think it makes sense for pandas to declare what style knobs it wants to expose and expect the backend implementations to sort out what that means. This may mean not blindly passing **kwargs through and instead ensuring that the returned objects are "the right thing" for the given backend to be able to do after-the-fact style customization.

ghost · 2019-06-15T21:28:42Z

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

Thanks @jakevdp, yes, supporting interactive charts is a good goal.

Before things go too far down this particular avenue, here's an alternative solution.

Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input.

Advantages include:

Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification.
The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR.
Promoting a common language for interchange visualization in the pydata space
Which makes new tool more powerful because more widely applicable
Which makes the effort of writing them more reasonable. Basically, improved incentives.

Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't).

You know, frontend->IR->backend, like compilers are designed.

TomAugspurger · 2019-06-15T22:09:16Z

At least three packages already implement the API. All pandas needs to do is offer an option for changing the backend and document its use, which seems like a good bang for our buck.

…

On Jun 15, 2019, at 16:28, pilkibun ***@***.***> wrote: For example, matplotlib is really good at static charts, but not great at producing portable interactive charts. Thanks @jakevdp, yes, supporting interactive charts is a good goal. Before things go too far down this particular avenue, here's an alternative solution. Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input. Advantages include: Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification. The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR. Promoting a common language for interchange visualization in the pydata space Which makes new tool more powerful because more widely applicable Which makes the effort of writing them more reasonable. Basically, improved incentives. Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't). You know, frontend->IR->backend, like compilers are designed. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

datapythonista · 2019-06-21T15:02:32Z

We now merged #26753, and the plotting backend can be changed from pandas. When we split the matplotlib code we left the SeriesPlotMethods and FramePlotMethods in the pandas (not matplotlib) side. That was mainly to leave the docstrings in the pandas side.

But I see that what backends did was to reimplement those classes. So, currently we expect the backends to have one class per plot (e.g. LinePlot, BarPlot), but instead they implement a class with a plot per method (e.g. hvPlot, or the same names as pandas for pdvega`).

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

If that makes sense for everyone, I'll start by creating the abstract class and adapting the matplotlib backend we have in pandas, and once this is done, we adapt hvplot and pdvega (the changes there should be quite small).

Thoughts?

philippjfr · 2019-06-21T15:16:46Z

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

I think that on balance this approach will be cleaner. I can't speak to other plotting backends but at least in hvPlot different plot methods share quite a bit of code, e.g. scatter, line and area are largely analogous, and I'd prefer not to rely on subclassing to share code between them. Additionally, I think different backends should have the option to add additional plot types and exposing those as additional public methods seems like the simplest, most natural approach.

datapythonista · 2019-06-21T15:29:58Z

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

I think I'm +1 on letting backends define plot types not in pandas. But I won't probably implement it right now. We're planning to release pandas in around one week. And I think this will require a bit more thinking than blindly calling the methods of backends if user provides kind='foo' and the backend provides the method foo (for example, parameter validation, or it'll cause that some kind will be in the documentation and some not).

philippjfr · 2019-06-21T15:36:23Z

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

Yes, that's right. More concretely I'd prefer not to have to do this kind of thing:

class MPL1dPlot(MPLPlot):

    def _some_shared_method(self, ...):
        ...

class LinePlot(MPL1dPlot):
    ...

class AreaPlot(MPL1dPlot):
    ...

Sorry if that was not clear.

jorisvandenbossche · 2019-06-26T05:21:42Z

Very much in favor of a simpler API that is publicly exposed as the single function instead of the classes as now proposed in #27009.

General question/remark on how the backend option now works. Assume I am the pdvega developer and make this backend available. That means that if users do pd.options.plotting.backend = 'pdvega', that the pdvega library needs to have a top-level plot function?
1) as a library author, that's not necessarily the function you want to publicly expose (meaning, for the top-level plot method from the library's point of view, it is not necessarily the API that you want your users to use directly) and 2) for this case you might actually want to be able to do pd.options.plotting.backend = 'altair' ? (in case altair developers are fine with that)
So basically my question is: does there need to be a exact 1:1 mapping on the backend name and what is imported? (which is now needed since it simply does an import of that provided backend string).

EDIT: I see that actually something similar was discussed in the PR #26753

datapythonista · 2019-06-26T08:43:47Z

If we make the decision that pandas doesn't know/limit which backends can be used (which I'm strongly in favor of making), we need to decide on how/what to call in the backends.

What it's been implemented and proposed in the PR I'm working on is that the option plotting.backend is a module (can be pdvega, altair, altair.pandas, or whatever), and that module must have a public plot function, that it's what we will call.

We can consider other options, like if the option is pdvega, we import pdvega.pandas, or we can name the function plot_pandas or whatever. I think the proposed way is the simplest, but if there are other proposals that make more sense, I'm happy to change it.

Another discussion is if we want to force the users to import the backends manually:

import pandas
import hvplot

pandas.Series([1, 2, 3]).plot()

If we do that, the modules can register themselves, they can also register aliases (so set_option can understand other names than the name of the module). They can also implement custom functions or machinery (e.g. context managers) to plot with certain backends,... Personally I think the simpler we keep things the better.

And while it could be nice to do pandas.set_option('plotting.backend', 'bokeh') to plot in bokeh, I think that implies two things I personally don't like:

pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.
It also implies that there is only one module to plot in bokeh. Which doesn't need to be true, and gives the wrong impression to users that you're plotting directly with bokeh, and not with a pandas plotting backend for bokeh.

jorisvandenbossche · 2019-06-30T22:20:24Z

@datapythonista thanks for the detailed answer. I am fine with keeping it now as is for the initial release (possibility for alias can always be added later).

If users want hvplot's Andrew's curve plot, they should import the function from hvplot and pass the dataframe there.

+1, I would also not expose all the additional plotting functions through the backend.

But about moving them to pandas.plotting.matplotlib, that seems like an unnecessary backwards incompatible break to me (assuming you meant not only moving the implementation).

jakevdp · 2019-07-01T20:35:51Z

pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.

If we use entrypoints to register extensions, then this does not have to be the case: having the package installed on the system will register the entrypoint and make it visible to pandas. For example, this is what Altair uses to detect various renderers that the user might have installed.

jakevdp · 2019-07-01T20:52:36Z

Also, for what it's worth, once this goes in I think I'd probably deprecate pdvega and move the relevant code over to a new package named pandas_altair or something similar.

datapythonista · 2019-07-06T20:45:09Z

Just to explain a bit why things are the way they are now. It's relevant because I'm not quite sure how to implement the changes you propose, or not exposing things in general. Not saying here that it can't be done in a different way, it's just to enrich the discussion.

The first decision was to move all the code using matplotlib to a separate module (pandas.plotting._matplotlib). By doing that, that module somehow became the matplotlib backend.

Everything that was public in pandas.plotting has been kept as public there. And to make things as simple as possible, every one of these functions, once called, it loads the backend (call to _get_plot_backend) and it calls the function there.

The public API for the user has no change at all, users still have the same methods and functions available. We're not exposing anything new.

How I understand things, if we decide that an existing plot like andrew_curves is not delegated to the backend, what this implies is that instead of getting the backend selected by the user, we will still select the matplotlib backend. Given that at least hvplot is already implementing andrew_curves, I personally don't see the point. If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream (I guess that's one of the reasonings?), may be we can default to the matplotlib backend anything that is missing in the selected backend?

About delegating any unknown kind of plot to the backend, I'm -1 on doing it right now. Surely it can make sense eventually. But I think having several plot kinds documented in pandas, and having extra ones that the we don't document, feels a bit hacky. I think it can wait for the next version, after we have feedback on how having different backends work for users, and we have more time to discuss and analyze in detail.

jorisvandenbossche · 2019-07-16T20:28:22Z

If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

I don't think we would be making the user's life harder. Instead of importing it from pandas.plotting, if they want a hvplot's version, they can simply import it from there. Which is something not possible for the DataFrame.plot method, as that is defined on the object. For me that is the main reason for the plotting backend.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream

For me it is not about being nice or that implementing everything would be required (it is totally fine if a backend does not support all plotting types, IMO), but rather an unnecessary expansion of the plotting backend API, which also ties ourselves to it.
If we would restart pandas from scratch, I don't think those misc plotting types would be included. But with the plotting backend API we are in some way starting something new.

Any other opinions about this?

TomAugspurger · 2019-07-16T21:20:53Z

Agreed with @jorisvandenbossche.

Just to make sure this isn't lost, I think @jakevdp's suggestion to use setuptool's entry points is worth considering to solve the import order registration issue: #26747 (comment)

datapythonista · 2019-07-17T12:05:30Z

@jorisvandenbossche how would you change that in the code? Instead of getting the backend defined in the settings for those methods, get the matplotlib backend? I think this is wrong conceptually, but I'm ok with it if there is agreement. Anything that reverts the decoupling of the matplotlib code from the rest I'm -1.

Since you mention that in a pandas from scratch we wouldn't include those plots, should we deprecate them? I'm +1 on moving all the plots that are not methods of Series or DataFrame to a third-party package. Or if any is important enough to be kept, to move it to be called with .plot() as the others.

jreback · 2019-07-17T12:11:23Z

i would deprecate the non standard plots in pandas
and move to an external package

TomAugspurger · 2019-07-17T12:16:39Z

Joris is offline for a bit.

I think when we’ve discussed this in the past, his and my position on theses is to just leave them untouched until they become a maintenance burden.

datapythonista · 2019-07-17T12:43:20Z

Just so we are in the same page, this is a summary of what we have, and my understanding of the state of the discussion:

Used as methods of Series and DataFrame (afaik we're all happy to keep them as they are, delegated to the selected backend):

PlotAccessor
boxplot_frame
boxplot_frame_groupby
hist_frame
hist_series

Other plots (under discussion whether they should be deprecated, delegated to the matplotlib backend, or delegated to the selected backend):

boxplot
scatter_matrix
radviz
andrews_curves
bootstrap_plot
parallel_coordinates
lag_plot
autocorrelation_plot
table

Other public stuff in pandas.plotting (under discussion too):

plot_params
register_matplotlib_converters
deregister_matplotlib_converters

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

For the converters and the other stuff, what we have now is surely not correct, since register_matplotlib_converters delegates to the selected plot, which can not be matplotlib. The options that I guess we can consider are:

Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend
Move them from pandas.plotting to pandas.plotting.matplotlib (which would imply making the matplotlib backend public, so I wouldn't)
Leave them as they are, and delegate to the matplotlib backend instead of the selected backend (I see this more as a hack than a good design decision, I'd prefer to keep pandas.plotting agnostic of which backends exist)

TomAugspurger · 2019-07-17T13:32:09Z

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

How do you find the "other plots" to be a maintenance burden? Looking at the history for the "misc" plots: https://github.com/pandas-dev/pandas/commits/0.24.x/pandas/plotting/_misc.py, we have ~10-15 commits since 2017. The majority are global cleanups applied to the entire codebase (so a small marginal burden). I only see 1-2 commits changing docs, and no commits changing functionality.

Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend

I don't think this would make sense. There are matplotlib-specific converters that we've written for matplotlib. Other backends won't have them. It probably shouldn't be part of the backend API.

datapythonista · 2019-07-17T13:44:50Z

I didn't mean those plots are a burden because of the amount of maintenance we've got in the last months of years, but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

Regarding the converters, I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases. But doesn't seem a problem if they don't, and those functions do nothing for some or all of the other backends. I'm also ok with option 2, but I don't find it as neat.

TomAugspurger · 2019-07-17T13:49:00Z

but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

They're already somewhat inconsistent with DataFrame.plot, though. The name "misc" implies that :) Does having a swappable backend make that any worse? To the extent that it's worth the churn on user code? I don't think so.

I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases.

I don't think so. The point of those converters is to teach matplotlib about pandas objects. Libraries implementing the backend won't have that problem, since they already depend on pandas.

datapythonista · 2019-07-17T14:13:06Z

Personally I think about it mainly in terms of managing complexity. Having a standard plotting API that is delegated to the backend via a single API is easy to understand, and to maintain. Users and maintainers just need to learn that there is a plot function with a kind argument, and that this will be executed in the selected backend.

Having in the backend a set of heterogeneous plots, that besides not following the same API, use a backend, but not the one selected for the other plots, but the Matplotlib one, adds too much complexity for everyone IMHO.

And the cost of moving them seems small to me, my guess is that not a big proportion of our users even know about those plots. And for the ones who do, they'll just need to install an extra conda package and use import pandas_plotting; pandas_plotting.andrews_curves(df) instead of pandas.plotting.andrews_curves(df).

To me seems a lot to win, at a small cost, but of course it's just an opinion.

TomAugspurger · 2019-07-17T14:41:42Z

Can we document that the swappable backend is just for Series/DataFrame.plot? That seems like a pretty simple rule.

datapythonista · 2019-07-17T15:07:39Z

Feels like a hack that adds unnecessary complexity to me; I don't think explaining it in the documentation makes it less counter-intuitive.

But anyway, not a big deal. If that's the preferred option, this is how I'd implement it, at least the increase in code complexity is minimal: #27432

jakevdp · 2019-07-19T18:16:25Z

Looking more closely at this now: if I understand correctly, the way that the plotting backend will be set is using:

pd.set_option('plotting.backend', 'name_of_module')

My understanding, then, is that if I want to make the following work:

pd.set_option('plotting.backend', 'altair')

then I will need the top-level altair package to define all the functions in https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py. I would prefer not to pollute Altair's top-level namespace with all these additional APIs that are not meant to actually be used by Altair users. In fact, I would prefer for altair's pandas extension to live in a separate package, so it's not tied to the release cadence of Altair itself.

If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend', 'altair') work correctly without hard-coding the altair package in pandas the way matplotlib is currently hard-coded, is that correct?

pandas/pandas/plotting/_core.py

Lines 1550 to 1551 in f1b9fc1

    
           if backend_str == "matplotlib": 
        
               backend_str = "pandas.plotting._matplotlib"

If so, I would strongly advise rethinking the means by which this API is exposed in third-party packages.

My suggested solution would be to adopt an entrypoint-based framework that would let me, for example, create a package like altair_pandas that registers the altair entrypoint to implement the API. Otherwise users will forever be confused that pd.set_option('plotting.backend', 'altair') doesn't do what they expect.

TomAugspurger · 2019-07-19T18:22:26Z

Agreed. I think entry points are the way to go. I'll prototype something.

…

On Fri, Jul 19, 2019 at 1:16 PM Jake Vanderplas ***@***.***> wrote: Looking more closely at this now: if I understand correctly, the way that the plotting backend will be set is using: pd.set_option('plotting.backend', 'name_of_module') My understanding, then, is that if I want to make the following work: pd.set_option('plotting.backend', 'altair') then I will need the top-level altair package to define all the functions in https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py. I would prefer not to pollute Altair's top-level namespace with all these additional APIs. In fact, I would prefer for altair's pandas extension to live in a separate package, so it's not tied to the release cadence of Altair itself. If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend', 'altair') work correctly without hard-coding the altair package in pandas the way matplotlib is currently hard-coded, is that correct? If so, I would strongly advise rethinking how this is enabled by third-party packages. In particular, adopting an entrypoint-based framework would let me create a package like altair_pandas that registers the altair entrypoint. Otherwise users will forever be confused that pd.set_option('plotting.backend', 'altair') doesn't do what they expect. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26747?email_source=notifications&email_token=AAKAOITQM7HH5X4SZ4IAPS3QAIAIBA5CNFSM4HWIMEK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2ML5OQ#issuecomment-513326778>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOISFLHDGXLGQ3PUMNLDQAIAIBANCNFSM4HWIMEKQ> .

datapythonista · 2019-07-19T18:28:41Z

There was a point in time where what you say was mostly correct, but that's not the case anymore.

If you want pandas.options.plotting.backend = 'altair', in 0.25 you just need to have a function altair.plot(). At some point I thought would be better to call the function pandas_plot instead of simply plot, so it was specific in a backend that had other things, but we finally didn't make the change.

If creating the plot function in the top level of altair is a problem, we can rename it in a future version, or you can also have altair.pandas.plot, but then users will have to set pandas.options.plotting.backend = 'altair.pandas'.

You can surely change the option yourself once users do an import altair. And we could implement a registry of backends. But I think it'd be confusing for users if they do the pandas.options.plotting.backend = 'altair' and it fails, because they forgot the import altair before.

One last thing is to consider that we could possibly have more than one pandas backend implemented for altair (or any other visualization library). So, for me, that the name of the backend is not altair, is not necessarily a bad thing.

TomAugspurger · 2019-07-19T18:38:18Z

Here's an entry-points based implementation

diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py
index 0610780ed..c8ac12901 100644
--- a/pandas/plotting/_core.py
+++ b/pandas/plotting/_core.py
@@ -1532,8 +1532,10 @@ class PlotAccessor(PandasObject):
 
         return self(kind="hexbin", x=x, y=y, C=C, **kwargs)
 
+_backends = {}
 
-def _get_plot_backend(backend=None):
+
+def _get_plot_backend(backend="matplotlib"):
     """
     Return the plotting backend to use (e.g. `pandas.plotting._matplotlib`).
 
@@ -1546,7 +1548,14 @@ def _get_plot_backend(backend=None):
     The backend is imported lazily, as matplotlib is a soft dependency, and
     pandas can be used without it being installed.
     """
-    backend_str = backend or pandas.get_option("plotting.backend")
-    if backend_str == "matplotlib":
-        backend_str = "pandas.plotting._matplotlib"
-    return importlib.import_module(backend_str)
+    import pkg_resources  # slow import. Delay
+    if backend in _backends:
+        return _backends[backend]
+
+    for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
+        _backends[entry_point.name] = entry_point.load()
+
+    try:
+        return _backends[backend]
+    except KeyError:
+        raise ValueError("No backend {}".format(backend))
diff --git a/setup.py b/setup.py
index 53e12da53..d2c6b18b8 100755
--- a/setup.py
+++ b/setup.py
@@ -830,5 +830,10 @@ setup(
             "hypothesis>=3.58",
         ]
     },
+    entry_points={
+        "pandas_plotting_backends": [
+            "matplotlib = pandas:plotting._matplotlib",
+        ],
+    },
     **setuptools_kwargs
 )

I think it's quite nice. 3rd party packages will modify their setup.py (or pyproject.toml) to include something like

entry_points={
    "pandas_plotting_backends": ["altair = pdvega._pandas_plotting_backend"]
}

I like that it breaks the tight coupling between naming and implementation.

datapythonista · 2019-07-19T19:44:45Z

I didn't work with entry points, are them like a global registry of the Python environment? Being new to them I don't love the idea, but I guess that would be a reasonable way to do it then.

I'd still like to have both options, so if the user does pandas.options.plottting.backend = 'my_own_project.my_custom_small_backend' it works, and doesn't require creating a package, and setting entry points.

TomAugspurger · 2019-07-19T19:48:09Z

I didn't work with entry points, are them like a global registry of the Python environment?

I haven't used them either, but I think that's the idea. From what I understand, they're from setuptools (but packages like flit hook into them?). So they aren't part of the standard library, but setuptools is what everyone uses anyway.

I'd still like to have both options

Falling back to import_module(backend_name) seems reasonable.

Libraries, including pandas, register backends via entrypoints. xref pandas-dev#26747

datapythonista added Visualization plotting API Design Clean Needs Discussion Requires discussion from core team before further action labels Jun 9, 2019

datapythonista mentioned this issue Jun 9, 2019

PLOT: Add option to specify the plotting backend #26753

Merged

4 tasks

datapythonista mentioned this issue Jun 23, 2019

PLT: Cleaner plotting backend API, and unify Series and DataFrame accessors #27009

Merged

4 tasks

DougBurke mentioned this issue Jun 26, 2019

Supporting different backends in Sherpa sherpa/sherpa#635

Open

jreback added the Blocker Blocking issue or pull request for an upcoming release label Jun 28, 2019

jorisvandenbossche added this to the 0.25.0 milestone Jun 30, 2019

datapythonista mentioned this issue Jul 17, 2019

PLT: Delegating to plotting backend only plots of Series and DataFrame methods #27432

Merged

5 tasks

WillAyd mentioned this issue Jul 18, 2019

RLS: 0.25.0 #24950

Closed

WillAyd removed this from the 0.25.0 milestone Jul 18, 2019

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jul 20, 2019

API: Add entrypoint for plotting

853bd66

Libraries, including pandas, register backends via entrypoints. xref pandas-dev#26747

TomAugspurger mentioned this issue Jul 20, 2019

API: Add entrypoint for plotting #27488

Merged

DougBurke mentioned this issue Aug 22, 2019

Allow plotting options in Sherpa commands sherpa/sherpa#251

Closed

This was referenced Aug 27, 2019

BUG: do not hard-code matplotlib backend for boxplot #28159

Closed

DEPR: Clean up of pandas.plotting #28177

Open

Casyfill mentioned this issue Aug 30, 2019

scatter_matrix altair-viz/altair_pandas#24

Open

mroeschke removed the Clean label Jul 10, 2021

mattijn mentioned this issue Aug 2, 2024

feat(python!): Use Altair in DataFrame.plot pola-rs/polars#17995

Open

API: Define API for pandas plotting backends #26747

API: Define API for pandas plotting backends #26747

Comments

datapythonista commented Jun 9, 2019

TomAugspurger commented Jun 9, 2019

TomAugspurger commented Jun 9, 2019

datapythonista commented Jun 9, 2019

TomAugspurger commented Jun 10, 2019 via email

datapythonista commented Jun 10, 2019

jakevdp commented Jun 10, 2019

TomAugspurger commented Jun 10, 2019 via email

ghost commented Jun 14, 2019

jakevdp commented Jun 14, 2019

tacaswell commented Jun 15, 2019

ghost commented Jun 15, 2019

TomAugspurger commented Jun 15, 2019 via email

datapythonista commented Jun 21, 2019

philippjfr commented Jun 21, 2019

datapythonista commented Jun 21, 2019

philippjfr commented Jun 21, 2019

jorisvandenbossche commented Jun 26, 2019 • edited Loading

datapythonista commented Jun 26, 2019

jorisvandenbossche commented Jun 30, 2019

jakevdp commented Jul 1, 2019 • edited Loading

jakevdp commented Jul 1, 2019 • edited Loading

datapythonista commented Jul 6, 2019

jorisvandenbossche commented Jul 16, 2019

TomAugspurger commented Jul 16, 2019

datapythonista commented Jul 17, 2019

jreback commented Jul 17, 2019

TomAugspurger commented Jul 17, 2019

datapythonista commented Jul 17, 2019

TomAugspurger commented Jul 17, 2019

datapythonista commented Jul 17, 2019

TomAugspurger commented Jul 17, 2019

datapythonista commented Jul 17, 2019

TomAugspurger commented Jul 17, 2019

datapythonista commented Jul 17, 2019

jakevdp commented Jul 19, 2019 • edited Loading

TomAugspurger commented Jul 19, 2019 via email

datapythonista commented Jul 19, 2019

TomAugspurger commented Jul 19, 2019 • edited Loading

datapythonista commented Jul 19, 2019

TomAugspurger commented Jul 19, 2019

jorisvandenbossche commented Jun 26, 2019 •

edited

Loading

jakevdp commented Jul 1, 2019 •

edited

Loading

jakevdp commented Jul 1, 2019 •

edited

Loading

jakevdp commented Jul 19, 2019 •

edited

Loading

TomAugspurger commented Jul 19, 2019 •

edited

Loading