Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Define API for pandas plotting backends #26747

Open
datapythonista opened this issue Jun 9, 2019 · 44 comments

Comments

@datapythonista
Copy link
Member

commented Jun 9, 2019

In #26414 we splitted the pandas plotting module into a general plotting framework able to call different backends and the current matplotlib backends. The idea is that other backends can be implemented in a simpler way, and be used with a common API by pandas users.

The API defined by the current matplotlib backend includes the objects listed next, but this API can probably be simplified. Here is the list with questions/proposals:

Non-controversial methods to keep in the API (They provide the Series.plot(kind='line')... functionality):

  • LinePlot
  • BarPlot
  • BarhPlot
  • HistPlot
  • BoxPlot
  • KdePlot
  • AreaPlot
  • PiePlot
  • ScatterPlot
  • HexBinPlot

Plotting functions provided in pandas (e.g. pandas.plotting.andrews_curves(df))

  • andrews_curves
  • autocorrelation_plot
  • bootstrap_plot
  • lag_plot
  • parallel_coordinates
  • radviz
  • scatter_matrix
  • table

Should those be part of the API and other backends should also implement them? Would it make sense to convert to the format .plot (e.g. DataFrame.plot(kind='autocorrelation')...)? Does it make sense to keep out of the API, or move to a third-party module?

Redundant methods that can possibly be removed:

  • hist_series
  • hist_frame
  • boxplot
  • boxplot_frame
  • boxplot_frame_groupby

In the case of boxplot, we currently have several ways of generating a plot (calling mainly the same code):

  1. DataFrame.plot.boxplot()
  2. DataFrame.plot(kind='box')
  3. DataFrame.boxplot()
  4. pandas.plotting.boxplot(df)

Personally, I'd deprecate number 4, and for number 3, deprecate or at least not require a separate boxplot_frame method in the backend, but try to reuse BoxPlot (for number 3 comments, same applies to hist).

For boxplot_frame_groupby, didn't check in detail, but not sure if BoxPlot could be reused for this?

Functions to register converters:

  • register
  • deregister

Do those make sense for other backends?

Deprecated in pandas 0.23, to be removed:

  • tsplot

To see what each of these functions do in practise, it may be useful this notebook by @liirusuk: https://github.com/python-sprints/pandas_plotting_library/blob/master/AllPlottingExamples.ipynb

CC: @pandas-dev/pandas-core @tacaswell, @jakevdp, @philippjfr, @PatrikHlobil

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

I think keep things like autocorrelation out of the swappable backend API.

I think we’ve left things like df.boxplot and hist around because they have slightly different behavior than the .plot API. I wouldn’t recommend making them part of the backend API.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

Here’s my start on a proposed backend API from a few months ago: TomAugspurger@b07aba2

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jun 9, 2019

I think it's worth mentioning that at least hvplot (didn't check the rest) does already provide the functions like andrews_curves, scatter_matrix, lag_plot,...

May be if we don't want to force all backends to implement those, we can check if the selected backend implements them, and default to the matplotlib plots?

I assumed boxplot and hist behaved exactly the same, but just had shortcuts Series.hist() for Series.plot.hist(). The "shortcut" shows the plot grid, but other than that I haven't seen any difference.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

I think that makes sense, but if we do that, I think we should move them to pandas.plotting.matplotlib.andrews_curves, instead of pandas.plotting.andrews_curves.

@TomAugspurger I need to check in more detail, but I think the API you implemented in TomAugspurger@b07aba2 is the one that makes more sense. I'll work on it once I finish #26753. I'll also experiment on whether it's feasible to move andrews_curves, scatter_matrix... to the .plot() syntax, I think that will make things simpler and easier for everyone (us, third-party libraries, and users).

@jakevdp

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

What's the intention here regarding extra kwargs passed to plotting functions? Should additional backends attempt to duplicate the functionality of all matplotlib-style plot customizations, or should they allow keywords to be passed that correspond to those used by the particular backend?

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

The second option is not as nice from the perspective of interchangeability, but would allow other backends to be added with a more reasonable set of expectations.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@pilkibun

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

I'm sorry if this is a stupid question, but If you define a plotting "API" which is basically a group of canned plots, wouldn't every backend produce more or less the same output? what new capability is this meant to enable? something like a pandas to vega exporter perhaps?

@jakevdp

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2019

I don't think it's correct to say that every backend produces more or less the same output.

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

On the other hand, bokeh, altair, et al. are great for interactive charts, but aren't quite as mature as matplotlib for static charts.

Being able to produce both with the same API would be a big win.

@tacaswell

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2019

The first option would be nice in theory, but would require every non-matplotlib plotting backend to essentially implement its own matplotlib conversion layer with a long tail of incompatibilities that would essentially never be complete (speaking from experience as someone who tried to create mpld3 some years back).

and also pins Matplotlib down even more than we already are API wise. I think it makes sense for pandas to declare what style knobs it wants to expose and expect the backend implementations to sort out what that means. This may mean not blindly passing **kwargs through and instead ensuring that the returned objects are "the right thing" for the given backend to be able to do after-the-fact style customization.

@pilkibun

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2019

For example, matplotlib is really good at static charts, but not great at producing portable interactive charts.

Thanks @jakevdp, yes, supporting interactive charts is a good goal.

Before things go too far down this particular avenue, here's an alternative solution.

Instead of proclaiming the pandas plotting API to now be a specification, and asking viz packages to implement it specifically, why not generate an intermediate representation (like a vega JSON file) of the plot, and encourage backends to target that as their input.

Advantages include:

  1. Not being tied to the expressive power of a reified pandas API, which wasn't designed as a specification.
  2. The work done by plotting packages to support pandas, becomes available to other pydata packages which generate IR.
  3. Promoting a common language for interchange visualization in the pydata space
  4. Which makes new tool more powerful because more widely applicable
  5. Which makes the effort of writing them more reasonable. Basically, improved incentives.

Vega/Vega-lite, as a modern, established, open, and JSON-based viz specification language, several man-years put it into its design and implementation, and existing tools built around it, seems like it was created expressly for this purpose. (just please don't).

You know, frontend->IR->backend, like compilers are designed.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jun 15, 2019

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2019

We now merged #26753, and the plotting backend can be changed from pandas. When we split the matplotlib code we left the SeriesPlotMethods and FramePlotMethods in the pandas (not matplotlib) side. That was mainly to leave the docstrings in the pandas side.

But I see that what backends did was to reimplement those classes. So, currently we expect the backends to have one class per plot (e.g. LinePlot, BarPlot), but instead they implement a class with a plot per method (e.g. hvPlot, or the same names as pandas for pdvega`).

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

If that makes sense for everyone, I'll start by creating the abstract class and adapting the matplotlib backend we have in pandas, and once this is done, we adapt hvplot and pdvega (the changes there should be quite small).

Thoughts?

@philippjfr

This comment has been minimized.

Copy link

commented Jun 21, 2019

What I think makes sense, at least as a first version, is that we implement the API as hvplot and pdvega did. I'd just create an abstract class in pandas, that backends inherit from.

I think that on balance this approach will be cleaner. I can't speak to other plotting backends but at least in hvPlot different plot methods share quite a bit of code, e.g. scatter, line and area are largely analogous, and I'd prefer not to rely on subclassing to share code between them. Additionally, I think different backends should have the option to add additional plot types and exposing those as additional public methods seems like the simplest, most natural approach.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jun 21, 2019

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

I think I'm +1 on letting backends define plot types not in pandas. But I won't probably implement it right now. We're planning to release pandas in around one week. And I think this will require a bit more thinking than blindly calling the methods of backends if user provides kind='foo' and the backend provides the method foo (for example, parameter validation, or it'll cause that some kind will be in the documentation and some not).

@philippjfr

This comment has been minimized.

Copy link

commented Jun 21, 2019

Just to make sure I understand, when you say I'd prefer not to rely on subclassing to share code between them you mean like in class LinePlot(MPLPlot), right? And not that you think it's a bad idea to inherit from an abstract base class?

Yes, that's right. More concretely I'd prefer not to have to do this kind of thing:

class MPL1dPlot(MPLPlot):

    def _some_shared_method(self, ...):
        ...

class LinePlot(MPL1dPlot):
    ...

class AreaPlot(MPL1dPlot):
    ...

Sorry if that was not clear.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 26, 2019

Very much in favor of a simpler API that is publicly exposed as the single function instead of the classes as now proposed in #27009.

General question/remark on how the backend option now works. Assume I am the pdvega developer and make this backend available. That means that if users do pd.options.plotting.backend = 'pdvega', that the pdvega library needs to have a top-level plot function?
1) as a library author, that's not necessarily the function you want to publicly expose (meaning, for the top-level plot method from the library's point of view, it is not necessarily the API that you want your users to use directly) and 2) for this case you might actually want to be able to do pd.options.plotting.backend = 'altair' ? (in case altair developers are fine with that)
So basically my question is: does there need to be a exact 1:1 mapping on the backend name and what is imported? (which is now needed since it simply does an import of that provided backend string).

EDIT: I see that actually something similar was discussed in the PR #26753

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jun 26, 2019

If we make the decision that pandas doesn't know/limit which backends can be used (which I'm strongly in favor of making), we need to decide on how/what to call in the backends.

What it's been implemented and proposed in the PR I'm working on is that the option plotting.backend is a module (can be pdvega, altair, altair.pandas, or whatever), and that module must have a public plot function, that it's what we will call.

We can consider other options, like if the option is pdvega, we import pdvega.pandas, or we can name the function plot_pandas or whatever. I think the proposed way is the simplest, but if there are other proposals that make more sense, I'm happy to change it.

Another discussion is if we want to force the users to import the backends manually:

import pandas
import hvplot

pandas.Series([1, 2, 3]).plot()

If we do that, the modules can register themselves, they can also register aliases (so set_option can understand other names than the name of the module). They can also implement custom functions or machinery (e.g. context managers) to plot with certain backends,... Personally I think the simpler we keep things the better.

And while it could be nice to do pandas.set_option('plotting.backend', 'bokeh') to plot in bokeh, I think that implies two things I personally don't like:

  • pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.
  • It also implies that there is only one module to plot in bokeh. Which doesn't need to be true, and gives the wrong impression to users that you're plotting directly with bokeh, and not with a pandas plotting backend for bokeh.
@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jun 30, 2019

@datapythonista thanks for the detailed answer. I am fine with keeping it now as is for the initial release (possibility for alias can always be added later).

If users want hvplot's Andrew's curve plot, they should import the function from hvplot and pass the dataframe there.

+1, I would also not expose all the additional plotting functions through the backend.

But about moving them to pandas.plotting.matplotlib, that seems like an unnecessary backwards incompatible break to me (assuming you meant not only moving the implementation).

@jakevdp

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2019

pandas.set_option('plotting.backend', 'bokeh') will only work if import pandas_bokeh has been called, and will be confusing for the users.

If we use entrypoints to register extensions, then this does not have to be the case: having the package installed on the system will register the entrypoint and make it visible to pandas. For example, this is what Altair uses to detect various renderers that the user might have installed.

@jakevdp

This comment has been minimized.

Copy link
Contributor

commented Jul 1, 2019

Also, for what it's worth, once this goes in I think I'd probably deprecate pdvega and move the relevant code over to a new package named pandas_altair or something similar.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 6, 2019

For an initial release of the backend API, I would rather be more conservative in what we expose, rather than including everything. It is much easier to add things later, than to remove.

I would personally also not move all those misc plots to the accessor (there might be some exceptions, like scatter matrix), IMO the andrew_curves and radviz etc are not "worth" a method.

That said: do we want to allow backends to implement additional "kinds" ? So we don't have to decide, as pandas, exactly which accessor methods can be available. If the user passes a certain kind or tries to access an attribute, we could still pass it to the backend plot with a custom __getattribute__.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 6, 2019

Just to explain a bit why things are the way they are now. It's relevant because I'm not quite sure how to implement the changes you propose, or not exposing things in general. Not saying here that it can't be done in a different way, it's just to enrich the discussion.

The first decision was to move all the code using matplotlib to a separate module (pandas.plotting._matplotlib). By doing that, that module somehow became the matplotlib backend.

Everything that was public in pandas.plotting has been kept as public there. And to make things as simple as possible, every one of these functions, once called, it loads the backend (call to _get_plot_backend) and it calls the function there.

The public API for the user has no change at all, users still have the same methods and functions available. We're not exposing anything new.

How I understand things, if we decide that an existing plot like andrew_curves is not delegated to the backend, what this implies is that instead of getting the backend selected by the user, we will still select the matplotlib backend. Given that at least hvplot is already implementing andrew_curves, I personally don't see the point. If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream (I guess that's one of the reasonings?), may be we can default to the matplotlib backend anything that is missing in the selected backend?

About delegating any unknown kind of plot to the backend, I'm -1 on doing it right now. Surely it can make sense eventually. But I think having several plot kinds documented in pandas, and having extra ones that the we don't document, feels a bit hacky. I think it can wait for the next version, after we have feedback on how having different backends work for users, and we have more time to discuss and analyze in detail.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Jul 16, 2019

If the user wants an andrew_curves plot in matplotlib is as easy as not changing the backend (or setting it again if it's been changed). So, with the change what we'd do is simply making users life much harder, by adding extra complexity to pandas.

I don't think we would be making the user's life harder. Instead of importing it from pandas.plotting, if they want a hvplot's version, they can simply import it from there. Which is something not possible for the DataFrame.plot method, as that is defined on the object. For me that is the main reason for the plotting backend.

If we want to be nice with backend developers and not force them to implement plots that may not be so mainstream

For me it is not about being nice or that implementing everything would be required (it is totally fine if a backend does not support all plotting types, IMO), but rather an unnecessary expansion of the plotting backend API, which also ties ourselves to it.
If we would restart pandas from scratch, I don't think those misc plotting types would be included. But with the plotting backend API we are in some way starting something new.

Any other opinions about this?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

Agreed with @jorisvandenbossche.


Just to make sure this isn't lost, I think @jakevdp's suggestion to use setuptool's entry points is worth considering to solve the import order registration issue: #26747 (comment)

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

@jorisvandenbossche how would you change that in the code? Instead of getting the backend defined in the settings for those methods, get the matplotlib backend? I think this is wrong conceptually, but I'm ok with it if there is agreement. Anything that reverts the decoupling of the matplotlib code from the rest I'm -1.

Since you mention that in a pandas from scratch we wouldn't include those plots, should we deprecate them? I'm +1 on moving all the plots that are not methods of Series or DataFrame to a third-party package. Or if any is important enough to be kept, to move it to be called with .plot() as the others.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

i would deprecate the non standard plots in pandas
and move to an external package

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

Joris is offline for a bit.

I think when we’ve discussed this in the past, his and my position on theses is to just leave them untouched until they become a maintenance burden.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

Just so we are in the same page, this is a summary of what we have, and my understanding of the state of the discussion:

Used as methods of Series and DataFrame (afaik we're all happy to keep them as they are, delegated to the selected backend):

  • PlotAccessor
  • boxplot_frame
  • boxplot_frame_groupby
  • hist_frame
  • hist_series

Other plots (under discussion whether they should be deprecated, delegated to the matplotlib backend, or delegated to the selected backend):

  • boxplot
  • scatter_matrix
  • radviz
  • andrews_curves
  • bootstrap_plot
  • parallel_coordinates
  • lag_plot
  • autocorrelation_plot
  • table

Other public stuff in pandas.plotting (under discussion too):

  • plot_params
  • register_matplotlib_converters
  • deregister_matplotlib_converters

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

For the converters and the other stuff, what we have now is surely not correct, since register_matplotlib_converters delegates to the selected plot, which can not be matplotlib. The options that I guess we can consider are:

  • Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend
  • Move them from pandas.plotting to pandas.plotting.matplotlib (which would imply making the matplotlib backend public, so I wouldn't)
  • Leave them as they are, and delegate to the matplotlib backend instead of the selected backend (I see this more as a hack than a good design decision, I'd prefer to keep pandas.plotting agnostic of which backends exist)
@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

For the Other plots section, I personally think they are a maintenance burden at this point, and I'm +1 on moving them out of pandas, and deprecate them in 0.25.

How do you find the "other plots" to be a maintenance burden? Looking at the history for the "misc" plots: https://github.com/pandas-dev/pandas/commits/0.24.x/pandas/plotting/_misc.py, we have ~10-15 commits since 2017. The majority are global cleanups applied to the entire codebase (so a small marginal burden). I only see 1-2 commits changing docs, and no commits changing functionality.

Rename them to register_converters/deregister_converters, deprecate the current ones, and keep delegating to the backend

I don't think this would make sense. There are matplotlib-specific converters that we've written for matplotlib. Other backends won't have them. It probably shouldn't be part of the backend API.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

I didn't mean those plots are a burden because of the amount of maintenance we've got in the last months of years, but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

Regarding the converters, I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases. But doesn't seem a problem if they don't, and those functions do nothing for some or all of the other backends. I'm also ok with option 2, but I don't find it as neat.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

but because of the problem that they suppose now in having a consistent and intuitive API for users, and a good modular code design for us.

They're already somewhat inconsistent with DataFrame.plot, though. The name "misc" implies that :) Does having a swappable backend make that any worse? To the extent that it's worth the churn on user code? I don't think so.

I don't know if backend authors may want to implement the equivalent of those for matplotlib in some cases.

I don't think so. The point of those converters is to teach matplotlib about pandas objects. Libraries implementing the backend won't have that problem, since they already depend on pandas.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

Personally I think about it mainly in terms of managing complexity. Having a standard plotting API that is delegated to the backend via a single API is easy to understand, and to maintain. Users and maintainers just need to learn that there is a plot function with a kind argument, and that this will be executed in the selected backend.

Having in the backend a set of heterogeneous plots, that besides not following the same API, use a backend, but not the one selected for the other plots, but the Matplotlib one, adds too much complexity for everyone IMHO.

And the cost of moving them seems small to me, my guess is that not a big proportion of our users even know about those plots. And for the ones who do, they'll just need to install an extra conda package and use import pandas_plotting; pandas_plotting.andrews_curves(df) instead of pandas.plotting.andrews_curves(df).

To me seems a lot to win, at a small cost, but of course it's just an opinion.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

Can we document that the swappable backend is just for Series/DataFrame.plot? That seems like a pretty simple rule.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 17, 2019

Feels like a hack that adds unnecessary complexity to me; I don't think explaining it in the documentation makes it less counter-intuitive.

But anyway, not a big deal. If that's the preferred option, this is how I'd implement it, at least the increase in code complexity is minimal: #27432

@WillAyd WillAyd referenced this issue Jul 18, 2019

@WillAyd WillAyd removed this from the 0.25.0 milestone Jul 18, 2019

@jakevdp

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

Looking more closely at this now: if I understand correctly, the way that the plotting backend will be set is using:

pd.set_option('plotting.backend', 'name_of_module')

My understanding, then, is that if I want to make the following work:

pd.set_option('plotting.backend', 'altair')

then I will need the top-level altair package to define all the functions in https://github.com/pandas-dev/pandas/blob/master/pandas/plotting/_core.py. I would prefer not to pollute Altair's top-level namespace with all these additional APIs that are not meant to actually be used by Altair users. In fact, I would prefer for altair's pandas extension to live in a separate package, so it's not tied to the release cadence of Altair itself.

If I understand correctly, this means that there's no way for me to make pd.set_option('plotting.backend', 'altair') work correctly without hard-coding the altair package in pandas the way matplotlib is currently hard-coded, is that correct?

if backend_str == "matplotlib":
backend_str = "pandas.plotting._matplotlib"

If so, I would strongly advise rethinking the means by which this API is exposed in third-party packages.

My suggested solution would be to adopt an entrypoint-based framework that would let me, for example, create a package like altair_pandas that registers the altair entrypoint to implement the API. Otherwise users will forever be confused that pd.set_option('plotting.backend', 'altair') doesn't do what they expect.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

There was a point in time where what you say was mostly correct, but that's not the case anymore.

If you want pandas.options.plotting.backend = 'altair', in 0.25 you just need to have a function altair.plot(). At some point I thought would be better to call the function pandas_plot instead of simply plot, so it was specific in a backend that had other things, but we finally didn't make the change.

If creating the plot function in the top level of altair is a problem, we can rename it in a future version, or you can also have altair.pandas.plot, but then users will have to set pandas.options.plotting.backend = 'altair.pandas'.

You can surely change the option yourself once users do an import altair. And we could implement a registry of backends. But I think it'd be confusing for users if they do the pandas.options.plotting.backend = 'altair' and it fails, because they forgot the import altair before.

One last thing is to consider that we could possibly have more than one pandas backend implemented for altair (or any other visualization library). So, for me, that the name of the backend is not altair, is not necessarily a bad thing.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

Here's an entry-points based implementation

diff --git a/pandas/plotting/_core.py b/pandas/plotting/_core.py
index 0610780ed..c8ac12901 100644
--- a/pandas/plotting/_core.py
+++ b/pandas/plotting/_core.py
@@ -1532,8 +1532,10 @@ class PlotAccessor(PandasObject):
 
         return self(kind="hexbin", x=x, y=y, C=C, **kwargs)
 
+_backends = {}
 
-def _get_plot_backend(backend=None):
+
+def _get_plot_backend(backend="matplotlib"):
     """
     Return the plotting backend to use (e.g. `pandas.plotting._matplotlib`).
 
@@ -1546,7 +1548,14 @@ def _get_plot_backend(backend=None):
     The backend is imported lazily, as matplotlib is a soft dependency, and
     pandas can be used without it being installed.
     """
-    backend_str = backend or pandas.get_option("plotting.backend")
-    if backend_str == "matplotlib":
-        backend_str = "pandas.plotting._matplotlib"
-    return importlib.import_module(backend_str)
+    import pkg_resources  # slow import. Delay
+    if backend in _backends:
+        return _backends[backend]
+
+    for entry_point in pkg_resources.iter_entry_points("pandas_plotting_backends"):
+        _backends[entry_point.name] = entry_point.load()
+
+    try:
+        return _backends[backend]
+    except KeyError:
+        raise ValueError("No backend {}".format(backend))
diff --git a/setup.py b/setup.py
index 53e12da53..d2c6b18b8 100755
--- a/setup.py
+++ b/setup.py
@@ -830,5 +830,10 @@ setup(
             "hypothesis>=3.58",
         ]
     },
+    entry_points={
+        "pandas_plotting_backends": [
+            "matplotlib = pandas:plotting._matplotlib",
+        ],
+    },
     **setuptools_kwargs
 )

I think it's quite nice. 3rd party packages will modify their setup.py (or pyproject.toml) to include something like

entry_points={
    "pandas_plotting_backends": ["altair = pdvega._pandas_plotting_backend"]
}

I like that it breaks the tight coupling between naming and implementation.

@datapythonista

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

I didn't work with entry points, are them like a global registry of the Python environment? Being new to them I don't love the idea, but I guess that would be a reasonable way to do it then.

I'd still like to have both options, so if the user does pandas.options.plottting.backend = 'my_own_project.my_custom_small_backend' it works, and doesn't require creating a package, and setting entry points.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

I didn't work with entry points, are them like a global registry of the Python environment?

I haven't used them either, but I think that's the idea. From what I understand, they're from setuptools (but packages like flit hook into them?). So they aren't part of the standard library, but setuptools is what everyone uses anyway.

I'd still like to have both options

Falling back to import_module(backend_name) seems reasonable.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jul 20, 2019
API: Add entrypoint for plotting
Libraries, including pandas, register backends via entrypoints.

xref pandas-dev#26747
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants
You can’t perform that action at this time.