Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR,VIS: remove custom matplotlib code #50059

Open
MarcoGorelli opened this issue Dec 4, 2022 · 17 comments
Open

DEPR,VIS: remove custom matplotlib code #50059

MarcoGorelli opened this issue Dec 4, 2022 · 17 comments
Labels
Needs Discussion Requires discussion from core team before further action Visualization plotting

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 4, 2022

The good - definitely keep

Before talking about what I'd like to remove, here's what I'd like to keep.

I love the plotting backend. Good one @datapythonista ! What I'm suggesting is to only keep that and get rid of everything else.

The not so good - deprecate?

Visualisation code is hard to maintain and hard to test. It regularly causes issues due to insufficient testing. The worst example I've come across is #39522, in which the lines on the plot don't correspond with the legend. If someone had trusted pandas and had made a business decision using such a plot, they'd have made the wrong decision!

On the maintenance side, this PR #29944 took nearly 3 years (!) to get reviewed.

I'm not blaming anyone, I'm just pointing out that visualisation is really hard to maintain and test. How about we reduce the maintenance burden and get rid of most of it?

What to do instead

seaborn is a high-level wrapper around matplotlib (a bit like how plotly.express is to plotly). Instead of maintaining all this buggy custom matplotlib code ourselves, let's leave it to the experts.

Concretely, this would mean adding a seaborn plotting backend, and making that the default backend. So df.plot.scatter(x=x, y=y) would defer to seaborn.scatterplot(data=df, x=x, y=y), and similarly for other common methods.
This could live in seaborn itself, just like how the plotly backend code live in the plotly repo:

https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/__init__.py#L99

If the seaborn folks weren't up for that, then it could still live in pandas. It'd be way less to maintain than all the custom unmaintained matplotlib code we currently have.

Impact on users

For the vast, vast majority, it should be practically nothing. If before they called df.plot.scatter(x=x, y=y), then they will still get a matplotlib scatter plot with x on the x-axis and y on the y-axis, but it'll have been produced by seaborn instead of with the custom matplotlib code we currently have in pandas.

There may be some unusual plots which pandas currently produces but which seaborn doesn't - I don't know, I only tried this as a POC and it seemed to work well - and to be honest I'd be fine with deprecating them completely. If they're too unusual to be in a plotting library, they're too unusual to be in pandas.

Alternatives

Split the custom matplotlib code into a separate repo. If someone wants to step up and volunteer to maintain it, then sure, no objections. If nobody does, my suggestion is to defer to seaborn for matplotlib plots and stop hacking around with custom matplotlib code in pandas.

@MarcoGorelli MarcoGorelli added the Needs Discussion Requires discussion from core team before further action label Dec 4, 2022
@datapythonista
Copy link
Member

+1 for the general idea. From the pandas side, I'd simply provide the API, and let the users decide if they install pandas-matplotlib, pandas-seaborn (when implemented), pandas-bokeh, pandas-holoviews, pandas-altair, pandas-plotly (not sure if this one eas finally implemented, there were plans) or any of the present or future plotting backends.

So, while I think this is a good idea, and a seaborn backend probably makes more sense, personally I don't think the seaborn backend is something for the pandas core team to implement, external backends can be developed already and they have been for a while. Personally I think pandas as a team should only care about the plotting API and how we allow backends in general.

So, to me it'd make more sense to discuss if we move thr matplotlib one to an external project, and I'm surely +1 on it.

And I'm not against using pandas funding for the seaborn backend if needed, my point is that we don't need the core team to agree to be able to develop a seaborn or any other backend, since it should live in a separate repo.

@jreback
Copy link
Contributor

jreback commented Dec 5, 2022

-0.5 on this idea generally

adding even more complexity on installation of pandas just hurts the batteries included philosophy

yes generally we should limit core functionality and remove non core to other packages, but basic plotting using matplotlib is and has always been core

plotting already has hooks that enable pluggable backends. i suppose moving the matplotlib code to a package that is included by default would be ok but that would have to happen before a change to remove it is contemplated

@datapythonista
Copy link
Member

Having a separate package and for now depending on it seems reasonable.

I don't think the extra installation complexity is so big. Now we have the backend, but we don't have matplotlib installed with pandas. For many users, the only difference is that instead of getting an error asking to install matplotlib, they'll get an error asking to install pandas-matplotlib. Only for users who have matplotlib installed and don't know how, this change would make a difference I think.

A different discussion, but maybe eventually it's worth having a pnadas-core package and a pandas metapackage. The same trade-off of batteries included applies to all IO at least (sas, strata, gbq...), and possibly other features of pandas. This could get the best of both worlds and provide a batteries included experience to users, while not making this project keep growing with every plotting library, or data format. But this isbprobably worth a separate issue or PDEP.

@rhshadrach
Copy link
Member

rhshadrach commented Dec 5, 2022

Will a separate package create complexities in contributing? Is there a possibility what would be one PR into pandas now needs to be a PR into pandas-seaborn and then a separate PR into pandas? It would also need a separate CI to maintain.

@datapythonista
Copy link
Member

I think the pandas plotting backend API is quite stable. We have many other plotting backends, and I don't think we're breaking their code often, or at all.

I think contributing to pandas is hard, conributing to a plotting backend is easy and less intimidating. Starting by matplotlib devs, I think the new project could easily have better maintenance that the current code in pandas. Maybe I'm wrong. But since Python itself is batteries included, but supoorts third party packages to interact well with it, and that seems way better than Python supporting everything in the standard library (both for users and maintainers). I prefer to think this would be also the case for pandas, for some particular things, like plotting or IO.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 5, 2022

Thanks all for your comments!

The main issues that have come up are:

  • that this is core functionality, and shouldn't go away
  • that installation should stay simple
  • CI

One the first one - I agree. This should only be once once the seaborn backend has been implemented. I did some quick work on this as a POC, and it shouldn't be that much effort. By comparison, the plotly backend is < 200 lines of code: https://github.com/plotly/plotly.py/blob/master/packages/python/plotly/plotly/__init__.py#L99 . There's no pandas-plotly, it's handled directly in plotly - I'm hoping the same might be possible in seaborn.

On installation - not sure I follow. Matplotlib already isn't a dependency of pandas, users need to install it separately. The message would just read "please install seaborn" instead of "please install matplotlib".

On CI / contributing: unfortunately we did break the backend recently, see #49732. The breakage actually came from having both plotting backends and custom matplotlib code. That's one reason I'm saying: let's get rid of the custom matplotlib code, and just keep the backends. A seaborn backend would do pretty much the same thing, but it'd be way better maintained and tested.

@mroeschke mroeschke added the Visualization plotting label Dec 5, 2022
@nicolaskruchten
Copy link

Hi all!

I maintain the plotly backend and I'm also a happy pandas user ... I just wanted to chime in with some quick thoughts:

  1. The current plotting API does seem like core functionality that shouldn't break for folks who rely on it, certainly not in the near future!
  2. If there is a desire for folks to eventually migrate to a different backend such that the current one can eventually be removed, I would recommend, as a first step, to more clearly indicate in the official Pandas docs that there are more backends out there (maybe with some links to the alternatives like plotly and hvplot etc) and that it's a choice to use the current one. By "more clearly" I would mean something like at the top of https://pandas.pydata.org/docs/user_guide/visualization.html rather than at the bottom :)
  3. Pulling the current one out into something like pandas-matplotlib would reinforce the message above IMO, especially if that package had its own docs.
  4. There is not currently a seaborn backend that I know of and I think we should probably ask @mwaskom how he'd feel about this proposal before investing a ton more time into that particular line of thinking :)
  5. Maybe hvplot would be a reasonable default instead, given that it does actually currently match the behaviour of the default backend and can, itself, retarget other backends like matplotlib, bokeh and plotly? Maybe @jbednar or @philippjfr have some thoughts?

@philippjfr
Copy link

Having implemented hvPlot as an alternative plotting backend I do have some thoughts. The idea of splitting out the current code into pandas-matplotlib sounds quite sensible to me. It decouples the projects, makes contributions simpler but you do have to find someone to maintain it and cut releases. I would not be too concerned about the coupling between the projects since changes in the plotting code are very unlikely to require upstream changes to pandas itself and pandas changes have not really caused much issue for us in hvPlot.

On the seaborn backend you definitely should check with @mwaskom to hear his thoughts about it. Personally I'd say such a backend would be quite welcome but I would be very wary of switching to it as a default any time soon. This would have to be telegraphed a long time in advance. Yes, implementing the basics is probably easy but the default pandas plotting backend is heavily used by many, many thousands of users and even minor changes will annoy a ton of people and there will be a very, very long tail of minor differences that will be a slog to address.

For that very same reason I'd be pretty wary of suggesting that hvPlot should be the default. Yes, we have taken care that we are compatible in many ways but there's also major differences that will catch users off guard.

Since I don't maintain the code I do not want to come off as prescriptive about the right course of action but I would strongly urge caution in anointing a new default plotting backend, particularly one that does not exist yet.

@nicolaskruchten
Copy link

(Maybe just a contextual note in case this isn't clear to everyone reading this: the hvplot backend is intended to be mostly arg-compatible with the current built-in one as far as I know, but the plotly backend is explicit not, and the hypothetical seaborn backend would not be either, as proposed above :)

@nicolaskruchten
Copy link

nicolaskruchten commented Dec 6, 2022

Basically I'm saying that this (quoted from the top) would not be correct:

Impact on users

For the vast, vast majority, it should be practically nothing. If before they called df.plot.scatter(x=x, y=y), then they will still get a matplotlib scatter plot with x on the x-axis and y on the y-axis, but it'll have been produced by seaborn instead of with the custom matplotlib code we currently have in pandas.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 6, 2022

I brought this up with Michael, and it turns out it'd be quite a bit more involved that I'd first estimated

I still think the "deprecate most of the matplotlib code in pandas, defer as much as possible to seaborn" suggestion could hold. I'll try this out in a separate repo - perhaps it'll end up being 1000+ lines of code, but I still think it'd be an improvement on the status quo

It's true that many people use pandas plotting, and that's exactly what concerns me (especially in the face of bugs like #39522 - I get that all software has bugs, but this one troubled me more than most)

Ultimately, I would only try to push this forwards if the impact on users were indeed minimal. Alternatively, I'd suggest declaring the pandas plotting code "maintenance only" - we'll fix bugs, but won't accept new features. But that's for a separate issue

@nicolaskruchten
Copy link

OK cool. @MarcoGorelli feel free to reach out if you want to chat as you work through your 1000+ line prototype, I'd be happy to advise or help out if you like :) I spent a few days looking at what it would take to convert the pandas plot signatures into Plotly Express ones and decided that it would not be a maintainability win (PX itself is only like 2300 lines or so!). My fear is that in order to stop maintaining the pandas -> matplotlib code, you'd end up maintaining more, hairier pandas -> seaborn code!

I think that the idea of carving off the current code into its own separate (even if auto-installed!) library with separate docs is a valuable one in any case.

@MarcSkovMadsen
Copy link

MarcSkovMadsen commented Dec 6, 2022

For me hvPlot would be the obvious candidate.

it can output to Matplotlib, Bokeh and Plotly. And it extends to other backends like Dask, XArray etc. It provides a lot of functionality in addition the the .plot api.

its a chance to take the Pandas .plot api and really anchor it as a central api for quick, exploratory data analysis across much of the PyData ecosystem.

but as a contributor and user I am biased.

@jbednar
Copy link

jbednar commented Dec 6, 2022

Indeed, we are all biased. :-) Even though I am an hvPlot maintainer, I wouldn't argue for making hvPlot be the default backend, because the default plotting functionality is already perfectly adequate for many users, and there is so much documentation on the internet that would never get updated, so such users would perpetually be confused.

But I do think that new users could be better made aware that there is additional rich and powerful functionality that is easily accessible, beyond the basic Matplotlib implementation. I think a good way to do that would, as others have suggested above, be to move the current functionality into its own library with separate docs, taking care to preserve the API for it precisely, to avoid rocking the boat for current users. As already discussed, doing that would have benefits ranging from supporting a minimal pandas core package for installation, making viz-specific fixes and contributions easier, and reducing the complexity of the core package for maintenance.

Having made the default backend be its own package, it would be great if the Pandas docs could then present the available backends in a more balanced manner, instead of the current approach of jumping right into an exploration of every plot available in the default backend, and only then, after 65 screenfuls of such plotting, mentioning that there are any alternatives. I.e., the Pandas docs could show a few examples from the default backend, listing some good reasons for using it (should always be installable, few dependencies, widely documented in tutorials and examples, etc.), and pointing to that backend's docs for any further info. And then it could show a few examples from other backends, again listing why a user might want to select that backend (fully interactive JavaScript plotting, automatic widgets for exploring DataFrames, easily composable plots, look and feel for matching output from different plotting libraries, etc.), again pointing to that backend's docs. That way users would be better equipped to decide which approach fits their own needs best, which also helps avoid users having to put pressure on the default backend for any feature requests they have.

To me it seems like this approach would be more maintainable and provide users with better guidance than the current approach. And if someone then wanted to write a Seaborn backend and fight hard to make it option-for-option compatible, with a goal of eventually deprecating the current Matplotlib backend entirely, then good luck, but the community would get tangible benefits long before that would come through.

@MarcoGorelli
Copy link
Member Author

Thanks @jbednar for your considered thoughts

I think you're right - splitting pandas.plotting into a separate package would already deliver benefits, and would be lower-effort than writing a seaborn backend

Let's start with that then

Thanks everyone for chiming in, I'm impressed by how many people from different projects have shared their thoughts here!

@jbednar
Copy link

jbednar commented Dec 6, 2022

Pandas brings everyone together! :-)

@maximlt
Copy link

maximlt commented Dec 7, 2022

Hey all, I'm part of the HoloViz group together with Marc, Philipp and Jim, and spend quite some time maintaining hvPlot. I agree with the suggestion of splitting pandas.plotting and changing the docs so that users are more aware of the existing ecosystem of more or less compatible .plot APIs.

Having pandas.plotting in its own package will also have a nice side effect, it will make it easier for us to track potential evolution of the API, and possibly chime in as we did in this issue to provide feedback.

Happy to help pushing this, e.g. updating the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action Visualization plotting
Projects
None yet
Development

No branches or pull requests

10 participants