Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a plot API to HoloViews #2446

Closed
philippjfr opened this issue Mar 14, 2018 · 42 comments
Closed

Adding a plot API to HoloViews #2446

philippjfr opened this issue Mar 14, 2018 · 42 comments
Labels

Comments

@philippjfr
Copy link
Member

Over the past few months I have been working on HoloViews based plotting APIs for a number of libraries including intake, streamz and pandas. In general I have borrowed heavily from the pandas DataFrame plotting API while mostly staying consistent with the HoloViews plot options. The API defines a plot namespace on the dataset objects of the respective libraries, which defines a wide array of plot types: including .area, .bars, .box, .heatmap, .histogram, .kde, .line, .scatter, .table and.violin. A fully fleshed out example for the intake library can be seen here.

All of these APIs are almost identical and maintaining them separately does not make much sense, since any divergences will become quite annoying. Therefore I've been wondering whether it might not make more sense to introduce the API to HoloViews instead, providing an easy, and familiar introduction to HoloViews and a powerful companion to the .to interface.

The .to interface is very powerful when dealing with tidy data, however we have long struggled to deal with wide data, where observations along some dimensions are grouped by column rather than row (see #2341, #2015, #2162). The plot interface provides a clean solution to this problem, automatically grouping and overlaying each column/variable. Adding this API to Dataset would I think be a good option, providing an easy, more familiar and consistent API to specify plots (while still constructing declarative HoloViews objects), which I think could be made highly consistent with the HoloViews API.

As a brief summary I'll outline the two main ways of using the API:

  • Declaring explicit x/y columns and optional an optional by kwarg to group the data by another variable (this spelling is very similar to the .to method)
  • Declaring use_index or an explicit index column and optionally a list of columns to plot, which will overlay the different columns (useful for wide datasets).

So far the interfaces I've designed only work with pandas/dask datasets but I'll soon be working on extending it to also cover xarray and geopandas types.

I think this API would compliment the explicit declarative approach to constructing HoloViews objects and would therefore be a very valuable addition to the core library. However we could also consider creating a new library for this interface, which other libraries could use, but this way we would not get the benefit of adding the plot interface to our Datasets (by default at least).

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

Sounds like an excellent idea! I'd prefer it to stay part of HoloViews itself so that we can use it whenever is convenient.

@jlstevens
Copy link
Contributor

Frankly, I'm not the least bit enthusiastic about this style of API.

There should be one-- and preferably only one --obvious way to do it.

I think holoviews core should either support this idea or the .to interface but not both. If you want to replace the .to interface, that would be a goal for holoviews 2.0.

That doesn't mean there couldn't be a separate extension that would live along holoviews, holoviews-bokeh, holoviews-mpl and holoviews-plotly when those repos are split out and moved to pyviz.

I do agree that maintaining multiple redundant codebases is annoying so I am happy to see the common code live somewhere: the question is where.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

Supporting easy use of wide datasets is very important in the real world, where you can't control choices made by someone providing data to you, and you don't always want to tidy everything up just to get a plot. We can always make our own data we generate tidy, and our own examples tidy, but that's not the situation most people are in.

That said, would there be a way to provide the wide-data support as part of .to in a way that establishes the implementation needed for each of the data libraries, leaving a relatively simple and static job of mapping that onto the data library APIs?

@jlstevens
Copy link
Contributor

That said, would there be a way to provide the wide-data support as part of .to in a way that establishes the implementation needed for each of the data libraries, leaving a relatively simple and static job of mapping that onto the data library APIs?

I would much prefer this approach.

@philippjfr
Copy link
Member Author

Frankly, I'm not the least bit enthusiastic about this style of API.

The number of people who are familiar with this style of API vastly exceeds the HoloViews userbase and if we want to reach a larger number of people providing an easy and consistent API that addresses most users needs is essential.

This also ignores the central issue in HoloViews this proposal addresses which is the lack of an API that allows users to explore wide datasets, a recurring limitation, which absolutely needs to be addressed in some form.

That said, would there be a way to provide the wide-data support as part of .to in a way that establishes the implementation needed for each of the data libraries, leaving a relatively simple and static job of mapping that onto the data library APIs?

I think that's an avenue that might be worth considering, there's two main things I'd do to make the .to interface more powerful:

  • Leave the signature for dataset.to(Element, kdims, vdims, groupby) largely intact but provide alternative index and columns/fields keyword arguments to handle the wide dataset cases.
  • Make the namespace methods on .to (e.g. dataset.to.histogram) abstract away the differences between different element types, e.g. Distribution, Histogram and BoxWhisker/Violin all have slightly different constructors, but .to.kde, .to.histogram and .to.violin could all have exactly the same signature, just as .to.line, .to.area, .to.bars and .to.scatter could have an identical signatures. Thinking about other data types, specifically gridded and path data, there would likely be further equivalence classes for .to.image, .to.contours and .to.surface and .to.path/.to.polygons respectively. In other words these would let you express what to plot without worrying about how it maps onto specific HoloViews element types.

If we extended .to in these two ways then a separate plotting API could build on top of that and it would mostly come down to handling streaming datasets and providing a simpler way of setting the options.

Nonetheless even if the plotting API does not live in HoloViews itself, it seems a shame not to offer Dataset.plot as a high-level API, just like the other downstream libraries like pandas, streamz or intake would. For my personal usage I'd probably resort to monkey patching it, if you object to having any mention of it in HoloViews itself.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

Extending .to in that way sounds great to me, as does having Dataset.plot() that matches pandas.plot(), xarray.plot(), streamz.plot(), and intake.plot(). For one thing, Dataset.plot() then acts as the reference by which all the other .plot APIs would be measured against; it's the one canonical expression of that interface and all the others will just be slight variants, making it easier for us to document and to build documentation for those other APIs in reference to this one. For that reason I do favor it being part of HoloViews proper, regardless of whether we end up recommending it or using it more widely.

@philippjfr
Copy link
Member Author

I fully agree, but I suspect @jlstevens will not. The limitation of .to will always be that it provides no mechanism to set plot/style options in some convenient way and once you've created a nested datastructure using .to it becomes awkward to set it because you have to explicitly declare the type of element to set it on.

@jlstevens
Copy link
Contributor

The number of people who are familiar with this style of API vastly exceeds the HoloViews userbase and if we want to reach a larger number of people providing an easy and consistent API that addresses most users needs is essential.

I people are familiar with this style, why would they switch to holoviews instead of continuing to use the tools they are already using? What is the point of offering something they already have? It seems to make more work for us for no benefit.

As the .to API already exists, I am happy to see it improved/extended/replaced but I don't want three APIs: the element API, the .to API and then this new API.

@philippjfr
Copy link
Member Author

philippjfr commented Mar 14, 2018

What is the point of offering something they already have?

Because what they don't have is easy composability (+ and *), easy parameter exploration (HoloMap/DynamicMap), interactive bokeh plots (hover/zoom/selection), streaming capabilities and datashading capabilities.

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

They will never know about any of those things as they will just be sticking to the API they are already familiar with. This API is catering for people who don't want to learn anything new...so they won't. In other words we will just be offering what they already have except with the additional burden of maintaining everything.

@philippjfr
Copy link
Member Author

philippjfr commented Mar 14, 2018

In other words we will just be offering what they have already except with the additional burden of maintaining everything.

I'll be maintaining the API anyway for intake, streamz and pandas at least so this is basically a moot point.

This API is catering for people who don't want to learn anything new...so they won't

It's not about the precise incantation of this API, there will be differences in any case because I'm not copying 100 different (and inconsistent) matplotlib based options that the pandas matplotlib API uses. It's simply about familiarity and consistency, learning the incantations for a wide a range of elements is a lot of learning overhead and this API will smooth over those differences by providing APIs that are consistent within a few broad classes of plots starting with charts and statistical plots as shown in the intake example and in future for path/shape data and gridded data. A lot of the benefits they will immediately get for free, e.g. bokeh interactivity and composition, the only additional thing they will have to learn is parameter exploration through groupbys which is one extra argument and therefore not a huge leap to make.

In any case, I'm happy to start by extending .to in the way I suggested above and developing the API in a separate repo. Once that's in place we can still decide whether to add the Dataset.plot namespace.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

I respectfully but vehemently disagree with the idea that people won't learn something new and won't get anything out of this. The proposed API gives people starting with one of the dataset types an easy way in to discovering all the features of Bokeh and HoloViews, eliminating what is typically a terminally large band gap that most people, most of the time, will fail to jump over.

@jlstevens
Copy link
Contributor

In any case, I'm happy to start by extending .to in the way I suggested above and developing the API in a separate repo.

Improving the .to interface is pretty uncontroversial so yes, that is a good place to start.

@philippjfr
Copy link
Member Author

philippjfr commented Mar 14, 2018

You mentioned that this somehow doesn't conform to the vision outlined in the SciPy paper, but going through the core design principles 1-by-1 it seems to me none of that original vision is lost:

• It must be easy to assign a useful and understandable default representation to your data. The goal is to keep the initial barrier to productivity as low as possible -- data should simply reveal itself.

This still applies, the API simply provides a convenient and consistent way to assign your data a useful representation after which it reveals itself. I'd argue it's superior on the "must be easy" front due to improved consistency, something we probably can't address in HoloViews itself until version 3.0 maybe.

• These atomic data objects (elements) should be almost trivially simple wrappers around your data, acting as proxies for the contained arrays along with a small amount of semantic metadata (such as whether the user thinks of some particular set of data as a continuous curve or as a discrete set of points).

Also still applies, the API still outputs elements providing atomic wrappers around your data.

• Any metadata included in the element must address issues of content and not be concerned with display issues -- elements should hold essential information only.

The core signature which consists of x/y (for tidy data) or index/columns (for wide data) of the API expresses the metadata associated with the element, other keywords simply define the visual representation.

• There are always numerous aesthetic alternatives associated with rich visual representations, but such option settings should be stored and implemented entirely separately from the content elements, so that elements can be generated, archived, and distributed without any dependencies on the visualization code.

The visual options are optional and whether these are specified in a separate .options call or as arguments to a function does not make any difference to how the "option setting [are] stored and implemented" or that the elements can be "generated, archived, and distributed without any dependencies on the visualization code."

Overall the visual options handled by the API basically reduces to the plot and style options that are shared across backends: e.g. cmap, color, size, logx, logy, alpha, xticks and yticks and I'm working on increasing that consistency. The main differences are a few more plotting centric options like xlim, ylim and title. xlim and ylim are things I'd like to add anyway once extents has been fully deprecated and we do have a title_format option already. The main issue is of course width/height vs. fig_size/aspect options which we need to find a better solution for anyway. Other options can be defined too but they would have to be listed separately on an options keyword and could be backend dependent.

• As the principles above force the atomic elements to be simple, they must then be compositional in order to build complex data structures that reflect the interrelated plots typical of publication figures.

All API methods return compositional objects so none of this is lost.


Personally I'm in love with the API because you get the ease of use and consistency of a pandas/xarray-like plot API with all the benefits of HoloViews - it's the best of both worlds:

  • The number of concepts to learn is small, no need to learn about Elements, NdOverlay, HoloMap/DynamicMap etc.
  • The API is consistent: you don't need to learn about the differences between Histogram, Distribution and Violin/BoxWhisker, or Curve, Area and Bars, the API stays the same.
  • You get all the benefits of HoloViews composability which is vastly superior to having to deal with subplots, figures, axes, especially in scenarios where this plotting API is sufficient.
  • You get the same reproducibility and archiveability benefits you get from HoloViews when compared to matplotlib figure objects
  • You can still apply HoloViews operations and methods unlike useless matplotlib objects
  • Datashading can happen automatically (or even be requested via intake Catalogue yaml specs)

Anyway, we can shop this API around with intake, pandas, dask, streamz, xarray and geopandas and if there's strong uptake we can make the decision at the HoloViews level. Therefore I'm happy to postpone this discussion.

@jlstevens
Copy link
Contributor

I'll read through your response shortly but for now I'll just say that my biggest problem is having a method called Dataset.plot then explaining that what it returns is explicitly not a plot.

@philippjfr
Copy link
Member Author

I'll read through your response shortly but for now I'll just say that my biggest problem is having a method called Dataset.plot then explaining that what it returns is explicitly not a plot.

That is true, in other libraries the name makes sense, in HoloViews not so much.

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

I want to be clear that I'm not trying to put my foot down and say 'no' to this idea - I think it is inevitable in some form and I am also against the current code duplication.

I just want to find a way to make this API available and useful while not confusing the message about the separation between data/plotting/options that is at the core of the design. I think there are some ideas that make it more palatable to me, for instance documenting this on pyviz.org - a website explicitly about getting different tools to work together - which can point to holoviews.org (which of course can also point back).

One thing which would make me happier would be if it was something like Dataset.visualizable in holoviews (too long and awkward I know!) in HoloViews but .plot in other libraries.

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

Or maybe plottable? For holoviews it really should be a noun not a verb...

Edit: I do realize plot is also a noun but it is also a verb. The difference is that plottable conveys that an object is returned and not just some rendering to the screen.

@jlstevens
Copy link
Contributor

I just checked..our abstract class is ViewableElement so it could be .viewable. As it is an abstract class, if you prefer plottable then I wouldn't mind it being renamed to PlottableElement for consistency...

@jlstevens
Copy link
Contributor

I would be pretty happy calling it Dataset.viewable tbh.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

I think it makes things worse, not better, if the same API has a different name in different contexts. For better or worse, it's called .plot(), and I don't actually think there's anything wrong with that, because if you type X.plot() in a Jupyter notebook, you get a plot, in any of the scenarios considered here. So the user invokes an action plot(), and gets a plot (noun) back. We can't call it .annotate_this_data_so_that_it_is_an_instantly_visualizable_yet_still_composable_and_sampleable_object_and_if_this_is_the_right_context_then_also_actually_plot_it(), right? :-)

I vote for calling it the PyViz .plot() API, explaining that (a) that it's supported across pandas, xarray, streamz, geopandas, intake, holoviews, and geoviews, and (b) that regardless of the context, it returns HoloViews objects, which you can find out more about at holoviews.org but which effectively work like plots that can be composed.

@jlstevens
Copy link
Contributor

... X.plot() in a Jupyter notebook, you get a plot

That isn't true as this won't render anything:

foo = X.plot()

I vote for calling it the PyViz .plot() API, explaining that (a) that it's supported across pandas, xarray, streamz, holoviews, and geoviews, and (b) that regardless of the context, it returns HoloViews objects, which you can find out more about at holoviews.org but which effectively work like plots that can be composed.

I can agree with all that, just saying that it returns a holoviews viewable object (and then in holoviews it is Dataset.viewable). With the pyviz docs suggestions, this is now the last sticking point for me. I really think Dataset.viewable is correct.

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

At the very least, we can alias Dataset.plot to Dataset.viewable and strongly recommend the former in a holoviews context, explaining why.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

Any other name makes it worse, not better, because it becomes more confusing, not less.

And I didn't say foo = X.plot(), I said X.plot()! :-)

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

The difference is that in other libraries which plot, foo=X.plot() will render to the screen and foo will be None.

@philippjfr
Copy link
Member Author

That gives me an idea for another option, we could subclass Dataset in pyviz (or wrap it in some other way). Much in the same way pyviz provides a global place to get the imports it could provide a universal Dataset object, providing a starting place to explore a Dataset of almost any type.

@jlstevens
Copy link
Contributor

That gives me an idea for another option, we could subclass Dataset in pyviz (or wrap it in some other way). Much in the same way pyviz provides a global place to get the imports it could provide a universal Dataset object, providing a starting place to explore a Dataset of almost any type.

I have no objection to this approach if you are happy with it.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

The difference is that in other libraries which plot, foo=X.plot() will render to the screen and X will be None.

And thus no one is going to do that with the other libraries, so it's a moot point...

@jlstevens
Copy link
Contributor

And thus no one is going to do that with the other libraries, so it's a moot point...

That is exactly the problem! They won't do that which means they won't have a handle on the result and therefore won't use the compositonality that holoviews offers!

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

We can worry about whether a subclassed Dataset goes in PyViz or HV itself later; no rush on that...

@philippjfr
Copy link
Member Author

Okay, then the last thing for now, what to call the plotting API package,hvplot?

@jlstevens
Copy link
Contributor

jlstevens commented Mar 14, 2018

Maybe just plot ... though, where is this living again? I assume if it needs a package then it won't be in holoviews itself, in which case plot is fine. It isn't right for holoviews itself though (it should live where Dataset is already)...

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

I'm not sure what you mean; Dataset is in holoviews.core.data, which is in holoviews itself, and indeed in holoviews core?

@philippjfr
Copy link
Member Author

Yeah, sorry, I meant the new repo and library I'll be developing the API in.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

Maybe pyviz/pvplot.

@jlstevens
Copy link
Contributor

Why does it need a prefix? Why not pyviz/plot?

@philippjfr
Copy link
Member Author

For the import, import plot doesn't seem sensible.

@jbednar
Copy link
Member

jbednar commented Mar 14, 2018

When I say pyviz/pvplot, pyviz is the organization name, and pvplot is the repo name and module name. And so when it becomes a module, it's import pvplot, not the too generic import plot, which is surely going to clash with something.

@philippjfr
Copy link
Member Author

philippjfr commented Mar 14, 2018

Not sure I like pv as an acronym, apart from giving me flashbacks to my PhD, I don't think pyviz needs an acronym and hvplot is more descriptive since the interface returns HoloViews objects not pyviz objects.

@rsignell-usgs
Copy link

I vote for hvplot! 😺

@philippjfr
Copy link
Member Author

@rsignell-usgs hvplot now exists (see https://hvplot.pyviz.org/). I'll try to finish a blog post to formally announce it this week.

@rsignell-usgs
Copy link

@philippjfr , yes, I was following the scipy2018 pyviz tutorial and the first lesson was on hvplot!

It blows my mind that:

url = 'http://thredds.ucar.edu/thredds/dodsC/grib/FNMOC/WW3/Global_1p0deg/FNMOC_WW3_Global_1p0deg_20180818_0000.grib1'
ds = xr.open_dataset(url)
ds['sig_wav_ht_surface'].hvplot(groupby='time1', clim=(0,5))

produces:
2018-08-19_14-35-04

Here's the full notebook. Amazing! Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants