Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline property to track data lineage #3967

Merged
merged 25 commits into from
Sep 24, 2019
Merged

Add pipeline property to track data lineage #3967

merged 25 commits into from
Sep 24, 2019

Conversation

jonmmease
Copy link
Collaborator

Overview

This PR adds a new pipeline property to the Dataset class. This property holds a list of (function, args, kwargs) tuples that represent the sequence of operations needed to transform the Dataset stored in the dataset property into an element equal to current element.

It also adds a new execute_pipeline method that can evaluate this sequence of functions on an input dataset. This makes it possible to reproduce the same sequence of operations on a new Dataset.

Relationship to other PRs

dataset property

The dataset property was added to the LabelledData class in #3919. This PR moves the dataset property down to the Dataset class, so there is no longer a dataset property on, for example, the Layout class. This reduces the scope of where dataset and pipeline need to be correct / consistent.

Histogram _operation_kwargs

This PR removes all special cases associated with Histogram elements. So the Histogram._operation_kwargs property added in #3921 has been removed.

select all dims

In #3924, the select method is updated to consider all dimensions in the Dataset stored in the element's dataset property. This PR does not do this, and instead provides the execute_pipeline method as a more powerful alternative to acheiving the same goal. See examples below.

link_selections

This PR will become a more powerful foundation for the automatic linked selection support being added in #3951

Example 1: Points

Create a sample 3-dimensional dataset. x and y are independently drawn from the standard normal distribution and r is calculated to be the radius of each point from the origin.

import numpy as np
import pandas as pd
import holoviews as hv
from holoviews import dim
from holoviews.operation.datashader import rasterize, datashade, dynspread 
hv.extension('bokeh')

np.random.seed(1)
df = pd.DataFrame(np.random.randn(100, 2), columns=['x', 'y'])

# Add radius column
df['r'] = (df.x ** 2 + df.y ** 2) ** 0.5

ds = hv.Dataset(df)
points = ds.to.points(kdims=['x', 'y'], groupby=[])
points

bokeh_plot-1

Display the pipeline for the new points element

points.pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []})]

Next, create a new points element by running execute_pipline on a subset of the dataset stored in points.dataset. Note that it would not be possible to compute this subset using points.select directly because it involves the r dimensions which is not a key or value dimension of points.

points * points.execute_pipeline(points.dataset.select(x=(0, None), r=(0, 1.5))) 

bokeh_plot-2

Example 2: Datashade

Create an RGB image element from points using the datashade and dynspread operations with dynamic=False.

points_rgb = dynspread(datashade(points, dynamic=False), dynamic=False, threshold=0.9)
points_rgb

bokeh_plot-3

Display the pipeline for points_rgb

points_rgb.pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (datashade(...),  [],  {'dynamic': False}),
 (dynspread(...),  [],  {'dynamic': False, 'threshold': 0.9})]

Next, compute a new RGB element by calling the execute_pipeline method with a subset of the original dataset. Note that this is a selection that was not possible using the approach in #3924.

points_rgb + points_rgb.execute_pipeline(points_rgb.dataset.select(x=(0, None), r=(0, 1.5)))

Screenshot_20190918_054413

Example 3: Histogram

Next, repeat the same process using a Histogram element created from points.

hist1 = hv.operation.histogram(points, num_bins=10, dynamic=False, normed=False)
hist1

bokeh_plot-4

Display pipeline

hist1.pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (histogram(...),  [],  {'num_bins': 10, 'dynamic': False, 'normed': False})]

Create new Histogram element with execute_pipeline

hist2 = hist1.execute_pipeline(hist1.dataset.select(x=(0, None), r=(0, 1.5))) 
hist1 * hist2

bokeh_plot-5

Example 4: Custom aggregation

In this example, create a Bars element from the result of aggregating an original Dataset.

df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
                   'b': [2, 1, 3, 0, 10, 4],
                   'c': [0, 0, 0, 1, 1, 1]
                  })
ds = hv.Dataset(df, kdims=['c'], vdims=['a', 'b'])
bars = ds.aggregate('c', function=np.sum).to(hv.Bars)
bars

bokeh_plot-6

pipeline

bars.pipeline
[(holoviews.core.data.Dataset,
  [],
  {'kdims': [Dimension('c')], 'vdims': [Dimension('a'), Dimension('b')]}),
 (<function holoviews.core.data.Dataset.aggregate(...)>,
  ['c'],
  {'function': <function numpy.sum(...)>}),
 (holoviews.element.chart.Bars,
  [],
  {'label': '',
   'kdims': [Dimension('c')],
   'vdims': [Dimension('a'), Dimension('b')]})]

Create a new Bars element on a subset of the original dataset using execute_pipeline

bars * bars.execute_pipeline(bars.dataset.select(b=(3, None)))

bokeh_plot-7

@philippjfr
Copy link
Member

This is pretty much exactly what I expected when we discussed this so I'm very happy to see it seems to have worked. The _in_method flag is also what I imagined and it should handle nested method calls but I'm wondering I haven't yet spotted how it works for .apply(operation, ...) calls for instance, does the operation get added twice in that case?

@philippjfr philippjfr added tag: API type: feature A major new feature labels Sep 18, 2019
@jonmmease
Copy link
Collaborator Author

jonmmease commented Sep 18, 2019

Yeah, thanks for working through the design with me!

In terms of apply, since this is an accessor (not a method) it doesn't cause _in_method to be set.

points.apply(hv.operation.histogram).pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,
  [],
  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (histogram(...),
  [],
  {'dynamic': False})]

But that does remind me that I should add some tests for apply.

And, there might be a hole here if the thing apply calls is not already an operation. I'll take a look.

@jonmmease
Copy link
Collaborator Author

And, there might be a hole here if the thing apply calls is not already an operation. I'll take a look.

No, I don't think this is a problem. We only need to update the pipeline if the function passed to apply returns a Dataset object, and to do this the function would call out to an operation or call a method on the object.

points.apply(lambda p: p.select(x=(0, None))).pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,
  [],
  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (<function holoviews.core.data.Dataset.select(self, selection_expr=None, selection_specs=None, **selection)>,
  [],
  {'x': (0, None)})]

Hmm, there are also the opts and redim accessors. Do you think these should be captured in the pipeline?

@philippjfr
Copy link
Member

philippjfr commented Sep 18, 2019

We only need to update the pipeline if the function passed to apply returns a Dataset object, and to do this the function would call out to an operation or call a method on the object.

I frequently write functions that take an object compute something from it and then repack a new Dataset, e.g. here's an apply function I just wrote for a dashboard I'm writing:

    def get_table(ds):
        arr = ds.array()
        weights = list(zip(stocks.columns, arr[0, 2:])) if len(arr) else []
        return hv.Table(weights, 'Stock', 'Weight').opts(editable=True)

Hmm, there are also the opts and redim accessors

redim should definitely be captured since without it the pipeline might be invalid. I have no strong opinion on opts but for completeness sake I guess we should do it.

@jonmmease
Copy link
Collaborator Author

Ok, yeah. That's a good point regarding the apply function constructing a brand new object. So this will need to be captured separately from the PipelineMeta metaclass. Which is fine.

I'll work on apply, redim, and opts, next. Let me know if any other cases come to mind. Right now the following are covered:

  • Dataset methods
  • Operation subclasses
  • iloc and ndloc accessors

@jonmmease
Copy link
Collaborator Author

In 702e531 I added a new meta class to support pipelines in apply, redim, and opts accessors.

points.apply(
    lambda p: hv.Points(p.select(x=(0, None)).data)
).redim.label(x="The X Dim").opts(color='green').pipeline
[(holoviews.core.data.Dataset, [], {}),
 (holoviews.element.geom.Points,  [],  {'label': '', 'kdims': [Dimension('x'), Dimension('y')], 'vdims': []}),
 (holoviews.core.accessors.Apply, [], {'mode': None}),
 (<function holoviews.core.accessors.Apply.__call__(...)>,  [<function __main__.<lambda>(p)>],  {}),
 (holoviews.core.accessors.Redim, [], {'mode': 'dataset'}),
 (<function holoviews.core.accessors.Redim.__call__(...)>,  [None],  {'x': {'label': 'The X Dim'}}),
 (holoviews.core.accessors.Opts, [], {'mode': None}),
 (<function holoviews.core.accessors.Opts.__call__(...)>,  [],  {'color': 'green'})]

@johnzzzzzzz
Copy link

Jon,
This feature looks great!
I am personally interested in model-view-control links between different hvplot diagrams.
It looks like a view's (hv.Points) pipeline could be used to update a view, when the model selection (DataSet.select) changes.
Could you show an example of that working using bokeh backend?
For example 1, if the two Points Elements where in a Layout instead of an Overlay. Then if the ds Dataset had rows selected, these rows would be selected in both Elements. That is all the selected points would be displayed in the first Points, but only the rows selected and then passed through the pipeline would be displayed on the second Points.

@jonmmease
Copy link
Collaborator Author

Hi @johnzzzzzzz,

Have you seen #3951? This is work towards creating a workflow to automatically link selections between HoloViews elements (including those produced by hvplot). The next iteration of that PR is going to build on top of this pipeline work.

@johnzzzzzzz
Copy link

I am excited about #3951 and would like to help create test cases.
I will try to figure out how to clone a branch that includes this and #3951
This feature may also help panel 604 not supporting linked Elements

@jbednar
Copy link
Member

jbednar commented Sep 18, 2019

I'm excited too. Would it be possible for obj.pipeline to work as it does above (returning the list) while obj.pipeline() does what is currently invoked with obj.execute_pipeline()? Having words like execute in a function call slightly annoys me, because every function call executes something, so it seems sufficient to convey "calling" with the standard Python () call syntax alone. But on a quick glance at the property I can't tell if overloading it in that way would work, so I'm just proposing it here if it's possible.

@jonmmease
Copy link
Collaborator Author

Would it be possible for obj.pipeline to work as it does above (returning the list) while obj.pipeline()

I don't think so, unless the thing returned by obj.pipeline isn't a standard Python list. We could return some object of our own that both represents the pipeline and evaluates it with __call__. But I'm not sure how intuitive this would be for users.

I'm definitely open to renaming execute_pipeline though!

@jbednar
Copy link
Member

jbednar commented Sep 18, 2019

That's what I suspected. It would be easy to have something that prints like a list while being callable, but I agree that it's nicer to have it simply be a list when it's returned as a value. I don't have any suggestions for a better name, then.

@philippjfr
Copy link
Member

Looks good! I don't actually much like the group handling of operations. I think we should set the group default to None in most cases and then skip setting the group is it is None, e.g. in chain

return processed if self.p.group is None else processed.clone(group=self.p.group) 

@jonmmease
Copy link
Collaborator Author

jonmmease commented Sep 23, 2019

Looks good!

Thanks! It is nice for pipeline to be a standard chain operation.

I don't actually much like the group handling of operations. I think we should set the group default to None in most cases and then skip setting the group is it is None

That sounds good to me. I made the change in the chain operation in 50bd22a.

I think this PR is in pretty good shape now. Thanks for taking a look, and let me know if anything else comes to mind that we should do before merging.

@philippjfr
Copy link
Member

Happy to see this merged. I'll give @jlstevens a chance to review though.

@philippjfr
Copy link
Member

Okay, since he's on PTO for the foreseeable future I'm going to go ahead and merge.

@philippjfr philippjfr merged commit a58216c into master Sep 24, 2019
@jonmmease
Copy link
Collaborator Author

Thanks!

@philippjfr philippjfr deleted the pipeline branch October 2, 2019 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tag: API type: feature A major new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants