Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization for heatmap aggregation with pandas #1174

Merged
merged 3 commits into from Mar 5, 2017
Merged

Conversation

@philippjfr
Copy link
Member

@philippjfr philippjfr commented Mar 5, 2017

A 5-10x speedup for HeatMap aggregation when using a pandas/dask interface.

data = [(i, j, np.random.rand()) for i in range(500) for j in range(500)]

%%timeit
hv.HeatMap(data)

Before:

1 loop, best of 3: 8.61 s per loop

After:

1 loop, best of 3: 800 ms per loop

@philippjfr philippjfr force-pushed the heatmap_agg_speedup branch from 8ba8dd6 to 3bef914 Mar 5, 2017
@jbednar
Copy link
Member

@jbednar jbednar commented Mar 5, 2017

Looks good to me.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

I find DF_INTERFACES a bit ugly but otherwise it looks good.

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Mar 5, 2017

Anything naming you would prefer?

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

Could the interfaces not declare the sort of APIs they support?

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Mar 5, 2017

Could the interfaces not declare the sort of APIs they support?

This is bypassing the interface API, because pandas/dask have optimized implementations for this operation.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

To clarify: shouldn't the interfaces declare the third party APIs (dataframe-like, array-like) that the data supports? This is also the API assumed by the interface class itself.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

Something like:

if 'Dataframe' in reindexed.interface.interface_type:
  ...
@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Mar 5, 2017

shouldn't the interfaces declare the third party APIs (dataframe-like, array-like) that the data supports?

Sure they could, although "dataframe-like" isn't a particular solid guarantee on how similar and extensive the API is.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

I think the same can be argued for the current approach. For instance, maybe only dask and pandas should claim to support the same interface type. I think such a mechanism is cleaner than building a list based on successful/failing imports.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

In addition, you might want to always have 'dask' ahead of 'dataframe' in datatypes if the former can be considered a more highly optimized version of the latter. If dask isn't installed, it won't be used.

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Mar 5, 2017

In addition, you might want to always have 'dask' ahead of 'dataframe' in datatypes if the former can be considered a more highly optimized version of the latter.

That's not really a safe assumption, by default dask will simple use a single partition and may therefore actually be slower than pandas and their lazy nature is probably a bit surprising/confusing for a user who doesn't know anything about them.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

Ok, sure: that suggestion was orthogonal to the original one about declaring the type of data interface anyhow.

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Mar 5, 2017

Okay, I did two things in the end, I got rid of DF_INTERFACES and added a copy keyword argument to the Dataset.dframe method, which lets you avoid making copies if the data is already a dataframe.

@@ -516,13 +516,17 @@ def get_dimension_type(self, dim):
return self.interface.dimension_type(self, dim_obj)


def dframe(self, dimensions=None):
def dframe(self, dimensions=None, copy=True):

This comment has been minimized.

@jlstevens

jlstevens Mar 5, 2017
Contributor

I'm not sure copy argument makes much sense if the element isn't already using a dataframe based interface - for other interfaces, don't you always have to create a new dataframe - which would be the same as copy being fixed to True?

This comment has been minimized.

@philippjfr

philippjfr Mar 5, 2017
Author Member

That's true, it's more like avoid_copy, but I think providing a consistent API to get a hold of a dataframe with the minimal amount of overhead is useful.

This comment has been minimized.

@philippjfr

philippjfr Mar 5, 2017
Author Member

That said, I'd also be fine having a utility for it instead.

@@ -134,7 +139,13 @@ def _aggregate_dataset(self, obj, xcoords, ycoords):
dtype = 'dataframe' if pd else 'dictionary'
dense_data = Dataset(data, kdims=obj.kdims, vdims=obj.vdims, datatype=[dtype])
concat_data = obj.interface.concatenate([dense_data, obj], datatype=[dtype])
agg = concat_data.reindex([xdim, ydim], vdims).aggregate([xdim, ydim], reduce_fn)
reindexed = concat_data.reindex([xdim, ydim], vdims)
if pd:

This comment has been minimized.

@jlstevens

jlstevens Mar 5, 2017
Contributor

Why not use reindexed.interface.dframe(dimensions=None, copy=False) instead of exposing the copy keyword argument at the element level? For copy=False to work, you are already assuming a dataframe type interface is being used...

This comment has been minimized.

@jlstevens

jlstevens Mar 5, 2017
Contributor

I suppose the other thing you could do is complain if copy=False is passed to the dframe method of any interface that isn't based on dataframes.

This comment has been minimized.

@philippjfr

philippjfr Mar 5, 2017
Author Member

For copy=False to work, you are already assuming a dataframe type interface is being used...

Because then I need conditional branches for the "is already dataframe" and "convert to dataframe" paths again. I guess I agree copy is confusing because you might assume you can mutate the dataframe and have an effect on the original element if you don't make a copy, when the real point of it is to avoid making pointless copies.

This comment has been minimized.

@jlstevens

jlstevens Mar 5, 2017
Contributor

Would there be any harm with the dataframe interfaces just avoiding pointless copies automatically? Then it doesn't have to be something the user needs to ever think about...

This comment has been minimized.

@philippjfr

philippjfr Mar 5, 2017
Author Member

In my usage of dframe I often create it and then assign to it so that would be a bit of pain.

@philippjfr philippjfr force-pushed the heatmap_agg_speedup branch from cbb5b2e to fcb3504 Mar 5, 2017
@philippjfr philippjfr force-pushed the heatmap_agg_speedup branch from fcb3504 to 2f3ba87 Mar 5, 2017
@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Mar 5, 2017

I feel the new approach using a utility is much nicer, thanks!

Tests are passing now. Merging.

@jlstevens jlstevens merged commit 5d90c72 into master Mar 5, 2017
4 checks passed
4 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.004%) to 78.326%
Details
@philippjfr
s3-reference-data-cache Test data is cached.
Details
@philippjfr philippjfr deleted the heatmap_agg_speedup branch Apr 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants