Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up NdMapping.groupby #349

Merged
merged 8 commits into from Dec 12, 2015
Merged

Speed up NdMapping.groupby #349

merged 8 commits into from Dec 12, 2015

Conversation

@philippjfr
Copy link
Member

@philippjfr philippjfr commented Dec 11, 2015

This PR refactors the NdMapping.groupby operation into a separate function and provides an alternative implementation based on Pandas, which is significantly faster for large datasets. You can see the linear performance scaling by the old implementation and might just make out the sublinear performance of pandas, which becomes very significant for large datasets >10000 items. This is a temporary workaround until we come up with a general solution data API for NdMapping types that's being discussed in #347.

image

@philippjfr philippjfr added this to the v1.4.1 milestone Dec 11, 2015
import pandas
ndmapping_groupby = ndmapping_groupby_pandas
except:
ndmapping_groupby = ndmapping_groupby_python

This comment has been minimized.

@jlstevens

jlstevens Dec 11, 2015
Contributor

I would make this a parameterized function (i.e a class) that uses one of two possible bothmethods in __call__. My only other comment is that we need some docstrings here...

yield [((), (v.ix[0],))]


def ndmapping_groupby_pandas(ndmapping, dimensions, container_type, group_type, **kwargs):

This comment has been minimized.

@jlstevens

jlstevens Dec 11, 2015
Contributor

From what I can see, this is actually used at the MultiDimensionalMapping level. Just a minor comment, as this name is certainly less unwieldy than multidimensional_groupby_pandas and I doubt this would cause much confusion.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Dec 11, 2015

Definitely a very valuable PR: it cleans up groupby on MultiDimensionalMapping, moves useful functionality into util and most importantly offers a major performance improvement (for people with pandas installed) with very little code.

I'm happy to implement my suggestion of turning ndmapping_groupby into a parameterized function once the tests are passing...

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Dec 12, 2015

I'm now done with this, I'd be happy if you refactored it into ParameterizedFunction, then we can merge.

Just plotted the improvement factor as a function of samples, huge difference for large N:

image

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Dec 12, 2015

Great! The key thing is that all the tests are passing now...

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Shouldn't take long to do and I am happy to change it if you are busy.

@philippjfr
Copy link
Member Author

@philippjfr philippjfr commented Dec 12, 2015

My only comment now is whether you are happy for me to make this into a parameterized function? Or do you object to having a single parameterized function for groupby?

Must have missed my comment somehow:

I'm now done with this, I'd be happy if you refactored it into [a] ParameterizedFunction, then we can merge.

@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Dec 12, 2015

Sorry yes - I skimmed your reply too quickly. I'll do the final refactor now.

philippjfr and others added 2 commits Dec 12, 2015
@jlstevens
Copy link
Contributor

@jlstevens jlstevens commented Dec 12, 2015

If the tests pass now, I'll go ahead and merge.

jlstevens added a commit that referenced this pull request Dec 12, 2015
Significant speed up of NdMapping.groupby
@jlstevens jlstevens merged commit 6b9fe08 into master Dec 12, 2015
3 checks passed
3 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+0.02%) to 71.238%
Details
@jlstevens jlstevens deleted the ndmapping_groupby branch Dec 12, 2015
@jbednar
Copy link
Member

@jbednar jbednar commented Dec 12, 2015

Excellent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

3 participants