Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmarking #7

Closed
c42f opened this issue Nov 10, 2015 · 3 comments
Closed

Performance benchmarking #7

c42f opened this issue Nov 10, 2015 · 3 comments

Comments

@c42f
Copy link
Contributor

c42f commented Nov 10, 2015

As discussed in JuliaGeometry/KDTrees.jl#20, nearest neighbor performance can depend rather strongly on the data distribution. For real data, there's various "interesting" cases which occur quite commonly, including at least:

  • Query points can systematically be outliers relative to the data in the tree. This reduces the efficiency of early bailout during tree traversal.
  • Data in the tree can have outliers. This can effect the current tree building heuristic by forcing a lot of long skinny rectangles.
  • Data in the tree can reside on subspaces in the higher dimensional space. It would be interesting to see how this effects query efficiency relative to the same data in lower dimensions.

Is there a plan/preference for where benchmark code should go, or what form it should take? (#3 obviously relevant).

@KristofferC
Copy link
Owner

There is no real concrete plan right now but here are some thoughts.

My suggestion is we use the https://github.com/johnmyleswhite/Benchmarks.jl package which seems to start becoming the de facto benchmarking package that other people build their benchmarking infrastructure on top of.

The first step would just be to get some simple benchmarks going with a few different simple data distributions. In https://jakevdp.github.io/blog/2013/04/29/benchmarking-nearest-neighbor-searches-in-python/ the author tested a few types of trees with different data distributions. Getting something like that going (without any regression testing or advanced stuff) would be good as a first step.

We could also see if we get similar results as in the link. The author mentions that the top down approach of creating rectangles (which is also used in this package) can lead to bad performance in large dimensions so that is something worth looking into. I also know that the BallTree is not as optimized as it could be in this package (for example using reduced distance when the metric is of Minkowsky type).

Second step could be to get some representative real life data that exhibits the problems mentioned above and see if there are any unexpected performance drops.

Third step is to integrate it with BenchmarksTrackers.jl and get some regression testing going.

@KristofferC
Copy link
Owner

The framework from #30 is now available so just add benchmarks there as needed.

@c42f
Copy link
Contributor Author

c42f commented Aug 24, 2016

Great, thanks for putting that together. As you can probably tell I haven't had time to spend on this for ages :-/ Hopefully I'll have time again in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants