Skip to content
/ katy Public

A kd-tree C library. Maybe a Python extension one day.

Notifications You must be signed in to change notification settings

halfhorst/katy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Katy: A kd-tree C library and python C extension

This library is a personal project, conceived of for practice with a spatial data structure and digging into Cpython. There are far better kd-tree implementations in the wild, and this one is liable to be WIP on the master branch. For instance, the python extension doesn't exist, currently.

kd-trees are tree structures containing k-dimensional points that partition the points along one dimension at each level of the tree. There is considerable variety around this theme in different implementations that revolve around how best to select and split the points, and how to balance build time and query time. These trees are suitable for answering geometric queries like "give me the n closest points to this one" (k nearest neighbors) or "give me all the points that lie this close to this point" (range queries). Note that closeness here is ambiguous and requires definition.

The original paper by J.L Bentley [1] that proposed kd-trees simply cycled through axes in order and selected a partition value that was the median of the points along that axis. This was followed by a paper [2] that selected the axis with the greatest spread at each level of the tree. Later, a midpoint rule was proposed that selects the partition value based not on the median, but the midpoint. Significantly, this means that spread out data, after the first split, can lead to a subsequent split that is degenerate, e.g. all points lie on one side. This is ameliorated by sliding the partition so that each side captures at least one point, known as the sliding-midpoint rule [3]. There are many other variations out there.

Using the median vs. the midpoint has consequences on the build time and geometry of the tree structure. Finding the median requires a partial sort or median selection, whereas the sliding midpoint requires a full sort to determine if the split is degenerate. The midpoint results in trees that have regions of bounded aspect ratio, but the tree itself may be unbalanced, which could rear its head during some queries. The median, on the other hand, may generate extreme aspect ratios but the tree height is bounded. This is because the median is splitting the tree into equal regions by number of points whereas the midpoint is splitting the tree into equal regions by value along the axis.

This structure suffers from the curse of dimensionality -- as k increases, your queres will slowdown as the algorithm tends towards visiting all points, e.g. a brute force solution. Specifying a leaf size can help, which dictates a threshold number of points below which to stop splitting axes and create a leaf in the tree.

[1] Bentley, Jon Louis. "Multidimensional binary search trees used for associative searching." Communications of the ACM 18.9 (1975): 509-517.

[2] Friedman, Jerome H., Jon Louis Bentley, and Raphael Ari Finkel. "An algorithm for finding best matches in logarithmic expected time." ACM Transactions on Mathematical Software (TOMS) 3.3 (1977): 209-226.

[3] Maneewongvatana, Songrit, and David M. Mount. "It’s okay to be skinny, if your friends are fat." Center for Geometric Computing 4th Annual Workshop on Computational Geometry. Vol. 2. 1999.

Implementation

Katy splits using the median of the axis with the largest spread at each level. It finds the median using a quick select method similar to the partition routine of quicksort, inspired by sklearn's implementation. It stores indices into the data at the leaves, and also all indices in the subtree at each node.

Katy does not support insertion or deletion. Since no good balancing method exists (for a vanilla kd-tree), insertions and deletions lead to degenerate trees. If you want to change change the points in the tree, build a new tree.

Katy supports n nearest neighbor searches and range searches from a test point. The range searches are additionally specified with an array of radii for each axis in k.

Dependencies

Katy is written in c99 but the tests require googletest and it is not distributed with this project. It needs to be available to the linker during linking, so check out how to get a hold of it.

Building / Running

The Makefile's default target will build and run the tests. There is a test suite for the heap that is used internally and a test suite for the tree itself.

What's next

  • More extensive testing
  • The actual Python binding
  • Once in python, play around with benching and airspeed-velocity.

Other Spatial Things to Explore:

  • Ball tree, suitable for higher dimensions
  • Quad/octree
  • Bounding Volume Hierarchies and Axis-aligned Bounding Boxes as they pertain to rendering.
  • Spacefilling curves, Google's S2 geometry and Uber's H3.

About

A kd-tree C library. Maybe a Python extension one day.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages