Experiment: Hash Grid (edge vertices stations) spatial index

Caveat

The Envelopes objects are taking lots of memory (4 doubles), and are used a lot in current JTS spatial indexes implementation (Quad-Tree, R-Tree-based...). JTS spatial indexes are designed as generic-purpose indexes, and allow the storing of any object; so they have to store lots of envelope to get the bounds of the items. Since we store edges and vertices which both know their bounds implicitely, this produce redundancy in having both implicit and explicit envelopes at the same time in memory.
QuadTree uses less envelopes, but is slower than R-tree. R-tree rely on a non-reversible building phase to get it's performance but this has the drawback of being read-only once built. This prevent us to have a spatial index which could be re-used across graph builder (currently a new graph spatial index is re-created for each builder that needs it).
We've made some quick tests using the old custom OTP "HashGrid", it seems to be almost as fast and consume problably less overall memory. The HashGrid has been wrote specifically with this goal of reducing memory consumption by envelopes. If it provides a significant improvement, it should be a viable replacement for spatial indexes throughout OTP. At the time, it has not been considered a sufficiently tested replacement to be dropped into production systems, compared to the index from the widely used JTS library.
We could just implement with a new implementation the "SpatialIndex" interface and let the caller do the filtering himself, with whatever knowledge he has on the stored data (be it geometry-edge or point-vertices). With this scheme we have a drop-in replacement on STRtree, with the possibility to switch implementation if needed.

The hash grid spatial index

We are trying a new implementation based on an optimized HashMap for long (Trove library) to store sparse bins, with a class implementing SpatialIndex interface. The advantage is to optimize memory use and remove all collisions to the wrapped (original hash grid) version.

Performances vs R-tree

                     Full    10k Queries  10k Out-of-bounds
(ms)                  indexing   edges vertx    edges vertx
STRtree    -------------------------------------------------
original azimuth          1067    2920   3270      26    28
optimized azimuth       (same)     806   2193        (same)
HashGrid                   498    9861  10604  58849  49399
HashGrid no dups           346    2391   2962   9646   8196
HashGridSI                 279    1464   1879     40     45
------------------------------------------------------------

HashGrid is the original HashGrid implementation (static wrapping array). No dups is a modification to it to remove duplicates from the queries (this improve performance by decreasing envelope test count).
HashGridSI is the new implemenation (Trove hash map), implementing the JTS SpatialIndex interface.
Optimized azimuth (see other experiment result page) has the biggest impact in term of performance. All tests (except first one) are made with optimized azimuth.
With an optimized hash grid versions (simple rasterizing, 500m bin size), we have comparable performance in term of queries.
The indexing is faster with all hash grid implementation. This is mainly due to the "building" phase of R-tree.
HashGrid spatial index is read-write; R-tree is read-only once built.
500m bin size.

Memory improvements

In term of memory, we gain again around 18%:

                              Graph heap, OSM only
---------------------------------------------------
Original                      123 060 000   ~1.9%
HashGrid (1km x 1km bins)     101 359 000    ~18%
---------------------------------------------------

On bin size

Various experiments shows that 500m to 1km bin size is optimal in both speed and memory (with smaller bins we gain a bit in speed, at the expense of memory).
1km is quite a large bin -- for a search in the denser part of a city you might have to iterate over and filter every street in the center. In (old) HashGrid we eventually settled on using small bins (200m) but also a small bin grid such that there was a lot of wrapping. Dense and sparse parts of the city then get overlaid, evening out the load across the grid. And it's quite fast to filter out distant objects (you don't have to do precise distance calculations).
We improved the envelope growing mechanism by taking into account the latitude projection factor, it reduce the number of false positives (which are now a bit more expensive since we do not cache the envelope anymore).
Items can belong to several bins. We thus need to filter duplicates. Reducing the number of false positives would help however, that should be our main attack angle.

Performance stats for 100k requests, comparing 1km and 500m bin sizes:

                           Index      Query      (Out-of-bounds)
                                       E      V      E      V
STRtree                      890    4091   8497    307    175
HashGridSpatialIndex:
" 1000m bin no raster        279   14979  17669    478    438
" 1000m bin raster           513   14546  16704    346    435
"  500m bin no raster        389    7964  10470    492    751
"  500m bin raster           683    6952   9822    721    599

With the "500m raster" version the overhead compared to STRtree become rather small (4091 vs 6952 for edges, 9822 vs 8497 for vertices).

On static array vs dynamic map

Question: Is the main point of using the hashmap for the bins to avoid allocating a lot of empty bins as we would with a statically sized 2D array? The default load factor on the trove hashmaps is 0.5 so if more than 25 percent of the total grid cells are in use the array storing them might actually be the same size or larger than a plain 2D array. And I expect we'll be triggering rehashes as we add items to the index.

Answers:

Using hash map is mostly for removing wrapping (and thus potential false positives), by removing any assumption on the covered size. Allocating a static array would need the client to provide hints on the covered region to compute the number of bins (this is not a big issue in our case however, we know the graph bounds beforehand before indexing).
The overhead of a map vs a static array is almost null: accessing a bin is an array indirection in both cases (using Trove maps and good hashing). The computation on load factor is right, but this should not add a lot of memory in general cases, and allow us to be able to handle sparse street networks (small street regions around far away train stations for example).
Also an important point is to get decent performances in case of out-of-bounds queries. With the "wrapped" implementation we get very bad performances for queries out of the covered bounds (it got a lots of false positive but still has to expand the query). With the map implementation out of bounds queries return mostly empty lists so are really fast.
Most of the memory is taken by the list instances of the bins (number of objects per bin is rather large so the list size is much higher compared to a single pointer). The impact in term of memory of null bins is rather low.

On rasterizing

Currently objects inserted in the index are stored in all grid bins intersecting the bounding envelope. Envelopes of edges can sometimes be vastly larger than the edge itself. For example, imagine a long stretch of country highway running diagonally out on the edge of your graph. This can have a bounding box that extends all the way into the central city. That highway can be inserted into a large number of faraway cells, and might need to be filtered out of almost every search.

Most of the geometry are rather small (see histogram of geometry length for various data). Average geometry length is around 100m. Lots of long stretch of road are cut by intersections anyway; really long segment are really rare, even in rural areas.

The idea is to implement a simple "rasterizing" for LineString geometry (overriding insert) by simply looping over the segments and add only the rectangular bounds of simple lines. Even for long geometries each line segment would not be too long, so that would reduce the number of false bins a lot. We could also iteratively cut any segment whose length is larger than the bin size for an improved version. (This overhead time would only be spent during indexing.) The advantage of this method is that the code is rather simple (simpler than a full exact-rasterizer anyway).

Load factor results (500m x 500m bins, edges index):

With raster: 7084 bins allocated, 200029 objs, 259217 entries (avg 36.59 entries/bin, 1.30 entries/object)
No raster: 7360 bins allocated, 200029 objs, 266534 entries (avg 36.21 entries/bin, 1.33 entries/object)

Rasterizing has a rather low impact. Since the average number of bins per object is low by default (1.33), rasterizing only reduce it to 1.30. That make sense: reading the histogram of geometry size most of them are really small so will fit in a single bin. Reducing the bin size make sense reading those values. The overhead of having the same object in several bin is rather low, both in term of memory and size.

Misc (notes on old HashGrid implementation)

The 'closest' method calculates the true distance to every object. It could just compare squared distances to avoid all the square roots, and could even filter target objects before squaring by comparing the distance on each axis.
The query method copies the lists of target objects into a new list that is returned to the caller. This could be quite wasteful in dense regions. It should really just concatenate the cell lists on the fly, which can be done simply with Guava: http://guava-libraries.googlecode.com/svn-history/r46/trunk/javadoc/com/google/common/collect/Iterators.html#concat%28java.util.Iterator%29 (Note: We're still copying the elements including any number of false positives into a set. This is necessary as long as we have duplicates in the index.)
Natively it only supports pointlike objects, but I added some code to rasterize line segments into the grid cells so it should work for linear objects as well. Of course the results may then contain the same line segment more than once.
Note that it is possible to get an effective spatial index with only a 10 or 20km grid, even in a big city. The query results will include a large number of false positives, but they can be filtered down without doing any serious calculations (using trivial comparisons to the query bounding box).
Perhaps this filtering should be provided in special helper functions for edges and vertices -- the base hashgrid is generic and should be able to store references to objects that don't hold references to their own coordinates internally.

The documentation on this wiki is outdated and should not be used

unless you are intentionally working with legacy versions of OpenTripPlanner. Please consult the current documentation at readthedocs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly