New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE when using very large STRtree with concurrent insertion #352

Open
samgurtman opened this Issue Dec 3, 2018 · 14 comments

Comments

Projects
None yet
5 participants
@samgurtman
Copy link

samgurtman commented Dec 3, 2018

I'm trying to build an stRTree large which contains 90,000,000 line strings. I'm getting a weird issue on first query (when it builds). A quadtree based version of this works fine. Are there any things to be aware of when I create a very large stRTree?

java.lang.NullPointerException: null
	at org.locationtech.jts.index.strtree.STRtree$1.compare(STRtree.java:83) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]
	at java.util.TimSort.binarySort(TimSort.java:296) ~[?:?]
	at java.util.TimSort.sort(TimSort.java:239) ~[?:?]
	at java.util.Arrays.sort(Arrays.java:1515) ~[?:?]
	at java.util.ArrayList.sort(ArrayList.java:1749) ~[?:?]
	at java.util.Collections.sort(Collections.java:177) ~[?:?]
	at org.locationtech.jts.index.strtree.STRtree.createParentBoundables(STRtree.java:123) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]
	at org.locationtech.jts.index.strtree.AbstractSTRtree.createHigherLevels(AbstractSTRtree.java:157) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]
	at org.locationtech.jts.index.strtree.AbstractSTRtree.build(AbstractSTRtree.java:107) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]
	at org.locationtech.jts.index.strtree.AbstractSTRtree.query(AbstractSTRtree.java:243) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]
	at org.locationtech.jts.index.strtree.STRtree.query(STRtree.java:205) ~[fastgeographic-1.0-SNAPSHOT-jar-with-dependencies.jar:?]

UPDATE: This only occurs when doing parallel loading of the stRTree. I believe there must be a thread safety issue.

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 5, 2018

I'm not aware of any issues, but have never tried loading that many items! How much memory do you have allocated?

If this is happening during concurrent access then perhaps it is a thread safety thing. Although the build method is synchronized, so not sure why that would happen.

Or could there be some issue with the sort code? A different kind of sort algorithm could be substituted to see if that makes a difference. (Maybe there's even a faster one?)

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 5, 2018

Can you explain your comment about "parallel loading"? This error is occurring during the build phase, which should only happen after all loading (via STRtree.insert) is complete. If insert and build are running concurrently perhaps there is a race condition. You probably need to ensure this does not happen at the client code level, since even if the STRtree is made resilient to this, inserts after build will result in exceptions.

@samgurtman

This comment has been minimized.

Copy link
Author

samgurtman commented Dec 5, 2018

@dr-jts The issue only occurs if the inserting is done concurrently. I am doing all the inserts before I try to do any queries. I am also ensuring the first query is synchronized (which triggers the optimization). All subsequent queries are concurrent but only occur after the first query completes.

The issue DOES NOT occur if the inserts are synchronized (one at a time) also.

Memory usage is around 300 gigabytes (Running on a high memory instance in AWS). Use case is a completely in-memory reverse geocoding service.

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 5, 2018

Well maybe that's the problem. I think that ArrayList is not certified for concurrent access, and one is used to hold the items before the tree is built. The insert method could be made synchronized to fix this. If you can't modify JTS, can you wrap the tree in a class which is synchronized? And feel free to submit a PR for this fix.

Does synchronizing insert limit performance in your case? It's not really doing very much - all the work happens when the tree is built. If it is an issue possibly some sort of batch insert could be provided.

Very cool about the JVM mem size! Good to know that JTS is useful (mostly) at this kinds of volumes.

Is the tree build time an issue? Probably there's some way to make that concurrent as well...

@samgurtman

This comment has been minimized.

Copy link
Author

samgurtman commented Dec 5, 2018

@dr-jts

Synchronizing the insert doesn't seem to affect performance (although I haven't measured this). That bit is actually quite fast.

The tree build time is around 15 minutes. If that could be made concurrent that would be fantastic. Currently, I'm looking to work around the build time by seeing if I can serialize (Kryo) a fully built tree, and then de-serialize it at start up instead of rebuilding the tree every time (as the data only changes very occasionally).

I should mention that the actual amount of objects in the tree is more like 800 million as I'm breaking each linestring into it's constituent segments before insertion.

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 6, 2018

Synchronizing the insert doesn't seem to affect performance (although I haven't measured this). That bit is actually quite fast.

Ok, then making the insert method synchronized sounds like a good thing to do.

The tree build time is around 15 minutes. If that could be made concurrent that would be fantastic. Currently, I'm looking to work around the build time by seeing if I can serialize (Kryo) a fully built tree, and then de-serialize it at start up instead of rebuilding the tree every time (as the data only changes very occasionally).

Interesting to think about...

I should mention that the actual amount of objects in the tree is more like 800 million as I'm breaking each linestring into it's constituent segments before insertion.

wow^2...

@dr-jts dr-jts changed the title NPE when using very large stRTree. NPE when using very large STRTree. Dec 6, 2018

@dr-jts dr-jts changed the title NPE when using very large STRTree. NPE when using very large STRtree with concurrent insertion Dec 6, 2018

@bogdartysh

This comment has been minimized.

Copy link

bogdartysh commented Dec 6, 2018

I'm not aware of any issues, but have never tried loading that many items!

I use spatial index STRtree for sequential uploading ~ 10 M multipolygon objects on a daily basis for half a year, without NPE. And building index usually is less then a minute on a "strong" server.

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 7, 2018

@bogdartysh Impressive!

With this kind of load, it sure seem like there might be room for improvement. In the big picture there might well be better spatial indexes out there. And as for STRtree itself perhaps there's some low-hanging fruit to pluck. Perhaps unroll the code to make fewer method calls?

And what about memory usage? Maybe there's a more compact representation possible?

Has anyone experimented with the nodeCapacity parameter to see if that makes a difference to query time, build time, and memory usage?

@dr-jts

This comment has been minimized.

Copy link
Contributor

dr-jts commented Dec 7, 2018

So it seems like the action item here is to make insert synchronized?

@jandam

This comment has been minimized.

Copy link

jandam commented Dec 17, 2018

I don't agree with making STRTree.insert synchronized. ArrayList is not thread safe either. STRTree is thread safe after calling build. Because it is read-only structure. I think that user of STRTree should be responsible for correct use.
When you make STRTree.insert then you have to make STRTree.build synchronized then every query must check (in synchronized method/block) whether tree is already built. That will have some impact on concurrent queries.

@jandam

This comment has been minimized.

Copy link

jandam commented Dec 17, 2018

The slowest part on building STRTree is sorting by X or Y axis. This can be improved (in some cases) by using parallelSort (Java version > 8)
STRTree can be made Serializable / Externalizable. Externalizable is faster and produces smaller data. But this mechanism can be removed in future Java versions.
Another issue can be memory overhead for huge trees. So there is possibility to use float or even short instead of double for storing bounding envelope. STRTree can be constructed and stored on disk. Only cached parts of STRTree remain in memory (several MBs). It is tested on 550M lines and polygons from OpenStreetMap.

@samgurtman

This comment has been minimized.

Copy link
Author

samgurtman commented Dec 17, 2018

@jandan can you point me to the openstreetmap example on on disk STRtree?

I’m currently looking at on disk indexes but I haven’t been able to find many.

@jandam

This comment has been minimized.

Copy link

jandam commented Dec 17, 2018

Unfortunately the code for on disk STRTree is not in separate library. I can publish it on GitHub in raw form but it will take some time. Currently I'm busy. So nearest term is the next year :-(

@dsmiley

This comment has been minimized.

Copy link

dsmiley commented Dec 18, 2018

Those wanting a high-performance disk resident index for envelopes should strongly consider Apache Lucene. This has been there for years, and improved substantially a couple years ago with the addition of a KDTree index to Lucene. See "FloatRange" https://github.com/apache/lucene-solr/blob/branch_7_6/lucene/core/src/java/org/apache/lucene/document/FloatRange.java (or DoubleRange) It's really not much code to use and I think pretty easy to use, although the initial learning curve may be kinda steep if you're completely new to Lucene since there's a ton of functionality in it. If you need more help, just ask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment