-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization of built STRTree #705
Comments
Tree building time is dominated by sorting the items, so we could speed it up by creating a tree from a list of pre-sorted items. The existing C API gets us most of the way there using the following steps:
This gives only a minor performance gain because I did a performance test (dbaston@f5ca63b) to see the potential gains, which look to be about 70%. Is this worth it?
|
Thanks for the fast response and benchmark! I must say I am no frequent user of distributed workloads of geometries. Maybe @jorisvandenbossche can judge if the 70% improvement is worth the effort? Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them? |
Yes, only the bounding boxes (envelopes) are needed. In our shapely code, we insert the actual geometry, but the C API directly gets the envelope and only inserts that in the tree: Lines 3541 to 3549 in 32348a6
But creating a tree with inserting the envelopes ourselves should be equivalent. So if we would "mimick" serializing a tree, we could store the sorted envelopes, and recreate a tree from that. |
In that case the serialised form of a tree could be a vector of envelope corner coordinates ( In the proposal by @dbaston that is possible through the My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup. |
You could serialize the envelopes however you like. If you did use WKB, I would probably represent the entire tree using a single
I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time. |
I am not sure how to judge this. I am probably not the right person as I am not personally needing this feature. Maybe a good metric would be the ratio between tree deserialization of N geometries and a common operation on those geometries? Regarding the serialization, I tried to make an implementation to get the sorted envelopes out of a built tree. Sadly I only got the items back from the iterate and not the geometry or geometry envelopes. So there might be still some work on the GEOS side here? (caspervdw/Shapely@fd06e9b) |
Close and move on? |
Copied from https://trac.osgeo.org/geos/ticket/1130
For usage in cluster computing workloads, the possibility of dumping / loading a built STRTree to/from some (internal) binary format through the CAPI would be a major enhancement.
Now, a tree needs to be rebuilt in every worker, which has considerable overhead.
It is unclear to me how much effort this would require. But some users would be very happy with this enhancement.
Related:
https://github.com/pygeos/pygeos/issues/274
https://github.com/Toblerity/Shapely/issues/1033
The text was updated successfully, but these errors were encountered: