Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of built STRTree #705

Open
caspervdw opened this issue Oct 17, 2022 · 7 comments
Open

Serialization of built STRTree #705

caspervdw opened this issue Oct 17, 2022 · 7 comments
Labels
Enhancement New feature or feature improvement.

Comments

@caspervdw
Copy link
Contributor

Copied from https://trac.osgeo.org/geos/ticket/1130

For usage in cluster computing workloads, the possibility of dumping / loading a built STRTree to/from some (internal) binary format through the CAPI would be a major enhancement.

Now, a tree needs to be rebuilt in every worker, which has considerable overhead.

It is unclear to me how much effort this would require. But some users would be very happy with this enhancement.

Related:

​https://github.com/pygeos/pygeos/issues/274
​https://github.com/Toblerity/Shapely/issues/1033

@pramsey pramsey added the Enhancement New feature or feature improvement. label Oct 17, 2022
@dbaston
Copy link
Member

dbaston commented Oct 17, 2022

Tree building time is dominated by sorting the items, so we could speed it up by creating a tree from a list of pre-sorted items. The existing C API gets us most of the way there using the following steps:

  • Create a tree
  • Query the tree so that it gets built
  • Retrieve the items in tree-sorted order using GEOSSTRtree_iterate
  • Build a tree from items already in tree-sorted order

This gives only a minor performance gain because std::sort is not much cheaper on a vector that is already sorted. To get better performance, we need a way to signal that no sorting is needed. Since we lack a GEOSSTRtree_build function, we could add one with an isSorted parameter.

I did a performance test (dbaston@f5ca63b) to see the potential gains, which look to be about 70%. Is this worth it?

----------------------------------------------------------------
Benchmark                      Time             CPU   Iterations
----------------------------------------------------------------
BM_BuildTree             1395533 ns      1395409 ns          470
BM_BuildTreePresorted     430600 ns       430564 ns         1660

@caspervdw
Copy link
Contributor Author

Thanks for the fast response and benchmark!

I must say I am no frequent user of distributed workloads of geometries. Maybe @jorisvandenbossche can judge if the 70% improvement is worth the effort?

Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?

@jorisvandenbossche
Copy link
Contributor

Are the geometries actually necessary for the tree or only some simplified (bbox?) version of them?

Yes, only the bounding boxes (envelopes) are needed. In our shapely code, we insert the actual geometry, but the C API directly gets the envelope and only inserts that in the tree:

geos/capi/geos_ts_c.cpp

Lines 3541 to 3549 in 32348a6

GEOSSTRtree_insert_r(GEOSContextHandle_t extHandle,
GEOSSTRtree* tree,
const geos::geom::Geometry* g,
void* item)
{
execute(extHandle, [&]() {
tree->insert(g->getEnvelopeInternal(), item);
});
}

But creating a tree with inserting the envelopes ourselves should be equivalent. So if we would "mimick" serializing a tree, we could store the sorted envelopes, and recreate a tree from that.

@caspervdw
Copy link
Contributor Author

In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.

In the proposal by @dbaston that is possible through the iterate CAPI, but then we get a superfluous conversion to Geometry. For deserializing we would also need to construct geometries, only to be converted back to envelopes by the tree constructor.

My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.

@dbaston
Copy link
Member

dbaston commented Oct 19, 2022

In that case the serialised form of a tree could be a vector of envelope corner coordinates ( (x1, y1, x2, y2) * n). That would be a factor 2-3 smaller in size than serializing the envelope geometries to wkb.

You could serialize the envelopes however you like. If you did use WKB, I would probably represent the entire tree using a single MultiLineString with two-point LineStrings representing the envelopes. Then build the tree with

GEOSGeometry* g = GEOSGeomFromWKB(buf, size);
int ngeoms = GEOSGetNumGeometries(g);
for (int i = 0; i < ngeoms; i++) {
  GEOSSTRtree_insert(tree, GEOSGetGeometryN(g, i), items[i]);
}

My feeling is that a new pair of functions (serialize/deserialize or dump/load) would form a more significant speedup.

I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.

@caspervdw
Copy link
Contributor Author

I don't doubt that it would be faster, but I'm not sure I understand the usage where no-sort tree construction is going to be a significant part of execution time.

I am not sure how to judge this. I am probably not the right person as I am not personally needing this feature. Maybe a good metric would be the ratio between tree deserialization of N geometries and a common operation on those geometries?

Regarding the serialization, I tried to make an implementation to get the sorted envelopes out of a built tree. Sadly I only got the items back from the iterate and not the geometry or geometry envelopes. So there might be still some work on the GEOS side here? (caspervdw/Shapely@fd06e9b)

@pramsey
Copy link
Member

pramsey commented Jun 2, 2023

Close and move on?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or feature improvement.
Projects
None yet
Development

No branches or pull requests

4 participants