Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Geometry:::Union() performance using clustering #692

Closed
wants to merge 3 commits into from

Conversation

dbaston
Copy link
Member

@dbaston dbaston commented Sep 27, 2022

This PR implements a pre-clustering step for Geometry::Union() to divide inputs into disjoint sets before performing the union operation. Performance testing shows a large benefit in datasets with many disjoint polygons and no significant penalty in datasets with no disjoint polygons.

I tried using envelope intersection, geometry intersection, and geometry distance as the basis for the clustering. The testing results show that geometry distance is preferred, since its performance is essentially equivalent to geometry intersection and is safe in the case where snap-rounding causes connected outputs from disjoint inputs. Timings are as follows, using geosop union with 5 repeats (-r 5) for all except the most complex dataset where only one iteration was performed.

Dataset Iterations Geometry Clusters main [s] Geom isect [s] Speedup Geom dist [s] Speedup Env isect [s] Speedup Env Clusters
world 5 8226 12.496 1.62 87.0% 1.86 85.1% 7.42 40.6% 2549
watersheds 5 1 11.273 11.1 1.5% 10.94 3.0% 11.33 -0.5% 1
watersheds_buf 5 1 15.414 15.87 -3.0% 15.83 -2.7% 15.18 1.5% 1
counties 5 98 107.776 105.7 1.9% 107.4 0.3% 102.065 5.3% 66
subdiv 1 175694 217 44.6 79.4% 53 75.6% 196 9.7% 7278
voronoi 50 1 2.5 2.8 -10.8%
vt parcels 1 1 90 110 -23.6% 94 -5.6% 1

The datasets are available in https://drive.google.com/drive/folders/1YNDsce_YiewgOiafPeJJOF4_AXlenX1C?usp=sharing

@pramsey
Copy link
Member

pramsey commented Sep 27, 2022

The thing I don't understand here is, isn't CascadedUnion already effectively doing this? The tree puts near-to-each-other things early in the union order. Is this on top of, or in replacement of, cascadedunion?

@dbaston
Copy link
Member Author

dbaston commented Sep 27, 2022

This is on top of cascaded union. This prevents the pieces that don't intersect from going into the overlay engine together.

@pramsey
Copy link
Member

pramsey commented Sep 27, 2022

Counter-intuitive. I'd have thought, working bottom-up, the disjoint parts would mostly come together near the end and get quickly no-opped with an envelope intersection test.

@dbaston
Copy link
Member Author

dbaston commented Sep 27, 2022

The first and last columns ("geometry clusters" and "envelope clusters") give some idea of this. Canadian subdivisions (last row) form 175k geometry clusters but only 7k envelope clusters.

@libgeos libgeos deleted a comment from dr-jts Sep 27, 2022
@libgeos libgeos deleted a comment from dr-jts Sep 27, 2022
@libgeos libgeos deleted a comment from dr-jts Sep 27, 2022
@libgeos libgeos deleted a comment from dr-jts Sep 27, 2022
@dbaston
Copy link
Member Author

dbaston commented Sep 28, 2022

Maybe it's possible to do a reasonable algorithm selection based on average number of vertices/polygon. Above a certain number of vertices/polygon the union operation can be expected to be expensive enough that the pre-clustering isn't going to slow things down too much even if it accomplishes nothing. Here are the vertex count distributions for the current test datasets. The datasets that see meaningful slowdown (voronoi, parcels) have ~2 orders of magnitude fewer vertices than the others.

image

@dbaston
Copy link
Member Author

dbaston commented Feb 21, 2023

Updated to revert changes to the default union strategy. Instead, GEOSDisjointSubsetUnion is exposed to the C API so clients can use it if they like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants