Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add clustering functions #688

Merged
merged 3 commits into from
Sep 26, 2022
Merged

Add clustering functions #688

merged 3 commits into from
Sep 26, 2022

Conversation

dbaston
Copy link
Member

@dbaston dbaston commented Sep 21, 2022

This PR is a rebase of #388 which died on the vine in 2021. It adds the DBSCAN, simple distance, and intersection clustering methods available in PostGIS. Also adds clustering based on envelope intersection and envelope distance.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

I have thought that a nice interface for clustering would be to return a list of cluster ids for the list of input geometries. That way, the client can (a) know which input geometry belongs to each cluster, and (b) choose whether or not to materialize the clusters as new geometries (and a utility function could support this). This would also allow the functionality to be used to back window functions in PostGIS.

Does this PR support this kind of interface?

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

What is the use case for the envelope clustering functions?

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

Is ClusterGeometryIntersecting just a special case of ClusterGeometryDistance (with distance = 0) ?

@dbaston
Copy link
Member Author

dbaston commented Sep 21, 2022

I wouldn't expect this code to be desirable for PostGIS because PostGIS already has its own implementation that does not require converting everything into GEOS types. But maybe it would be useful for other bindings or for algorithms within GEOS.

Does this PR support this kind of interface?

At the C++ level, it can and should be modified to support this. In the C API....do you want to post signatures for your ideal C interface? I think at the time I wrote this, I was trying to avoid introducing new types to the C API while also making the common use case of "assemble these geometries into clusters" relatively painless.

Assembling clusters from a list of IDs for each input is a actually a bit painful in C (You can see how I implemented it for PostGIS at https://github.com/postgis/postgis/blob/1dbae611ab087f5c78d94c4a8955196bf887fa9c/liblwgeom/lwgeom_geos_cluster.c#L542.). But maybe nobody is likely to be doing this; they'd be assembling the clusters in Python etc.

What is the use case for the envelope clustering functions?

Fast alternative to distance or intersection for coarse partitioning of data.

Is ClusterGeonmetryIntersecting just a special case of ClusterGeometryDistance (with distance = 0) ?

Like in PostGIS, it's implemented using the intersection predicate which is not always consistent with distance == 0.

@dbaston
Copy link
Member Author

dbaston commented Sep 21, 2022

I changed the C++ interface a bit and the following function is now public:

Clusters AbstractClusterFinder::cluster(const std::vector<const geom::Geometry*>& g);

this can be transformed into a vector of cluster IDs whose elements correspond to g using Clusters::getClusterIds()

@dbaston
Copy link
Member Author

dbaston commented Sep 21, 2022

which I now realize will not work for DBSCAN clustering because not all inputs are assigned a cluster. Would you want some placeholder cluster ID value used in this case? Or an array of booleans to accompany the array of IDs?

Edit: I guess the two-array method is how PostGIS does it internally.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

Like in PostGIS, it's implemented using the intersection predicate which is not always consistent with distance == 0.

That's not good. Do you have an example? Is this a true semantic difference, or just a robustness issue?

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

which I now realize will not work for DBSCAN clustering because not all inputs are assigned a cluster. Would you want some placeholder cluster ID value used in this case? Or an array of booleans to accompany the array of IDs?

An ID indicating "no cluster assigned" seems pretty simple. E.g. ID = -1.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 21, 2022

In the C API....do you want to post signatures for your ideal C interface?

How about an array of ints? I suppose the size is needed. Infer from the input size? Provide as an out parameter? Indicate via a sentinel value in the array?

@dbaston dbaston added the Enhancement New feature or feature improvement. label Sep 21, 2022
@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

That's not good. Do you have an example? Is this a true semantic difference, or just a robustness issue?

dan=# SELECT array_length(ST_ClusterIntersecting(geom), 1) FROM tl_2016_50017_linearwater;
 array_length 
--------------
          558
(1 row)

dan=# SELECT array_length(ST_ClusterWithin(geom, 1e-300), 1) FROM tl_2016_50017_linearwater;
 array_length 

--------------
          482
(1 row)

tl_2016_50017_linearwater.zip

That said, maybe distance == 0 and intersects are equivalent in GEOS.

How about an array of ints? I suppose the size is needed. Infer from the input size? Provide as an out parameter? Indicate via a sentinel value in the array?

I'd be fine with removing the C API for now. This interface would have been painful for my use case for the reasons I described above, but maybe I'm the only one.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 22, 2022

I'd be fine with removing the C API for now. This interface would have been painful for my use case for the reasons I described above, but maybe I'm the only one.

Not sure I understand. A lot of use of GEOS comes through the C API, so that seems important to provide?

@dr-jts
Copy link
Contributor

dr-jts commented Sep 22, 2022

I wonder whether the input should be provided as a array of geometries, rather than a GeometryCollection? Simpler to construct, and doesn't have the nested semantics of collections to deal with. That also nicely matches a potential return type of an array of cluster IDs.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 22, 2022

That said, maybe distance == 0 and intersects are equivalent in GEOS.

I guess you could confirm by testing using this PR? Would be interesting to know if there is a discrepancy. (Also why this happens in PostGIS - but I think the distance code and predicate code there are of different origins, so maybe have different contracts?)

@dr-jts
Copy link
Contributor

dr-jts commented Sep 22, 2022

What is the use case for the envelope clustering functions?

Fast alternative to distance or intersection for coarse partitioning of data.

Understood. I'm curious if there is any papers or discussion about using this technique. Envelopes are a pretty coarse approximation, so it seems like this might have some undesirable characteristics in practice.

@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

I wonder whether the input should be provided as a array of geometries, rather than a GeometryCollection? Simpler to construct, and doesn't have the nested semantics of collections to deal with. That also nicely matches a potential return type of an array of cluster IDs.

I'm really trying to follow the contours of the existing C API, which uses GeometryCollection to handle for groups of geometries. I don't think they are especially complex to construct or access and unlike an array they know their own size. A non-owning variant of GeometryCollection could be useful.

A lot of use of GEOS comes through the C API, so that seems important to provide?

It's ideal to expose it to the C API, but it doesn't have to all happen at once. If you don't like the API I've proposed and you don't have a detailed alternative, I don't think there's anything wrong with punting.

I'm curious if there is any papers or discussion about using this technique. Envelopes are a pretty coarse approximation, so it seems like this might have some undesirable characteristics in practice.

I used it to optimize raster reads from sparse polygons (think USA or France with overseas territories.) I'm not aware of any papers or discussion of it.

@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

Here's an example where envelope pre-clustering improves unaryUnion runttime by 5% despite incurring an extra copy of the input and only dividing the data into two clusters.

Geometry::Ptr
Geometry::Union() const
{
    using geos::operation::geounion::UnaryUnionOp;
#ifdef DISABLE_OVERLAYNG
    return UnaryUnionOp::Union(*this);
#else

    operation::cluster::EnvelopeIntersectsClusterFinder f;
    auto clustered = f.cluster(*this);

    for (auto& g : clustered) {
        g = operation::overlayng::OverlayNGRobust::Union(g.get());
    }

    return this->getFactory()->buildGeometry(std::move(clustered));
#endif
}

using test geosop unaryUnion -t -q -a ~/data/world.wkt

@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

Although swapping in a GeometryIntersectionClusterFinder comes up with 91 clusters and gives about a 15% performance benefit.

@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

Flattening the inputs, per the latest commit, brings the performance gain to 60% for this dataset.

@dbaston dbaston force-pushed the cluster branch 2 times, most recently from f8e74e3 to bf92b2b Compare September 22, 2022 14:02
@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

I've gone ahead and added pre-clustering by intersection to Geometry::Union(). I'm not seeing an adverse performance impact even in cases like Paul's watershed dataset where we end up with a single cluster. More testing would be great.

@dbaston
Copy link
Member Author

dbaston commented Sep 22, 2022

Also added pre-clustering to CoverageUnion which shows a 40% performance benefit on world.wkt (but remains 5x slower than coverageUnionNG)

src/geom/Geometry.cpp Outdated Show resolved Hide resolved
@dr-jts
Copy link
Contributor

dr-jts commented Sep 24, 2022

I wouldn't expect this code to be desirable for PostGIS because PostGIS already has its own implementation that does not require converting everything into GEOS types.

It would be nice if the PostGIS functions ST_ClusterIntersecting and ST_ClusterWithin could be converted to window functions (with new names of course). To do this, would it make sense to back them onto the GEOS code? Or just rework the PG-native code?

@dbaston
Copy link
Member Author

dbaston commented Sep 24, 2022

To do this, would it make sense to back them onto the GEOS code? Or just rework the PG-native code?

I would expect PostGIS to stick with its existing code which already provides this capability (look at ST_ClusterDBSCAN implementation; ST_ClusterWithin is implemented as a special case of ST_ClusterDBSCAN). Switching to GEOS would require converting geometries into GEOS and dropping support for curve types that are supported by PostGIS but not GEOS.

@dbaston
Copy link
Member Author

dbaston commented Sep 26, 2022

I've removed all changes to the C API and Union operations. Will post the union changes in a separate PR.

@dbaston dbaston merged commit 6ea4e94 into libgeos:main Sep 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or feature improvement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants