Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep ST_Union intermediate state as GEOS geometries #394

Closed
wants to merge 7 commits into from

Conversation

@Komzpa
Copy link
Member

commented Apr 16, 2019

Komzpa added 3 commits Apr 16, 2019
postgis/lwgeom_geos.c Show resolved Hide resolved
postgis/lwgeom_geos.c Show resolved Hide resolved
postgis/lwgeom_geos.c Show resolved Hide resolved
Komzpa added 4 commits Apr 18, 2019
@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Does it makes sense to use this for other aggregates?

@Komzpa

This comment has been minimized.

Copy link
Member Author

commented Apr 18, 2019

Failing on srid mismatch before reading terabytes of data is probably a good thing. Oher aggregates need their own specially tailored functions.

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Aren't several of the other aggregates doing the same thing as this (accumulating in an array and then sending array to the finalizing function, where it is converted to GEOS)? ST_Polygonize, seems like it would see the same benefit, for example.

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Another possible benefit of solving the more general problem is you could maybe get away with redefining pgis_geometry_accum_transfn instead of having to drop and recreate aggregates.

@strk strk closed this in 41e46ce Apr 18, 2019

@Komzpa Komzpa deleted the Komzpa:ticket-4340 branch Apr 18, 2019

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

@Komzpa are you interested in discussing or no?

@dr-jts

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

Aren't several of the other aggregates doing the same thing as this (accumulating in an array and then sending array to the finalizing function, where it is converted to GEOS)? ST_Polygonize, seems like it would see the same benefit, for example.

The difference is that Union can be run incrementally, but that is not the case for ST_Polygonize. Some other kinds of spatial aggregate functions could work that way though - e.g. Extent functions, or ConvexHull (if that was an aggregate, which should be built!)

@pramsey

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Yes, changing pgis_geometry_accum_transfn seems like a reasonable approach that minimizes upgrade issues, if possible. Instead of using a pgsql array for the transfer state use a lwcollection or something.

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

The difference is that Union can be run incrementally, but that is not the case for ST_Polygonize

The incremental union approach is taken by ST_MemUnion, but ST_Union indeed collects everything beforehand and runs a cascaded union. As far as I know ST_MemUnion is the only aggregate that actually does its work incrementally.

@dr-jts

This comment has been minimized.

Copy link
Contributor

commented Apr 19, 2019

Right, I see that now. It sounded like the original design was to union incrementally in smaller batches.

@Komzpa

This comment has been minimized.

Copy link
Member Author

commented Apr 19, 2019

Following other aggregates have pgis_abs as thier transfer type, so we can only talk about them:
ST_Collect ST_ClusterIntersecting ST_ClusterWithin ST_Polygonize ST_MakeLine

ST_Collect does not convert to GEOS, so its intermediate state doesn't need GEOS conversion, so different. It could have just one gserialized there appended in place, but that's a special code. It won't benefit functionally as we're limited by 1GB postgres tuples: Union can get a lot of vertices removed so a bigger output fits into this limitation, Collect can't.

ST_ClusterIntersecting ST_ClusterWithin I believe @dbaston has plans to convert to window functions so out of scope. Failing at 1Gb of points with ClusterWithin is frustrating too, but it's still limited by its output format involving one tuple, so only a full rewrite.

ST_Polygonize ST_MakeLine - same, ST_Polygonize wants GEOS but can't eat vertices, ST_MakeLine may not even want GEOS.

In principle most of benefit of this rewrite can be added to those by moving srid check to transfn. Beyond that, Union is special case which benefits most of being able to collect more data than a Postgres array.

Extent aggregates already use better-fit intermediate storage. Convex Hull aggregate is not present in PostGIS.

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

ST_Polygonize ST_MakeLine - same, ST_Polygonize wants GEOS but can't eat vertices,

It depends on the inputs, doesn't it? Union won't reduce the size of disjoint inputs; polygonize will reduce the size of inputs that don't completely form polygons.

Either way, you highlight a potentially frustrating issue in which these changes allow us to send more than 1 GB of data to GEOS but we don't know if we can store the result until we have it in hand. So we have the possibility of hitting the 1GB limit only after having done much more work. I wonder if a warning should be emitted during accumulation but before beginning processing if the 1 GB number is passed.

@strk

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

@strk

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

@strk

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 19, 2019

I think the change is about raising not the size but the number of
elements an array can hold. Please correct me if I'm wrong.

It's about raising the size (bytes of memory)

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

It depends on the inputs, doesn't it? Union won't reduce the size of disjoint inputs; polygonize will reduce the size of inputs that don't completely form polygons.

@Komzpa any thoughts on the above? I'm still not understanding why this code path can't be applied to other GEOS aggregates.

@Komzpa

This comment has been minimized.

Copy link
Member Author

commented Apr 22, 2019

It's about usefullness. If you polygonize six gigs of lines and get back less than a gig, your data is broken and you shouldn't do it, and after you fix it you won't get anything back as it won't fit in a gig. If you union six gigs of data and get a gig back, it's probably okay. If you believe such refactoring will benefit your case feel free to perform it?

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 22, 2019

your data is broken and you shouldn't do it

It depends on what you're trying to do. For example, think about the use of polygonization to find the topological faces in a set of road network data. There's nothing "broken" about roads that don't form an edge in the topology.

If you believe such refactoring will benefit your case feel free to perform it?

I think when committing to a project like this, there's an obligation to think beyond your own use case and try to come up with solutions that are as general as possible.

@Komzpa

This comment has been minimized.

Copy link
Member Author

commented Apr 23, 2019

obligation to think beyond
come up with solutions that are as general as possible.

After some meditation on this I disagree. This thinking model is prone to getting stuck with perfection paralysis, and has potential to consume energy in unbounded fashion, as each iteration of widening the scope of solution makes "potential" improvement options even wider - something called "scope creep" in project management, a reason of many project failures. Similar thing was weaponized previously in Embrace, Extend & Extinguish tactics of fighting open source projects by some large companies.

A working alternative to this thinking is a number of ideas that are known under name "Kaizen" published in Toyota, embracing continuous improvement by "pull" model, demand-driven.

References:
https://en.wikipedia.org/wiki/Analysis_paralysis
https://en.wikipedia.org/wiki/Scope_creep
https://en.wikipedia.org/wiki/Toyota_Production_System#Principles

@dbaston

This comment has been minimized.

Copy link
Member

commented Apr 23, 2019

My suggestion was that you try and use the same code path in the set of functions that are doing the same thing. As you aware, there is large potential for subtle bugs with the aggregate memory context management, so there is an interest in having as small a number of functions that do it as possible. Your suggestion that I'm attempting to "weaponize" the code review process is offensive and absurd. I'm done here.

@pramsey

This comment has been minimized.

Copy link
Member

commented Apr 23, 2019

Honestly, if you think this is a tough audience, @Komzpa, go push patches through the PgSQL community. We're pretty cowboy here, within reason. But still.

Like, why carry out the POSTGIS2GEOS stage on each input? Why not just carry them through as LWGEOM in a collection? This would fit quite nicely with all the other implementations and be re-usable as they are. Also, though I see you clean up the GEOS objects in the case of a failed conversion, I was wondering in my head what happens to them if the process is just ctrl-c interrupted in mid-aggregate? All those GEOS objects are malloc'ed, not palloc'ed, so they could be orphaned. If they were LWGEOM and palloc'ed into the upper context they'd get cleaned up.

Also, as noted by @strk the upgrade of aggregates can be fraught, particularly if you are changing things like the signatures of transfn/finalfn. This is another perfectly valid reason to try and shim your new behaviour into the existing pattern.

I think we can all get behind your goal of trying to get around the 1GB limit, and in retrospect a PgSQL array, with it's flat storage, wasn't the best choice for a transfer object, but maybe hear that lots of people have discomfort with this implementation and propose some ways to make us comfortable.

@strk

This comment has been minimized.

Copy link
Member

commented Apr 24, 2019

@strk

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

I've created a pull request testing upgrades in presence of a view defined using ST_Union here:
https://git.osgeo.org/gitea/postgis/postgis/pulls/30

Unfortunately the current Dronie is broken so doesn't show the result of that testing.
Note that the only bots testing upgrades are Dronie and Debbie/Winnie, but Debbie/Winnie only test things committed to trunk...

@strk

This comment has been minimized.

Copy link
Member

commented Apr 29, 2019

https://trac.osgeo.org/postgis/ticket/4386 tackle the regression introduced by merging this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.