Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert CoordinateSequence from abstract to concrete class #674

Closed
wants to merge 1 commit into from

Conversation

dbaston
Copy link
Member

@dbaston dbaston commented Sep 8, 2022

This PR

  • removes the FixedSizeCoordinateSequence and CoordinateArraySequence class
  • converts CoordinateSequence class from an abstract to a concrete class, with the implementation cribbed from CoordinateArraySequence

I compared the performance to the current main branch using the suite of benchmarks in @pramsey's geos-performance. This shows:

  • some improvement on operations that involve a lot of coordinate access within large geometries (buffer, isValid, union), and
  • a slight degradation on operations that use Point geometry ("Prepared geometry", which is a point-in-polygon operation). This could be mitigated by providing dedicated C API functions for common point operations (e.g., GEOSPreparedContainsXY) and/or by using a union or std::variant to stash a single Coordinate within the concrete CoordinateSequence.

This PR does not implement any special algorithms to take advantage of the known structure of a CoordinateSequence, as discussed here. Some further performance gains may be possible by replacing std::unique_ptr<CoordinateSequence> with CoordinateSequence (e.g., in LineString) but I think a more fruitful track is to explore switching to std::shared_ptr<CoordinateSequence> instead.

As discussed elsewhere, I think the benefits of the abstract CoordinateSequence are minimal, so this change is probably worthwhile even for a modest performance benefit. But I'm curious what others think.

image

@pramsey
Copy link
Member

pramsey commented Sep 8, 2022

Very exciting. I think the preparedgeometryxy stuff is going to be required anyways, to really crank up point performance on the classic p-i-p case. A little surprised the gains were this small, but I guess C++ compilers are just really really good.

@dbaston
Copy link
Member Author

dbaston commented Sep 8, 2022

Very exciting. I think the preparedgeometryxy stuff is going to be required anyways

Yes, even with the FixedSizeCoordinateSequence the cost of creating and destroying points is about the same as the PIP test itself.

A little surprised the gains were this small,

They're not huge but these changes would enable more stuff down the line. A better algorithm for duplicate-point removal (or just delegate to std::unique). And a lot of our Coordinate access is through SegmentString::getCoordinate which is slow even if the underlying CoordinateSequence::getAt is not. Maybe that can be improved.

@pramsey
Copy link
Member

pramsey commented Sep 13, 2022

One thing I'd like to cogitate on is the extent to which this does/doesn't foreclose on a higher dimension strategy that splits the XY coordinates from the Z and M values, so a CoordinateSequence in storage terms ends up like

std::vector<Coordinate> xy;
std::vector<double> z;
std::vector<double> m;

I think this is doable, but I haven't spent enough time cataloguing the places that make use of Z and M coordinates (except, obviously, the readers and writers) and whether a move to having getAt(i) return an XY coordinate, and a getZ(i) return a double would fit.

@dbaston
Copy link
Member Author

dbaston commented Sep 13, 2022

I don't think the current implementation is the final implementation, but switching to a concrete class makes experimentation a bit easier, I think.

Here is one way to handle multiple dimensions with a single class: https://gist.github.com/dbaston/3902b785ed3d841333d2b1c7f3b4ed25

I'm not sure if it's a good way or not. I don't like the potential for slicing (pass a CoordinateZ to a function expecting a Coordinate and the z member is suddenly NaN.) But without going virtual that might be something we have to accept.

@dr-jts
Copy link
Contributor

dr-jts commented Sep 13, 2022

Here is one way to handle multiple dimensions with a single class:

Well that's slick! I guess the codebase would have to be changed to delegate creating new CoordinateSequences to the input ones, and maybe have different patterns for copying coordinates. But seems like a small price to pay.

Implementation is taken from CoordinateArraySequence.
FixedSizeCoordinateSequence is removed.
@dbaston
Copy link
Member Author

dbaston commented Oct 4, 2022

Backing CoordinateSequence with double* turns out to be pretty invasive, because we lose the ability to construct a CoordinateSequence without copies by moving in a std::vector<Coordinate>. The code is full of std::vector<Coordinate> (as a workaround to CoordinateSequence being slow!) so making this change will require us to adopt CoordinateSequence throughout the code.

@dr-jts
Copy link
Contributor

dr-jts commented Oct 5, 2022

Backing CoordinateSequence with double* turns out to be pretty invasive, because we lose the ability to construct a CoordinateSequence without copies by moving in a std::vector<Coordinate>. The code is full of std::vector<Coordinate> (as a workaround to CoordinateSequence being slow!) so making this change will require us to adopt CoordinateSequence throughout the code.

In JTS there is a clear distinction between CoordinateSequence, which is fixed-length, and CoordinateList, which is extendable. I suspect that a lot of the uses of std::vector<Coordinate> are just because it was possible to do this, rather than using a CoordinateList equivalent.

It seems to me that basing CoordinateSequence on std::vector<Coordinate> is limiting in several ways:

  • Ties the CoordinateSequence tightly to Coordinate
  • Prevents using alternate ordinate storage models (e.g. XYZM, XXYY, etc)

So how about reintroducing this distinction in GEOS? CoordinateSequence becomes fixed-length, with an opaque internal representation (almost certainly including an XY memory buffer option). And where extendable lists of coordinates are required, either retain the use of std::vector<Coordinate> or (better IMHO) create a CoordinateList equivalent with useful conversion methods.

@dbaston
Copy link
Member Author

dbaston commented Oct 5, 2022

My comment probably wasn't clear. I am working on a CoordinateSequence variant that is backed by a buffer of doubles (thus allowing different dimensionality, but not relaxing the interleaving requirement). I was just pointing out that it will be far-reaching because of the pervasive use of std::vector<Coordinate> throughout GEOS.

The distinction between extensible and non-extensible sequences seems like a different discussion. GEOS also has the CoordinateList type.

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

I've put together additional implementations of CoordinateSequence that are:

In the benchmarking I've done, the std::vector<Coordinate> backing is the fastest, but the library isn't taking advantage of potential gains from storing 2D coordinates. Will update this PR when I have a recommendation.
image
Implementations shown on the x axis are, from left-to-right:

  • Backed by std::vector<Coordinate>
  • Backed by std::vector<double>
  • Backed by the union described above
  • Current main branch

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

The distinction between extensible and non-extensible sequences seems like a different discussion. GEOS also has the CoordinateList type.

Doesn't it have a bearing on whether CoordinateSequences are better off using a vector rather than a fixed-size buffer?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

Does the (slightly) lower performance of the union-based concrete-coordseq-multi reflect overhead needed to discriminate on the union? And what was the underlying representation used in the tests?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

For the record, I'm in favour of keeping the CoordinateSequence implementation simple. The simplest proposal so far is a fixed-size buffer storing XY, with optional Z and M (either interleaved or as separate arrays - not sure which is best).

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

Doesn't it have a bearing on whether CoordinateSequences are better off using a vector rather than a fixed-size buffer?

What would be the advantage of a fixed-size buffer over a vector? You can initialize a vector to whatever size you want; there is no overhead beyond dynamically allocating a fixed-size buffer.

Does the (slightly) lower performance of the union-based concrete-coordseq-multi reflect overhead needed to discriminate on the union?

I think it comes from add operations being more expensive with the union. It might be possible to improve this. I am describing work in progress. This also assumes no client code would change, which is a bad assumption. For example, we would no longer implement repeated point removal using a CoordinateFilter, which is what is hitting add in the union case.

And what was the underlying representation used in the tests?

The tests are done with each of the underlying representations described here. Unless I'm misunderstanding the question.

The simplest proposal so far is a fixed-size buffer storing XY, with optional Z and M (either interleaved or as separate arrays - not sure which is best).

I don't understand how this would work or why it would be simple. How would you get a Coordinate from non-contiguous memory?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

Doesn't it have a bearing on whether CoordinateSequences are better off using a vector rather than a fixed-size buffer?

What would be the advantage of a fixed-size buffer over a vector? You can initialize a vector to whatever size you want; there is no overhead beyond dynamically allocating a fixed-size buffer.

  • can utilize externally-allocated memory
  • makes CoordinateSequence immutable, thus eliminating potential bugs
  • might be more efficient?
  • easily represents different coordinate dimensions (interleaved in particular - not sure how this would work with a vector of Coordinate)

@pramsey
Copy link
Member

pramsey commented Oct 13, 2022

I don't think there's anything particularly magical about fixing the size of a CoordinateSequence. I look forward to seeing what happens when higher dimensionality comes into play. If one pre-supposes that "most" GEOS usage is 2D, then a 2D main Coordinate buffer and optional Z and M double buffers might trade a little complexity of implementation for (maybe?) some performance gains in the "common" use case. But that's total finger-to-the-wind guessing. Didn't you have a templated branch packing higher dimensional coordinates into a different coordinate buffer as an optional thing?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

And what was the underlying representation used in the tests?

The tests are done with each of the underlying representations described here. Unless I'm misunderstanding the question.

Doesn't the concrete-coordseq-multi support different representations? If so, which is used in the tests? (I'm guessing it's using the vector rep - but would using the fixed-buffer rep make any difference?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

The simplest proposal so far is a fixed-size buffer storing XY, with optional Z and M (either interleaved or as separate arrays - not sure which is best).

I don't understand how this would work or why it would be simple. How would you get a Coordinate from non-contiguous memory?

Via a method get(Coordinate), which copies the ordinate values into the provided Coordinate. I expect we'd wind up with get(CoordinateXY), get(CoordinateXYM) and getCoordinate(CoordinateXYZM) too, for situations in which they are needed explicitly.

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

Via a method get(Coordinate), which copies the ordinate values into the provided Coordinate. I expect we'd wind up with get(CoordinateXY), get(CoordinateXYM) and getCoordinate(CoordinateXYZM) too, for situations in which they are needed explicitly.

What is the benefit of introducing copy-on-access where we currently have none?

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

Didn't you have a templated branch packing higher dimensional coordinates into a different coordinate buffer as an optional thing?

Yes, that's the second implementation from left in the plots. https://github.com/dbaston/libgeos/tree/concrete-coordseq-buf

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

Via a method get(Coordinate), which copies the ordinate values into the provided Coordinate. I expect we'd wind up with get(CoordinateXY), get(CoordinateXYM) and getCoordinate(CoordinateXYZM) too, for situations in which they are needed explicitly.

What is the benefit of introducing copy-on-access where we currently have none?

Couldn't it allow return-by-reference of a CoordinateXY? Which is most of the use.

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

Couldn't it allow return-by-reference of a CoordinateXY? Which is most of the use.

Yes, you could avoid a copy for the 2D case.

To back up, we are fundamentally talking about

struct Seq {
  double* vals; // XY or XYZ or XYZM or XYM
};

This can return a reference for any Coordinate dimension, allows GEOS to retain a 3D default for now, and can be backed by an external buffer.

vs

struct Seq {
  double* xyvals;
  double* zmvals; // Z or ZM or M, depending
};

This returns a reference for 2D only, requires GEOS to universally switch to a 2D default or adopt copy-on-access by default, and can only be backed by an external buffer for the 2D case. What is the advantage?

@dr-jts
Copy link
Contributor

dr-jts commented Oct 13, 2022

I'm fine with the simple concept of a buffer of XY, XYZ/M, or XYZM.

Can't recall the reasons for the split ZM concept - @pramsey may know?

@pramsey
Copy link
Member

pramsey commented Oct 13, 2022

The basis for splitting off Z/M from X/Y is totally theoretical! Except it's also practically implemented in the Shape file. Also the OGR feature implementation. It packs the things we actually care about and iterate on and read closer together (we only access Z and M for relatively rare introduced intersection calculation). Locality! Why? Because locality! It makes Coordinate smaller, and we do shove Coordinate around a great deal. We can increase and decrease the dimensionality of a CoordinateSequence really cheaply (I don't think we care about this?).

All in all, a grab bag of not a lot. It might have made more sense way back when Z was first introduced as a much less invasive way to bring it in.

@dbaston
Copy link
Member Author

dbaston commented Oct 13, 2022

Yes, I think a future GEOS mostly uses CoordinateXY and those values are stored without a bunch of interspersed NaN values in a CoordinateSequence. The implementations I am working with here support that but also allow GEOS to continue with a 3D default for now. This is because I am trying to avoid one big pull request that Does Everything, and instead propose independent sets of changes (this is one, #682) that move us toward that goal while doing no harm along the way.

I have a separate branch (#701) that adds the CoordinateXY type and uses it throughout the majority of the algorithm namespace that it is 2D only. Maybe if that one is merged first it will make the tradeoffs in this PR more clear.

@dbaston
Copy link
Member Author

dbaston commented Nov 8, 2022

Superseded by #721

@dbaston dbaston closed this Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants