Multi-dimensional CoordinateSequence #721

dbaston · 2022-11-01T17:57:18Z

This PR builds on #674 to implement CoordinateSequence as a concrete class capable of storing XY, XYZ, XYM, or XYZM coordinates. Like the liblwgeom POINTARRAY, it provides direct access to coordinates of a dimensionality compatible with the stored values, and copy access to higher-dimensionality coordinates padded with NaNs. For example, if a CoordinateSequence stores XYM Coordinates, you can access XY and XYM Coordinates directly and can read an XYZM Coordinate with copying.

It still needs some polish but I wanted to give a chance for overall feedback.

I have been running a series of benchmarks comparing this implementation to the current main branch. These numbers represent a worst case because we have the overhead of a multi-dimensional CoordinateSequence but we are continuing to store XYZ Coordinates for 2D geometries because most of GEOS depends on direct access to XYZ Coordinates even when the XY values are not needed. As we transition to the CoordinateXY type, we should start to see benefits from not storing Z values.

Among these cases, the only degradation is in PIP tests, since we no longer have the stack-only FixedSizeCoordinateSequence. This is mostly but not completely mitigated by the new GEOSPreparedContainsXY signature.

I can think of two alternative implementations:

Making Coordinate a semi-abstract class with direct access to X and Y values and virtual method access to Z and M values. I did not go this route because it would have required each Coordinate to store a pointer to its vtable, essentially taking up the space used by the Z value and negating the locality benefits that we expect for X and Y.
Storing interleaved XY values in a single array and Z/M values in separate arrays. This would require us to either make Coordinates copy-on-access or access Z/M values through some reference to the underlying CoordinateSequence. Relative to the implementation in this PR, this approach would give more locality benefits for XY algorithms only in the case where the user is storing XYZ Coordinates but does not want the Z values used. This seems uncommon.

pramsey · 2022-11-01T18:11:56Z

I was never a huge fan of FixedSizeCoordinateSequence, so... I think for use cases where "instantiate a whole geometry each time" we need to look for better performance paths, and GEOSPreparedContainsXY is a reasonable move in that direction.

dbaston · 2022-11-03T15:23:17Z

A follow-on is figuring out how to properly handle various coordinate types throughout GEOS. I think we have three types of situations:

2D algorithms: switch to CoordinateXY
2D non-performance-critical code (WKTReader, etc.): switch to CoordinateXYZM, copy-on-access
2D performance-critical code that should correctly handle different types: ???

This last one is tricky. For example, how should we best replace a block like this one, where we're performing an operation on two coordinates from two difference sequences which may have different dimensionality? We don't want an if block with the 16 possible combinations of dimensions.

    const Coordinate& p00 = e0->getCoordinate(segIndex0);
    const Coordinate& p01 = e0->getCoordinate(segIndex0 + 1);
    const Coordinate& p10 = e1->getCoordinate(segIndex1);
    const Coordinate& p11 = e1->getCoordinate(segIndex1 + 1);

    li.computeIntersection(p00, p01, p10, p11);

This particular case comes from a noding intersection finder. Those are virtual anyway, so I guess we could template the class on the coordinate types of the input geometries.

If we didn't want to do that, the most flexible thing I've found so far is to move the block into a class with a templated method, like:

class DoIntersect {
public:
    DoIntersect(algorithm::LineIntersector& li,
                const CoordinateSequence& seq0,
                std::size_t i0,
                const CoordinateSequence& seq1,
                std::size_t i1) :
        m_li(li),
        m_seq0(seq0),
        m_i0(i0),
        m_seq1(seq1),
        m_i1(i1) {}

    template<typename T1, typename T2>
    void operator()() {
        const T1& p00 = m_seq0.getAt<T1>(m_i0);
        const T1& p01 = m_seq0.getAt<T1>(m_i0 + 1);
        const T2& p10 = m_seq1.getAt<T2>(m_i1);
        const T2& p11 = m_seq1.getAt<T2>(m_i1 + 1);

        m_li.computeIntersection(p00, p01, p10, p11);
    }

private:
    algorithm::LineIntersector& m_li;
    const CoordinateSequence& m_seq0;
    std::size_t m_i0;
    const CoordinateSequence& m_seq1;
    std::size_t m_i1;
};

and then pass it to some dispatching functions defined elsewhere, so the original block gets replaced by something like

    DoIntersect dis(li, *cs0, segIndex0, *cs1, segIndex1);
    binaryDispatch(*cs0, *cs1, dis);

Clearly it's a loss for readability. Other ideas welcome!

abellgithub · 2022-11-03T21:47:59Z

A follow-on is figuring out how to properly handle various coordinate types throughout GEOS. I think we have three types of situations:

2D algorithms: switch to CoordinateXY

2D non-performance-critical code (WKTReader, etc.): switch to CoordinateXYZM, copy-on-access

2D performance-critical code that should correctly handle different types: ???

This last one is tricky. For example, how should we best replace a block like this one, where we're performing an operation on two coordinates from two difference sequences which may have different dimensionality? We don't want an if block with the 16 possible combinations of dimensions.
    const Coordinate& p00 = e0->getCoordinate(segIndex0);
    const Coordinate& p01 = e0->getCoordinate(segIndex0 + 1);
    const Coordinate& p10 = e1->getCoordinate(segIndex1);
    const Coordinate& p11 = e1->getCoordinate(segIndex1 + 1);

    li.computeIntersection(p00, p01, p10, p11);
This particular case comes from a noding intersection finder. Those are virtual anyway, so I guess we could template the class on the coordinate types of the input geometries.

Could you just ditch the idea of fetching coordinates of some specific type altogether? So, rather than getAt<CoordinateXY>(i), you just have getAt(i) and you always return a reference to a XYZM. If the algorithm wants to treat it as XY or XYZ, it's free to do that. You'd have to pad your array with (up to) 2 doubles to avoid an accvio.

But I probably don't understand. :)

dbaston · 2022-11-03T22:57:53Z

You'd have to pad your array with (up to) 2 doubles to avoid an accvio.

The thinking is that most usage of GEOS does not involve Z or M, so it would be nice to avoid a 100% storage penalty for the common XY case.

pramsey · 2022-11-03T23:05:28Z

He's not proposing a 4D underlying vector, he's proposing that the reads always be via a Coordinate4D, and that the caller check the dimensionality of the CoordinateSequence to determine whether or not the Z/M values are garbage or not.

Implementation is taken from CoordinateArraySequence. FixedSizeCoordinateSequence is removed.

Improves performance of copy-to-buffer by about 75%

dbaston · 2022-11-08T00:45:04Z

I think this is ready to go. I've run an updated set of benchmarks against the current main branch. The only performance regression is in the GEOSPreparedContains test, as discussed above. The gains in the remaining tests are modest but consistent.

   name                                    this_pr    main speedup_pct
   <chr>                                     <dbl>   <dbl>       <dbl>
 1 australia buffer                        402.    401.         -0.174
 2 australia isvalid                         1.40    1.48        5.09 
 3 buffered watershed union                  2.81    3.02        7.14 
 4 cluster countries                         0.924   1.01        8.50 
 5 county buffer                            12.9    13.5         4.28 
 6 landcov isvalid                           0.132   0.153      14.2  
 7 pip watersheds (GEOSPreparedContains)   274.    241.        -13.5  
 8 pip watersheds (GEOSPreparedContainsXY) 196.    198.          1.47 
 9 watershed intersection                   21.5    22.8         5.76 
10 watershed union                           2.09    2.19        4.41 
11 watersheds isvalid                        0.269   0.316      14.8

Until the entire library is made to be more dimension-aware (if ever) we are still storing Z values in all cases to avoid a crash or incorrect result when accessing values by Coordinate&. This means that XY coordinates are stored as XYZ, and XYM coordinates are stored as XYZM. But at least we now have support for storage of M coordinates with no penalty in the lower-dimensionality cases. (You still can't do anything with them, the WKT and WKB readers/writers now handle them correctly.)

This PR does not include the ability to read coordinates directly from an external buffer. I think the best way to do this would be to replace the std::vector backing by our own vector type. I don't think that is a huge undertaking since we don't need anywhere near the full capabilities of std::vector, but I think it's best handled separately.

dbaston · 2022-11-28T00:44:38Z

Any concerns with this one?

jorisvandenbossche · 2022-12-03T08:58:31Z

src/geom/CoordinateSequence.cpp

+CoordinateSequence::CoordinateSequence(std::size_t sz, std::size_t dim) :
+    m_vect(sz * 3),
+    m_stride(3u),
+    m_hasdim(dim > 0),
+    m_hasz(dim == 3),
+    m_hasm(false)


Running into another issue in Shapely.

If one would do GEOSCoordSeq_create(1, 4), which ends up above CoordinateSequence(1, 4), that would create a 3D coordseq with both hasz and hasm set to False.

I see that the documentation of both CoordinateSequence(std::size_t size, std::size_t dim = 0) and the GEOSCoordSeq_create C API says that this is only for creating XY or XYZ sequences (so dim can be 2 or 3). So that's a user error that I pass a value of 4.
But just noting that before, this would raise an exception, while now you silently get wrong values if you subsequently set the ordinates with GEOSCoordSeq_setOrdinate_r / CoordinateSequence::setOrdinate (the ordinate index 2 and 3 (z and m) will overwrite each other)

This is totally fixable on the Shapely side (we just need to verify the dimension on our side instead of just passing it to GEOS). But I do wonder if the GEOSCoordSeq_create should be updated to allow passing dim=4?

Yes, see #753

jorisvandenbossche · 2022-12-11T14:24:47Z

capi/geos_ts_c.cpp

                for (std::size_t i = 0; i < size; i++) {
-                    coords[i] = { *buf, *(buf + 1) };
+                    coords->setAt(Coordinate{ *buf, *(buf + 1) }, i);


Another question: the above doesn't yet support creating a XYM coordinate sequence, right? (there would need to be another if (hasM))

Thanks, this is fixed with #772

dbaston mentioned this pull request Nov 2, 2022

Convert more algorithms to CoordinateXY #723

Merged

dbaston force-pushed the concrete-coordseq-buf branch from 39b7354 to a09c72c Compare November 3, 2022 15:26

dbaston force-pushed the concrete-coordseq-buf branch 2 times, most recently from 1338046 to 2c87470 Compare November 4, 2022 11:41

dbaston added 10 commits November 5, 2022 12:10

Convert CoordinateSequence from abstract to concrete class

b1792fb

Implementation is taken from CoordinateArraySequence. FixedSizeCoordinateSequence is removed.

Implement CoordinateSequence using std::vector<double>

b46b8fd

Support XYZM in WKTReader, WKTWriter

5b1867a

Update WKTReader behavior and tests

71a9e4e

Pad XYM CoordinateSequences to XYZM, for now

db8c62b

Support M values in WKBReader

fac7913

Support M values in WKBWriter

dee26ac

Add some tests, resolve TODOs

6169b51

Handle M values in CAPI CoordSeq array and buffer functions

f326962

Improves performance of copy-to-buffer by about 75%

Remove CoordinateSequenceFactory

07cc61b

dbaston force-pushed the concrete-coordseq-buf branch from ed2d27c to 07cc61b Compare November 5, 2022 16:59

dbaston added 4 commits November 5, 2022 21:24

Optimize CoordinateSequence::initialize

5be9c0a

Fix MSVC linking error

e5bc852

Add tests

40737c5

Allow autogenerated CoordinateSequence move ctr

2a39673

dbaston mentioned this pull request Nov 8, 2022

Convert CoordinateSequence from abstract to concrete class #674

Closed

dbaston mentioned this pull request Nov 10, 2022

Pull CoordinateSequence into SegmentString base class #731

Merged

dbaston mentioned this pull request Nov 22, 2022

Optimizations to RingClipper implementation #519

Closed

Avoid changing LineString::normalize behavior for zero-length LineString

a68939d

dbaston added 3 commits November 29, 2022 11:39

Update NEWS

602ae76

Avoid change in Coordinate::toString (breaks PostGIS test)

a466931

Remove commented-out code

974a5de

dbaston force-pushed the concrete-coordseq-buf branch from 79bf831 to 974a5de Compare November 29, 2022 16:39

dbaston added 4 commits November 29, 2022 18:06

Avoid repetition in Coordinate isNull methods

e0f7b8c

Fix copyright

9e48383

Remove commented-out code

4c3ae75

Remove duplicated import

e4e3765

dbaston merged commit 60edf0e into libgeos:main Nov 30, 2022

jorisvandenbossche mentioned this pull request Dec 1, 2022

BUG: Setting precision on empty point segfaults (GEOS main) #748

Closed

jorisvandenbossche reviewed Dec 3, 2022

View reviewed changes

jorisvandenbossche mentioned this pull request Dec 3, 2022

TST: fix tests for GEOS changes in M handling shapely/shapely#1647

Merged

jorisvandenbossche reviewed Dec 11, 2022

View reviewed changes

idanmiara mentioned this pull request Jan 22, 2023

QST: Memory size: XY vs XYZ, Point vs MultiPoint/LineString? shapely/shapely#1731

Closed

jorisvandenbossche mentioned this pull request Jan 23, 2023

Geometry created as 3D with NaN as z coordinate is 3D but WKT shows it as 2D #808

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-dimensional CoordinateSequence #721

Multi-dimensional CoordinateSequence #721

dbaston commented Nov 1, 2022

pramsey commented Nov 1, 2022

dbaston commented Nov 3, 2022

abellgithub commented Nov 3, 2022 •

edited

dbaston commented Nov 3, 2022

pramsey commented Nov 3, 2022

dbaston commented Nov 8, 2022

dbaston commented Nov 28, 2022

jorisvandenbossche Dec 3, 2022

dbaston Dec 3, 2022

jorisvandenbossche Dec 11, 2022

dbaston Dec 11, 2022

Multi-dimensional CoordinateSequence #721

Multi-dimensional CoordinateSequence #721

Conversation

dbaston commented Nov 1, 2022

pramsey commented Nov 1, 2022

dbaston commented Nov 3, 2022

abellgithub commented Nov 3, 2022 • edited

dbaston commented Nov 3, 2022

pramsey commented Nov 3, 2022

dbaston commented Nov 8, 2022

dbaston commented Nov 28, 2022

jorisvandenbossche Dec 3, 2022

Choose a reason for hiding this comment

dbaston Dec 3, 2022

Choose a reason for hiding this comment

jorisvandenbossche Dec 11, 2022

Choose a reason for hiding this comment

dbaston Dec 11, 2022

Choose a reason for hiding this comment

abellgithub commented Nov 3, 2022 •

edited