Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-dimensional CoordinateSequence #721

Merged
merged 22 commits into from
Nov 30, 2022

Conversation

dbaston
Copy link
Member

@dbaston dbaston commented Nov 1, 2022

This PR builds on #674 to implement CoordinateSequence as a concrete class capable of storing XY, XYZ, XYM, or XYZM coordinates. Like the liblwgeom POINTARRAY, it provides direct access to coordinates of a dimensionality compatible with the stored values, and copy access to higher-dimensionality coordinates padded with NaNs. For example, if a CoordinateSequence stores XYM Coordinates, you can access XY and XYM Coordinates directly and can read an XYZM Coordinate with copying.

It still needs some polish but I wanted to give a chance for overall feedback.

I have been running a series of benchmarks comparing this implementation to the current main branch. These numbers represent a worst case because we have the overhead of a multi-dimensional CoordinateSequence but we are continuing to store XYZ Coordinates for 2D geometries because most of GEOS depends on direct access to XYZ Coordinates even when the XY values are not needed. As we transition to the CoordinateXY type, we should start to see benefits from not storing Z values.

image

Among these cases, the only degradation is in PIP tests, since we no longer have the stack-only FixedSizeCoordinateSequence. This is mostly but not completely mitigated by the new GEOSPreparedContainsXY signature.

I can think of two alternative implementations:

  • Making Coordinate a semi-abstract class with direct access to X and Y values and virtual method access to Z and M values. I did not go this route because it would have required each Coordinate to store a pointer to its vtable, essentially taking up the space used by the Z value and negating the locality benefits that we expect for X and Y.
  • Storing interleaved XY values in a single array and Z/M values in separate arrays. This would require us to either make Coordinates copy-on-access or access Z/M values through some reference to the underlying CoordinateSequence. Relative to the implementation in this PR, this approach would give more locality benefits for XY algorithms only in the case where the user is storing XYZ Coordinates but does not want the Z values used. This seems uncommon.

@pramsey
Copy link
Member

pramsey commented Nov 1, 2022

I was never a huge fan of FixedSizeCoordinateSequence, so... I think for use cases where "instantiate a whole geometry each time" we need to look for better performance paths, and GEOSPreparedContainsXY is a reasonable move in that direction.

@dbaston
Copy link
Member Author

dbaston commented Nov 3, 2022

A follow-on is figuring out how to properly handle various coordinate types throughout GEOS. I think we have three types of situations:

  • 2D algorithms: switch to CoordinateXY
  • 2D non-performance-critical code (WKTReader, etc.): switch to CoordinateXYZM, copy-on-access

  • 2D performance-critical code that should correctly handle different types: ???

This last one is tricky. For example, how should we best replace a block like this one, where we're performing an operation on two coordinates from two difference sequences which may have different dimensionality? We don't want an if block with the 16 possible combinations of dimensions.

    const Coordinate& p00 = e0->getCoordinate(segIndex0);
    const Coordinate& p01 = e0->getCoordinate(segIndex0 + 1);
    const Coordinate& p10 = e1->getCoordinate(segIndex1);
    const Coordinate& p11 = e1->getCoordinate(segIndex1 + 1);

    li.computeIntersection(p00, p01, p10, p11);

This particular case comes from a noding intersection finder. Those are virtual anyway, so I guess we could template the class on the coordinate types of the input geometries.

If we didn't want to do that, the most flexible thing I've found so far is to move the block into a class with a templated method, like:

class DoIntersect {
public:
    DoIntersect(algorithm::LineIntersector& li,
                const CoordinateSequence& seq0,
                std::size_t i0,
                const CoordinateSequence& seq1,
                std::size_t i1) :
        m_li(li),
        m_seq0(seq0),
        m_i0(i0),
        m_seq1(seq1),
        m_i1(i1) {}

    template<typename T1, typename T2>
    void operator()() {
        const T1& p00 = m_seq0.getAt<T1>(m_i0);
        const T1& p01 = m_seq0.getAt<T1>(m_i0 + 1);
        const T2& p10 = m_seq1.getAt<T2>(m_i1);
        const T2& p11 = m_seq1.getAt<T2>(m_i1 + 1);

        m_li.computeIntersection(p00, p01, p10, p11);
    }

private:
    algorithm::LineIntersector& m_li;
    const CoordinateSequence& m_seq0;
    std::size_t m_i0;
    const CoordinateSequence& m_seq1;
    std::size_t m_i1;
};

and then pass it to some dispatching functions defined elsewhere, so the original block gets replaced by something like

    DoIntersect dis(li, *cs0, segIndex0, *cs1, segIndex1);
    binaryDispatch(*cs0, *cs1, dis);

Clearly it's a loss for readability. Other ideas welcome!

@abellgithub
Copy link
Contributor

abellgithub commented Nov 3, 2022

A follow-on is figuring out how to properly handle various coordinate types throughout GEOS. I think we have three types of situations:

  • 2D algorithms: switch to CoordinateXY
  • 2D non-performance-critical code (WKTReader, etc.): switch to CoordinateXYZM, copy-on-access

  • 2D performance-critical code that should correctly handle different types: ???

This last one is tricky. For example, how should we best replace a block like this one, where we're performing an operation on two coordinates from two difference sequences which may have different dimensionality? We don't want an if block with the 16 possible combinations of dimensions.

    const Coordinate& p00 = e0->getCoordinate(segIndex0);
    const Coordinate& p01 = e0->getCoordinate(segIndex0 + 1);
    const Coordinate& p10 = e1->getCoordinate(segIndex1);
    const Coordinate& p11 = e1->getCoordinate(segIndex1 + 1);

    li.computeIntersection(p00, p01, p10, p11);

This particular case comes from a noding intersection finder. Those are virtual anyway, so I guess we could template the class on the coordinate types of the input geometries.

Could you just ditch the idea of fetching coordinates of some specific type altogether? So, rather than getAt<CoordinateXY>(i), you just have getAt(i) and you always return a reference to a XYZM. If the algorithm wants to treat it as XY or XYZ, it's free to do that. You'd have to pad your array with (up to) 2 doubles to avoid an accvio.

But I probably don't understand. :)

@dbaston
Copy link
Member Author

dbaston commented Nov 3, 2022

You'd have to pad your array with (up to) 2 doubles to avoid an accvio.

The thinking is that most usage of GEOS does not involve Z or M, so it would be nice to avoid a 100% storage penalty for the common XY case.

@pramsey
Copy link
Member

pramsey commented Nov 3, 2022

He's not proposing a 4D underlying vector, he's proposing that the reads always be via a Coordinate4D, and that the caller check the dimensionality of the CoordinateSequence to determine whether or not the Z/M values are garbage or not.

@dbaston dbaston force-pushed the concrete-coordseq-buf branch 2 times, most recently from 1338046 to 2c87470 Compare November 4, 2022 11:41
@dbaston
Copy link
Member Author

dbaston commented Nov 8, 2022

I think this is ready to go. I've run an updated set of benchmarks against the current main branch. The only performance regression is in the GEOSPreparedContains test, as discussed above. The gains in the remaining tests are modest but consistent.

image

   name                                    this_pr    main speedup_pct
   <chr>                                     <dbl>   <dbl>       <dbl>
 1 australia buffer                        402.    401.         -0.174
 2 australia isvalid                         1.40    1.48        5.09 
 3 buffered watershed union                  2.81    3.02        7.14 
 4 cluster countries                         0.924   1.01        8.50 
 5 county buffer                            12.9    13.5         4.28 
 6 landcov isvalid                           0.132   0.153      14.2  
 7 pip watersheds (GEOSPreparedContains)   274.    241.        -13.5  
 8 pip watersheds (GEOSPreparedContainsXY) 196.    198.          1.47 
 9 watershed intersection                   21.5    22.8         5.76 
10 watershed union                           2.09    2.19        4.41 
11 watersheds isvalid                        0.269   0.316      14.8  

Until the entire library is made to be more dimension-aware (if ever) we are still storing Z values in all cases to avoid a crash or incorrect result when accessing values by Coordinate&. This means that XY coordinates are stored as XYZ, and XYM coordinates are stored as XYZM. But at least we now have support for storage of M coordinates with no penalty in the lower-dimensionality cases. (You still can't do anything with them, the WKT and WKB readers/writers now handle them correctly.)

This PR does not include the ability to read coordinates directly from an external buffer. I think the best way to do this would be to replace the std::vector backing by our own vector type. I don't think that is a huge undertaking since we don't need anywhere near the full capabilities of std::vector, but I think it's best handled separately.

@dbaston
Copy link
Member Author

dbaston commented Nov 28, 2022

Any concerns with this one?

Comment on lines +68 to +73
CoordinateSequence::CoordinateSequence(std::size_t sz, std::size_t dim) :
m_vect(sz * 3),
m_stride(3u),
m_hasdim(dim > 0),
m_hasz(dim == 3),
m_hasm(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running into another issue in Shapely.

If one would do GEOSCoordSeq_create(1, 4), which ends up above CoordinateSequence(1, 4), that would create a 3D coordseq with both hasz and hasm set to False.

I see that the documentation of both CoordinateSequence(std::size_t size, std::size_t dim = 0) and the GEOSCoordSeq_create C API says that this is only for creating XY or XYZ sequences (so dim can be 2 or 3). So that's a user error that I pass a value of 4.
But just noting that before, this would raise an exception, while now you silently get wrong values if you subsequently set the ordinates with GEOSCoordSeq_setOrdinate_r / CoordinateSequence::setOrdinate (the ordinate index 2 and 3 (z and m) will overwrite each other)

This is totally fixable on the Shapely side (we just need to verify the dimension on our side instead of just passing it to GEOS). But I do wonder if the GEOSCoordSeq_create should be updated to allow passing dim=4?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see #753

for (std::size_t i = 0; i < size; i++) {
coords[i] = { *buf, *(buf + 1) };
coords->setAt(Coordinate{ *buf, *(buf + 1) }, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question: the above doesn't yet support creating a XYM coordinate sequence, right? (there would need to be another if (hasM))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is fixed with #772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants