Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix partitioned spatial joins with small spatial index #14485

Merged
merged 2 commits into from
Oct 7, 2020

Conversation

jagill
Copy link
Contributor

@jagill jagill commented May 5, 2020

In a partitioned spatial join, rows with geometries are assigned 0, 1, or
more partition indexes via the KdbTree spatial partition. The geometry
is assigned one index for each leaf node rectangle it intersects.

Currently, if a geometry is outside the bounding box of a KdbTree, it is
dropped: it's assigned an empty partition index array, which is
unnested, resulting in the row being dropped. This can be an efficiency
measure: if one side of the join is much smaller than the other, then
the bounds will drop many rows before they are sent to the join worker.

However, if the bounds are less than both the build- and probe-side of
the join, then rows that would have matched in a non-partitioned join
will be dropped when you partition the join. This makes the correctness
of the partitioned join dependent on the partition chosen, which can
lead to some surprising output changes that could be reasonably viewed
as data loss.

This change makes the KdbTree "open": its outer boundaries extend to
infinity. This means all points in the plane belong to exactly one leaf node
of the KdbTree.

== RELEASE NOTES ==

Geospatial Changes
* Allow geometries outside of the spatial partitioning to match in a partitioned spatial join.

@jagill
Copy link
Contributor Author

jagill commented May 5, 2020

This is also a step towards spatial LEFT JOINs, since non-matching rows have to be handled more intentionally.

Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One bit of rapid feedback. Can you make this a configuration option and session property?
If this breaks an existing query or ends up having a bug it would be good to be able to turn it off without restarting clusters.

@jagill
Copy link
Contributor Author

jagill commented May 6, 2020

I'm interpreting the session property gating is "leave most of the PR the same, but only add the OUTSIDE partition index if a session property is set." If so, sure!

* We assume that the right node is the build side. The CBO is disabled
* for spatial joins, so this should be a good assumption. Then we have:
* 1. INNER joins are OK.
* 2. LEFT joins when the left node is a Point are OK.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation doesn't actually allow this yet

@@ -101,7 +101,7 @@ public static void output(SpatialPartitioningState state, BlockBuilder out)
Rectangle envelope = state.getExtent();

// Add a small buffer on the right and upper sides
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is now wrong

Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulus the notes and and adding a flag so we can disable this.

We talked about it, and while people can go and update their spatial partitioning my concern is that they might go and do that and then find out there is some other blocker that prevents them from doing it.

@jagill jagill force-pushed the spatial-partition-fix branch 2 times, most recently from 309b201 to bfb52c6 Compare June 2, 2020 15:22
@jagill jagill requested a review from aweisberg June 2, 2020 15:22
@jagill jagill requested a review from rschlussel June 21, 2020 18:29
@jagill jagill force-pushed the spatial-partition-fix branch 2 times, most recently from 1e250ef to 5e5cf8b Compare July 18, 2020 11:30
Copy link
Contributor

@aweisberg aweisberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me. Had to break out the graph paper, for the tests :-)

Currently, if a geometry is outside the bounding box of a KdbTree, it is
dropped: it's assigned an empty partition index array, which is
unnested, resulting in the row being dropped.  This can be an efficiency
measure: if one side of the join is much smaller than the other, then
the bounds will drop many rows before they are sent to the join worker.

However, if the bounds are less than both the build- and probe-side of
the join, then rows that would have matched in a non-partitioned join
will be dropped when you partition the join.  This makes the correctness
of the partitioned join dependent on the partition chosen, which can
lead to some surprising output changes that could be reasonably viewed
as data loss.

This commit changes the bounding box of the KdbTree to extend from
-Infinity to +Infinity, so that all (non-empty) geometries will get
at least one partition.
This commit adds more data to the spatial join tests, to test cases
where geometries may be outside the range of those geometries used
to build the spatial index.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants