-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix partitioned spatial joins with small spatial index #14485
Conversation
This is also a step towards spatial LEFT JOINs, since non-matching rows have to be handled more intentionally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One bit of rapid feedback. Can you make this a configuration option and session property?
If this breaks an existing query or ends up having a bug it would be good to be able to turn it off without restarting clusters.
I'm interpreting the session property gating is "leave most of the PR the same, but only add the OUTSIDE partition index if a session property is set." If so, sure! |
* We assume that the right node is the build side. The CBO is disabled | ||
* for spatial joins, so this should be a good assumption. Then we have: | ||
* 1. INNER joins are OK. | ||
* 2. LEFT joins when the left node is a Point are OK. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation doesn't actually allow this yet
@@ -101,7 +101,7 @@ public static void output(SpatialPartitioningState state, BlockBuilder out) | |||
Rectangle envelope = state.getExtent(); | |||
|
|||
// Add a small buffer on the right and upper sides |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is now wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulus the notes and and adding a flag so we can disable this.
We talked about it, and while people can go and update their spatial partitioning my concern is that they might go and do that and then find out there is some other blocker that prevents them from doing it.
5867755
to
022bade
Compare
309b201
to
bfb52c6
Compare
bfb52c6
to
c9f8ba1
Compare
c9f8ba1
to
bd88ff1
Compare
1e250ef
to
5e5cf8b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me. Had to break out the graph paper, for the tests :-)
Currently, if a geometry is outside the bounding box of a KdbTree, it is dropped: it's assigned an empty partition index array, which is unnested, resulting in the row being dropped. This can be an efficiency measure: if one side of the join is much smaller than the other, then the bounds will drop many rows before they are sent to the join worker. However, if the bounds are less than both the build- and probe-side of the join, then rows that would have matched in a non-partitioned join will be dropped when you partition the join. This makes the correctness of the partitioned join dependent on the partition chosen, which can lead to some surprising output changes that could be reasonably viewed as data loss. This commit changes the bounding box of the KdbTree to extend from -Infinity to +Infinity, so that all (non-empty) geometries will get at least one partition.
This commit adds more data to the spatial join tests, to test cases where geometries may be outside the range of those geometries used to build the spatial index.
5e5cf8b
to
b0df9cd
Compare
In a partitioned spatial join, rows with geometries are assigned 0, 1, or
more partition indexes via the KdbTree spatial partition. The geometry
is assigned one index for each leaf node rectangle it intersects.
Currently, if a geometry is outside the bounding box of a KdbTree, it is
dropped: it's assigned an empty partition index array, which is
unnested, resulting in the row being dropped. This can be an efficiency
measure: if one side of the join is much smaller than the other, then
the bounds will drop many rows before they are sent to the join worker.
However, if the bounds are less than both the build- and probe-side of
the join, then rows that would have matched in a non-partitioned join
will be dropped when you partition the join. This makes the correctness
of the partitioned join dependent on the partition chosen, which can
lead to some surprising output changes that could be reasonably viewed
as data loss.
This change makes the KdbTree "open": its outer boundaries extend to
infinity. This means all points in the plane belong to exactly one leaf node
of the KdbTree.