Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust joins cost estimation [HZ-2658, HZ-2494] #25249
Adjust joins cost estimation [HZ-2658, HZ-2494] #25249
Changes from 17 commits
57b9c02
7d4410f
187c05d
1775447
48abe08
98ad736
d1a1b16
c4bd3ff
83dc9b5
dc4e5dc
d003fc7
f2b9e19
bd9efeb
50fee00
6044a69
516ab7b
05ec305
68def60
b06e92b
4933580
fbd5492
ae0bd63
f65293f
491e5fd
a716a94
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in case of outer join one of the sides is broadcast. This requires more processing. Shouldn't we include that in cost calculation (I don't mean network - however that would nice too, but CPU)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f65293f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not what I meant: in case of outer non-equi join, outer side is broadcast. So we do
rightCount*memberCount
lookups which directly translates to CPU cost of lookups. However, they are also different than in case of equi-join. Eg. for query likeselect * from m1 left join m2 on m1.__key<>m2.__key
hashmap inSqlHashJoinP
degenerates to single key, and we iterate over all rows (this is a de-facto nested loop with spooling left side in memory)Additionally, those rows have to be sent over network, but AFAIR we do not add network cost to CPU cost. We have separate
network
cost, which has it's own mutliplier but is generally0
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I see correctly in case of non-equi hash join the hashmap is always degenerate. So we:
memberCount
single-unique-key hash maps (multimaps to be precise) each containingrightCount
itemsleftCount
lookups (in single-key hash map) and iterate over all found rows (ie.rightCount
rows). This givesleftCount
lookups (are they constant time with regard torightCount
?) andleftCount*rightCount
comparisons for iterating over elements found in multimap (ie. all right side rows).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
equi-join case is somewhere on the continuum between "hashmap lookup gives always single right row" and "hashmap lookup gives all right rows" depending on selectivity of equijoin columns (which we cannot estimate because we do not have column histograms).
also in equi-join hash table is partitioned, but might be skewed or even degenerate for the same reason
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not need
hz
anymore