Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

planner: add an upper bound for estimated row count of inner side of index join #41996

Merged
merged 16 commits into from Mar 14, 2023

Conversation

time-and-fate
Copy link
Member

@time-and-fate time-and-fate commented Mar 7, 2023

What problem does this PR solve?

Issue Number: ref #31316

Problem Summary:

Because of our current implementation of index join, the row count for its inner side might be severely overestimated.

What is changed and how it works?

Add a reasonable upper bound for the inner children of index join to prevent very severe estimation errors.

Specifically, the average row count for the inner side IndexScan that corresponds to each row from the outer side should be no larger than (total row count / NDV of join key columns).

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Mar 7, 2023

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • AilinKid
  • winoros

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 7, 2023
@time-and-fate
Copy link
Member Author

/retest

@time-and-fate
Copy link
Member Author

/test unit-test

@ti-chi-bot ti-chi-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 8, 2023
" │ └─IndexRangeScan 1000000.00 cop[tikv] table:t2, index:idx(b) range: decided by [eq(test.t.b, test.t.a)], keep order:false, stats:pseudo",
" └─Selection(Probe) 1000000.00 cop[tikv] eq(test.t.a, 0)",
" └─TableRowIDScan 1000000.00 cop[tikv] table:t2 keep order:false, stats:pseudo",
))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a comparison, on current master branch, the execution plan is:

IndexJoin 1000000.00 root  inner join, inner:IndexLookUp, outer key:test.t.a, inner key:test.t.b, equal cond:eq(test.t.a, test.t.b)
├─TableReader(Build) 1000.00 root  data:Selection
│ └─Selection 1000.00 cop[tikv]  lt(test.t.a, 1), not(isnull(test.t.a))
│   └─TableFullScan 500000.00 cop[tikv] table:t keep order:false, stats:pseudo
└─IndexLookUp(Probe) 1000000.00 root  
  ├─Selection(Build) 500000000.00 cop[tikv]  not(isnull(test.t.b))
  │ └─IndexRangeScan 500000000.00 cop[tikv] table:t2, index:idx(b) range: decided by [eq(test.t.b, test.t.a)], keep order:false, stats:pseudo
  └─Selection(Probe) 1000000.00 cop[tikv]  eq(test.t.a, 0)
    └─TableRowIDScan 500000000.00 cop[tikv] table:t2 keep order:false, stats:pseudo

return idxStats.NDV
}
}
return -1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return max{single col ndv} instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be min.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Contributor

@AilinKid AilinKid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest LGTM

}
}

// 3. If we still haven't got an NDV, we use the minimal NDV in the column stats as a lower bound.
Copy link
Contributor

@AilinKid AilinKid Mar 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer a more case 2.5, once the index prefix columns satisfied the keys,we can use the index‘s DNV / the abscent column’s DNV (thought they are completely independent distributions, and result after division is quite small than the real value, leading a higher row count estimations(but not that highest from direct dividing lowest col-ndv) is which we wanna get from here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clever one!
It's not that obvious but indeed a correct lower bound for the NDV. And I think it doesn't need to be prefix columns, it's enough if the index contains the columns we want.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if we add this, there needs to be a strategy for choosing from multiple possible indexes, and we need to make sure the result is stable, we also need to care about some edge cases like prefix index and generated columns, and we also need to construct corresponding test cases, that's too much for this sprint.
Besides, this PR would be cherry-picked for a customer, safety is more important, and less change is preferred.
So I tend to add a TODO comment here and leave it to the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a clever one! It's not that obvious but indeed a correct lower bound for the NDV. And I think it doesn't need to be prefix columns, it's enough if the index contains the columns we want.

make sense,subset of index column is enough

@time-and-fate
Copy link
Member Author

/retest

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 14, 2023
@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 14, 2023
@time-and-fate
Copy link
Member Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 23f34df

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 14, 2023
@time-and-fate
Copy link
Member Author

/retest

@ti-chi-bot ti-chi-bot merged commit e264615 into pingcap:master Mar 14, 2023
time-and-fate added a commit to time-and-fate/tidb that referenced this pull request Mar 15, 2023
ti-chi-bot pushed a commit that referenced this pull request Mar 16, 2023
ti-chi-bot pushed a commit that referenced this pull request Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note-none size/L Denotes a PR that changes 100-499 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants