New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce NDV based cardinality estimation for filters #405
Conversation
ceab035
to
f885aea
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good overall
|
||
// conversion function | ||
static | ||
CStatsPredNDV *ConvertPredStats |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cast method name
(CStatsPred::EstatscmptEq == stats_cmp_type) && | ||
is_array_cmp_any) | ||
{ | ||
// for large array constants greater than 100 elements, expanding the constants |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is 100 configurable?, if so just mention in bracket (default)..
CDouble hist_num_distinct = hist_before->GetNumDistinct(); | ||
|
||
// consider the predicate foo.b IN (1,2,3) and the array constant is not | ||
// expanded and the input has 100 NDV. Then the upper bound of scale factor is 100/3. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scale factor is 100/3 (i.e out of hundred, max 3 values will match)
@@ -912,6 +915,9 @@ CTranslatorExprToDXLUtils::PdxlnListFilterScCmp | |||
|
|||
pmdidScCmp = CUtils::GetScCmpMdId(mp, md_accessor, pmdidTypeOther, pmdidTypePartKey, cmp_type); | |||
|
|||
// number of distinct values used in cardinality estimation, no needed at this stage of leaving ORCA-land | |||
ULONG ulLengthArray = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set to 0? didn't understand the comment here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about this length so why bother setting it. This field is only use during statistics gathering. Since the is during Expr2DXL translation, this is a useless field that will never be used.
@@ -0,0 +1,44 @@ | |||
//--------------------------------------------------------------------------- | |||
// Greenplum Database | |||
// Copyright (C) 2013 EMC Corp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pivotal copyright?
// the number of distinct values in the result histogram and its corresponding frequency | ||
// is based on the frequency of not null constants and the scale factor | ||
CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct); | ||
CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / scale_factor; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this always be towards correct when the scale factor is 1 / CHistogram::DefaultSelectivity
or hist_num_distinct / filter_num_distinct
?
3ab51e0
to
2c45c45
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The review a few littered comments on moving the ScalarArrayComp:: m_array_length
down to the RHS const array child, but looking at the code, (*scaralar_cmp_expr)[1]->m_pdrgPconst->Size()
should give you that number, anytime. Did you already try that approach?
@@ -812,11 +812,19 @@ CDXLOperatorFactory::MakeDXLArrayComp | |||
); | |||
} | |||
|
|||
INT array_const_len = 0; | |||
|
|||
const XMLCh *length_xml = attrs.getValue(CDXLTokens::XmlstrToken(EdxltokenArrayConstantLength)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it makes more sense to move this to CParseHandlerArray. ScalarArrayCmp need not always have a const as its RHS child.
@@ -854,7 +854,7 @@ select * from P,X where P.a=X.a and X.a not in (1,2); | |||
</dxl:ProjElem> | |||
</dxl:ProjList> | |||
<dxl:Filter> | |||
<dxl:ArrayComp OperatorName="<>" OperatorMdid="0.518.1.0" OperatorType="All"> | |||
<dxl:ArrayComp OperatorName="<>" OperatorMdid="0.518.1.0" OperatorType="All" ArrayConstLength="2"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to save this in the MDP? Can't we compute it on the fly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. not if u do not expand it is an Const
|
||
CScalarIdent *scalar_ident_op = CScalarIdent::PopConvert(expr_ident->Pop()); | ||
const CColRef *col_ref = scalar_ident_op->Pcr(); | ||
|
||
if (!is_cmp_to_const_and_scalar_idents) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_cmp_to_const_and_scalar_idents = CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr) or CPredicateUtils::FCompareScalarIdentToConstAndScalarIdentArray(predicate_expr)
Don't we want to use NDV if CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr)
?
// unsupported predicate for stats calculations | ||
pred_stats_array->Append(GPOS_NEW(mp) CStatsPredUnsupported( | ||
gpos::ulong_max, CStatsPred::EstatscmptOther)); | ||
if ((0 != scalar_array_cmp_op->UlLength()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0 != scalar_array_cmp_op->UlLength()
is only needed because we're putting const length at the scalar_array_cmp level. Maybe it could go into the ScalarArray?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. See the minidump test I added. The array is converted into a constant. Since this is only used by the comparison operator, it is okay to leave it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me delete all the comments like that. It would be nice to use a different name for UlLength
, something like UlArrayConstLengthForStats
, so that it's clear.
// the number of distinct values in the result histogram and its corresponding frequency | ||
// is based on the frequency of not null constants and the scale factor | ||
CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct); | ||
CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / std::max(CDouble(1), (hist_num_distinct / filter_num_distinct)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason why you didn't use scale_factor in this expression? We don't want this capped by DefaultSelectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is done after the fact, over all the predicates.
2c45c45
to
d7ff0a6
Compare
When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we pass the size of the array constant so that GPORCA can try do a better job estimating cardinality. Associated GPORCA PR: greenplum-db/gporca#405
d407995
to
391066d
Compare
f097e91
to
3c72b59
Compare
3c72b59
to
1ec9f6d
Compare
When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we pass the size of the array constant so that GPORCA can try do a better job estimating cardinality. Associated GPORCA PR: greenplum-db/gporca#405
When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we introduce an NDV based cardinality estimation model, that tries to may such unsupported predicates more selective.
1ec9f6d
to
5f09f18
Compare
When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we pass the size of the array constant so that GPORCA can try do a better job estimating cardinality. Associated GPORCA PR: greenplum-db/gporca#405
Closing PR since some of the TPC-DS queries regressed |
When the size of the array is more than
100
our translator does not expand the arrays.This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:
Next force GPORCA to not expand the array
In the example,:
In this change, we introduce an NDV based cardinality estimation model,
that tries to may such unsupported predicates more selective.