Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce NDV based cardinality estimation for filters #405

Closed
wants to merge 2 commits into from

Conversation

vraghavan78
Copy link
Member

When the size of the array is more than 100 our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;

Next force GPORCA to not expand the array

vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)

In the example,:

  • Table has 100 rows
  • Each column has unique value for b
  • Array of the in clause has 3 values. So cardinality must be at most 3.
  • Since the array is not expanded we get wrong cardinality estimation

In this change, we introduce an NDV based cardinality estimation model,
that tries to may such unsupported predicates more selective.

Copy link
Contributor

@bhuvnesh2703 bhuvnesh2703 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good overall


// conversion function
static
CStatsPredNDV *ConvertPredStats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cast method name

(CStatsPred::EstatscmptEq == stats_cmp_type) &&
is_array_cmp_any)
{
// for large array constants greater than 100 elements, expanding the constants
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is 100 configurable?, if so just mention in bracket (default)..

CDouble hist_num_distinct = hist_before->GetNumDistinct();

// consider the predicate foo.b IN (1,2,3) and the array constant is not
// expanded and the input has 100 NDV. Then the upper bound of scale factor is 100/3.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale factor is 100/3 (i.e out of hundred, max 3 values will match)

@@ -912,6 +915,9 @@ CTranslatorExprToDXLUtils::PdxlnListFilterScCmp

pmdidScCmp = CUtils::GetScCmpMdId(mp, md_accessor, pmdidTypeOther, pmdidTypePartKey, cmp_type);

// number of distinct values used in cardinality estimation, no needed at this stage of leaving ORCA-land
ULONG ulLengthArray = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why set to 0? didn't understand the comment here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't care about this length so why bother setting it. This field is only use during statistics gathering. Since the is during Expr2DXL translation, this is a useless field that will never be used.

@@ -0,0 +1,44 @@
//---------------------------------------------------------------------------
// Greenplum Database
// Copyright (C) 2013 EMC Corp.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pivotal copyright?

// the number of distinct values in the result histogram and its corresponding frequency
// is based on the frequency of not null constants and the scale factor
CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct);
CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / scale_factor;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this always be towards correct when the scale factor is 1 / CHistogram::DefaultSelectivity or hist_num_distinct / filter_num_distinct?

@vraghavan78 vraghavan78 force-pushed the large-array branch 2 times, most recently from 3ab51e0 to 2c45c45 Compare September 26, 2018 23:30
Copy link
Contributor

@hardikar hardikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The review a few littered comments on moving the ScalarArrayComp:: m_array_length down to the RHS const array child, but looking at the code, (*scaralar_cmp_expr)[1]->m_pdrgPconst->Size() should give you that number, anytime. Did you already try that approach?

@@ -812,11 +812,19 @@ CDXLOperatorFactory::MakeDXLArrayComp
);
}

INT array_const_len = 0;

const XMLCh *length_xml = attrs.getValue(CDXLTokens::XmlstrToken(EdxltokenArrayConstantLength));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it makes more sense to move this to CParseHandlerArray. ScalarArrayCmp need not always have a const as its RHS child.

@@ -854,7 +854,7 @@ select * from P,X where P.a=X.a and X.a not in (1,2);
</dxl:ProjElem>
</dxl:ProjList>
<dxl:Filter>
<dxl:ArrayComp OperatorName="&lt;&gt;" OperatorMdid="0.518.1.0" OperatorType="All">
<dxl:ArrayComp OperatorName="&lt;&gt;" OperatorMdid="0.518.1.0" OperatorType="All" ArrayConstLength="2">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to save this in the MDP? Can't we compute it on the fly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. not if u do not expand it is an Const


CScalarIdent *scalar_ident_op = CScalarIdent::PopConvert(expr_ident->Pop());
const CColRef *col_ref = scalar_ident_op->Pcr();

if (!is_cmp_to_const_and_scalar_idents)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_cmp_to_const_and_scalar_idents = CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr) or CPredicateUtils::FCompareScalarIdentToConstAndScalarIdentArray(predicate_expr)

Don't we want to use NDV if CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr)?

// unsupported predicate for stats calculations
pred_stats_array->Append(GPOS_NEW(mp) CStatsPredUnsupported(
gpos::ulong_max, CStatsPred::EstatscmptOther));
if ((0 != scalar_array_cmp_op->UlLength()) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(0 != scalar_array_cmp_op->UlLength() is only needed because we're putting const length at the scalar_array_cmp level. Maybe it could go into the ScalarArray?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. See the minidump test I added. The array is converted into a constant. Since this is only used by the comparison operator, it is okay to leave it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me delete all the comments like that. It would be nice to use a different name for UlLength, something like UlArrayConstLengthForStats, so that it's clear.

// the number of distinct values in the result histogram and its corresponding frequency
// is based on the frequency of not null constants and the scale factor
CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct);
CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / std::max(CDouble(1), (hist_num_distinct / filter_num_distinct));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why you didn't use scale_factor in this expression? We don't want this capped by DefaultSelectivity?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is done after the fact, over all the predicates.

bhuvnesh2703 pushed a commit to greenplum-db/gpdb that referenced this pull request Oct 16, 2018
When the size of the array is more than `100` our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

```
create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;
```

Next force GPORCA to not expand the array

```
vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)
```

In the example,:

* Table has 100 rows
* Each column has unique value for b
* Array of the in clause has 3 values. So cardinality must be at most 3.
* Since the array is not expanded we get wrong cardinality estimation

In this change, we pass the size of the array constant so that GPORCA can try do a
better job estimating cardinality.

Associated GPORCA PR: greenplum-db/gporca#405
@vraghavan78 vraghavan78 force-pushed the large-array branch 2 times, most recently from d407995 to 391066d Compare October 16, 2018 21:35
vraghavan78 added a commit to vraghavan78/gpdb that referenced this pull request Oct 24, 2018
When the size of the array is more than `100` our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

```
create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;
```

Next force GPORCA to not expand the array

```
vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)
```

In the example,:

* Table has 100 rows
* Each column has unique value for b
* Array of the in clause has 3 values. So cardinality must be at most 3.
* Since the array is not expanded we get wrong cardinality estimation

In this change, we pass the size of the array constant so that GPORCA can try do a
better job estimating cardinality.

Associated GPORCA PR: greenplum-db/gporca#405
vraghavan78 and others added 2 commits October 26, 2018 10:05
When the size of the array is more than `100` our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

```
create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;
```

Next force GPORCA to not expand the array

```
vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)
```

In the example,:

* Table has 100 rows
* Each column has unique value for b
* Array of the in clause has 3 values. So cardinality must be at most 3.
* Since the array is not expanded we get wrong cardinality estimation

In this change, we introduce an NDV based cardinality estimation model,
that tries to may such unsupported predicates more selective.
bhuvnesh2703 pushed a commit to bhuvnesh2703/gpdb that referenced this pull request Oct 26, 2018
When the size of the array is more than `100` our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

```
create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;
```

Next force GPORCA to not expand the array

```
vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)
```

In the example,:

* Table has 100 rows
* Each column has unique value for b
* Array of the in clause has 3 values. So cardinality must be at most 3.
* Since the array is not expanded we get wrong cardinality estimation

In this change, we pass the size of the array constant so that GPORCA can try do a
better job estimating cardinality.

Associated GPORCA PR: greenplum-db/gporca#405
@vraghavan78
Copy link
Member Author

Closing PR since some of the TPC-DS queries regressed

@hardikar hardikar deleted the large-array branch December 5, 2018 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants