Introduce NDV based cardinality estimation for filters #405

vraghavan78 · 2018-09-24T23:36:19Z

When the size of the array is more than 100 our translator does not expand the arrays.
This make GPORCA's cardinality estimation model think it is an unsupported predicate
which in turn makes it the cardinality wrong:

create table foo(a int, b int);
insert into foo select i, i from generate_series(1,100) i;

Next force GPORCA to not expand the array

vraghavan=# set optimizer_array_expansion_threshold = 1;
SET
vraghavan=# explain select * from foo where b in (1, 2, 3);                                                                                                                                                                                                                                                                                                                                           QUERY PLAN
-------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..431.00 rows=40 width=8)
   ->  Table Scan on foo  (cost=0.00..431.00 rows=14 width=8)
         Filter: b = ANY ('{1,2,3}'::integer[])
 Settings:  optimizer_array_expansion_threshold=1
 Optimizer status: PQO version 2.75.0
(5 rows)

In the example,:

Table has 100 rows
Each column has unique value for b
Array of the in clause has 3 values. So cardinality must be at most 3.
Since the array is not expanded we get wrong cardinality estimation

In this change, we introduce an NDV based cardinality estimation model,
that tries to may such unsupported predicates more selective.

bhuvnesh2703

Looks Good overall

bhuvnesh2703 · 2018-09-25T16:02:20Z

libnaucrates/include/naucrates/statistics/CStatsPredNDV.h

+
+			// conversion function
+			static
+			CStatsPredNDV *ConvertPredStats


Cast method name

bhuvnesh2703 · 2018-09-25T16:28:55Z

libnaucrates/src/statistics/CStatsPredUtils.cpp

+				(CStatsPred::EstatscmptEq == stats_cmp_type) &&
+				is_array_cmp_any)
+		{
+			// for large array constants greater than 100 elements, expanding the constants


is 100 configurable?, if so just mention in bracket (default)..

bhuvnesh2703 · 2018-09-25T16:30:14Z

libnaucrates/src/statistics/CFilterStatsProcessor.cpp

+	CDouble hist_num_distinct = hist_before->GetNumDistinct();
+
+	// consider the predicate foo.b IN (1,2,3) and the array constant is not
+	// expanded and the input has 100 NDV. Then the upper bound of scale factor is 100/3.


scale factor is 100/3 (i.e out of hundred, max 3 values will match)

bhuvnesh2703 · 2018-09-25T16:34:58Z

libgpopt/src/translate/CTranslatorExprToDXLUtils.cpp

@@ -912,6 +915,9 @@ CTranslatorExprToDXLUtils::PdxlnListFilterScCmp

 	pmdidScCmp = CUtils::GetScCmpMdId(mp, md_accessor, pmdidTypeOther, pmdidTypePartKey, cmp_type);

+	// number of distinct values used in cardinality estimation, no needed at this stage of leaving ORCA-land
+	ULONG ulLengthArray = 0;


why set to 0? didn't understand the comment here.

We don't care about this length so why bother setting it. This field is only use during statistics gathering. Since the is during Expr2DXL translation, this is a useless field that will never be used.

bhuvnesh2703 · 2018-09-25T16:35:59Z

libnaucrates/src/statistics/CStatsPredNDV.cpp

@@ -0,0 +1,44 @@
+//---------------------------------------------------------------------------
+//	Greenplum Database
+//	Copyright (C) 2013 EMC Corp.


pivotal copyright?

bhuvnesh2703 · 2018-09-25T16:37:07Z

libnaucrates/src/statistics/CFilterStatsProcessor.cpp

+	// the number of distinct values in the result histogram and its corresponding frequency
+	// is based on the frequency of not null constants and the scale factor
+	CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct);
+	CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / scale_factor;


will this always be towards correct when the scale factor is 1 / CHistogram::DefaultSelectivity or hist_num_distinct / filter_num_distinct?

hardikar

The review a few littered comments on moving the ScalarArrayComp:: m_array_length down to the RHS const array child, but looking at the code, (*scaralar_cmp_expr)[1]->m_pdrgPconst->Size() should give you that number, anytime. Did you already try that approach?

hardikar · 2018-10-05T21:42:16Z

libnaucrates/src/operators/CDXLOperatorFactory.cpp

@@ -812,11 +812,19 @@ CDXLOperatorFactory::MakeDXLArrayComp
 			);
 	}

+	INT array_const_len = 0;
+
+	const XMLCh *length_xml = attrs.getValue(CDXLTokens::XmlstrToken(EdxltokenArrayConstantLength));


Perhaps it makes more sense to move this to CParseHandlerArray. ScalarArrayCmp need not always have a const as its RHS child.

hardikar · 2018-10-05T21:42:59Z

data/dxl/minidump/DPE-NOT-IN.mdp

@@ -854,7 +854,7 @@ select * from P,X where P.a=X.a  and X.a not in (1,2);
              </dxl:ProjElem>
            </dxl:ProjList>
            <dxl:Filter>
-              <dxl:ArrayComp OperatorName="&lt;&gt;" OperatorMdid="0.518.1.0" OperatorType="All">
+              <dxl:ArrayComp OperatorName="&lt;&gt;" OperatorMdid="0.518.1.0" OperatorType="All" ArrayConstLength="2">


Why do we need to save this in the MDP? Can't we compute it on the fly?

no. not if u do not expand it is an Const

hardikar · 2018-10-05T21:50:17Z

libnaucrates/src/statistics/CStatsPredUtils.cpp

+
+	CScalarIdent *scalar_ident_op = CScalarIdent::PopConvert(expr_ident->Pop());
+	const CColRef *col_ref = scalar_ident_op->Pcr();
+
 	if (!is_cmp_to_const_and_scalar_idents)


is_cmp_to_const_and_scalar_idents = CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr) or CPredicateUtils::FCompareScalarIdentToConstAndScalarIdentArray(predicate_expr)

Don't we want to use NDV if CPredicateUtils::FCompareCastIdentToConstArray(predicate_expr)?

hardikar · 2018-10-05T21:51:34Z

libnaucrates/src/statistics/CStatsPredUtils.cpp

-		// unsupported predicate for stats calculations
-		pred_stats_array->Append(GPOS_NEW(mp) CStatsPredUnsupported(
-			gpos::ulong_max, CStatsPred::EstatscmptOther));
+		if ((0 != scalar_array_cmp_op->UlLength()) &&


(0 != scalar_array_cmp_op->UlLength() is only needed because we're putting const length at the scalar_array_cmp level. Maybe it could go into the ScalarArray?

No. See the minidump test I added. The array is converted into a constant. Since this is only used by the comparison operator, it is okay to leave it here.

Ok. Let me delete all the comments like that. It would be nice to use a different name for UlLength, something like UlArrayConstLengthForStats, so that it's clear.

hardikar · 2018-10-05T22:00:43Z

libnaucrates/src/statistics/CFilterStatsProcessor.cpp

+	// the number of distinct values in the result histogram and its corresponding frequency
+	// is based on the frequency of not null constants and the scale factor
+	CDouble distinct_remaining = std::min(filter_num_distinct, hist_num_distinct);
+	CDouble freq_remaining = (1 - hist_before->GetNullFreq()) / std::max(CDouble(1), (hist_num_distinct / filter_num_distinct));


Any reason why you didn't use scale_factor in this expression? We don't want this capped by DefaultSelectivity?

That is done after the fact, over all the predicates.

When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we pass the size of the array constant so that GPORCA can try do a better job estimating cardinality. Associated GPORCA PR: greenplum-db/gporca#405

When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we introduce an NDV based cardinality estimation model, that tries to may such unsupported predicates more selective.

When the size of the array is more than `100` our translator does not expand the arrays. This make GPORCA's cardinality estimation model think it is an unsupported predicate which in turn makes it the cardinality wrong: ``` create table foo(a int, b int); insert into foo select i, i from generate_series(1,100) i; ``` Next force GPORCA to not expand the array ``` vraghavan=# set optimizer_array_expansion_threshold = 1; SET vraghavan=# explain select * from foo where b in (1, 2, 3); QUERY PLAN ------------------------------------------------------------------------------- Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..431.00 rows=40 width=8) -> Table Scan on foo (cost=0.00..431.00 rows=14 width=8) Filter: b = ANY ('{1,2,3}'::integer[]) Settings: optimizer_array_expansion_threshold=1 Optimizer status: PQO version 2.75.0 (5 rows) ``` In the example,: * Table has 100 rows * Each column has unique value for b * Array of the in clause has 3 values. So cardinality must be at most 3. * Since the array is not expanded we get wrong cardinality estimation In this change, we pass the size of the array constant so that GPORCA can try do a better job estimating cardinality. Associated GPORCA PR: greenplum-db/gporca#405

vraghavan78 · 2018-10-29T19:29:56Z

Closing PR since some of the TPC-DS queries regressed

vraghavan78 mentioned this pull request Sep 24, 2018

Pass the array size to GPORCA when do array comparion greenplum-db/gpdb#5850

Closed

vraghavan78 force-pushed the large-array branch from ceab035 to f885aea Compare September 24, 2018 23:56

bhuvnesh2703 approved these changes Sep 25, 2018

View reviewed changes

vraghavan78 force-pushed the large-array branch 2 times, most recently from 3ab51e0 to 2c45c45 Compare September 26, 2018 23:30

hardikar reviewed Oct 5, 2018

View reviewed changes

bhuvnesh2703 force-pushed the large-array branch from 2c45c45 to d7ff0a6 Compare October 16, 2018 00:01

vraghavan78 force-pushed the large-array branch 2 times, most recently from d407995 to 391066d Compare October 16, 2018 21:35

bhuvnesh2703 force-pushed the large-array branch from f097e91 to 3c72b59 Compare October 16, 2018 23:34

vraghavan78 force-pushed the large-array branch from 3c72b59 to 1ec9f6d Compare October 24, 2018 06:18

vraghavan78 and others added 2 commits October 26, 2018 10:05

Bump ORCA to v3.7.0

5f09f18

bhuvnesh2703 force-pushed the large-array branch from 1ec9f6d to 5f09f18 Compare October 26, 2018 17:06

vraghavan78 closed this Oct 29, 2018

hardikar deleted the large-array branch December 5, 2018 23:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce NDV based cardinality estimation for filters #405

Introduce NDV based cardinality estimation for filters #405

vraghavan78 commented Sep 24, 2018

bhuvnesh2703 left a comment

bhuvnesh2703 Sep 25, 2018

bhuvnesh2703 Sep 25, 2018 •

edited

bhuvnesh2703 Sep 25, 2018

bhuvnesh2703 Sep 25, 2018

vraghavan78 Oct 10, 2018

bhuvnesh2703 Sep 25, 2018

bhuvnesh2703 Sep 25, 2018

hardikar left a comment

hardikar Oct 5, 2018

hardikar Oct 5, 2018

vraghavan78 Oct 12, 2018

hardikar Oct 5, 2018

hardikar Oct 5, 2018

vraghavan78 Oct 10, 2018

hardikar Oct 12, 2018

hardikar Oct 5, 2018

vraghavan78 Oct 10, 2018

vraghavan78 commented Oct 29, 2018

Introduce NDV based cardinality estimation for filters #405

Introduce NDV based cardinality estimation for filters #405

Conversation

vraghavan78 commented Sep 24, 2018

bhuvnesh2703 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhuvnesh2703 Sep 25, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hardikar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vraghavan78 commented Oct 29, 2018

bhuvnesh2703 Sep 25, 2018 •

edited