Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

sopel39 · 2018-10-01T15:21:09Z

Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.

Cumulative cost quality

Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.

Observations:

f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.
We expect that f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)
E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimates

Let's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:

<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>

for i = 1...N.

Those points will look similar to:

Let's transform those points by dividing yN by xN. We now get:

<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>

Now those points will look like:

We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).

We can derive quantifiable model properties like:

OLS error magnitude
regression statistical significance test (if errors have normal distribution)
confidence and prediction intervals (if errors have normal distribution)

Such metrics allow us to compare quality of cost estimates

Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.

Individual operator cost quality

Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.

Observations:

operator most likely will have input estimated data size mismatch actual input data size. This affects values of OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:

NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)

Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:

OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)

It will be similar to:

We can now:

derive variance of the OperatorEstimationErrorXXX
derive the mean of the OperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.
if OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).

Note that we should perform such analysis for each operator type as they have different profiles.

Stats quality

Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.

Adjusting model

The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.

However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:

introduce variables for multiplying estimated operator cost so that they match measured cost.
specifically choose operator memory cost factor to be large so that for majority of the queries we overestimate memory (to be on a safe side)
multiply cumulative plan costs so that we slightly overestimate. Such factor for cumulative cost should account for plan complexity as simpler plans should have more accurate costs

Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.

Plan Of Attack

The steps are as follows:

For each operator expose cumulative and individual stats/costs in query info (v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.

Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.

Expose estimated cost/stats in EventListener so that we could gather and store historical data points.
Add variables to stats/cost models so that we can adjust model for particular hardware, data and workloads.

CC: @findepi @rschlussel @mbasmanova @arhimondr @martint

The text was updated successfully, but these errors were encountered:

sopel39 · 2018-10-01T15:21:41Z

FYI: @kbajda

dain · 2018-10-01T18:01:50Z

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others? Specifically, having an understanding the resource uses over the life of the query will be important for bin packing a cluster.

sopel39 · 2018-10-01T20:28:26Z

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others

This is more about quality of the metrics and adjusting them against measured data.
For instance, peak memory should be derived by @findepi #11591, which accounts for operator timelines.
We could imagine more metrics that account for stages timelines, etc.

mbasmanova · 2018-10-02T03:42:17Z

@sopel39 Karol, I like this proposal and have some follow-up questions.

CostCpu(q), CostMemory(q), CostNetwork(q)

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network. Do you have any ideas about how could we measure/report these?

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

CC: @rschlussel @arhimondr

sopel39 · 2018-10-02T08:57:39Z

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network.

PlanNodeCostEstimate contains both memory and networkCost.

sopel39 · 2018-10-02T09:06:09Z

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

Hopefully yes. There are more operators in final plan then just at reorder join phase (when CBO is used mostly). For instance partial aggregations stats are hard to estimate, see #11595.
However, even without known cost for the entire plan we should be still able to obtain cumulative cost of plan subtrees and individual cost of operators.

mbasmanova · 2018-10-02T12:08:57Z

@sopel39

PlanNodeCostEstimate contains both memory and networkCost.

These are the estimates, but I'm looking for actual costs. Do we have these?

sopel39 · 2018-10-02T12:26:40Z

These are the estimates, but I'm looking for actual costs. Do we have these?

I think so. We have got measured exchanges input (for network) and operator peak memory (we can compute actual subplan peak memory from that).

findepi · 2018-10-02T12:50:02Z

and operator peak memory (we can compute actual subplan peak memory from that)

we cannot.

QueryStats contains observed peak memory for a query, so the query-level and operator-level comparison is covered.

mbasmanova · 2018-10-02T14:50:25Z

@sopel39 @findepi Are we saying that memory cost equals peak memory? What about network cost. I don't believe we measure it today, do we?

sopel39 · 2018-10-02T14:58:47Z

@mbasmanova actual network data size can be obtained via: StageStats#getOutputPositions (PlanPrinter uses those metrics).

Are we saying that memory cost equals peak memory?

As @findepi mentioned QueryStats contains peek memory for a query and with #11591 we want to make memory cost estimate to be estimate of subplan peak memory that accounts for operator execution timelines.

mbasmanova · 2018-10-02T15:05:06Z

@sopel39 Karol, thanks for explaining. One more question. You are saying that #11591 makes memory cost = peak memory. What is memory cost today?

sopel39 · 2018-10-02T15:08:48Z

What is memory cost today?

@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under #11495 is a great summary of the issue.

stale · 2020-10-10T08:05:35Z

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

This was referenced Oct 9, 2018

Restrict size of automatically broadcasted tables #11667

Merged

Join lookup source building cost is greatly underaccounted by CBO #11669

Closed

findepi added the cbo label Oct 10, 2018

findepi mentioned this issue Oct 29, 2018

Print cost metrics as data size #11443

Closed

This was referenced Nov 8, 2018

Explicitly measure raw input and generated input in operators #11838

Merged

Store CBO statistics and cost per plan node id in QueryStatistics or QueryMetadata #11875

Closed

sopel39 mentioned this issue Nov 16, 2018

Optionally use default filter factor to estimate filter node #11904

Merged

sopel39 mentioned this issue Jan 24, 2019

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal trinodb/trino#51

Open

3 tasks

stale bot added the stale label Oct 10, 2020

stale bot closed this as completed Oct 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

sopel39 commented Oct 1, 2018 •

edited by mbasmanova

Loading

sopel39 commented Oct 1, 2018

dain commented Oct 1, 2018

sopel39 commented Oct 1, 2018 •

edited

Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018

sopel39 commented Oct 2, 2018 •

edited

Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 •

edited

Loading

findepi commented Oct 2, 2018

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 •

edited

Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 •

edited

Loading

stale bot commented Oct 10, 2020

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

Comments

sopel39 commented Oct 1, 2018 • edited by mbasmanova Loading

sopel39 commented Oct 1, 2018

dain commented Oct 1, 2018

sopel39 commented Oct 1, 2018 • edited Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018

sopel39 commented Oct 2, 2018 • edited Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 • edited Loading

findepi commented Oct 2, 2018

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 • edited Loading

mbasmanova commented Oct 2, 2018

sopel39 commented Oct 2, 2018 • edited Loading

stale bot commented Oct 10, 2020

sopel39 commented Oct 1, 2018 •

edited by mbasmanova

Loading

sopel39 commented Oct 1, 2018 •

edited

Loading

sopel39 commented Oct 2, 2018 •

edited

Loading

sopel39 commented Oct 2, 2018 •

edited

Loading

sopel39 commented Oct 2, 2018 •

edited

Loading

sopel39 commented Oct 2, 2018 •

edited

Loading