Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615

Closed
3 tasks
sopel39 opened this issue Oct 1, 2018 · 14 comments
Closed
3 tasks
Labels

Comments

@sopel39
Copy link
Contributor

sopel39 commented Oct 1, 2018

Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.

Cumulative cost quality

Let's call cumulative estimated query q cost as:
EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q).
Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q).
All of those are random variables with respect to q.

Observations:

  • f(EstimatedXXX(q)) = CostXXX(q) + E(q), where E(q) is the estimate error.
  • We expect that f should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)
  • E(q) should be proportional to the CostXXX(q) or EstimatedXXX(q). This is because we expect larger errors for bigger cost/estimates

Let's say we gather queries qi and their estimates EstimatedXXX(qi) and costs CostXXX(qi). Thus we have data points composed of pairs:

<x1=EstimatedXXX(q1), y1=CostXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN=CostXXX(qN)>

for i = 1...N.

Those points will look similar to:
points1

Let's transform those points by dividing yN by xN. We now get:

<x1=EstimatedXXX(q1), y1'=CostXXX(q1)/EstimatedXXX(q1)>, ..., <xN=EstimatedXXX(qN), yN'=CostXXX(qN)/Estimated(qN)>

Now those points will look like:
points2

We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).

We can derive quantifiable model properties like:

  • OLS error magnitude
  • regression statistical significance test (if errors have normal distribution)
  • confidence and prediction intervals (if errors have normal distribution)

Such metrics allow us to compare quality of cost estimates

Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.

Individual operator cost quality

Let's call operator o estimated cost as:
OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o).
Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q).
Let's call operator estimated input data size as:
EstimatedInputSize(o).
Let's call operator measured input data size as:
MeasuredInputSize(o).
All of those are random variables with respect to o.

Observations:

  • operator most likely will have input estimated data size mismatch actual input data size. This affects values of OperatorEstimatedXXX(o) and OperatorCostXXX(q). Because of that, let's normalize them by:
NormalizedOperatorEstimatedXXX(o) = OperatorEstimatedXXX(o)/EstimatedInputSize(o)
NormalizedOperatorCostXXX(o) = OperatorCostXXX(o)/MeasuredInputSize(o)

Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:

OperatorEstimationErrorXXX(o) = NormalizedOperatorEstimatedXXX(o) - NormalizedOperatorCostXXX(o)

It will be similar to:
Error

We can now:

  • derive variance of the OperatorEstimationErrorXXX
  • derive the mean of the OperatorEstimationErrorXXX, e.g: how much (per input byte) operator is under/overestimating cost.
  • if OperatorEstimationErrorXXX has normal distribution we can make more sophisticated conclusions (e.g: if we added Constant to OperatorEstimatedXXX than for 90% of queries we would overestimate memory usage by this much).

Note that we should perform such analysis for each operator type as they have different profiles.

Stats quality

Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of FilterStatsCalculator and rules that perform predicate estimation.

Adjusting model

The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.

However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:

  • introduce variables for multiplying estimated operator cost so that they match measured cost.
  • specifically choose operator memory cost factor to be large so that for majority of the queries we overestimate memory (to be on a safe side)
  • multiply cumulative plan costs so that we slightly overestimate. Such factor for cumulative cost should account for plan complexity as simpler plans should have more accurate costs

Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.

Plan Of Attack

The steps are as follows:

  • For each operator expose cumulative and individual stats/costs in query info (v1/query) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.

Note that operators don't match 1-1 to plan nodes. For instance to obtain HashAggregationOperator stats we would need to get stats for join build side.

  • Expose estimated cost/stats in EventListener so that we could gather and store historical data points.
  • Add variables to stats/cost models so that we can adjust model for particular hardware, data and workloads.

CC: @findepi @rschlussel @mbasmanova @arhimondr @martint

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 1, 2018

FYI: @kbajda

@dain
Copy link
Contributor

dain commented Oct 1, 2018

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others? Specifically, having an understanding the resource uses over the life of the query will be important for bin packing a cluster.

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 1, 2018

In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others

This is more about quality of the metrics and adjusting them against measured data.
For instance, peak memory should be derived by @findepi #11591, which accounts for operator timelines.
We could imagine more metrics that account for stages timelines, etc.

@mbasmanova
Copy link
Contributor

@sopel39 Karol, I like this proposal and have some follow-up questions.

CostCpu(q), CostMemory(q), CostNetwork(q)

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network. Do you have any ideas about how could we measure/report these?

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

CC: @rschlussel @arhimondr

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 2, 2018

Currently, we report total CPU for a query, which could be defined as CostCpu, but we don't have anything for memory or network.

PlanNodeCostEstimate contains both memory and networkCost.

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 2, 2018

I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts.

Hopefully yes. There are more operators in final plan then just at reorder join phase (when CBO is used mostly). For instance partial aggregations stats are hard to estimate, see #11595.
However, even without known cost for the entire plan we should be still able to obtain cumulative cost of plan subtrees and individual cost of operators.

@mbasmanova
Copy link
Contributor

@sopel39

PlanNodeCostEstimate contains both memory and networkCost.

These are the estimates, but I'm looking for actual costs. Do we have these?

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 2, 2018

These are the estimates, but I'm looking for actual costs. Do we have these?

I think so. We have got measured exchanges input (for network) and operator peak memory (we can compute actual subplan peak memory from that).

@findepi
Copy link
Contributor

findepi commented Oct 2, 2018

and operator peak memory (we can compute actual subplan peak memory from that)

we cannot.

QueryStats contains observed peak memory for a query, so the query-level and operator-level comparison is covered.

@mbasmanova
Copy link
Contributor

@sopel39 @findepi Are we saying that memory cost equals peak memory? What about network cost. I don't believe we measure it today, do we?

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 2, 2018

@mbasmanova actual network data size can be obtained via: StageStats#getOutputPositions (PlanPrinter uses those metrics).

Are we saying that memory cost equals peak memory?

As @findepi mentioned QueryStats contains peek memory for a query and with #11591 we want to make memory cost estimate to be estimate of subplan peak memory that accounts for operator execution timelines.

@mbasmanova
Copy link
Contributor

@sopel39 Karol, thanks for explaining. One more question. You are saying that #11591 makes memory cost = peak memory. What is memory cost today?

@sopel39
Copy link
Contributor Author

sopel39 commented Oct 2, 2018

What is memory cost today?

@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under #11495 is a great summary of the issue.

@stale
Copy link

stale bot commented Oct 10, 2020

This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things.

@stale stale bot added the stale label Oct 10, 2020
@stale stale bot closed this as completed Oct 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants