-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantify stats/cost calculator quality. Probabilistically adjust models as a longer term goal #11615
Comments
FYI: @kbajda |
In the future would we want to perform this analysis on each phase of the query, since some phases are much more resource intensive then others? Specifically, having an understanding the resource uses over the life of the query will be important for bin packing a cluster. |
This is more about quality of the metrics and adjusting them against measured data. |
@sopel39 Karol, I like this proposal and have some follow-up questions.
Currently, we report total CPU for a query, which could be defined as I expect Rebecca's #11511 to provide sufficient information (and, hopefully, it will be sufficiently parsable) to allow for basic analysis of estimated vs. actual CPU costs as well as row counts. |
|
Hopefully yes. There are more operators in final plan then just at reorder join phase (when CBO is used mostly). For instance partial aggregations stats are hard to estimate, see #11595. |
These are the estimates, but I'm looking for actual costs. Do we have these? |
I think so. We have got measured exchanges input (for network) and operator peak memory (we can compute actual subplan peak memory from that). |
we cannot.
|
@mbasmanova actual network data size can be obtained via:
As @findepi mentioned |
@mbasmanova Currently memory cost estimate is sum of peak memory estimates for all operators. That doesn't account for operator execution timelines. I think discussion under #11495 is a great summary of the issue. |
This issue has been automatically marked as stale because it has not had any activity in the last 2 years. If you feel that this issue is important, just comment and the stale tag will be removed; otherwise it will be closed in 7 days. This is an attempt to ensure that our open issues remain valuable and relevant so that we can keep track of what needs to be done and prioritize the right things. |
Currently we don't have any quantifiable means of measuring stats cost model quality. It would be great to have one so that we can compare stats/cost calculator changes via A/B testing or by aggregating historical data.
Cumulative cost quality
Let's call cumulative estimated query
q
cost as:EstimatedUserCpu(q), EstimatedMemory(q), EstimatedNetwork(q)
.Let's call actual query cost as:
CostCpu(q), CostMemory(q), CostNetwork(q)
.All of those are random variables with respect to
q
.Observations:
f(EstimatedXXX(q)) = CostXXX(q) + E(q)
, whereE(q)
is the estimate error.f
should be a linear transformation (otherwise sum of subplan estimates wont be the estimate of a whole plan)E(q)
should be proportional to theCostXXX(q)
orEstimatedXXX(q)
. This is because we expect larger errors for bigger cost/estimatesLet's say we gather queries
qi
and their estimatesEstimatedXXX(qi)
and costsCostXXX(qi)
. Thus we have data points composed of pairs:for
i = 1...N
.Those points will look similar to:
Let's transform those points by dividing
yN
byxN
. We now get:Now those points will look like:
We can now perform regression analysis as we should have satisfied conditions for performing OLS regression (errors come from same distribution and have constant variance).
We can derive quantifiable model properties like:
Such metrics allow us to compare quality of cost estimates
Note: I think that there will be different error distributions depending how complex the plans are. For instance sophisticated plans with many plan nodes will have the error accumulated. To account for that, we could derive plan complexity (e.g: number of nodes, depth) as an additional input parameter to the model and we could perform analysis for each complexity level.
Individual operator cost quality
Let's call operator
o
estimated cost as:OperatorEstimatedUserCpu(o), OperatorEstimatedMemory(o), OperatorEstimatedNetwork(o)
.Let's call actual operator cost as:
OperatorCostCpu(q), OperatorCostMemory(q), OperatorCostNetwork(q)
.Let's call operator estimated input data size as:
EstimatedInputSize(o)
.Let's call operator measured input data size as:
MeasuredInputSize(o)
.All of those are random variables with respect to
o
.Observations:
OperatorEstimatedXXX(o)
andOperatorCostXXX(q)
. Because of that, let's normalize them by:Intuitively they tell how much memory is needed by operator per input byte. We can now define operator estimation error as:
It will be similar to:
We can now:
OperatorEstimationErrorXXX
OperatorEstimationErrorXXX
, e.g: how much (per input byte) operator is under/overestimating cost.OperatorEstimationErrorXXX
has normal distribution we can make more sophisticated conclusions (e.g: if we addedConstant
toOperatorEstimatedXXX
than for 90% of queries we would overestimate memory usage by this much).Note that we should perform such analysis for each operator type as they have different profiles.
Stats quality
Similar approach can be applied to operator nodes that filter the data (e.g: actual filtering factor vs estimated filtering factor). This way we can evaluate quality of
FilterStatsCalculator
and rules that perform predicate estimation.Adjusting model
The initial goal is to provide quality metrics for stats/cost model so that we could test and quantify model changes.
However, having gathered historical query and model data, we could introduce additional variables to our stats/cost computations that we would adjust to achieve desired model properties.
For instance, we could/should:
Having such variables we could adjust model for different hardware, data and workloads. Specifically we could adjust model for each client so that CBO produces better plans.
Plan Of Attack
The steps are as follows:
v1/query
) JSON so that they can be gathered and analyzed. This would be used to estimate model on predefined set of queries (e.g: TPCH/TPCDS) during benchmarks.Note that operators don't match 1-1 to plan nodes. For instance to obtain
HashAggregationOperator
stats we would need to get stats for join build side.EventListener
so that we could gather and store historical data points.CC: @findepi @rschlussel @mbasmanova @arhimondr @martint
The text was updated successfully, but these errors were encountered: