Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CBO: Do not consider plans that exceed memory limits #11495

Conversation

cameronreynoldson
Copy link

Check for the maximum amount of memory a query is allowed to have at any point in time in the CostComparator. If it is exceeded, choose the plan using less memory.

@mbasmanova
Copy link
Contributor

CC: @findepi @rschlussel

@cameronreynoldson cameronreynoldson changed the title Add getQueryMaxMemory check to CostComparator CBO: Do not consider plans that exceed memory limits Sep 17, 2018
@rschlussel
Copy link
Contributor

@cameronreynoldson thanks for the PR. This looks like a good start on how to compare costs. I don't see the constructor that has individual costs and cumulative cost being used, so you still need to have the costCalculator include the individual cost when it's creating cost estimates.

private final double cpuCost;
private final double memoryCost;
private final double networkCost;

public PlanNodeCostEstimate(double cpuCost, double memoryCost, double networkCost)
public PlanNodeCostEstimate(double individualCpuCost, double individualMemoryCost, double individualNetworkCost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this constructor may be useful for testing, but in general we should use the one below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this constructor here because it is used in several other places and seems to be used to pass in the initial costs. If this constructor were to disappear, we would be explicitly passing in the same costs to both the individual and cumulative parameters. Is that what you're getting at?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh i see it becomes cummulative after the "add" method. I would put a comment making that clearer and also make the other constructor private. Additionally instead of having two constructors that initialize stuff, call the other constructor from here.

{
return new PlanNodeCostEstimate(0, 0, individualNetworkCost);
}

public static PlanNodeCostEstimate cpuCost(double cpuCost)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where do these get used. Right now they are exactly the same as the individualCost methods. Figure out if that's the correct behavior, and if so get rid of one set of methods.

@rschlussel
Copy link
Contributor

Spoke to @mbasmanova. Looks good to both of us. Can you squash the commits together and rename the commit title to "Choose lower memory plan if a memory estimate exceeds limit"?

* Returns CPU component of the cost. Unknown value is represented by {@link Double#NaN}
* Returns individual CPU component of the cost. Unknown value is represented by {@link Double#NaN}
*/
public double getIndividualCpuCost()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "individual" mean in this context? Individual with respect to what?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

individual here refers to the cost of the node itself; before this change, the estimate was tracking the cumulative cost of the whole sub-tree rooted at that node.

Copy link
Contributor

@martint martint Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'd suggest we rename the existing methods to getCumulativeCpuCost, etc. and make these look like getCpuCost. The term "individual" is confusing, whereas "cumulative" has a much more familiar connotation, especially in the context of a tree.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "individual cost" (memory/cpu/network) of a single node is not the right abstraction.
If, for example, we have an identity projection over one of the Joins and no projection over the other Join, we would be comparing pears to apples -- individual cost (memory/cpu/network) of the projection will negligible.

What we need instead is something that estimates the cost of the whole plan in 4 dimensions (3 existing and 1 new):

  • (cumulative) CPU cost
  • (cumulative) network cost
  • (cumulative) memory cost (maybe this will be actually obsolete)
  • peak memory usage

@@ -85,13 +138,17 @@ public double getNetworkCost()
*/
public boolean hasUnknownComponents()
{
return isNaN(cpuCost) || isNaN(memoryCost) || isNaN(networkCost);
return isNaN(individualCpuCost) || isNaN(individualMemoryCost) || isNaN(individualNetworkCost) ||
isNaN(cpuCost) || isNaN(memoryCost) || isNaN(networkCost);
}

@Override
public String toString()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel @cameronreynoldson Will this change affect the query plan? E.g. is this method used in PlanPrinter?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could probably be changed back to how it was given that the one of the cumulative costs would have to be unknown if any of the individual costs were unknown.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it won't affect the query plan unless you have cbo on. PlanPrinter just gets the stats and costs of the final plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there aren't any cases where the individual cost would be unknown and the cumulative cost wouldn't be.

Double.compare(that.memoryCost, memoryCost) == 0 &&
Double.compare(that.networkCost, networkCost) == 0;
}

@Override
public int hashCode()
{
return Objects.hash(cpuCost, memoryCost, networkCost);
return Objects.hash(individualCpuCost, individualMemoryCost, individualNetworkCost,
cpuCost, memoryCost, networkCost);
}

public PlanNodeCostEstimate add(PlanNodeCostEstimate other)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel @cameronreynoldson Before this change add method was simply adding two estimates. E.g. there was not implicit assumption that the estimates being added are for the source nodes. In fact, this doesn't always seem to be the case.

com.facebook.presto.cost.CostCalculatorWithEstimatedExchanges.ExchangeCostEstimator#visitAggregation

        @Override
        public PlanNodeCostEstimate visitAggregation(AggregationNode node, Void context)
        {
            PlanNodeStatsEstimate sourceStats = getStats(node.getSource());
            List<Symbol> sourceSymbols = node.getSource().getOutputSymbols();

            PlanNodeCostEstimate remoteRepartitionCost = CostCalculatorUsingExchanges.calculateExchangeCost(
                    numberOfNodes,
                    sourceStats,
                    sourceSymbols,
                    REPARTITION,
                    REMOTE,
                    types);
            PlanNodeCostEstimate localRepartitionCost = CostCalculatorUsingExchanges.calculateExchangeCost(
                    numberOfNodes,
                    sourceStats,
                    sourceSymbols,
                    REPARTITION,
                    LOCAL,
                    types);

            // TODO consider cost of aggregation itself, not only exchanges, based on aggregation's properties
            return remoteRepartitionCost.add(localRepartitionCost);
        }

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth our effort to change the uses of PlanNodeCostEstimate to follow our new pattern of initializing with the individual costs and using add() for cumulative costs?

}

public PlanNodeCostEstimate add(PlanNodeCostEstimate other)
{
return new PlanNodeCostEstimate(
individualCpuCost,
individualMemoryCost,
individualNetworkCost,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not the other. individualNetworkCost? Add until this point was commutative. Maybe Double.NaN would be better here.

@@ -85,13 +138,17 @@ public double getNetworkCost()
*/
public boolean hasUnknownComponents()
{
return isNaN(cpuCost) || isNaN(memoryCost) || isNaN(networkCost);
return isNaN(individualCpuCost) || isNaN(individualMemoryCost) || isNaN(individualNetworkCost) ||
isNaN(cpuCost) || isNaN(memoryCost) || isNaN(networkCost);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put each one in separate line

@@ -59,6 +60,11 @@ public int compare(Session session, PlanNodeCostEstimate left, PlanNodeCostEstim
requireNonNull(right, "right is null");
checkArgument(!left.hasUnknownComponents() && !right.hasUnknownComponents(), "cannot compare unknown costs");

long queryMaxMemory = getQueryMaxMemory(session).toBytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to use dedicated session property here, which could be set to query max memory by default. This is because that memory cost is not exactly the memory consumption.

new CostComparisonAssertion(1.0, 0.0, 0.0)
.smaller(1000, 100, 100)
.larger(100, 6e13, 100)
.assertCompare();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a test where you compare the case where individual memory costs are higher than query max memory

@@ -59,6 +60,11 @@ public int compare(Session session, PlanNodeCostEstimate left, PlanNodeCostEstim
requireNonNull(right, "right is null");
checkArgument(!left.hasUnknownComponents() && !right.hasUnknownComponents(), "cannot compare unknown costs");

long queryMaxMemory = getQueryMaxMemory(session).toBytes();
if (left.getIndividualMemoryCost() > queryMaxMemory || right.getIndividualMemoryCost() > queryMaxMemory) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to me it is like hack that you are changing generic compare method. I would prefer to have a dedicated method to handle the case high memory consumption plan nodes.

* Returns CPU component of the cost. Unknown value is represented by {@link Double#NaN}
* Returns individual CPU component of the cost. Unknown value is represented by {@link Double#NaN}
*/
public double getIndividualCpuCost()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "individual cost" (memory/cpu/network) of a single node is not the right abstraction.
If, for example, we have an identity projection over one of the Joins and no projection over the other Join, we would be comparing pears to apples -- individual cost (memory/cpu/network) of the projection will negligible.

What we need instead is something that estimates the cost of the whole plan in 4 dimensions (3 existing and 1 new):

  • (cumulative) CPU cost
  • (cumulative) network cost
  • (cumulative) memory cost (maybe this will be actually obsolete)
  • peak memory usage

@findepi
Copy link
Contributor

findepi commented Sep 18, 2018

@cameronreynoldson @mbasmanova @rschlussel I don't think this problem is matter of a simple patch.
To me, what we currently have is a quickfix which helps in some situations but not the others.

If we want a quickfix only, I would consider following options:

  • compare plans by memory usage first (ie assign memory cost a much higher weight)
  • compare plans by memory usage first if memory cost exceeds certain threshold (in relation to query max memory) -- we had this initially, but we dropped this, since memory cost doesn't estimate peak memory usage

If we want a proper solution, I think some design doc/discussion would be welcome.

@sopel39
Copy link
Contributor

sopel39 commented Sep 18, 2018

Currently CBO memory estimate has rather undefined semantics. I think it is supposed to measure peak memory, but it doesn't do it currently. What it does instead is summing peak memory of each operator without taking into account execution timeline. This will make estimate pessimistic in plans with joins (build side finishes thus freeing memory of that subplan).
Additionally, we probably should go operator by operator to verify that memory estimate is reasonable.
I would suggest improving peak memory estimation before proceeding with this PR. Otherwise I think there should be some feature toggle to disable or parametrize this behavior.
As @findepi suggested, for now one can tweak cost weights to put more emphasis on memory.

@mbasmanova
Copy link
Contributor

@findepi @sopel39 @kokosing These are all fair points. While evaluating CBO on production queries. some queries failed due to memory limit after being rewritten to use broadcast join. Hence, we were wondering if there would be a way to make CBO of the memory limit and avoid plan that exceed that limit.

It is indeed not clear how this can be achieved right now. In particular, it seems like there is no infrastructure to compare costs of the whole plans and not just the sub-plans rooted at the nodes being modified. Also, the costs are compared before all the optimizations have been applied.

I would suggest improving peak memory estimation before proceeding with this PR.

Seems reasonable. Are there any specific places that are known to have issues?

one can tweak cost weights to put more emphasis on memory

What we found is that broadcast joins tend to use about the same amount of CPU as distributed joins, but a lot more peak memory. At the same time, broadcast joins have significantly lower wall times (due to better handling of data skew). In practice, lower wall times are good, because users experience lower latency and memory gets released faster. Hence, we don't want to put higher weight on memory, unless the amount of memory needed for the query exceeds the limit.

@findepi
Copy link
Contributor

findepi commented Sep 18, 2018

It is indeed not clear how this can be achieved right now. In particular, it seems like there is no infrastructure to compare costs of the whole plans and not just the sub-plans rooted at the nodes being modified.

Correct. It was also apparent in one of the TPC-DS queries.
I would say: memo-based (exploratory) optimizer is the answer. Other than that, I don't have a better (quicker) answer to that.

Maybe if we solve all the problems that are definitely solvable within current framework, production queries will be already happy?

I would suggest improving peak memory estimation before proceeding with this PR.

Seems reasonable. Are there any specific places that are known to have issues?

@mbasmanova You mean where peak memory deviates from what we calculate today?

Things to consider

  • are we calculating memory usage correctly (eg does PagesIndex introduce some overhead that we need to care about? we used to disregard is-null arrays, which can add noticeable percentage; there might be other cases)
  • agg
    • operators below final agg and above it cannot run in parallel (but final agg can run in parallel with those below and with those above)
  • join, sort -- something like the above
  • thus, for a plan, you need to calculate 2 things "peak memory usage of a subtree" and "peak memory usage when top node is executing outputting" -- this allows you to calc the metric bottom-up, just like any other cost

if you wish, i can try formalize this a bit.

What we found is that broadcast joins tend to use about the same amount of CPU as distributed joins, but a lot more peak memory.

@mbasmanova "use" -- you mean estimates or actual resource usage on the cluster?
The cost model assumes there is certain cost of building LookupSource, which is multiplied in case of broadcast join (every node builds its own LS)
Or maybe the cost of building LS is negligible compared to the cost of processing the probe table (join of tables with very different sizes)?

It might be that relative CPU weights between nodes/operators are assigned "suboptimally". Today building LS costs the same as Filter, as Projection, etc.

@findepi
Copy link
Contributor

findepi commented Sep 19, 2018

I am eager to work on this.
Just not this week :/

@mbasmanova
Copy link
Contributor

Superseded by #11667

@mbasmanova mbasmanova closed this Oct 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants