New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the batch resizing heuristic in PageProcessor #11901

Merged
merged 1 commit into from Nov 27, 2018

Conversation

Projects
None yet
4 participants
@nezihyigitbasi
Contributor

nezihyigitbasi commented Nov 12, 2018

The yield check in the generated projection tight loop may have negative
impact on the performance. The yield check was introduced for expensive
projections so that the engine can yield those. However, that check is
not really necessary for cheap projections and having a branch in the
tight for loop seems to hurt the performance.

This change removes the yield check from the generated projection loop,
and to handle the expensive expressions properly it adds the cost of the
evaluated expression as another variable to the batch resizing heuristic
in PageProcessor. With that for expensive expressions Presto now uses
small batch sizes, and uses larger batch sizes for cheaper expressions.

Removing the yield check from the projection loop improves the performance
as much as ~5% with microbenchmarks and various production queries.

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score   Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   15  37.672 ± 4.401  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   15  35.336 ± 4.724  us/op

Alternative to #11816.

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 12, 2018

Looks like test failures are related. Please don't review for now. I will update & let you know.

@nezihyigitbasi nezihyigitbasi force-pushed the nezihyigitbasi:projection-experiment-batch-resizing branch from beb8473 to 3b19409 Nov 13, 2018

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 13, 2018

There were tests that are sensitive to the initial batch size in PageProcessor, so I exposed that as a knob for those tests to configure. Some tests were relying on the yield behavior in the generated code and since we removed that they were failing, I fixed those too.

@nezihyigitbasi nezihyigitbasi force-pushed the nezihyigitbasi:projection-experiment-batch-resizing branch from 3b19409 to df1a52a Nov 13, 2018

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 14, 2018

@dain I fixed the tests, please take a look when you have a chance.

@sopel39

This comment has been minimized.

Member

sopel39 commented Nov 15, 2018

@nezihyigitbasi are your results stable across multiple executions of ExpressionProfilerBenchmark.pageProjection?

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 16, 2018

I have run it a few times before opening this PR. Now that you asked I ran it a few more times also with different jmh settings.

With 3 forks, 5 warmup iterations of 10s runs and 5 measurement iterations of 10s runs (the same config used for the results I shared above), I get these two results (~15-25% better):

# Run complete. Total time: 00:10:29

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score    Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   15  53.148 ± 39.445  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   15  40.399 ±  8.933  us/op


# Run complete. Total time: 00:10:25

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score    Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   15  51.358 ± 11.412  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   15  43.896 ± 11.130  us/op

Then to further make sure that results are stable and get more confidence in the numbers, I increased the fork count to 5, and increased the measurement and warmup runs to both 30s and I got the following result (~8% better):

# Run complete. Total time: 00:50:35

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score   Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   25  39.318 ± 3.682  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   25  36.051 ± 3.441  us/op
@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 16, 2018

@dain any comments?

@dain

I have a bunch of comments, but this is looking good. I think we're also going to want a similar change on the filter side, but that can be done later.

}
@VisibleForTesting
public ExpressionProfiler(int rowsToProfile, int expensiveFunctionThresholdMillis)

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I'd replace function with expression all of these names

long now = System.nanoTime();
long delta = NANOSECONDS.toMillis(now - previousTimestamp);
meanExecutionTime = (meanExecutionTime * samples + delta) / (samples + 1);

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I don't think there is any reason to keep a running mean, since this is only evaluated once. Instead, I'd just keep sum of time and rows.

meanExecutionTime = (meanExecutionTime * samples + delta) / (samples + 1);
samples += batchSize;
if (samples >= rowsToProfile) {
isProfiling = false;

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I'm not sure the benefits of disabling profiling outweigh the risks. You can have organized data where first N-rows are are super cheap (e.g., they are null), and then things get expensive. I'd consider always profiling, since the overhead is pretty small (two time calls and some math)

public class ExpressionProfiler
{
private static final int NUMBER_OF_ROWS_TO_PROFILE = 1024;

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I'm concerned that the ramp up is going to be too slow, and then grow too large. My understanding is we will do 1 row pages until we hit the profile size, and then we allow the batch size to increase based only on the output size. This means if the operation takes 0.5 seconds, we will allow size to grow to max, but the the max runtime is 8 * 1024 * 0.5 seconds ~= 1.14 hours. Instead, I suggest we measure each batch and then simply say if the last batch was too large, since this fits the existing scaling model in PageProjection (we could say the exact size we expect to be correct, but I don't think we need that). This means we will grow by doubles quickly, but still stop at an appropriate range and if the data changes we can continue to adjust.

public class ExpressionProfiler
{
private static final int NUMBER_OF_ROWS_TO_PROFILE = 1024;
private static final int EXPENSIVE_FUNCTION_THRESHOLD_MILLIS = 1_000;

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

Can we link this to the quanta variable in the thread scheduler?

PageProjection projection = projectionSupplier.get();
Page page = new Page(createLongSequenceBlock(1, 10));
ExpressionProfiler profiler = new ExpressionProfiler(10, 10_000);
for (int i = 0; i < 100; i++) {

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I'd like to see tighter bounds on the test. The first 10 invocations should say it is expensive and then is should switch back.

Supplier<PageProjection> projectionSupplier = functionCompiler.compileProjection(add10Expression, Optional.empty());
PageProjection projection = projectionSupplier.get();
Page page = new Page(createLongSequenceBlock(1, 10));
ExpressionProfiler profiler = new ExpressionProfiler(10, 10_000);

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I think you can create better tests, if you change this class to use a Ticker and then you can control exactly how time moves in the test.

@@ -22,6 +22,9 @@
import com.facebook.presto.spi.block.LazyBlock;
import com.facebook.presto.spi.block.VariableWidthBlock;
import com.facebook.presto.spi.type.Type;
import com.facebook.presto.sql.gen.ExpressionProfiler;

This comment has been minimized.

@dain

dain Nov 16, 2018

Contributor

I'd like to see a test, where this effects the output scaling, and if possibly with and without the size scaling, so we can verify the integrated feature is working how we expected.

@dain

This comment has been minimized.

Contributor

dain commented Nov 16, 2018

Also, it would be nice if we can get a benchmark result where the results don't have overlapping error bars.

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 16, 2018

Here is the latest benchmark source.

@sopel39

This comment has been minimized.

Member

sopel39 commented Nov 19, 2018

@nezihyigitbasi

Then to further make sure that results are stable and get more confidence in the numbers, I increased the fork count to 5, and increased the measurement and warmup runs to both 30s and I got the following result (~8% better):

Were the results (after increased warmup time):

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score   Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   25  39.318 ± 3.682  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   25  36.051

with your optimization?
What are the unmodified code results after increased timeouts?

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 19, 2018

@sopel39 the expressionProfilerEnabled=true case is the current PR and expressionProfilerEnabled=false is what we have right now.

@nezihyigitbasi nezihyigitbasi force-pushed the nezihyigitbasi:projection-experiment-batch-resizing branch from df1a52a to 6c2f7ba Nov 19, 2018

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 19, 2018

@dain comments addressed in a separate commit (will be squashed before merge).

Here are the new u-benchmark results:

# Run complete. Total time: 00:33:55

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score   Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   50  42.334 ± 0.471  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   50  38.683 ± 1.801  us/op

@nezihyigitbasi nezihyigitbasi force-pushed the nezihyigitbasi:projection-experiment-batch-resizing branch from 6c2f7ba to 02087fa Nov 26, 2018

@nezihyigitbasi

This comment has been minimized.

Contributor

nezihyigitbasi commented Nov 26, 2018

@dain Please take a look when you have a chance.

@dain

dain approved these changes Nov 27, 2018

Other then the part where timings are converted to millis, looks good.

this.rowsToProfile = NUMBER_OF_ROWS_TO_PROFILE;
this.expensiveFunctionThresholdMillis = EXPENSIVE_FUNCTION_THRESHOLD_MILLIS;
this.expensiveExpressionThresholdMillis = EXPENSIVE_EXPRESSION_THRESHOLD_MILLIS;
this.ticker = systemTicker();

This comment has been minimized.

@dain

dain Nov 27, 2018

Contributor

You can use a passthrough constructor call here this(systemTicker(), EXPENSIVE_EXPRESSION_THRESHOLD_MILLIS)

verify(expensiveFunctionThresholdMillis >= 0);
this.rowsToProfile = rowsToProfile;
this.expensiveFunctionThresholdMillis = expensiveFunctionThresholdMillis;
verify(expensiveExpressionThresholdMillis >= 0);

This comment has been minimized.

@dain

dain Nov 27, 2018

Contributor

maybe move this down a line so they are in the same order as the parameters

}
@VisibleForTesting
public ExpressionProfiler(int rowsToProfile, int expensiveFunctionThresholdMillis)
public ExpressionProfiler(Ticker ticker, long expensiveExpressionThresholdMillis)

This comment has been minimized.

@dain

dain Nov 27, 2018

Contributor

I'd change expensiveExpressionThresholdMillis to duration and then do the conversion in the constructor.

verify(previousTimestamp != NOT_INITALIZED, "start() is not called");
verify(batchSize > 0, "batchSize must be positive");
long now = System.nanoTime();
long now = ticker.read();
long delta = NANOSECONDS.toMillis(now - previousTimestamp);

This comment has been minimized.

@dain

dain Nov 27, 2018

Contributor

Why convert the delta to mills instead of just leaving it in nanos?

if (positionCount > 1 && previousPositionCount != 1) {
assertEquals(positionCount, previousPositionCount / 2);
}
previousPositionCount = positionCount;

This comment has been minimized.

@dain

dain Nov 27, 2018

Contributor

I'd advance the testing ticker in this loop just to be safe.

Update the batch resizing heuristic in PageProcessor
The yield check in the generated projection tight loop may have negative
impact on the performance. The yield check was introduced for expensive
projections so that the engine can yield those. However, that check is
not really necessary for cheap projections and having a branch in the
tight for loop seems to hurt the performance.

This change removes the yield check from the generated projection loop,
and to handle the expensive expressions properly it adds the cost of the
evaluated expression as another variable to the batch resizing heuristic
in PageProcessor. With that for expensive expressions Presto now uses
small batch sizes, and uses larger batch sizes for cheaper expressions.

Removing the yield check from the projection loop improves the performance
as much as ~5% with microbenchmarks and various production queries.

Benchmark                                   (expressionProfilerEnabled)  Mode  Cnt   Score   Error  Units
ExpressionProfilerBenchmark.pageProjection                        false  avgt   15  37.672 ± 4.401  us/op
ExpressionProfilerBenchmark.pageProjection                         true  avgt   15  35.336 ± 4.724  us/op

@nezihyigitbasi nezihyigitbasi force-pushed the nezihyigitbasi:projection-experiment-batch-resizing branch from 02087fa to ef55657 Nov 27, 2018

@nezihyigitbasi nezihyigitbasi merged commit a99670e into prestodb:master Nov 27, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nezihyigitbasi nezihyigitbasi deleted the nezihyigitbasi:projection-experiment-batch-resizing branch Nov 27, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment