Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement key-based sampling in the planner #16766

Merged
merged 1 commit into from Sep 29, 2021

Conversation

@kaikalur
Copy link
Contributor

@kaikalur kaikalur commented Sep 20, 2021

We have many ad hoc queries working on large datasets that keep failing/running long time/timing out. So for cases when the user just wants to get a sense of the results, we do smart sampling based on mapping hashes of keys to a percent. When the feature is enabled, we traverse the plan for the query and sample the first integer or string key found (in that order)

  • both sides of an equi-join
  • group by
  • partition by of a window/rownum/topnrownum
  • left-side (source) of semijoin

We apply the sampling predicate only once in every branch of graph so that eventually all qualifying scans will be sampled.

Test plan - added tests

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Added Smart Sampling feature that samples the tables at query (planning) time. This helps in cases when the user is just exploring the data, testing a pipeline or looking for trends. We sample all the tables (and only tables) in such a way that all "key-based" operations like JOIN/GROUP BY/window functions etc will be complete for the sampled keys. 

@kaikalur kaikalur requested a review from rongrong Sep 20, 2021
@kaikalur kaikalur changed the title Implement key-based sampling in the oplanner Implement key-based sampling in the planner Sep 20, 2021
@kaikalur kaikalur requested a review from highker Sep 20, 2021
@kaikalur
Copy link
Contributor Author

@kaikalur kaikalur commented Sep 21, 2021

Fixed it to not create new plan nodes, instead use the idiomatic replaceChildren call.

@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from 4ccdedb to 22c9322 Sep 21, 2021
@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from cc04fa5 to 2113935 Sep 23, 2021
@kaikalur kaikalur requested a review from rongrong Sep 23, 2021
@kaikalur kaikalur force-pushed the smart-sampling branch 4 times, most recently from de4a83e to e546149 Sep 24, 2021
@kaikalur
Copy link
Contributor Author

@kaikalur kaikalur commented Sep 24, 2021

Cleaned up more - renamed the function to key_sampling_percent so it's generally useful and also documented it.

@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from 76a4d49 to e959496 Sep 28, 2021
Copy link
Contributor

@rongrong rongrong left a comment

There's merge conflict. Please rebase.

@@ -238,3 +238,9 @@ Unicode Functions
are replaced with `replace`. The replacement string `replace` must either
be a single character or empty (in which case invalid characters are
removed).

.. function:: key_sampling_percent(varchar) -> double
Copy link
Contributor

@rongrong rongrong Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to introduce a version of this function for varbinary?

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. I will open an issue for later. For now are using only varchar version.

true),
booleanProperty(
KEY_BASED_SAMPLING_ENABLED,
"Key based sampling of tables enabled",
Copy link
Contributor

@rongrong rongrong Sep 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not really a description of what this feature does. If you think there's no way to describe it clearly here, add proper documentation of this feature and refer it so people can learn more about it.

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - where do I add it? Add a chapter to the user doc?

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the PR message has pretty much the documentation.

Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users won't read PR messages though. This is ok for now. But we probably want to document recommended usage, behavior and limitations once we polish the feature off.

functionAndTypeManager.resolveOperator(OperatorType.LESS_THAN_OR_EQUAL, fromTypes(DOUBLE, DOUBLE)),
BOOLEAN,
sampledArg,
new ConstantExpression(SystemSessionProperties.getKeyBasedSamplingPercentage(session), DOUBLE));
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Irritating that Intellij is inconsistent with this.


RowExpression sampledArg = call(
functionAndTypeManager,
SystemSessionProperties.getKeyBasedSamplingFunction(session),
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


private Optional<VariableReferenceExpression> findSuitableKey(List<VariableReferenceExpression> keys)
{
Optional<VariableReferenceExpression> variableReferenceExpression = keys.stream().filter(x -> TypeUtils.isIntegralType(x.getType().getTypeSignature(), functionAndTypeManager)).findFirst();
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

keys.stream()
.filter(x -> TypeUtils.isIntegralType(x.getType().getTypeSignature(), functionAndTypeManager))
.findFirst();

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Find the best equijoin clause so we sample both sides the same way optimally

// First see if there is a int/bigint key
Optional<JoinNode.EquiJoinClause> equiJoinClause = node.getCriteria().stream().filter(x -> TypeUtils.isIntegralType(x.getLeft().getType().getTypeSignature(), functionAndTypeManager)).findFirst();
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: break into multiple lines.

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

arg = rowExpression;
}

RowExpression sampledArg = call(
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can throw if the function doesn't exist. You probably want to handle that.

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - doesn't it throw PrestoException? Then we can just propagate it?

Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message might not be clear to users. We'd probably throw a function not found error, but the function is not used in user query. Ideally we'd probably want to catch here and rewrite the error message to something like sampling method not found.

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

requireNonNull(plan, "plan is null");
requireNonNull(session, "session is null");
requireNonNull(types, "types is null");
requireNonNull(variableAllocator, "variableAllocator is null");
requireNonNull(idAllocator, "idAllocator is null");
Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are necessary.

Copy link
Contributor Author

@kaikalur kaikalur Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked idiomatic - all planner classes use it. So copied lol. Removed now.

Copy link
Contributor

@rongrong rongrong Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all. Many didn't I think. 😛

@kaikalur kaikalur requested a review from rongrong Sep 29, 2021
@kaikalur kaikalur force-pushed the smart-sampling branch 3 times, most recently from 86968fa to d4afcc9 Sep 29, 2021
@rongrong rongrong merged commit ae77100 into prestodb:master Sep 29, 2021
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants