-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: multi-measurement query optimization #22301
Conversation
I have quickly looked through this change. While I think it is good, I think it might be worth spending a bit more time understanding why the pre-optimization case was so bad - there might be better places to put this logic than right at the top level. Discussing with @wbaker85 offline about it. |
e5842f4
to
542ee7c
Compare
tsdb/shard.go
Outdated
// subtree containing the OR'd measurements accessible from root of the tree | ||
// either directly (tree contains nothing but OR'd measurements) or by | ||
// traversing AND binary expression nodes. | ||
func (v measurementEvaluator) Visit(node influxql.Node) influxql.Visitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make the logic better to have two visitors? You could have one visitor responsible for finding a potential head ( the LHS or RHS of an AND node that was not itself an AND node) and a second visitor responsible for finding if a given head is valid (composed of only measurement EQ conditions) or fatal (some measurement conditions and some non-measurements).
Separating the complicated algorithm into two simple algorithms would probably make the whole thing more clear when we read this code in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split it up and I think it did make it easier to understand. Check it out and see what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.
19f4ce5
to
ca4dee2
Compare
tsdb/shard.go
Outdated
return v | ||
} | ||
|
||
if name, ok := measurementNameFromBinary(n, v.measurementKey); ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a test for something this misses? e.g.
(r._name = "foo" || r._name = "bar") && (r._value = r._name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some updates that we discussed which should alleviate this concern. Since an expression like (_name = 'm0' OR _name = 'm1' OR _name = 'm2') AND (tag1 != 'foo' OR _name = 'm1')
should be able to apply the optimization (we know that the values must come from m0
, m1
, or m2
, even though m1
occurs on the left and right side of the AND), I updated the code to only look for a single group of exclusive OR'd measurements, and not invalidate the optimization if a measurement occurs elsewhere.
This simplifies looking for measurements anywhere else in the query, since it basically doesn't matter. I renamed this funciton to measurementNameFromEqBinary
and added clarification that it only returns measurement names for EQ operations in the from of _measurement = name
. For an EQ that has anything else, the tree will be invalidated.
ca4dee2
to
bbb3e67
Compare
tsdb/shard.go
Outdated
// A BinaryExpr must have an operation of OR or EQ in a valid tree | ||
if n.Op == influxql.OR { | ||
// The children of ORs must be either BinaryExprs themselves, or Parens | ||
if binaryOrParen(n.LHS) && binaryOrParen(n.RHS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need this anymore - you can just return v
now, if not binary or paren then the visitor will return false (hence invalid) anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor comments and I think this is good to go.
This looks good to me now, but a flux test failed... |
* feat: multi-measurement query optimization (cherry picked from commit 3e275a1)
Closes #22156
This replaces the existing single measurement query optimization with an expanded version that applies to certain queries containing multiple measurements.
Background
The optimization applies for for queries where:
All measurements only use equality for comparison, for example
r["_measurement"] == "measName"
.One or more measurement expressions constituting a "group" of measurements are evaluated using only
OR
.For example, a query like this applies the optimization:
...but this does not:
There are measurements at no other place in the query than in the single group ofOR
'd measurements.The group of
OR
'd measurements is applied to the query by a "top-level"AND
. This ensures that all results from the query must belong to series containing one of those measurements. Put another way, there are no operators other thanAND
separating the group ofOR
'd measurements from the other conditions of the query.As an
influxql.Expr
string, such a query could look like this:...but not this:
Note that a query consisting only of a group of
OR
'd measurements will apply the optimization as well:The code for determining if the optimization applies and what measurement names are contained in the query uses traversal of the
influxql.Expr
AST, as did the previous single measurement optimization. It locates subtrees consisting entirely of binary expressions withOR
operators (or parens) where the leaf nodes areEQ
comparisons between_measurement
and the string value of the measurement. These subtrees must be accessible from the head of the tree by exclusively traveling through binary expressions withAND
operators (or parens). If there is more than one such subtree, the optimization cannot be applied.Performance Improvement
For high-cardinality data spanning across several shards, the performance increase from this optimization is substantial. I tested locally using a dataset spanning 1 year (~52 shards) with 60,000 unique measurements containing 6,000,000 lines with each line having 20 fields (comparable to the data reported in #22156).
The following flux query was used, with the time range encompassing the entire dataset:
Pre-optimization, the average time for a single query was around ~20 seconds:
With the optimization in place, the query time was reduced to an average of ~55 milliseconds:
The optimized query time is comparable to an equivalent InfluxQL query, which averages around 35 milliseconds: