New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arithmetic Expressions in Projections #30
Conversation
Very nice! I'll have a look over this. |
tests/lib.rs
Outdated
let mut g = distributary::Blender::new(); | ||
let sql = " | ||
CREATE TABLE Bread (id int, price int, PRIMARY KEY(id)); | ||
Price: SELECT 2 * COUNT(price) FROM Bread WHERE id = ?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted this query to be SELECT 2 * MAX(price) FROM Bread
but I couldn't figure out how to retrieve the result without having a key connected to the query. I saw some of the other tests/benchmarks used a bogus key of 0, but getter.lookup(&0.into(), true)
didn't seem to do the trick either. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got it working by using the result from 2 * MAX(price)
as the key for now, but would definitely prefer a cleaner way of doing it.
Commit: 688e85f
let left = match expression.left { | ||
ProjectExpressionBase::Column(i) => &record[i], | ||
ProjectExpressionBase::Literal(ref data) => data, | ||
}.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the .clone()
here (or below) should be necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to do this without overloading the arithmetic operators on &DataType
instead of DataType
? Pushed a commit here for how that would look.
Tried playing around with ownership stuff, but couldn't figure out how to do it without having to .clone()
the ProjectExpression
passed into eval_expression
and the DataType
at record[i]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doing arithmetic on &DataType
is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions
is being cloned (which is larger and more complex). Strictly speaking, only record[i]
needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType
and Literal
, without having to overload them for ProjectExpression
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, only
record[i]
needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.
I might be misunderstanding, but I think that's what I'm doing? left
and right
are DataType
objects here, and the operators are overloaded directly on DataType
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yes, you're right, I misread.
src/ops/project.rs
Outdated
for i in a { | ||
new_r.push(i.clone()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@malte this is a similar case to aggregations where columns appear in a pre-defined location (i.e., first all projected columns, then all expressions, then all literals). I know that affected MIR for aggregations during the deadline, so maybe be on the lookout for that here too?
@@ -720,7 +720,7 @@ mod tests { | |||
// Should have two nodes: source and "users" base table | |||
let ncount = mig.graph().node_count(); | |||
assert_eq!(ncount, 2); | |||
assert_eq!(get_node(&inc, &*mig, "users").name(), "users"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the &*
make a difference in these tests? They seem to be passing fine without it as well, so I removed it - let me know if it should be there though and I'll revert the commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.
Thanks, I'll review this tonight or tomorrow at the latest! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! I've left some comments inline, although I think almost all of them are probably best addressed in separate PRs rather than by changing this one.
} => match *other { | ||
MirNodeType::Project { | ||
ref emit, | ||
ref literals, | ||
} => our_emit == emit && our_literals == literals, | ||
ref arithmetic, | ||
} => our_emit == emit && our_literals == literals && our_arithmetic == arithmetic, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite conservative (though probably fine for v0 of arithmetic expression support): it will only allow reuse of projections that have exactly identical arithmetic expressions. For example, with multiple elements in the arithmetic
vector, it would only reuse the projection if the vectors have the same elements and they're all identical. However, it is also acceptable to reuse the projection if the new our_arithmetic
is a strict subset of arithmetic
.
Now, you might observe that we are similarly unnecessarily strict about emit
, and you'd be right ;-)
src/mir/node.rs
Outdated
emit.iter() | ||
.map(|c| c.name.as_str()) | ||
.collect::<Vec<_>>() | ||
.join(", "), | ||
if arithmetic.len() > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: is_empty()
right: right, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I suspect we only need these (as opposed to using nom-sql
's almost-identical ArithmeticExpression
) because of the Literal(DataType)
in the base case, which in nom-sql
is Scalar(Literal)
. That's a bit suboptimal -- could we use the nom-sql
structure here since we implement Into<DataType>
for nom_sql::Literal
? (I guess we'd potentially be doing the conversion every time we access the literal if we're not careful, but maybe that can be solved without duplicating the data structure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further thought, I guess another difference is that the Column
variant of the ProjectExpressionBase
enum holds a numeric column index, not a Column
object. Hrm, maybe we need the duplicate structures after all.
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result { | ||
match *self { | ||
ProjectExpressionBase::Column(u) => write!(f, "{}", u), | ||
ProjectExpressionBase::Literal(ref l) => write!(f, "(lit: {})", l), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this just be the value? If I parse this right, we would currently end up with funny-looking output like salary * (lit: 100)
rather than salary * 100
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the lit:
since it outputs the emit index, and not the actual column name. I thought 0 * 100
might look slightly confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair point. I keep forgetting that the low-level operators only know about column indices...
let left = match expression.left { | ||
ProjectExpressionBase::Column(i) => &record[i], | ||
ProjectExpressionBase::Literal(ref data) => data, | ||
}.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doing arithmetic on &DataType
is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions
is being cloned (which is larger and more complex). Strictly speaking, only record[i]
needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType
and Literal
, without having to overload them for ProjectExpression
.
@@ -720,7 +720,7 @@ mod tests { | |||
// Should have two nodes: source and "users" base table | |||
let ncount = mig.graph().node_count(); | |||
assert_eq!(ncount, 2); | |||
assert_eq!(get_node(&inc, &*mig, "users").name(), "users"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.
FieldExpression::Arithmetic(ref a) => { | ||
if let ArithmeticBase::Column(ref c) = a.left { | ||
add_computed_column(&mut qg, c); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention (and if I read this right, the implementation) here is that computed columns (e.g., SUM(col)
) can occur within arithmetic expressions and that we still need to generate the right query graphs for these, correct?
(As opposed to each column defined via an arithmetic expression being considered "computed", which it wouldn't really be from the query graph perspective.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly - before I added this it wasn't possible to refer to computed columns inside arithmetic expressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
} | ||
|
||
qg.columns.push(OutputColumn::Arithmetic(ArithmeticColumn { | ||
name: String::from("arithmetic"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this deal with aliases for arithmetic columns? Seems like it might hardcode the column name always, which would be problematic if we had more than one arithmetic column.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, that's a good point. I implemented this by looking at how it was being done for literals, which also seems to use the same name for the computed columns:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/query_graph.rs#L696-L702
Both literals and arithmetic expression names are handled here though:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/mir.rs#L737-L763
Where column objects are created with only the name and the table. It doesn't seem to look at aliases at all for literals though, which I guess might mean that aliases doesn't work for literals either?
I added a test for multiple arithmetic expressions in one query here - I'll see how it behaves with aliases as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right that aliases for literals probably don't work.
Maybe it's best to add alias support to both ArithmeticColumn
and LiteralColumn
for now (either within this PR or separately, I don't mind).
A more general thought on this (certainly outside the scope of this PR): I wonder if we need to move to a model where the
The downsides of this plan are the increase in complexity of code handling |
// Retrieve the result of the count query: | ||
let result = getter.lookup(&id, true).unwrap(); | ||
assert_eq!(result.len(), 1); | ||
assert_eq!(result[0][1], 246.into()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might have gotten something wrong with the output order here. For other queries it looks like the query key is outputted as the last element in the result - whereas with arithmetic expressions it seems to be placed first, which I think might be wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind - this seems to match the output order for literals (and I think it makes sense since the columns are just outputted in groups like mentioned in #30 (comment))
@ms705: Sorry about the late answer. Storing the information on Maybe it would be possible to move the translation from columns to column IDs from to_flow#make_project_node to the |
I still think it's sad that we |
@jonhoo Agreed; do you see a way to do this without a clone? As far as I can see, arithmetic on |
@ekmartin We had an offline discussion here and concluded that arithmetic on |
This adds support for e.g.
SELECT 2 * price FROM ...
andSELECT price * MAX(column) FROM ...
by handling nom-sql'sArithmeticExpression
.Arithmetic between different numeric data types is converted implicitly:
BigInt
+Int
=BigInt
Real
+Int
=Real
BigInt
+Real
(sinceReal
's integer part is represented using ani32
)Similar to how nom-sql handles arithmetic this currently only deals with simple, non-nested, expressions, but it should definitely be possible to add support for nested expressions in the future.
I'm not completely sure if the way I've handled
ArithmeticColumns
in the query graph is correct (especially when it comes to re-use), but it seems to work for the tests I've written so far.The output formatting of arithmetic nodes could probably also need some tuning.