Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arithmetic Expressions in Projections #30

Closed
wants to merge 19 commits into from

Conversation

ekmartin
Copy link
Member

@ekmartin ekmartin commented Oct 1, 2017

This adds support for e.g. SELECT 2 * price FROM ... and SELECT price * MAX(column) FROM ... by handling nom-sql's ArithmeticExpression.

Arithmetic between different numeric data types is converted implicitly:

  • BigInt + Int = BigInt
  • Real + Int = Real
  • Not supported: BigInt + Real (since Real's integer part is represented using an i32)

Similar to how nom-sql handles arithmetic this currently only deals with simple, non-nested, expressions, but it should definitely be possible to add support for nested expressions in the future.

I'm not completely sure if the way I've handled ArithmeticColumns in the query graph is correct (especially when it comes to re-use), but it seems to work for the tests I've written so far.
The output formatting of arithmetic nodes could probably also need some tuning.

@ms705 ms705 self-requested a review October 1, 2017 20:32
@ms705
Copy link
Member

ms705 commented Oct 1, 2017

Very nice! I'll have a look over this.

tests/lib.rs Outdated
let mut g = distributary::Blender::new();
let sql = "
CREATE TABLE Bread (id int, price int, PRIMARY KEY(id));
Price: SELECT 2 * COUNT(price) FROM Bread WHERE id = ?;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted this query to be SELECT 2 * MAX(price) FROM Bread but I couldn't figure out how to retrieve the result without having a key connected to the query. I saw some of the other tests/benchmarks used a bogus key of 0, but getter.lookup(&0.into(), true) didn't seem to do the trick either. What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got it working by using the result from 2 * MAX(price) as the key for now, but would definitely prefer a cleaner way of doing it.

Commit: 688e85f

let left = match expression.left {
ProjectExpressionBase::Column(i) => &record[i],
ProjectExpressionBase::Literal(ref data) => data,
}.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the .clone() here (or below) should be necessary?

Copy link
Member Author

@ekmartin ekmartin Oct 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to do this without overloading the arithmetic operators on &DataType instead of DataType? Pushed a commit here for how that would look.

Tried playing around with ownership stuff, but couldn't figure out how to do it without having to .clone() the ProjectExpression passed into eval_expression and the DataType at record[i].

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doing arithmetic on &DataType is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions is being cloned (which is larger and more complex). Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

I might be misunderstanding, but I think that's what I'm doing? left and right are DataType objects here, and the operators are overloaded directly on DataType.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, you're right, I misread.

for i in a {
new_r.push(i.clone());
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malte this is a similar case to aggregations where columns appear in a pre-defined location (i.e., first all projected columns, then all expressions, then all literals). I know that affected MIR for aggregations during the deadline, so maybe be on the lookout for that here too?

@@ -720,7 +720,7 @@ mod tests {
// Should have two nodes: source and "users" base table
let ncount = mig.graph().node_count();
assert_eq!(ncount, 2);
assert_eq!(get_node(&inc, &*mig, "users").name(), "users");
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the &* make a difference in these tests? They seem to be passing fine without it as well, so I removed it - let me know if it should be there though and I'll revert the commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.

@ekmartin
Copy link
Member Author

ekmartin commented Oct 2, 2017

@ms705: I split out the changes from mir/mod.rs into the different files you created in 8590bff.

@ms705
Copy link
Member

ms705 commented Oct 2, 2017

Thanks, I'll review this tonight or tomorrow at the latest!

@ekmartin ekmartin closed this Oct 3, 2017
@ekmartin ekmartin deleted the arithmetic branch October 3, 2017 00:50
@ekmartin ekmartin restored the arithmetic branch October 3, 2017 00:50
@ekmartin ekmartin reopened this Oct 3, 2017
Copy link
Member

@ms705 ms705 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I've left some comments inline, although I think almost all of them are probably best addressed in separate PRs rather than by changing this one.

} => match *other {
MirNodeType::Project {
ref emit,
ref literals,
} => our_emit == emit && our_literals == literals,
ref arithmetic,
} => our_emit == emit && our_literals == literals && our_arithmetic == arithmetic,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite conservative (though probably fine for v0 of arithmetic expression support): it will only allow reuse of projections that have exactly identical arithmetic expressions. For example, with multiple elements in the arithmetic vector, it would only reuse the projection if the vectors have the same elements and they're all identical. However, it is also acceptable to reuse the projection if the new our_arithmetic is a strict subset of arithmetic.

Now, you might observe that we are similarly unnecessarily strict about emit, and you'd be right ;-)

src/mir/node.rs Outdated
emit.iter()
.map(|c| c.name.as_str())
.collect::<Vec<_>>()
.join(", "),
if arithmetic.len() > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is_empty()

right: right,
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I suspect we only need these (as opposed to using nom-sql's almost-identical ArithmeticExpression) because of the Literal(DataType) in the base case, which in nom-sql is Scalar(Literal). That's a bit suboptimal -- could we use the nom-sql structure here since we implement Into<DataType> for nom_sql::Literal? (I guess we'd potentially be doing the conversion every time we access the literal if we're not careful, but maybe that can be solved without duplicating the data structure.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further thought, I guess another difference is that the Column variant of the ProjectExpressionBase enum holds a numeric column index, not a Column object. Hrm, maybe we need the duplicate structures after all.

fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
match *self {
ProjectExpressionBase::Column(u) => write!(f, "{}", u),
ProjectExpressionBase::Literal(ref l) => write!(f, "(lit: {})", l),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this just be the value? If I parse this right, we would currently end up with funny-looking output like salary * (lit: 100) rather than salary * 100.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the lit: since it outputs the emit index, and not the actual column name. I thought 0 * 100 might look slightly confusing.

https://github.com/ekmartin/distributary/blob/436333cce567b3c50e253fa44ba9f621c23a576a/src/ops/project.rs#L300

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair point. I keep forgetting that the low-level operators only know about column indices...

let left = match expression.left {
ProjectExpressionBase::Column(i) => &record[i],
ProjectExpressionBase::Literal(ref data) => data,
}.clone();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doing arithmetic on &DataType is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions is being cloned (which is larger and more complex). Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

@@ -720,7 +720,7 @@ mod tests {
// Should have two nodes: source and "users" base table
let ncount = mig.graph().node_count();
assert_eq!(ncount, 2);
assert_eq!(get_node(&inc, &*mig, "users").name(), "users");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.

FieldExpression::Arithmetic(ref a) => {
if let ArithmeticBase::Column(ref c) = a.left {
add_computed_column(&mut qg, c);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention (and if I read this right, the implementation) here is that computed columns (e.g., SUM(col)) can occur within arithmetic expressions and that we still need to generate the right query graphs for these, correct?

(As opposed to each column defined via an arithmetic expression being considered "computed", which it wouldn't really be from the query graph perspective.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly - before I added this it wasn't possible to refer to computed columns inside arithmetic expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

qg.columns.push(OutputColumn::Arithmetic(ArithmeticColumn {
name: String::from("arithmetic"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this deal with aliases for arithmetic columns? Seems like it might hardcode the column name always, which would be problematic if we had more than one arithmetic column.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, that's a good point. I implemented this by looking at how it was being done for literals, which also seems to use the same name for the computed columns:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/query_graph.rs#L696-L702

Both literals and arithmetic expression names are handled here though:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/mir.rs#L737-L763

Where column objects are created with only the name and the table. It doesn't seem to look at aliases at all for literals though, which I guess might mean that aliases doesn't work for literals either?

I added a test for multiple arithmetic expressions in one query here - I'll see how it behaves with aliases as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right that aliases for literals probably don't work.

Maybe it's best to add alias support to both ArithmeticColumn and LiteralColumn for now (either within this PR or separately, I don't mind).

@ms705
Copy link
Member

ms705 commented Oct 4, 2017

A more general thought on this (certainly outside the scope of this PR): I wonder if we need to move to a model where the Column object carries information about e.g., arithmetic on projections, or indeed about the column being a literal value. There are two reasons for this:

  1. We already associate information about functions (e.g., MAX, SUM etc.) with the Column object, and it seems inconsistent to treat other similar notions differently.
  2. By carrying information about arithmetic expressions and literals outside the Column object, our interfaces keep gaining additional vector of "special" values (most notably the MIR Project node and the Project operator), and we are forced into a static column emit order -- namely ordinary columns, literal columns, and arithmetic columns in separate groups. This is a little unnecessary, and requires extra projections to permute the columns back into the order the query expects them in; carrying information about arithmetic definitions and literals inside Column would permit us to order them any way we like.

The downsides of this plan are the increase in complexity of code handling Column objects (and we have plenty of that!) and having to be careful to rewrite columns with arithmetic or literal definitions to "ordinary" columns after their first occurrence (which we're not even doing consistently for function columns currently, but should).

// Retrieve the result of the count query:
let result = getter.lookup(&id, true).unwrap();
assert_eq!(result.len(), 1);
assert_eq!(result[0][1], 246.into());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I might have gotten something wrong with the output order here. For other queries it looks like the query key is outputted as the last element in the result - whereas with arithmetic expressions it seems to be placed first, which I think might be wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind - this seems to match the output order for literals (and I think it makes sense since the columns are just outputted in groups like mentioned in #30 (comment))

@ekmartin
Copy link
Member Author

@ms705: Sorry about the late answer. Storing the information on Column seems more reasonable indeed. That would also make it easier to add support for aliases in arithmetic columns in nom-sql, as it wouldn't need to be tacked onto e.g. both Literal and ArithmeticExpression.

Maybe it would be possible to move the translation from columns to column IDs from to_flow#make_project_node to the Project operator itself? That way the full list of columns in the right order could be attached to the operator, and something like parent.column_id_for_column() could be done from there.

@ms705
Copy link
Member

ms705 commented Oct 16, 2017

Are we good to merge this? @jonhoo @ekmartin

@jonhoo
Copy link
Contributor

jonhoo commented Oct 16, 2017

I still think it's sad that we clone the DataType, but I guess it's fine since it'll always be numbers. So yeah, I'm fine with merging.

@ms705
Copy link
Member

ms705 commented Oct 16, 2017

@jonhoo Agreed; do you see a way to do this without a clone? As far as I can see, arithmetic on &DataType is the only option, right?

@ms705
Copy link
Member

ms705 commented Oct 16, 2017

@ekmartin We had an offline discussion here and concluded that arithmetic on &DataType is the best solution. Can you push that commit into this PR?

@ekmartin
Copy link
Member Author

@ms705: Sounds good. Had to make a new PR, since the last commit wouldn't show up in this one for some reason (still doesn't).

Might be a good idea to push the branch to mit-pdos/distributary as well before merging, so that taster runs the benchmarks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants