Arithmetic Expressions in Projections #30

ekmartin · 2017-10-01T20:03:38Z

This adds support for e.g. SELECT 2 * price FROM ... and SELECT price * MAX(column) FROM ... by handling nom-sql's ArithmeticExpression.

Arithmetic between different numeric data types is converted implicitly:

BigInt + Int = BigInt
Real + Int = Real
Not supported: BigInt + Real (since Real's integer part is represented using an i32)

Similar to how nom-sql handles arithmetic this currently only deals with simple, non-nested, expressions, but it should definitely be possible to add support for nested expressions in the future.

I'm not completely sure if the way I've handled ArithmeticColumns in the query graph is correct (especially when it comes to re-use), but it seems to work for the tests I've written so far.
The output formatting of arithmetic nodes could probably also need some tuning.

ms705 · 2017-10-01T20:32:30Z

Very nice! I'll have a look over this.

ekmartin · 2017-10-01T21:09:58Z

tests/lib.rs

+    let mut g = distributary::Blender::new();
+    let sql = "
+        CREATE TABLE Bread (id int, price int, PRIMARY KEY(id));
+        Price: SELECT 2 * COUNT(price) FROM Bread WHERE id = ?;


I wanted this query to be SELECT 2 * MAX(price) FROM Bread but I couldn't figure out how to retrieve the result without having a key connected to the query. I saw some of the other tests/benchmarks used a bogus key of 0, but getter.lookup(&0.into(), true) didn't seem to do the trick either. What am I missing?

I got it working by using the result from 2 * MAX(price) as the key for now, but would definitely prefer a cleaner way of doing it.

Commit: 688e85f

jonhoo · 2017-10-01T23:11:46Z

src/ops/project.rs

+        let left = match expression.left {
+            ProjectExpressionBase::Column(i) => &record[i],
+            ProjectExpressionBase::Literal(ref data) => data,
+        }.clone();


I don't think the .clone() here (or below) should be necessary?

Is there any way to do this without overloading the arithmetic operators on &DataType instead of DataType? Pushed a commit here for how that would look.

Tried playing around with ownership stuff, but couldn't figure out how to do it without having to .clone() the ProjectExpression passed into eval_expression and the DataType at record[i].

I doing arithmetic on &DataType is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions is being cloned (which is larger and more complex). Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

I might be misunderstanding, but I think that's what I'm doing? left and right are DataType objects here, and the operators are overloaded directly on DataType.

Oh, yes, you're right, I misread.

jonhoo · 2017-10-01T23:12:49Z

src/ops/project.rs

+                        for i in a {
+                            new_r.push(i.clone());
+                        }
+                    }


@malte this is a similar case to aggregations where columns appear in a pre-defined location (i.e., first all projected columns, then all expressions, then all literals). I know that affected MIR for aggregations during the deadline, so maybe be on the lookout for that here too?

ekmartin · 2017-10-02T12:59:36Z

src/sql/mod.rs

@@ -720,7 +720,7 @@ mod tests {
            // Should have two nodes: source and "users" base table
            let ncount = mig.graph().node_count();
            assert_eq!(ncount, 2);
-            assert_eq!(get_node(&inc, &*mig, "users").name(), "users");


Does the &* make a difference in these tests? They seem to be passing fine without it as well, so I removed it - let me know if it should be there though and I'll revert the commit.

I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.

ekmartin · 2017-10-02T13:09:51Z

@ms705: I split out the changes from mir/mod.rs into the different files you created in 8590bff.

ms705 · 2017-10-02T23:01:33Z

Thanks, I'll review this tonight or tomorrow at the latest!

ms705

Great work! I've left some comments inline, although I think almost all of them are probably best addressed in separate PRs rather than by changing this one.

ms705 · 2017-10-04T14:14:23Z

src/mir/node.rs

            } => match *other {
                MirNodeType::Project {
                    ref emit,
                    ref literals,
-                } => our_emit == emit && our_literals == literals,
+                    ref arithmetic,
+                } => our_emit == emit && our_literals == literals && our_arithmetic == arithmetic,


This is quite conservative (though probably fine for v0 of arithmetic expression support): it will only allow reuse of projections that have exactly identical arithmetic expressions. For example, with multiple elements in the arithmetic vector, it would only reuse the projection if the vectors have the same elements and they're all identical. However, it is also acceptable to reuse the projection if the new our_arithmetic is a strict subset of arithmetic.

Now, you might observe that we are similarly unnecessarily strict about emit, and you'd be right ;-)

ms705 · 2017-10-04T14:15:15Z

src/mir/node.rs

                emit.iter()
                    .map(|c| c.name.as_str())
                    .collect::<Vec<_>>()
                    .join(", "),
+                if arithmetic.len() > 0 {


nit: is_empty()

ms705 · 2017-10-04T14:23:21Z

src/ops/project.rs

+            right: right,
+        }
+    }
+}


Hmm, I suspect we only need these (as opposed to using nom-sql's almost-identical ArithmeticExpression) because of the Literal(DataType) in the base case, which in nom-sql is Scalar(Literal). That's a bit suboptimal -- could we use the nom-sql structure here since we implement Into<DataType> for nom_sql::Literal? (I guess we'd potentially be doing the conversion every time we access the literal if we're not careful, but maybe that can be solved without duplicating the data structure.)

On further thought, I guess another difference is that the Column variant of the ProjectExpressionBase enum holds a numeric column index, not a Column object. Hrm, maybe we need the duplicate structures after all.

ms705 · 2017-10-04T14:34:40Z

src/ops/project.rs

+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        match *self {
+            ProjectExpressionBase::Column(u) => write!(f, "{}", u),
+            ProjectExpressionBase::Literal(ref l) => write!(f, "(lit: {})", l),


Should this just be the value? If I parse this right, we would currently end up with funny-looking output like salary * (lit: 100) rather than salary * 100.

I added the lit: since it outputs the emit index, and not the actual column name. I thought 0 * 100 might look slightly confusing.

https://github.com/ekmartin/distributary/blob/436333cce567b3c50e253fa44ba9f621c23a576a/src/ops/project.rs#L300

Ah, fair point. I keep forgetting that the low-level operators only know about column indices...

ms705 · 2017-10-04T14:47:22Z

src/ops/project.rs

+        let left = match expression.left {
+            ProjectExpressionBase::Column(i) => &record[i],
+            ProjectExpressionBase::Literal(ref data) => data,
+        }.clone();


I doing arithmetic on &DataType is fine. That said, the problem here is less about cloning the record but about the fact that the whole ProjectExpressions is being cloned (which is larger and more complex). Strictly speaking, only record[i] needs to be cloned, correct? Then you could use the arithmetic operators directly on DataType and Literal, without having to overload them for ProjectExpression.

ms705 · 2017-10-04T14:52:16Z

src/sql/mod.rs

@@ -720,7 +720,7 @@ mod tests {
            // Should have two nodes: source and "users" base table
            let ncount = mig.graph().node_count();
            assert_eq!(ncount, 2);
-            assert_eq!(get_node(&inc, &*mig, "users").name(), "users");


I think this is probably a leftover from when migrations worked somewhat differently, and can indeed be removed.

ms705 · 2017-10-04T14:55:28Z

src/sql/query_graph.rs

+            FieldExpression::Arithmetic(ref a) => {
+                if let ArithmeticBase::Column(ref c) = a.left {
+                    add_computed_column(&mut qg, c);
+                }


The intention (and if I read this right, the implementation) here is that computed columns (e.g., SUM(col)) can occur within arithmetic expressions and that we still need to generate the right query graphs for these, correct?

(As opposed to each column defined via an arithmetic expression being considered "computed", which it wouldn't really be from the query graph perspective.)

Exactly - before I added this it wasn't possible to refer to computed columns inside arithmetic expressions.

ms705 · 2017-10-04T14:58:19Z

src/sql/query_graph.rs

                }
+
+                qg.columns.push(OutputColumn::Arithmetic(ArithmeticColumn {
+                    name: String::from("arithmetic"),


How does this deal with aliases for arithmetic columns? Seems like it might hardcode the column name always, which would be problematic if we had more than one arithmetic column.

Hm, that's a good point. I implemented this by looking at how it was being done for literals, which also seems to use the same name for the computed columns:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/query_graph.rs#L696-L702

Both literals and arithmetic expression names are handled here though:
https://github.com/ekmartin/distributary/blob/85044bf28c9bb83bbe2b907bfdfaf7cf0b6d0c84/src/sql/mir.rs#L737-L763

Where column objects are created with only the name and the table. It doesn't seem to look at aliases at all for literals though, which I guess might mean that aliases doesn't work for literals either?

I added a test for multiple arithmetic expressions in one query here - I'll see how it behaves with aliases as well.

Yeah, you're right that aliases for literals probably don't work.

Maybe it's best to add alias support to both ArithmeticColumn and LiteralColumn for now (either within this PR or separately, I don't mind).

ms705 · 2017-10-04T15:06:19Z

A more general thought on this (certainly outside the scope of this PR): I wonder if we need to move to a model where the Column object carries information about e.g., arithmetic on projections, or indeed about the column being a literal value. There are two reasons for this:

We already associate information about functions (e.g., MAX, SUM etc.) with the Column object, and it seems inconsistent to treat other similar notions differently.
By carrying information about arithmetic expressions and literals outside the Column object, our interfaces keep gaining additional vector of "special" values (most notably the MIR Project node and the Project operator), and we are forced into a static column emit order -- namely ordinary columns, literal columns, and arithmetic columns in separate groups. This is a little unnecessary, and requires extra projections to permute the columns back into the order the query expects them in; carrying information about arithmetic definitions and literals inside Column would permit us to order them any way we like.

The downsides of this plan are the increase in complexity of code handling Column objects (and we have plenty of that!) and having to be careful to rewrite columns with arithmetic or literal definitions to "ordinary" columns after their first occurrence (which we're not even doing consistently for function columns currently, but should).

ekmartin · 2017-10-04T15:31:03Z

tests/lib.rs

+    // Retrieve the result of the count query:
+    let result = getter.lookup(&id, true).unwrap();
+    assert_eq!(result.len(), 1);
+    assert_eq!(result[0][1], 246.into());


I think I might have gotten something wrong with the output order here. For other queries it looks like the query key is outputted as the last element in the result - whereas with arithmetic expressions it seems to be placed first, which I think might be wrong?

Never mind - this seems to match the output order for literals (and I think it makes sense since the columns are just outputted in groups like mentioned in #30 (comment))

ekmartin · 2017-10-11T00:29:28Z

@ms705: Sorry about the late answer. Storing the information on Column seems more reasonable indeed. That would also make it easier to add support for aliases in arithmetic columns in nom-sql, as it wouldn't need to be tacked onto e.g. both Literal and ArithmeticExpression.

Maybe it would be possible to move the translation from columns to column IDs from to_flow#make_project_node to the Project operator itself? That way the full list of columns in the right order could be attached to the operator, and something like parent.column_id_for_column() could be done from there.

ms705 · 2017-10-16T16:00:11Z

Are we good to merge this? @jonhoo @ekmartin

jonhoo · 2017-10-16T16:14:00Z

I still think it's sad that we clone the DataType, but I guess it's fine since it'll always be numbers. So yeah, I'm fine with merging.

ms705 · 2017-10-16T18:38:42Z

@jonhoo Agreed; do you see a way to do this without a clone? As far as I can see, arithmetic on &DataType is the only option, right?

ms705 · 2017-10-16T19:11:58Z

@ekmartin We had an offline discussion here and concluded that arithmetic on &DataType is the best solution. Can you push that commit into this PR?

ekmartin · 2017-10-17T15:07:50Z

@ms705: Sounds good. Had to make a new PR, since the last commit wouldn't show up in this one for some reason (still doesn't).

Might be a good idea to push the branch to mit-pdos/distributary as well before merging, so that taster runs the benchmarks?

ekmartin added 9 commits September 29, 2017 18:15

Initial arithmetic support in Project

c1bd87d

Use a macro for arithmetic DataType operations

a2f7140

Add tests for literal arithmetic

a9eeb92

Handle ArithmeticExpressions from nom-sql

af2832f

Add a FLOAT_PRECISION constant

ddb7043

Handle arithmetic and literal names together

8978a11

Add computed columns if they're used in arithmetic expressions

72f6dfc

Implement formatting for arithmetic expressions

059ffaf

Only re-use arithmetic expressions without functions

811ca80

ms705 self-requested a review October 1, 2017 20:32

ekmartin commented Oct 1, 2017

View reviewed changes

Use MAX in arithmetic test

688e85f

jonhoo reviewed Oct 1, 2017

View reviewed changes

ekmartin added 3 commits October 2, 2017 14:44

Merge branch 'master' into arithmetic

a0df5f5

Use g.migrate instead of start_migration

a492cb0

&*mig -> mig

292566e

ekmartin commented Oct 2, 2017

View reviewed changes

ekmartin added 2 commits October 2, 2017 22:56

Merge branch 'master' into arithmetic

475cb89

rustfmt

70340d3

ekmartin closed this Oct 3, 2017

ekmartin deleted the arithmetic branch October 3, 2017 00:50

ekmartin restored the arithmetic branch October 3, 2017 00:50

ekmartin reopened this Oct 3, 2017

ekmartin added 2 commits October 3, 2017 03:09

Merge branch 'master' into arithmetic

956c328

thread::sleep(...) -> sleep()

436333c

ms705 approved these changes Oct 4, 2017

View reviewed changes

ekmartin added 2 commits October 4, 2017 17:16

len() > 0 -> is_empty()

85044bf

Add a test for multiple arithmetic expressions

28a9dad

ekmartin commented Oct 4, 2017

View reviewed changes

ekmartin closed this Oct 17, 2017

ekmartin mentioned this pull request Oct 17, 2017

Arithmetic Expressions in Projections #35

Merged

Arithmetic Expressions in Projections #30

Arithmetic Expressions in Projections #30

Conversation

ekmartin commented Oct 1, 2017

ms705 commented Oct 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekmartin Oct 1, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekmartin commented Oct 2, 2017

ms705 commented Oct 2, 2017

ms705 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ms705 commented Oct 4, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ekmartin commented Oct 11, 2017

ms705 commented Oct 16, 2017

jonhoo commented Oct 16, 2017

ms705 commented Oct 16, 2017

ms705 commented Oct 16, 2017

ekmartin commented Oct 17, 2017

ekmartin Oct 1, 2017 •

edited