Authors: Hannes Voigt <hannes.voigt@neo4j.com>
This CIP defines the semantics of grouping and aggregation functionality in the WITH and the RETURN clause of Cypher.
In this CIP we assume two baseline semantics as given and well-defined. Each baseline semantics captures on capability of the WITH and the RETURN clause in isolation — similar to relational algebra operators.
- Baseline projection
-
Describes the projection of binding table
- Baseline aggregation
-
Describes grouping and aggregation of binding table
The WITH and the RETURN clause also allow ordering and truncating the driving table, but we ignore these aspects in the background considerations.
We define a Simple rewrite semantics for grouping and aggregation. As its name suggests, the simple rewrite semantics is defined as a syntax transformation to a linear composition of the baseline semantics. The simple rewrite semantics lays the foundation for the rewrite semantics that is proposed in the Proposal section of this CIP.
We briefly demonstrate the Limits of the simple rewrite semantics to motivate the proposed semantics.
Further, we use the following driving table in most of the CIP’s examples:
a | b | c |
---|---|---|
1 |
2 |
3 |
1 |
3 |
4 |
2 |
3 |
5 |
The WITH and the RETURN clause allow projecting the driving table including the computation of new columns (in database theory, this is called extended projection).
The projection of a driving table D can be described formally as πP(D) where
-
P is a nonempty set of pairs (ex, al) where
-
ex is a tree of operations of the expression sub-language where each of the leaves is either
-
a constant (such as a value literal or a label),
-
a parameter, or
-
a variable in the driving table D; and
-
-
al is a variable name — called an alias — that is different from all other aliases x with (·, x) ∈ P.
-
Assuming P = {(ex1, al1), (ex2, al2), …, (exN, alN)}, the Cypher equivalent of πP(D) is
WITH ex1 AS al1, ex2 AS al2, ..., exN AS alN
and
RETURN ex1 AS al1, ex2 AS al2, ..., exN AS alN
Cypher allows grouping and aggregating (often simply referred to as aggregation) the driving table in the WITH and in the RETURN clause.
Formally, the aggregation of a driving table D can be described as γK, A(D) where
-
K is a (possibly empty) set of pairs (k, al) where
-
k is a variable — called a grouping key — in the driving table D,
-
al is an alias that is different from all other aliases x with (·, x) ∈ K., and
-
-
A is a non-empty set of triple (agg, x, al) where
-
agg is an aggregation function and
-
x is a variable in the driving table D
-
al is an alias that is different from all other aliases x with (·, ·, x) ∈ A and x with (·, x) ∈ K.
-
Assuming K = {(k1, ka1), (k2, ka2), …, (kN, kaN)} and A = {(agg1, x1, al1), (agg2, x2, al2), …, (aggM, xM, alM)}, the Cypher equivalent of γK, A(D) is
WITH k1 AS ka1, k2 AS ka2, ..., kN AS kaN,
agg1(x1) AS al1, agg2(x2) AS al2, ..., aggM(xM) AS alM
and
RETURN k1 AS ka1, k2 AS ka2, ..., kN AS kaN,
agg1(x1) AS al1, agg2(x2) AS al2, ..., aggM(xM) AS alM
Cypher’s WITH and RETURN clause are syntactically more flexible than the two baseline semantics. In particular, they allow mixing aggregation and projection rather freely.
Specifically, the WITH and the RETURN clause denote the parameters for projection (P) and aggregation (K and A) with a single nonempty duplicate-free list L of <ProjectionItem>s ex AS al
where
-
ex is an expression that is allowed to contain aggregation functions and
-
al is an alias.
The semantics of such RETURN and WITH clauses can be described as a rewrite to the two baseline semantics combined by Cypher’s linear composition.
For this purpose, the <ProjectionItem>s in L can be spilt into aggregates and grouping keys:
-
A <ProjectionItem> p is an aggregate if it is of the form
agg(ex) AS al
, where-
agg is an aggregation function,
-
ex is an expression, and
-
al is an alias; and
-
-
A <ProjectionItem> p is a grouping key if is not an aggregate
For a <ProjectionItem> p,
-
If p is a grouping key of the form
ex AS al
-
Let
PROJ(p)
beex AS al
and -
Let
AGGR(p)
beal AS al
-
-
If p is an aggregate of the form
agg(ex) AS al
-
Let
PROJ(p)
beex AS al
and -
Let
AGGR(p)
beagg(al) AS al
-
With this, RETURN p1, p2, …, pN
can be defined as effectively equivalent to
WITH PROJ(p1), PROJ(p2), ..., PROJ(pN)
RETURN AGGR(p1), AGGR(p2), ..., AGGR(pN)
Analogously, WITH p1, p2, …, pN
can be defined as effectively equivalent to
WITH PROJ(p1), PROJ(p2), ..., PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
We call this the simple rewrite semantics for the WITH and RETURN clause.
With the simple rewrite semantics, the RETURN clause in [Q3]
RETURN b-a AS x, SUM(b*c) AS sumBC
is effectively equivalent to
WITH b-a AS x, b*c AS sumBC
RETURN x AS x, SUM(sumBC) AS sumBC
While the simple rewrite semantics works nicely for the considered examples, it is limited.
Specifically, it only supports aggregation expressions of the form agg(ex)
.
Cypher, however, also allows aggregation functions to appear as sub-expression of <ProjectionItem>s, i.e. Cypher allows <ProjectionItem>s with expressions of forms, such as
-
ex1 + agg(ex2)
, -
agg(ex1) + ex2
, and -
agg1(ex1) + ex2 * agg2(ex3)
Such expressions can still be sensible and useful.
The RETURN clause
should produce a result that is grouped by a
and foo
should be computed for each group as the value a
plus the sum of all products of b
and c
minus the smallest value of c
multiplied by two, so that result is:
a | foo |
---|---|
1 |
32 |
2 |
24 |
Note
|
A less artificial example is calculating the total gross of an order as the discounted sum of the net values –– product price multiplied by amount –– of the order’s line items in a query such as: MATCH
(c:Customer)-[:LOCATED_IN]->(s:State),
(c)-[:ORDERED]->(o:Order),
(o)-[:INCLUDES]->(li:LineItem)-->(p:Product)
RETURN s AS state, c AS customer, o AS order,
SUM(li.amount * p.price) * c.discount * s.vat AS totalGross |
It has been documented on multiple occasions (e.g. cf. First oCIG Meeting) that the existing semantics of Cypher is imprecise on such queries.
A precise semantics on such queries has to provide
-
A clear definition of which <ProjectionItem>s constitute the grouping keys
-
Clear rules of which sub-expressions are allowed in <ProjectionItem>s containing aggregation functions
This proposal provides such a precise semantics.
The proposed grouping and aggregation semantics is defined as a rewrite to the baseline semantics (similar to Simple rewrite semantics discussed above). The proposed semantics does not cover all syntactically possible queries and hence requires a syntax restriction to prohibit queries that are not covered. We discuss the Rewrite and the Syntax restriction in the following two subsections. We simplify this discussion by ignoring row ordering and pagination as well as omitted aliases. Subsequently, we give a separate and brief consideration of how to these aspects fit into the proposal, cf. Row ordering and pagination and Omitted aliases, respectively.
For an expression ex, let AGG(ex) be the set of aggregating (sub-)expressions, i.e. (sub-)expressions of the form agg(preEx).
For a <ProjectionItem> p = postEx AS al
, let AGG(p) be the set of aggregating (sub-)expressions, i.e. AGG(p) = AGG(postEx).
AGG( (a + SUM(b*c) - MIN(c)) * 2 AS foo
)
= AGG( (a + SUM(b*c) - MIN(c)) * 2
)
= { SUM(b*c)
, MIN(c)
}
The <ProjectionItem>s in L are split according to AGG(p) in two cases
-
A <ProjectionItem> p in L is an aggregate if AGG(p) is non-empty
-
A <ProjectionItem> p in L is a grouping key if AGG(p) is empty
For a <ProjectionItem> p = ex AS al
,
-
If AGG(p) = ∅ (i.e. p is a grouping key)
-
Let
PRE_PROJ(p)
beex AS al
, -
Let
AGGR(p)
beal AS al
, and -
Let
POST_PROJ(p)
beal AS al
-
-
If AGG(p) = {
agg1(preEx1)
,agg2(preEx2)
, …,aggN(preExN)
} with N > 0 (i.e. p is an aggregate)-
Let
PRE_PROJ(p)
bepreEx1 AS al_1, preEx2 AS al_2, …, preExN AS al_N
, -
Let
AGGR(p)
beagg1(al_1) AS al_1, agg2(al_2) AS al_2, …, aggN(al_N) AS al_N
, and -
Let
POST_PROJ(p)
bepostEx AS al
where postEx is ex with eachaggi(preExi)
in AGG(p) being replaced byal_i
for 1 ≤ i ≤ N.
-
Important
|
Rewrite semantics
WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
RETURN POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) Analogously, WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
WITH POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) |
The RETURN clause in [Q4]
RETURN a AS a, (a + SUM(b*c) - MIN(c)) * 2 AS foo
is effectively equivalent to
WITH a AS a, b*c AS foo_1, c AS foo_2
WITH a AS a, SUM(foo_1) AS foo_1, MIN(foo_2) AS foo_2
RETURN a AS a, (a + foo_1 - foo_2) * 2 AS foo
Note that the grouping and aggregation semantics also provides for the mixing of projection and aggregation that the Simple rewrite semantics covers, i.e. it is a generalization of the simple rewrite semantics.
The RETURN clause in [Q3]
RETURN b-a AS x, SUM(b*c) AS sumBC
is effectively equivalent to
WITH b-a AS x, b*c AS sumBC_1
WITH x AS x, SUM(sumBC_1) AS sumBC_1
RETURN x AS x, sumBC_1 AS sumBC
The rewrite does not cover all syntactically possible queries. Specifically, any <ProjectItem> containing
-
an aggregation function and
-
a sub-expression that is
-
outside any of the contained aggregation functions and
-
not constant under the grouping keys
-
is not rewritten to valid query.
By the grouping and aggregation semantics, the RETURN clause
is effectively equivalent to
WITH a AS a, c AS foo_1
WITH a AS a, SUM(foo_1) AS foo_1
RETURN a AS a, b + foo_1 * 2 AS foo
Note that variable b
appears in the <ProjectionItem> b + foo_1 * 2 AS foo
in the RETURN clause.
However, variable b
has already been removed from the driving table by the previous projections.
In other words, the proposed rewrite produces an invalid query for [Q5].
To prevent such invalid rewrites, this proposal includes a syntax restriction to be imposed on RETURN and WITH clauses. The definition of the syntax restriction happens in three steps:
-
Definition of grouping keys that are recognized as constant sub-expressions
-
Definition of constant sub-expressions
-
Definition of the syntax restriction
Given a list of <ProjectionItem>s L = {p1, p2, …, pN}, let RECOGNIZED_GROUPING_KEYS(L) be the set of all expressions ex
-
For which there is a <ProjectionItem>s p =
ex AS al
in L where AGG(p) is empty and -
Which are either
-
A variable
-
Element property access on a variable
-
Static map access on a variable
-
For the RETURN clause
RECOGNIZED_GROUPING_KEYS( b-a AS x, c AS c, d.prop AS d, c + SUM(b) AS sum
)
= { c
, d.prop
}.
Note that b-a
is grouping key since AGG(b-a
) is empty. However, it is not a grouping key that needs to be recognized as such for propose of identifying constant sub-expressions. Hence, b-a
is not in RECOGNIZED_GROUPING_KEYS( b-a AS x, c AS c, d.prop AS d, c + SUM(b) AS sum
).
For an expression ex and a projection list L, let CONSTANT(ex, L) hold
-
If ex is either
-
A constant,
-
A parameter,
-
An aggregation function, i.e. of the form
agg(subEx)
, or -
A grouping key, i.e. ex ∈ RECOGNIZED_GROUPING_KEYS(L),
-
-
or if ex comprises sub-expressions, it only comprises sub-expressions subEx for which CONSTANT(subEx, L) holds.
For the RETURN clause
CONSTANT( (2*c + SUM(b) + d.prop) / $pi
, L) = {
2
, // a constant
c
, // a grouping key
2*c
, // only comprises constant sub-expressions
SUM(b)
, // an aggregation function
d.prop
, // a grouping key
(2*c + SUM(b) + d.prop)
, // only comprises constant sub-expressions
$pi
, // a parameter
(2*c + SUM(b) + d.prop) / $pi
// only comprises constant sub-expressions
}.
Based on the notion of constant expression, the syntax restriction is defined as:
Important
|
Syntax restriction
For clauses |
Under this restriction, [Q5] is invalid.
For sub-expression b
-
in <ProjectionItem>
b + foo_1 * 2 AS foo
-
in L =
a AS a, b + foo_1 * 2 AS foo
,
CONSTANT(b
, L) does not hold.
b
is neither an aggregation function, a grouping key, a constant, a parameter, nor does it have any sub-expressions.
The WITH and the RETURN clause allow to
-
Order the rows of the result table with the ORDER BY sub-clause and
-
Paginate the result table with the SKIP and LIMIT sub-clauses.
Assuming the baseline semantics includes ORDER BY, SKIP, and LIMIT capabilities, the grouping and aggregation semantics extends as follows:
Important
|
WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
RETURN POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) ORDER-SKIP-LIMIT Analogously, WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
WITH POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) ORDER-SKIP-LIMIT |
Every <SortItem> listed in the ORDER BY clause contains an expression. Since these expressions are effectively evaluate after all POST_PROJ(pi) expressions, a similar syntax restrictions applies to the <SortItem>s.
However, <SortItem>s can refer to aliases introduced by POST_PROJ(pi).
Given a list of <ProjectionItem>s L = {p1, p2, …, pN}, let RECOGNIZED_ALIASES(L) be the set of all aliases al for which there is a <ProjectionItem>s p = ex AS al
in L.
For an expression ex and a projection list L, let AVAILABLE_TO_ORDER(ex, L) hold
-
If ex is either
-
A constant,
-
A parameter,
-
A grouping key, i.e. ex ∈ RECOGNIZED_GROUPING_KEYS(L), or
-
An alias, i.e. ex ∈ RECOGNIZED_ALIASES(L),
-
-
or if ex comprises sub-expressions, it only comprises sub-expressions subEx for which AVAILABLE_TO_ORDER(subEx, L) holds,
-
or if ex is aggregation expression in L, i.e. there is a <ProjectionItem>s p =
ex AS al
in L.
Important
|
For |
This proposal considers all <ProjectionItem>s have user-given alias. Cypher allows to omit the aliases, particularly in the RETURN clause, though. However, the alias omission rules are based on the assumption that an implementation will infer a more or less reasonable alias if an alias is omitted by the user. Hence, it is safe for this proposal to assume that all <ProjectionItem>s have an alias.
Cypher support *
as projection wildcard.
The wildcard can appear before the list of <ProjectionItem>s L, i.e. RETURN *
, WITH *
, RETURN *, L
and WITH *, L
valid clauses, cf. <ProjectionItems>.
The semantics of the wildcard is the projection of all columns of the driving table.
This semantics can be defined by a rewrite to the baseline semantics, which removes the wildcard before the actual projection and aggregation semantics, including the semantics discussed in Section "Rewrite". Likewise, the Syntax restriction applies after the wildcard effectively been removed.
Given a driving table D, let {VAR1(D), VAR2(D), …, VARN(D)} be the set of all variables in D.
Important
|
RETURN DISTINCT VAR1(D), VAR2(D), ..., VARN(D) X Analogously, WITH DISTINCT VAR1(D), VAR2(D), ..., VARN(D) X |
Note that X is anything valid syntax in a <Return> or <With> after *
specified by the query, respectively.
DISTINCT is the optional keyword DISTINCT
if specified by the query or otherwise empty.
*
Assume the following driving table:
a | b |
---|---|
1 |
2 |
1 |
2 |
2 |
3 |
RETURN *, b * SUM(a) AS x ORDER BY x
is effectively equivalent to
RETURN a, b, b * SUM(a) AS x ORDER BY x
and, hence, valid under the Syntax restriction. Further, it is effectively equivalent to
WITH a, b, a AS x_1
WITH a, b, SUM(x_1) AS x_1
RETURN a, b, b * x_1 AS x ORDER BY x
by the Rewrite semantics, so that it results in:
a | b | x |
---|---|---|
1 |
2 |
4 |
2 |
3 |
6 |
The following clauses exhibit valid aggregations under the grouping and aggregation semantics and the syntax restriction it includes. For each example we list why it is valid.
-
RETURN 1 + count(*)
-
The sub-expression
1
is a constant.
-
-
RETURN 1, 1 + count(*)
-
The sub-expression
1
is a constant.
-
-
RETURN $x + count($x)
-
The sub-expression
$x
is a parameter.
-
-
RETURN count($x) + $x
-
The sub-expression
$x
is a parameter.
-
-
RETURN 1 + count($x) + $x * 2 + sum($x) + 'cake'
-
The sub-expressions
1
,2
, and'cake'
are constants. -
The sub-expression
$x
is a parameter.
-
-
MATCH (a) RETURN a.x, 1 + count(a.x)
-
The sub-expression
1
is a constant.
-
-
MATCH (a) RETURN a.x, a.x + count(a.x)
-
The sub-expression
a.x
is a recognized grouping key.
-
-
MATCH (a) WITH a.x AS ax RETURN ax, ax + count(ax)
-
The sub-expression
ax
is a recognized grouping key.
-
-
MATCH (a)-[]→(b) RETURN a, a.x + count(b.y)
-
The sub-expression
a
is a recognized grouping key.
-
-
MATCH (a)-[]→(b) RETURN a, size(keys(a)) + count(b.y)
-
The sub-expression
a
is a recognized grouping key.
-
-
MATCH (x) RETURN x.a, x.b, x.c, x.a + x.b + count(x) + x.c
-
The sub-expressions
x.a
,x.b
, andx.c
are recognized grouping keys.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax)
-
The sub-expression
ax
is a recognized grouping key. -
The sub-expression
1
is a constant.
-
-
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b)
-
The sub-expression
map.a
is a recognized grouping key.
-
-
MATCH (x) WITH x.a + x.b + x.c AS sum RETURN sum, sum + count(*) + sum
-
The sub-expression
sum
is a recognized grouping key.
-
-
MATCH (x)-[]→(y) WITH x.a AS a, collect(y) AS b RETURN a, b, a.x[2] + sum(a.c) + -(b[3].x)*3
-
The sub-expressions
a
andb
are a recognized grouping key. -
The sub-expressions
2
and3
are a constant.
-
-
MATCH (a)-[b]→(c) WITH x.a AS a, collect(y) AS b RETURN a, b.y, a.x + sum(c.z) + b.y*3
-
The sub-expressions
a
andb.y
are a recognized grouping key. -
The sub-expression
3
is a constant.
-
The following clauses exhibit invalid aggregations under the grouping and aggregation semantics and the syntax restriction it includes. For each example we list why it is invalid.
-
MATCH (a) RETURN a.x + count(*)
-
The sub-expression
a.x
is not a grouping key.
-
-
MATCH (a) RETURN a.x + a.y + count(*) + a.z
-
The sub-expressions
a.x + a.y
anda.z
are not grouping keys.
-
-
MATCH (a) WITH a.x AS ax, a.y AS ay RETURN ax, count(ax) + ay
-
The sub-expression
ay
is not a grouping key.
-
-
MATCH path=(a)-[*]-() RETURN length(path) + count(a)
-
The sub-expression
length(path)
is not a grouping key.
-
-
WITH {a:1, b:2} AS map RETURN map.a, map.b + count(map.b)
-
The sub-expression
map.b
is not a grouping key.
-
-
MATCH (a) RETURN a.x + a.y, a.x + collect(a.x)
-
The sub-expression
a.x
is not a grouping key.
-
-
MATCH (a) RETURN a.x * a.x, a.x + collect(a.x)
-
The sub-expression
a.x
is not a grouping key.
-
-
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x)
-
The sub-expression
a.x + 1
is not a recognized grouping key.
-
-
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x)
-
The sub-expression
x.a + x.b + x.c
is not a recognized grouping key.
-
-
MATCH (a)-[]→(b) RETURN a AS x, x.x + count(b.y)
-
The sub-expression
x
is not a grouping key; it is the alias of a grouping key, which are not visible to <ProjectionItem>s within the same clause.
-
-
MATCH (a)-[]→(b) RETURN a AS x, size(keys(x)) + count(b.y)
-
The sub-expression
x
is not a grouping key; it is the alias of a grouping key, which are not visible to <ProjectionItem>s within the same clause.
-
-
MATCH (a)-[c]→(b) WITH a, c, {dim: properties(b)} AS b RETURN a, b.dim.x, sum(c.p) + b.dim.x
-
The sub-expression
b.dim.x
is not a recognized grouping key.
-
The following clauses exhibit valid row orderings. For each example we list why it is valid.
-
RETURN 1 + count(*) AS x ORDER BY x
-
The expression
x
is a recognized alias.
-
-
RETURN 1, 1 + count(*) ORDER BY 2
-
The expression
1
is a constant.
-
-
RETURN $x + count($x) ORDER BY $x
-
The expression
$x
is a parameter.
-
-
RETURN 1 + count(*) ORDER BY 1 + count(*)
-
The expression
1
is a constant. -
The expression
count(*)
is an aggregation function.
-
-
MATCH (a) RETURN a.x, 1 + count(a.x) ORDER BY a.x % 2
-
The sub-expression
a.x
is a recognized grouping key. -
The sub-expression
2
is a constant.
-
-
MATCH (a) WITH a.x AS ax RETURN ax, ax + count(ax) ORDER BY ax
-
The expression
ax
is a recognized grouping key.
-
-
MATCH (a)-[]→(b) RETURN a, a.x + count(b.y) ORDER BY a.y
-
The sub-expression
a
is a recognized grouping key.
-
-
MATCH (a)-[]→(b) RETURN a, count(b.y) ORDER BY size(keys(a))
-
The sub-expression
a
is a recognized grouping key.
-
-
MATCH (x) RETURN x.a, x.b, x.c, x.a + x.b + count(x) + x.c ORDER BY x.a + x.c
-
The sub-expressions
x.a
andx.c
are recognized grouping keys.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax - 1
-
The sub-expression
ax
is a recognized grouping key. -
The sub-expression
1
is a constant.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax + 2
-
The sub-expression
ax
is a recognized grouping key. -
The sub-expression
2
is a constant.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax + 2 - count(ax)
-
The sub-expression
ax
is a recognized grouping key. -
The sub-expression
2
is a constant. -
The sub-expression
count(ax)
is an aggregation function.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) ORDER BY x + 2
-
The sub-expression
x
is an recognized alias. -
The sub-expression
2
is a constant.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) ORDER BY x + 2 - count(ax)
-
The sub-expression
x
is an recognized alias. -
The sub-expression
2
is a constant. -
The sub-expression
count(ax)
is an aggregation function.
-
-
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) AS y ORDER BY x + 2 - y
-
The sub-expressions
x
andy
are recognized aliases. -
The sub-expression
2
is a constant.
-
-
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b) ORDER BY map.a
-
The expression
map.a
is a recognized grouping key.
-
The following clauses exhibit invalid row orderings. For each example we list why it is invalid.
-
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 1
-
The sub-expression
a.x + 1
is not a recognized grouping key.
-
-
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 1 + count(a.x)
-
The sub-expression
a.x + 1
is not a recognized grouping key.
-
-
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 2
-
The expression
a.x + 2
is not a grouping key. -
The sub-expression
2
is a constant, but sub-expressiona.x
is not a grouping key.
-
-
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b) ORDER BY map.b
-
The sub-expression
map.b
is not a grouping key.
-
-
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x) ORDER BY x.a + x.c
-
The expression
x.a + x.c
is not a grouping key. -
The sub-expressions
x.a
andx.c
andx
are not grouping keys, either.
-
-
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x) ORDER BY x.a + x.b + x.c
-
The expression
x.a + x.b + x.c
is not a recognized grouping key.
-
All other major query languages explicitly delineate grouping key expressions.
For instance, SQL does so by requiring users to list all grouping key expressions in the GROUP BY clause. If the GROUP BY clause is present in a query, the projection in the SELECT clause have to fulfill a similar syntax restriction as defined by this CIP. The SQL-equivalent of [Q5]
SELECT a AS a, b + SUM(c) * 2 AS foo
FROM A
GROUP BY a
is invalid in SQL as well. For instance, PostgreSQL v13 rejects this query with
error: column "a.b" must appear in the GROUP BY clause or be used in an aggregate function
The main advantage of this proposal is, that is clarifies the semantics of grouping and aggregation in the WITH and the RETURN clause and removes imprecision of the previously existing semantics (cf. First oCIG Meeting).
From a pure logical standpoint, the syntax restriction only has to rule out sub-expressions of aggregating projection items, which are not constant under the grouping keys. However, statically inferring all possible constant sub-expressions is not necessarily easy. To this effect, the proposed rules of the syntax restriction are a heuristic, which safely identifies sub-expression that are constant under the grouping keys, but can not identify all such expression theoretically possible.
The RETURN clause
is ruled out by the syntax restriction, although sub-expression (b - b)
is effectively constant.
It is imaginable that a semantic analyser may figure that (b - b)
can be simplified to 0
if b
is know to be numeric, so that the clause effectively is equivalent to
RETURN a AS a, SUM(c) AS foo
which is perfectly valid.
The proposal tries to strike a balance between allowing good number of useful queries while keeping the rules of the syntax restrict reasonable simple.
Also note: For queries that are logically possible but rejected by the syntax restriction, users can always manually rewrite the query with additional explicit projections to make the query syntactically valid while it still produces the desired result.
The openCypher grammar does not encode left- or right-deep precedence for chainable operations, cf.
The rules of this proposal just refer to "contained sub-expressions". Currently, openCypher lacks a clear reference point of what this precisely means.
Most parser technologies result in left- or right-deep parse trees.
For instance an expression a+b+c
is typically parsed as (a+b)c` or `a(b+c)
.
Typically, an implementation will decide containment according to its parse tree.
Hence, one implementation may find a+b
be contained in a+b+c
while it finds b+c
not to be contained in a+b+c
.
Another implementation may reach the opposite conclusion w.r.t. containment.
SQL’s and GQL’s definition of containment does not define containment within repetition, too.
However, their grammar does not encode chainable operations grammatically with repetition.
Rather, SQL and GQL use head-recursive grammar productions, which result in a left-deep containment, i.e. a+b+c
is considered as (a+b)+c
in these standards.