CIP2021-07-07 Grouping key and aggregation expressions

Authors: Hannes Voigt <hannes.voigt@neo4j.com>

Abstract

This CIP defines the semantics of grouping and aggregation functionality in the WITH and the RETURN clause of Cypher.

Table of Contents

1. Background
2. Proposal
3. What others do
4. Benefits to this proposal
5. Caveats to this proposal
- 5.1. Rules are heuristic
- 5.2. Containment is imprecise

1. Background

In this CIP we assume two baseline semantics as given and well-defined. Each baseline semantics captures on capability of the WITH and the RETURN clause in isolation — similar to relational algebra operators.

Baseline projection: Describes the projection of binding table
Baseline aggregation: Describes grouping and aggregation of binding table

The WITH and the RETURN clause also allow ordering and truncating the driving table, but we ignore these aspects in the background considerations.

We define a Simple rewrite semantics for grouping and aggregation. As its name suggests, the simple rewrite semantics is defined as a syntax transformation to a linear composition of the baseline semantics. The simple rewrite semantics lays the foundation for the rewrite semantics that is proposed in the Proposal section of this CIP.

We briefly demonstrate the Limits of the simple rewrite semantics to motivate the proposed semantics.

Further, we use the following driving table in most of the CIP’s examples:

Table 1. Example driving table

a	b	c
1	2	3
1	3	4
2	3	5

1.1. Baseline projection

The WITH and the RETURN clause allow projecting the driving table including the computation of new columns (in database theory, this is called extended projection).

The projection of a driving table D can be described formally as π_P(D) where

P is a nonempty set of pairs (ex, al) where
- ex is a tree of operations of the expression sub-language where each of the leaves is either
  - a constant (such as a value literal or a label),
  - a parameter, or
  - a variable in the driving table D; and
- al is a variable name — called an alias — that is different from all other aliases x with (·, x) ∈ P.

Assuming P = {(ex₁, al₁), (ex₂, al₂), …, (ex_N, al_N)}, the Cypher equivalent of π_P(D) is

WITH ex1 AS al1, ex2 AS al2, ..., exN AS alN

and

RETURN ex1 AS al1, ex2 AS al2, ..., exN AS alN

Example 1. Baseline projection

The RETURN clause

Q1

RETURN b-a AS x, b*c AS y

projects each row in the driving table to b minus a (as x) and the product of b and c (as y), so that the result is:

Table 2. Result Q1

x	y
1	6
2	12
1	15

1.2. Baseline aggregation

Cypher allows grouping and aggregating (often simply referred to as aggregation) the driving table in the WITH and in the RETURN clause.

Formally, the aggregation of a driving table D can be described as γ_{K, A}(D) where

K is a (possibly empty) set of pairs (k, al) where
- k is a variable — called a grouping key — in the driving table D,
- al is an alias that is different from all other aliases x with (·, x) ∈ K., and
A is a non-empty set of triple (agg, x, al) where
- agg is an aggregation function and
- x is a variable in the driving table D
- al is an alias that is different from all other aliases x with (·, ·, x) ∈ A and x with (·, x) ∈ K.

Assuming K = {(k₁, ka₁), (k₂, ka₂), …, (k_N, ka_N)} and A = {(agg₁, x₁, al₁), (agg₂, x₂, al₂), …, (agg_M, x_M, al_M)}, the Cypher equivalent of γ_{K, A}(D) is

WITH k1 AS ka1, k2 AS ka2, ..., kN AS kaN,
     agg1(x1) AS al1, agg2(x2) AS al2, ..., aggM(xM) AS alM

and

RETURN k1 AS ka1, k2 AS ka2, ..., kN AS kaN,
       agg1(x1) AS al1, agg2(x2) AS al2, ..., aggM(xM) AS alM

Example 2. Baseline aggregation

The RETURN clause

Q2

RETURN a AS a, SUM(c) AS sumC

groups the driving table by a and computes the sum of all c as sumC for each group, so that result is:

Table 3. Result Q2

a	sumC
1	7
2	5

1.3. Simple rewrite semantics

Cypher’s WITH and RETURN clause are syntactically more flexible than the two baseline semantics. In particular, they allow mixing aggregation and projection rather freely.

Specifically, the WITH and the RETURN clause denote the parameters for projection (P) and aggregation (K and A) with a single nonempty duplicate-free list L of <ProjectionItem>s ex AS al where

ex is an expression that is allowed to contain aggregation functions and
al is an alias.

Example 3. Mixing aggregation and projection

The RETURN clause

Q3

RETURN b-a AS x, SUM(b*c) AS sumBC

should produce a result that is grouped by b minus a (as x) and the sum of all products of b and c should be computed as sumBC for each group, so that result is:

Table 4. Result Q3

x	sumBC
1	21
2	12

The semantics of such RETURN and WITH clauses can be described as a rewrite to the two baseline semantics combined by Cypher’s linear composition.

For this purpose, the <ProjectionItem>s in L can be spilt into aggregates and grouping keys:

A <ProjectionItem> p is an aggregate if it is of the form agg(ex) AS al, where
- agg is an aggregation function,
- ex is an expression, and
- al is an alias; and
A <ProjectionItem> p is a grouping key if is not an aggregate

For a <ProjectionItem> p,

If p is a grouping key of the form ex AS al
- Let PROJ(p) be ex AS al and
- Let AGGR(p) be al AS al
If p is an aggregate of the form agg(ex) AS al
- Let PROJ(p) be ex AS al and
- Let AGGR(p) be agg(al) AS al

With this, RETURN p₁, p₂, …, p_N can be defined as effectively equivalent to

WITH PROJ(p1), PROJ(p2), ..., PROJ(pN)
RETURN AGGR(p1), AGGR(p2), ..., AGGR(pN)

Analogously, WITH p₁, p₂, …, p_N can be defined as effectively equivalent to

WITH PROJ(p1), PROJ(p2), ..., PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)

We call this the simple rewrite semantics for the WITH and RETURN clause.

Example 4. Simple rewrite semantics

With the simple rewrite semantics, the RETURN clause in [Q3]

RETURN b-a AS x, SUM(b*c) AS sumBC

is effectively equivalent to

WITH b-a AS x, b*c AS sumBC
RETURN x AS x, SUM(sumBC) AS sumBC

1.4. Limits of the simple rewrite semantics

While the simple rewrite semantics works nicely for the considered examples, it is limited. Specifically, it only supports aggregation expressions of the form agg(ex).

Cypher, however, also allows aggregation functions to appear as sub-expression of <ProjectionItem>s, i.e. Cypher allows <ProjectionItem>s with expressions of forms, such as

ex₁ + agg(ex₂),
agg(ex₁) + ex₂, and
agg₁(ex₁) + ex₂ * agg₂(ex₃)

Such expressions can still be sensible and useful.

Example 5. Aggregation functions as sub-expressions

The RETURN clause

Q4

RETURN a AS a, (a + SUM(b*c) - MIN(c)) * 2 AS foo

should produce a result that is grouped by a and foo should be computed for each group as the value a plus the sum of all products of b and c minus the smallest value of c multiplied by two, so that result is:

Table 5. Result Q4

a	foo
1	32
2	24

Note

A less artificial example is calculating the total gross of an order as the discounted sum of the net values –– product price multiplied by amount –– of the order’s line items in a query such as:

MATCH
(c:Customer)-[:LOCATED_IN]->(s:State),
(c)-[:ORDERED]->(o:Order),
(o)-[:INCLUDES]->(li:LineItem)-->(p:Product)
RETURN s AS state, c AS customer, o AS order,
       SUM(li.amount * p.price) * c.discount * s.vat AS totalGross

It has been documented on multiple occasions (e.g. cf. First oCIG Meeting) that the existing semantics of Cypher is imprecise on such queries.

A precise semantics on such queries has to provide

A clear definition of which <ProjectionItem>s constitute the grouping keys
Clear rules of which sub-expressions are allowed in <ProjectionItem>s containing aggregation functions

This proposal provides such a precise semantics.

2. Proposal

2.1. Syntax

This proposal does not propose any net-new syntax.

2.2. Semantics

The proposed grouping and aggregation semantics is defined as a rewrite to the baseline semantics (similar to Simple rewrite semantics discussed above). The proposed semantics does not cover all syntactically possible queries and hence requires a syntax restriction to prohibit queries that are not covered. We discuss the Rewrite and the Syntax restriction in the following two subsections. We simplify this discussion by ignoring row ordering and pagination as well as omitted aliases. Subsequently, we give a separate and brief consideration of how to these aspects fit into the proposal, cf. Row ordering and pagination and Omitted aliases, respectively.

2.2.1. Rewrite

For an expression ex, let AGG(ex) be the set of aggregating (sub-)expressions, i.e. (sub-)expressions of the form agg(preEx).

For a <ProjectionItem> p = postEx AS al, let AGG(p) be the set of aggregating (sub-)expressions, i.e. AGG(p) = AGG(postEx).

Example 6. Set of aggregating (sub-)expressions

AGG( (a + SUM(b*c) - MIN(c)) * 2 AS foo )
= AGG( (a + SUM(b*c) - MIN(c)) * 2 )
= { SUM(b*c), MIN(c) }

The <ProjectionItem>s in L are split according to AGG(p) in two cases

A <ProjectionItem> p in L is an aggregate if AGG(p) is non-empty
A <ProjectionItem> p in L is a grouping key if AGG(p) is empty

For a <ProjectionItem> p = ex AS al,

If AGG(p) = ∅ (i.e. p is a grouping key)
- Let PRE_PROJ(p) be ex AS al,
- Let AGGR(p) be al AS al, and
- Let POST_PROJ(p) be al AS al
If AGG(p) = {agg₁(preEx₁), agg₂(preEx₂), …, agg_N(preEx_N)} with N > 0 (i.e. p is an aggregate)
- Let PRE_PROJ(p) be preEx₁ AS al_1, preEx₂ AS al_2, …, preEx_N AS al_N,
- Let AGGR(p) be agg₁(al_1) AS al_1, agg₂(al_2) AS al_2, …, agg_N(al_N) AS al_N, and
- Let POST_PROJ(p) be postEx AS al where postEx is ex with each agg_i(preEx_i) in AGG(p) being replaced by al_i for 1 ≤ i ≤ N.

Important

Rewrite semantics

RETURN p₁, p₂, …, p_N is effectively equivalent to

WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
RETURN POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN)

Analogously, WITH p₁, p₂, …, p_N is effectively equivalent to

WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
WITH POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN)

Example 7. Rewrite semantics

The RETURN clause in [Q4]

RETURN a AS a, (a + SUM(b*c) - MIN(c)) * 2 AS foo

is effectively equivalent to

WITH a AS a, b*c AS foo_1, c AS foo_2
WITH a AS a, SUM(foo_1) AS foo_1, MIN(foo_2) AS foo_2
RETURN a AS a, (a + foo_1 - foo_2) * 2 AS foo

Note that the grouping and aggregation semantics also provides for the mixing of projection and aggregation that the Simple rewrite semantics covers, i.e. it is a generalization of the simple rewrite semantics.

Example 8. Rewrite semantics on simple mixing of projection and aggregation

The RETURN clause in [Q3]

RETURN b-a AS x, SUM(b*c) AS sumBC

is effectively equivalent to

WITH b-a AS x, b*c AS sumBC_1
WITH x AS x, SUM(sumBC_1) AS sumBC_1
RETURN x AS x, sumBC_1 AS sumBC

2.2.2. Syntax restriction

The rewrite does not cover all syntactically possible queries. Specifically, any <ProjectItem> containing

an aggregation function and
a sub-expression that is
- outside any of the contained aggregation functions and
- not constant under the grouping keys

is not rewritten to valid query.

Example 9. Aggregation not covered by the rewrite

By the grouping and aggregation semantics, the RETURN clause

Q5

RETURN a AS a, b + SUM(c) * 2 AS foo

is effectively equivalent to

WITH a AS a, c AS foo_1
WITH a AS a, SUM(foo_1) AS foo_1
RETURN a AS a, b + foo_1 * 2 AS foo

Note that variable b appears in the <ProjectionItem> b + foo_1 * 2 AS foo in the RETURN clause. However, variable b has already been removed from the driving table by the previous projections. In other words, the proposed rewrite produces an invalid query for [Q5].

To prevent such invalid rewrites, this proposal includes a syntax restriction to be imposed on RETURN and WITH clauses. The definition of the syntax restriction happens in three steps:

Definition of grouping keys that are recognized as constant sub-expressions
Definition of constant sub-expressions
Definition of the syntax restriction

Given a list of <ProjectionItem>s L = {p₁, p₂, …, p_N}, let RECOGNIZED_GROUPING_KEYS(L) be the set of all expressions ex

For which there is a <ProjectionItem>s p = ex AS al in L where AGG(p) is empty and
Which are either
- A variable
- Element property access on a variable
- Static map access on a variable

Example 10. Grouping keys

For the RETURN clause

Q6

RETURN b-a AS x, c AS c, d.prop AS d, c + SUM(b) AS sum

RECOGNIZED_GROUPING_KEYS( b-a AS x, c AS c, d.prop AS d, c + SUM(b) AS sum )
= { c, d.prop }.

Note that b-a is grouping key since AGG(b-a) is empty. However, it is not a grouping key that needs to be recognized as such for propose of identifying constant sub-expressions. Hence, b-a is not in RECOGNIZED_GROUPING_KEYS( b-a AS x, c AS c, d.prop AS d, c + SUM(b) AS sum ).

For an expression ex and a projection list L, let CONSTANT(ex, L) hold

If ex is either
- A constant,
- A parameter,
- An aggregation function, i.e. of the form agg(subEx), or
- A grouping key, i.e. ex ∈ RECOGNIZED_GROUPING_KEYS(L),
or if ex comprises sub-expressions, it only comprises sub-expressions subEx for which CONSTANT(subEx, L) holds.

Example 11. Constant expressions

For the RETURN clause

Q7

RETURN b-a AS x, c AS c, d.prop AS d, (2*c + SUM(b) + d.prop) / $pi AS val

CONSTANT( (2*c + SUM(b) + d.prop) / $pi, L) = {
  2, // a constant
  c, // a grouping key
  2*c, // only comprises constant sub-expressions
  SUM(b), // an aggregation function
  d.prop, // a grouping key
  (2*c + SUM(b) + d.prop), // only comprises constant sub-expressions
  $pi, // a parameter
  (2*c + SUM(b) + d.prop) / $pi // only comprises constant sub-expressions
}.

Based on the notion of constant expression, the syntax restriction is defined as:

Important

Syntax restriction

For clauses WITH L and RETURN L and every p = ex AS a in L where AGG(p) is not empty, CONSTANT(ex, L) shall hold.

Example 12. Effect of the syntax restriction

Under this restriction, [Q5] is invalid. For sub-expression b

in <ProjectionItem> b + foo_1 * 2 AS foo
in L = a AS a, b + foo_1 * 2 AS foo,

CONSTANT(b, L) does not hold. b is neither an aggregation function, a grouping key, a constant, a parameter, nor does it have any sub-expressions.

2.2.3. Row ordering and pagination

The WITH and the RETURN clause allow to

Order the rows of the result table with the ORDER BY sub-clause and
Paginate the result table with the SKIP and LIMIT sub-clauses.

Assuming the baseline semantics includes ORDER BY, SKIP, and LIMIT capabilities, the grouping and aggregation semantics extends as follows:

Important

RETURN p₁, p₂, …, p_N ORDER-SKIP-LIMIT is effectively equivalent to

WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
RETURN POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) ORDER-SKIP-LIMIT

Analogously, WITH p₁, p₂, …, p_N ORDER-SKIP-LIMIT is effectively equivalent to

WITH PRE_PROJ(p1), PRE_PROJ(p2), ..., PRE_PROJ(pN)
WITH AGGR(p1), AGGR(p2), ..., AGGR(pN)
WITH POST_PROJ(p1), POST_PROJ(p2), ..., POST_PROJ(pN) ORDER-SKIP-LIMIT

Every <SortItem> listed in the ORDER BY clause contains an expression. Since these expressions are effectively evaluate after all POST_PROJ(p_i) expressions, a similar syntax restrictions applies to the <SortItem>s.

However, <SortItem>s can refer to aliases introduced by POST_PROJ(p_i).

Given a list of <ProjectionItem>s L = {p₁, p₂, …, p_N}, let RECOGNIZED_ALIASES(L) be the set of all aliases al for which there is a <ProjectionItem>s p = ex AS al in L.

For an expression ex and a projection list L, let AVAILABLE_TO_ORDER(ex, L) hold

If ex is either
- A constant,
- A parameter,
- A grouping key, i.e. ex ∈ RECOGNIZED_GROUPING_KEYS(L), or
- An alias, i.e. ex ∈ RECOGNIZED_ALIASES(L),
or if ex comprises sub-expressions, it only comprises sub-expressions subEx for which AVAILABLE_TO_ORDER(subEx, L) holds,
or if ex is aggregation expression in L, i.e. there is a <ProjectionItem>s p = ex AS al in L.

Important

For WITH L ORDER BY SI and RETURN L ORDER BY SI and every ex contained in a <SortItem> in SI, AVAILABLE_TO_ORDER(ex, L) shall hold.

2.2.4. Omitted aliases

This proposal considers all <ProjectionItem>s have user-given alias. Cypher allows to omit the aliases, particularly in the RETURN clause, though. However, the alias omission rules are based on the assumption that an implementation will infer a more or less reasonable alias if an alias is omitted by the user. Hence, it is safe for this proposal to assume that all <ProjectionItem>s have an alias.

2.2.5. Project wildcard `*`

Cypher support * as projection wildcard. The wildcard can appear before the list of <ProjectionItem>s L, i.e. RETURN *, WITH *, RETURN *, L and WITH *, L valid clauses, cf. <ProjectionItems>. The semantics of the wildcard is the projection of all columns of the driving table.

This semantics can be defined by a rewrite to the baseline semantics, which removes the wildcard before the actual projection and aggregation semantics, including the semantics discussed in Section "Rewrite". Likewise, the Syntax restriction applies after the wildcard effectively been removed.

Given a driving table D, let {VAR₁(D), VAR₂(D), …, VAR_N(D)} be the set of all variables in D.

Important

RETURN DISTINCT * X with an incoming driving table D is effectively equivalent to

RETURN DISTINCT VAR1(D), VAR2(D), ..., VARN(D) X

Analogously, WITH DISTINCT * X is effectively equivalent to

WITH DISTINCT VAR1(D), VAR2(D), ..., VARN(D) X

Note that X is anything valid syntax in a <Return> or <With> after * specified by the query, respectively. DISTINCT is the optional keyword DISTINCT if specified by the query or otherwise empty.

Example 13. Semantics of projection wildcard *

Assume the following driving table:

a	b
1	2
1	2
2	3

RETURN *, b * SUM(a) AS x ORDER BY x

is effectively equivalent to

RETURN a, b, b * SUM(a) AS x ORDER BY x

and, hence, valid under the Syntax restriction. Further, it is effectively equivalent to

WITH a, b, a AS x_1
WITH a, b, SUM(x_1) AS x_1
RETURN a, b, b * x_1 AS x ORDER BY x

by the Rewrite semantics, so that it results in:

a	b	x
1	2	4
2	3	6

2.3. Examples

2.3.1. Valid aggregations

The following clauses exhibit valid aggregations under the grouping and aggregation semantics and the syntax restriction it includes. For each example we list why it is valid.

RETURN 1 + count(*)
- The sub-expression 1 is a constant.
RETURN 1, 1 + count(*)
- The sub-expression 1 is a constant.
RETURN $x + count($x)
- The sub-expression $x is a parameter.
RETURN count($x) + $x
- The sub-expression $x is a parameter.
RETURN 1 + count($x) + $x * 2 + sum($x) + 'cake'
- The sub-expressions 1, 2, and 'cake' are constants.
- The sub-expression $x is a parameter.
MATCH (a) RETURN a.x, 1 + count(a.x)
- The sub-expression 1 is a constant.
MATCH (a) RETURN a.x, a.x + count(a.x)
- The sub-expression a.x is a recognized grouping key.
MATCH (a) WITH a.x AS ax RETURN ax, ax + count(ax)
- The sub-expression ax is a recognized grouping key.
MATCH (a)-[]→(b) RETURN a, a.x + count(b.y)
- The sub-expression a is a recognized grouping key.
MATCH (a)-[]→(b) RETURN a, size(keys(a)) + count(b.y)
- The sub-expression a is a recognized grouping key.
MATCH (x) RETURN x.a, x.b, x.c, x.a + x.b + count(x) + x.c
- The sub-expressions x.a, x.b, and x.c are recognized grouping keys.
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax)
- The sub-expression ax is a recognized grouping key.
- The sub-expression 1 is a constant.
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b)
- The sub-expression map.a is a recognized grouping key.
MATCH (x) WITH x.a + x.b + x.c AS sum RETURN sum, sum + count(*) + sum
- The sub-expression sum is a recognized grouping key.
MATCH (x)-[]→(y) WITH x.a AS a, collect(y) AS b RETURN a, b, a.x[2] + sum(a.c) + -(b[3].x)*3
- The sub-expressions a and b are a recognized grouping key.
- The sub-expressions 2 and 3 are a constant.
MATCH (a)-[b]→(c) WITH x.a AS a, collect(y) AS b RETURN a, b.y, a.x + sum(c.z) + b.y*3
- The sub-expressions a and b.y are a recognized grouping key.
- The sub-expression 3 is a constant.

2.3.2. Invalid aggregations

The following clauses exhibit invalid aggregations under the grouping and aggregation semantics and the syntax restriction it includes. For each example we list why it is invalid.

MATCH (a) RETURN a.x + count(*)
- The sub-expression a.x is not a grouping key.
MATCH (a) RETURN a.x + a.y + count(*) + a.z
- The sub-expressions a.x + a.y and a.z are not grouping keys.
MATCH (a) WITH a.x AS ax, a.y AS ay RETURN ax, count(ax) + ay
- The sub-expression ay is not a grouping key.
MATCH path=(a)-[*]-() RETURN length(path) + count(a)
- The sub-expression length(path) is not a grouping key.
WITH {a:1, b:2} AS map RETURN map.a, map.b + count(map.b)
- The sub-expression map.b is not a grouping key.
MATCH (a) RETURN a.x + a.y, a.x + collect(a.x)
- The sub-expression a.x is not a grouping key.
MATCH (a) RETURN a.x * a.x, a.x + collect(a.x)
- The sub-expression a.x is not a grouping key.
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x)
- The sub-expression a.x + 1 is not a recognized grouping key.
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x)
- The sub-expression x.a + x.b + x.c is not a recognized grouping key.
MATCH (a)-[]→(b) RETURN a AS x, x.x + count(b.y)
- The sub-expression x is not a grouping key; it is the alias of a grouping key, which are not visible to <ProjectionItem>s within the same clause.
MATCH (a)-[]→(b) RETURN a AS x, size(keys(x)) + count(b.y)
- The sub-expression x is not a grouping key; it is the alias of a grouping key, which are not visible to <ProjectionItem>s within the same clause.
MATCH (a)-[c]→(b) WITH a, c, {dim: properties(b)} AS b RETURN a, b.dim.x, sum(c.p) + b.dim.x
- The sub-expression b.dim.x is not a recognized grouping key.

2.3.3. Valid orderings

The following clauses exhibit valid row orderings. For each example we list why it is valid.

RETURN 1 + count(*) AS x ORDER BY x
- The expression x is a recognized alias.
RETURN 1, 1 + count(*) ORDER BY 2
- The expression 1 is a constant.
RETURN $x + count($x) ORDER BY $x
- The expression $x is a parameter.
RETURN 1 + count(*) ORDER BY 1 + count(*)
- The expression 1 is a constant.
- The expression count(*) is an aggregation function.
MATCH (a) RETURN a.x, 1 + count(a.x) ORDER BY a.x % 2
- The sub-expression a.x is a recognized grouping key.
- The sub-expression 2 is a constant.
MATCH (a) WITH a.x AS ax RETURN ax, ax + count(ax) ORDER BY ax
- The expression ax is a recognized grouping key.
MATCH (a)-[]→(b) RETURN a, a.x + count(b.y) ORDER BY a.y
- The sub-expression a is a recognized grouping key.
MATCH (a)-[]→(b) RETURN a, count(b.y) ORDER BY size(keys(a))
- The sub-expression a is a recognized grouping key.
MATCH (x) RETURN x.a, x.b, x.c, x.a + x.b + count(x) + x.c ORDER BY x.a + x.c
- The sub-expressions x.a and x.c are recognized grouping keys.
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax - 1
- The sub-expression ax is a recognized grouping key.
- The sub-expression 1 is a constant.
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax + 2
- The sub-expression ax is a recognized grouping key.
- The sub-expression 2 is a constant.
MATCH (a) WITH a.x + 1 as ax RETURN ax, ax - 1 + count(ax) ORDER BY ax + 2 - count(ax)
- The sub-expression ax is a recognized grouping key.
- The sub-expression 2 is a constant.
- The sub-expression count(ax) is an aggregation function.
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) ORDER BY x + 2
- The sub-expression x is an recognized alias.
- The sub-expression 2 is a constant.
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) ORDER BY x + 2 - count(ax)
- The sub-expression x is an recognized alias.
- The sub-expression 2 is a constant.
- The sub-expression count(ax) is an aggregation function.
MATCH (a) WITH a.x + 1 as ax RETURN ax AS x, ax - 1 + count(ax) AS y ORDER BY x + 2 - y
- The sub-expressions x and y are recognized aliases.
- The sub-expression 2 is a constant.
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b) ORDER BY map.a
- The expression map.a is a recognized grouping key.

2.3.4. Invalid orderings

The following clauses exhibit invalid row orderings. For each example we list why it is invalid.

MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 1
- The sub-expression a.x + 1 is not a recognized grouping key.
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 1 + count(a.x)
- The sub-expression a.x + 1 is not a recognized grouping key.
MATCH (a) RETURN a.x + 1, a.x + 1 + count(a.x) ORDER BY a.x + 2
- The expression a.x + 2 is not a grouping key.
- The sub-expression 2 is a constant, but sub-expression a.x is not a grouping key.
WITH {a:1, b:2} AS map RETURN map.a, map.a + count(map.b) ORDER BY map.b
- The sub-expression map.b is not a grouping key.
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x) ORDER BY x.a + x.c
- The expression x.a + x.c is not a grouping key.
- The sub-expressions x.a and x.c and x are not grouping keys, either.
MATCH (x) RETURN x.a + x.b + x.c, x.a + x.b + x.c + count(x) ORDER BY x.a + x.b + x.c
- The expression x.a + x.b + x.c is not a recognized grouping key.

3. What others do

All other major query languages explicitly delineate grouping key expressions.

For instance, SQL does so by requiring users to list all grouping key expressions in the GROUP BY clause. If the GROUP BY clause is present in a query, the projection in the SELECT clause have to fulfill a similar syntax restriction as defined by this CIP. The SQL-equivalent of [Q5]

SELECT a AS a, b + SUM(c) * 2 AS foo
FROM A
GROUP BY a

is invalid in SQL as well. For instance, PostgreSQL v13 rejects this query with

error: column "a.b" must appear in the GROUP BY clause or be used in an aggregate function

4. Benefits to this proposal

The main advantage of this proposal is, that is clarifies the semantics of grouping and aggregation in the WITH and the RETURN clause and removes imprecision of the previously existing semantics (cf. First oCIG Meeting).

5. Caveats to this proposal

5.1. Rules are heuristic

From a pure logical standpoint, the syntax restriction only has to rule out sub-expressions of aggregating projection items, which are not constant under the grouping keys. However, statically inferring all possible constant sub-expressions is not necessarily easy. To this effect, the proposed rules of the syntax restriction are a heuristic, which safely identifies sub-expression that are constant under the grouping keys, but can not identify all such expression theoretically possible.

Example 14. Logically correct aggregation ruled out by the syntax restriction

The RETURN clause

Q8

RETURN a AS a, (b - b) + SUM(c) AS foo

is ruled out by the syntax restriction, although sub-expression (b - b) is effectively constant. It is imaginable that a semantic analyser may figure that (b - b) can be simplified to 0 if b is know to be numeric, so that the clause effectively is equivalent to

RETURN a AS a, SUM(c) AS foo

which is perfectly valid.

The proposal tries to strike a balance between allowing good number of useful queries while keeping the rules of the syntax restrict reasonable simple.

Also note: For queries that are logically possible but rejected by the syntax restriction, users can always manually rewrite the query with additional explicit projections to make the query syntactically valid while it still produces the desired result.

5.2. Containment is imprecise

The openCypher grammar does not encode left- or right-deep precedence for chainable operations, cf.

<OrExpression>,
<XorExpression>,
<AndExpression>,
<ComparisonExpression>,
<AddOrSubtractExpression>,
<MultiplyDivideModuloExpression>,
etc.

The rules of this proposal just refer to "contained sub-expressions". Currently, openCypher lacks a clear reference point of what this precisely means.

Most parser technologies result in left- or right-deep parse trees. For instance an expression a+b+c is typically parsed as (a+b)c` or `a(b+c).

Typically, an implementation will decide containment according to its parse tree. Hence, one implementation may find a+b be contained in a+b+c while it finds b+c not to be contained in a+b+c. Another implementation may reach the opposite conclusion w.r.t. containment.

SQL’s and GQL’s definition of containment does not define containment within repetition, too. However, their grammar does not encode chainable operations grammatically with repetition. Rather, SQL and GQL use head-recursive grammar productions, which result in a left-deep containment, i.e. a+b+c is considered as (a+b)+c in these standards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CIP2021-07-07-Grouping-keys-and-aggregation-expressions.adoc

CIP2021-07-07-Grouping-keys-and-aggregation-expressions.adoc

CIP2021-07-07 Grouping key and aggregation expressions

1. Background

1.1. Baseline projection

1.2. Baseline aggregation

1.3. Simple rewrite semantics

1.4. Limits of the simple rewrite semantics

2. Proposal

2.1. Syntax

2.2. Semantics

2.2.1. Rewrite

2.2.2. Syntax restriction

2.2.4. Omitted aliases

2.2.5. Project wildcard `*`

2.3. Examples

2.3.1. Valid aggregations

2.3.2. Invalid aggregations

2.3.3. Valid orderings

2.3.4. Invalid orderings

3. What others do

4. Benefits to this proposal

5. Caveats to this proposal

5.1. Rules are heuristic

5.2. Containment is imprecise

Files

CIP2021-07-07-Grouping-keys-and-aggregation-expressions.adoc

Latest commit

History

CIP2021-07-07-Grouping-keys-and-aggregation-expressions.adoc

File metadata and controls

CIP2021-07-07 Grouping key and aggregation expressions

1. Background

1.1. Baseline projection

1.2. Baseline aggregation

1.3. Simple rewrite semantics

1.4. Limits of the simple rewrite semantics

2. Proposal

2.1. Syntax

2.2. Semantics

2.2.1. Rewrite

2.2.2. Syntax restriction

2.2.3. Row ordering and pagination

2.2.4. Omitted aliases

2.2.5. Project wildcard *

2.3. Examples

2.3.1. Valid aggregations

2.3.2. Invalid aggregations

2.3.3. Valid orderings

2.3.4. Invalid orderings

3. What others do

4. Benefits to this proposal

5. Caveats to this proposal

5.1. Rules are heuristic

5.2. Containment is imprecise

2.2.5. Project wildcard `*`