New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specifying Cypher constraints #166

Open
wants to merge 22 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@Mats-SX
Member

Mats-SX commented Dec 15, 2016

Specifies syntax for constraints and specifies three concrete constraints: node property uniqueness, node property existence, and relationship property existence.

CIP

@petraselmer

Looks good! Some comments...

== Motivation
Constraints provide utility for shaping the data graph.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"utility for shaping the data graph" -> "the means by which various facets of the data graph may be controlled"

== Background
Cypher has a loose notion of schema, in which nodes and relationships may take very heterogeneous forms, both in terms of properties and in graph patterns.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"of schema" -> "of a schema"

== Background
Cypher has a loose notion of schema, in which nodes and relationships may take very heterogeneous forms, both in terms of properties and in graph patterns.
Constraints allows us to bound the heterogeneous nature of the property graph into a more regular form.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"Constraints allows us to bound" -> "Constraints allow us to bind"

* <<existence, Node property existence constraint>>
* <<existence, Relationship property existence constraint>>
Each constraint is detailed in its own below section.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"its own below section" -> "its own section below"

The constraint expressions vary depending on the actual constraint (see the detailed sections).
.Example of dropping a constraint with name foo:

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

Please quote / emphasize "foo" in some way

* `NOT NULL` - Indicates that a column cannot store NULL value
* `UNIQUE` - Ensures that each row for a column must have a unique value
* `PRIMARY KEY` - A combination of a `NOT NULL` and `UNIQUE`. Ensures that a column (or combination of two or more columns) have a unique identity which helps to find a particular record in a table more easily and quickly
* `FOREIGN KEY` - Ensure the referential integrity of the data in one table to match values in another table

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"Ensure the referential integrity of the data in one table to match values in another table"
->
"Ensures the referential integrity of the data in one table matches values in another table"

SQL constraints may be introduced at table creation time (in a `CREATE TABLE` statement), or in an `ALTER TABLE` statement:
.Creating a persons table in SQL Server / Oracle / MS Access:

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

Format 'persons'

FirstName varchar(255))
----
.Creating a persons table in MySQL:

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

Format 'persons'

== Benefits to this proposal
Constraints make Cypher's notion of schema more well-defined, and allows users to keep graphs in a more regular, easier to manage form.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"keep graphs in a more regular, easier to manage form."
->
"maintain graphs in a more regular, easier-to-manage form."

== Caveats to this proposal
For an implementing system, some constraints may prove challenging to enforce, as they generally require scanning through large parts of the graph to look for conflicting entities.

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"For an implementing system"
->
"For a system seeking to implement the contents of this CIP"

This comment has been minimized.

@petraselmer

petraselmer Feb 1, 2017

Contributor

"look for" -> "locate"

@thobe

This comment has been minimized.

Contributor

thobe commented Feb 28, 2017

I thought the idea was that we were going to specify a syntax for constraints without specifying which particular constraints an implementation should support. The syntax definition here explicitly talks only about uniqueness-constraint and existence-constraint.

@Mats-SX

This comment has been minimized.

Member

Mats-SX commented Mar 1, 2017

CIP has now been reworked to try and fit the model discussed.

constraint command = create-constraint | drop-constraint ;
create-constraint = "CREATE", "CONSTRAINT", constraint-name, "FOR", constraint-pattern, "REQUIRE", constraint-expr, { "REQUIRE", constraint-expr } ;
constraint-name = symbolic-name
constraint-pattern = node-pattern | simple-pattern ;

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

Couldn't this just be any pattern, same way as we allow in MATCH? Implementations would still be free to support only some forms in their constraints.

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

Or even a comma-separated sequence?

This comment has been minimized.

@Mats-SX

Mats-SX Mar 1, 2017

Member

You're right, I don't see why not.

All constraints require the user to specify a nonempty _name_ at constraint creation time.
This name is subsequently the handle with which a user may refer to the constraint, for example when dropping it.
// TODO: Should we impose restrictions on the domain of constraint names, or are all Unicode characters allowed?

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

I think the correct thing here is to align with Cypher identifier syntax for now.

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

We should also support escaping of names.

This comment has been minimized.

@Mats-SX

Mats-SX Mar 1, 2017

Member

Escaping is captured by referring to the symbolicString rule.

CREATE CONSTRAINT color_schema
FOR (c:Color)
REQUIRE UNIQUE c.rgb, c.name
REQUIRE exists(c.rgb)

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

I'm missing an example for an existence constraint that requires multiple properties to exist on a node; this can be expressed using exists(n.foo) AND exists(n.bar).

This comment has been minimized.

@Mats-SX

Mats-SX Mar 1, 2017

Member

Right, or using multiple REQUIRE. I can add both.

* `UNIQUE` - Ensures that each row for a column must have a unique value.
* `PRIMARY KEY` - A combination of a `NOT NULL` and `UNIQUE`. Ensures that a column (or a combination of two or more columns) has a unique identity, reducing the resources required to locate a specific record in a table.
* `FOREIGN KEY` - Ensures the referential integrity of the data in one table matches values in another table.
* `CHECK` - Ensures that the value in a column meets a specific condition

This comment has been minimized.

@boggle

boggle Mar 1, 2017

Contributor

Maybe we should have a leading keyword like that for non-UNIQUE constraints as well?

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

if so, I would propose the word THAT, since it reads nicely...

CREATE CONSTRAINT FOR (x:Foo) REQUIRE THAT ...
create-constraint = "CREATE", "CONSTRAINT", [ constraint-name ], "FOR", pattern, "REQUIRE", constraint-predicate, { "REQUIRE", constraint-predicate } ;
constraint-name = symbolic-name
constraint-predicate = expression | unique | primary-key ;
uniqune = "UNIQUE", property-expression

This comment has been minimized.

@IanRogers

IanRogers Mar 4, 2017

typo? uniqune -> unique

This comment has been minimized.

@Mats-SX

Mats-SX Mar 6, 2017

Member

Thanks!

CREATE CONSTRAINT enforce_dag_unique_for_R_links
FOR p = (a)-[:R]-(b)
REQUIRE length(p) < 2
----

This comment has been minimized.

@IanRogers

IanRogers Mar 4, 2017

Sorry, this isn't what #173 is talking about. The "unique" constraint on a DAG would be require size(p) < 2 though this is a generalisation of require unique p (i.e. to enforce the behaviour of create unique http://neo4j.com/docs/developer-manual/current/cypher/clauses/create-unique/#_create_unique_relationships)

This comment has been minimized.

@Mats-SX

Mats-SX Mar 6, 2017

Member

You're right; length() actually doesn't make much sense here (it will always be 1 with this pattern). Although size() doesn't either, as it's not defined over paths at all.

Perhaps

REQUIRE size(collect(DISTINCT p)) < 2

would work, but it's not very pretty. We probably need some other construct here. I'll remove this example for now.

@thobe

This comment has been minimized.

Contributor

thobe commented Mar 7, 2017

PRIMARY KEY is not a helpful name for the concept it is used for describing.

On the high level there are two reasons for this:

  1. The word "primary" does not mean anything that is helpful in this context.
  2. The concept of primary keys carries with it a lot of associations from relational databases, many of which do not apply to the property graph model.
    • it should be noted that for the relational database case the word "primary" does have relevant meaning.

Diving further into these reasons, starting with the word "primary":

  • It implies that there is such a thing as a "secondary" key as well. In a relational database a secondary key is any index on a table other than the primary key.
  • It implies that this key has higher importance than any other key. While this might be true in many actual domain models, it is not always the case - in some cases there are other keys of equal importance.
  • It carries with it the association that there can be only one primary key. In many implementations - Neo4j being one of them - there is absolutely no need to enforce such a constraint on the ability to model your data.
  • It implies that the key needs to be defined first - before any data is inserted. In many implementation - Neo4j being one of them - this is not the case.

As for the aspects of primary keys in relational databases that do not apply to the property graph model:

  • The notion of there being a primary key implies that there might also be a foreign key - the idea of having foreign keys in a graph is quite silly, since we have direct relationships.
  • Coming from relational databases, I would expect preferential treatment for primary keys over any other (secondary) keys. I would expect lookup based on the primary key to be faster than any other key, since I would expect the data in that table to be structured by the primary key. In essence I would expect that leaf nodes in the index for the primary key being the actual row, with all of its data. Whereas for a secondary key the leaf node would only point to the actual row in the primary structure - there would be indirection for accessing the full data by a secondary key, thus penalizing access by secondary key.
    • Again this would not be true in for example Neo4j, where every key is actually secondary.
  • Relational databases equate the primary key with identity, the property graph (at least in some implementations - for example Neo4j) has a separate notion of identity, and while the type of key we are proposing to add to the model would allow you to uniquely identify an entity, it would not necessarily identify that same entity forever - the same entity might change some of the values of the key and thus be identified by a different key, but still have the same identity.

The only reason I can think of for introducing the concept of a primary key in Cypher is for being able to map Cypher onto a relational database model. If that is the case I would much rather see this proposed from a vendor working on such a mapping, since they would have the insight into what needs to be modeled.

I do think that the notion of a unique indexed key of mandatory properties is helpful, and I see the benefit of elevating such a concept to the status of receiving its own syntax, but I don't think PRIMARY KEY is a good name for it.

REQUIRE UNIQUE p.name
REQUIRE UNIQUE p.email
REQUIRE UNIQUE p.address
REQUIRE exists(p.name) AND exists(p.email) AND exists(p.address)

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

This decomposition seems wrong to me, we would want the compound key to require that the combination of p.name, p.email, and p.address is unique, not that each of them is unique on their own. But we do require that they each exist.

This comment has been minimized.

@Mats-SX

Mats-SX Mar 7, 2017

Member

Right, without UNIQUE on multiple properties, this is not expressible in another way.

[source, ebnf]
----
constraint command = create-constraint | drop-constraint ;
create-constraint = "CREATE", "CONSTRAINT", [ constraint-name ], "FOR", pattern, "REQUIRE", constraint-predicate, { "REQUIRE", constraint-predicate } ;

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

I know there has been conversation about how CREATE is an overloaded word if used for constraints as well, since it is already used for creating nodes and edges. I just wanted to plant the idea of using the word DEFINE for constraints instead. i.e. DEFINE CONSTRAINT FOR (p:Person) REQUIRE UNIQUE p.firstName, p.lastName

create-constraint = "CREATE", "CONSTRAINT", [ constraint-name ], "FOR", pattern, "REQUIRE", constraint-predicate, { "REQUIRE", constraint-predicate } ;
constraint-name = symbolic-name
constraint-predicate = expression | unique | primary-key ;
unique = "UNIQUE", property-expression

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

I think we want this to be a list of expressions in order to allow compound keys to be unique.

This comment has been minimized.

@Mats-SX

Mats-SX Mar 7, 2017

Member

I view that as out of scope for the first version of this CIP.

==== Mutability
Once a constraint has been created, it may not be amended.
Should a user wish to change its definition, it has to be dropped and recreated with an updated structure.

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

Should we have a note here that transactional implementations could do both the dropping and recreation in the same transaction so that the constraint is atomically mutated? This would of course allow leaving the old constraint in place should the creation of the new constraint fail.

* Attempting to create a constraint on a graph where the data does not comply with the constraint criterion.
* Attempting to create a constraint with a name that already exists.
* Attempting to drop a constraint referencing a non-existent name.
* Attempting to modify the graph in such a way that it would violate a constraint.

This comment has been minimized.

@thobe

thobe Mar 7, 2017

Contributor

+ attempting to create a constraint that the underlying engine does not support enforcing.

@thobe

This comment has been minimized.

Contributor

thobe commented Mar 7, 2017

I wonder if the notion of unique key (that at the moment are erroneously called primary key) should really be a constraint, or if it should have its own syntax, something like:

CREATE KEY FOR (p:Person) AS p.name, p.address

The reason for this being that it actually implies multiple constraints, and typically also an index. Since it is a composed concept like that, perhaps it would be sensible to elevate it to being a syntactical concept of its own.

In the syntax for this, if accepted, we should allow an optional name for the key as well, just like we do for constraints.

@Mats-SX

This comment has been minimized.

Member

Mats-SX commented Mar 7, 2017

The CIP now uses ADD, NODE KEY and details a return record. I also took several review comments into account (thanks!).

@thobe

thobe approved these changes Mar 30, 2017

petraselmer and others added some commits Feb 17, 2017

Rework CIP
- Specify general constraint language
- Specify `UNIQUE` operator
- Clearly define semantics for domain and expressions
- List all concrete constraints in example section
- Add several more examples
Move Mutability and Name sections
They are more appropriate under the Semantics and Syntax sections, respectively.
Add cross-links
Move Errors section
Support arbitrary patterns
- Remove TODO
- Add example using larger pattern
- Add example using multiple `exists()`
Introduce PRIMARY KEY constraint predicate
- Rename `constrait-expr` to `constraint-predicate`
- Limit scope of `UNIQUE` to single properties only
- Update examples to reflect `PRIMARY KEY`
Rename constraint operator to NODE KEY
- Remove erroneous example for composing `NODE KEY` with `UNIQUE` and `exists()`
- Rephrase example section to describe `NODE KEY` more accurately.
Use ADD for constraint creation
- Add missing case for when an error should be raised
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment