Permalink
Fetching contributors…
Cannot retrieve contributors at this time
151 lines (103 sloc) 7.08 KB

CIP2015-09-15 STARTS WITH and ENDS WITH

Author: Stefan Plantikow stefan.plantikow@neotechnology.com

Abstract

String search is a feature that is continuously being requested by customers. Being able to describe and execute string searches efficiently (e.g. using index lookups) adds a lot of value to the language. To better support string search, this CIP proposes the introduction of new operators for string search (prefix search, suffix search and inclusion search).

Case-insensitive string search (which requires specifying the correct locale or some default case folding rules) is out of scope for this CIP.

1. Motivation & Background

A previous (now rejected) CIP on LIKE has shown that supporting efficient, indexed string search using LIKE patterns leads to complications around escaping of parameters and literals in LIKE patterns. We drafted a CIP that proposed the introduction of String interpolation to address this issue but the overall solution was considered to be too complex and difficult to understand for users.

This CIP follows a more focused approach by adding explicit operators for commonly requested types of string searches.

2. Proposal

  • Add the operator STARTS WITH

  • Add the operator ENDS WITH

  • Add the operator CONTAINS

2.1. Examples

Find all persons whose name starts with 'And'
MATCH (a:Person)
WHERE a.name STARTS WITH 'And'
RETURN a
Find all persons whose name starts with the parameter prefix
MATCH (a:Person)
WHERE a.name STARTS WITH $prefix
RETURN a
Find all persons whose name ends with 'fan'
MATCH (a:Person)
WHERE a.name ENDS WITH 'fan'
RETURN a
Find all books whose isbn in string form contains '007'
MATCH (b:Book)
WHERE toString(b.isbn) CONTAINS '007'
RETURN a

2.2. Syntax

Syntactically this CIP proposes adding STARTS WITH, ENDS WITH, and CONTAINS as new keywords and adds new production rules to the expression nonterminal.

expression    = current definition of expression
               | string-search
               ;

string-search = starts-with | ends-with | contains ;
starts-with   = expression,STARTS,WITH, expression ;
ends-with     = expression,ENDS,WITH, expression ;
contains      = expression,CONTAINSexpression ;

2.3. Semantics

All three proposed operators are defined as comparison operators on string operands. As such, behavioural semantics need to be in line with what is detailed by the Comparability CIP. This has the following consequence:

  • If any argument to STARTS WITH, ENDS WITH, or CONTAINS is null or not a string, then the result of evaluating the whole predicate is null.

2.3.1. STARTS WITH

Using lhs STARTS WITH rhs requires both lhs and rhs to be strings. This new expression evaluates to true if lhs textually starts with rhs. Otherwise, it is false.

2.3.2. ENDS WITH

Using lhs ENDS WITH rhs requires both lhs and rhs to be strings. This new expression evaluates to true if lhs textually ends with rhs. Otherwise, it is false.

2.3.3. CONTAINS

Using lhs CONTAINS rhs requires both lhs and rhs to be strings. This new expression evaluates to true if lhs textually contains rhs. Otherwise, it is false.

2.4. Alternatives

3. What others do

3.1. SQL

SQL uses LIKE - please refer to the LIKE CIP.

3.2. SPARQL

SPARQL uses regular expressions only.

3.3. MongoDB

  • If a field is indexed with a text index, a search can be made for documents containing a given term in the field; see here for more details.

  • Regex searching is also provided; see here for more details.

3.4. Elastic (formerly known as Elasticsearch)

  • Simple prefix query: this is a low-level query working at the document term level and is not optimised. More details may be found here.

  • Wildcard term query: this is a low-level, term-based query similar to the prefix query, but allowing for the specification of a pattern. It uses ? to match any character and * to match zero or more characters and is also not optimised. More details may be found here.

  • Regex searching is also available and is also not optimised. Details may be found here.

4. Benefits to this proposal

(Efficient) string search would be a very frequently-used and important operation that would be supported by implementing this CIP.

5. Caveats to this proposal

  • More complex string searches must still use the regular expression search.

  • Differs from SQL’s approach.

  • More keywords added to the language.

6. Appendix

Case-insensitive string search requires specifying a case conversion function for converting the string operands as well as using a suitable equality predicate to compare them. Unicode itself defines three possible cases: lower case, upper case, and title case. Converting to these cases is inherently locale specific though in practice this is often ignored by using the default ("C" or "en") locale. To achieve good results, this is often combined with another locale independent normalization step. Furthermore notably Java defines a special equality predicate "equalsIgnoreCase" for case insensitive comparison that treats two strings identical if they have the same length and if all of their characters are pairwise equal either directly, or after upcasing both of them, or after downcasing both of them.

Further reading

6.2. Adding title case

To be on par with Unicode, it may be desirable to add the toTitle function for case folding a string to the title case in the future.