Improve document about the absence operator #87

Open
k-takata opened this Issue Mar 27, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@k-takata
Owner

k-takata commented Mar 27, 2017

After the release of Ruby 2.4.1, some blog posts about the absent operator were written, and they caused some discussions.

Blog posts:

Discussions:

Unfortunately, some people might not understand the advantage of the absent operator. Maybe one of the reason is that the original paper by Tanaka Akira is written in Japanese. Another reason would be that the document of the operator in Onigmo is not enough.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Mar 27, 2017

Owner

First, I'd like to translate the important points of his paper very very roughly.


  1. Introduction

Complement sets are useful to express C-style comments, CR LF terminated lines,
Python's """...""" strings, HTTP header terminated by CR LF CR LF, SMTP body
terminated by CR LF . CR LF, etc.
These can be expressed with the complement set of (any string)(terminator)(any string). However, it it not easy to construct a regex which matches such set.
This proposes the absent operator which is easy to write and fits well to regular language theory.

  1. Regular expressions

Regular expressions consist of:
e: empty string
c: character
rr: concatenation
r|r: alternation
r*: Kleene closure

  1. Implementation of regular expressions engine

A regular expressions engine can be implemented with DFA or backtracking.
Script languages nowadays like Perl, Python and Ruby uses backtracking,
because it is not easy to implement capturing with DFA.
He implemented a basic backtracking engine using Ruby.

3.1. Abstract Syntax Tree of regular expressions

3.2. A basic regular expressions engine

A basic implementation by Ruby.

# re: AST of a regex
# str: An array of characters
# pos: Start position
# block: A block executed when it is matched
#        (Callback)
def try(re, str, pos, &block)
  1. Proposal for the absent operator

If a regex engine uses DFA, complement sets can be handled easily.
However it is not easy for backtracking engine. (See 7.3 for detail.)

  1. The absent operator !r

(Note: Onigmo actually uses (?~r) instead of !r, because !r is not compatible
with current syntax.)
The language generated by !r can be defined as the following:

L(!r) = \Sigma * - L(.*r.*)

L(!r) is a complement set of L(.*r.*).

  1. Implementation of the absent operator

This is also described in pp.26-27 of his slide:
https://staff.aist.go.jp/tanaka-akira/pub/prosym49-akr-presen.pdf
See also the next comment.

  1. Advantage of the absent operator

7.1. Easy to write

C-style comments can be expressed with the following:

ab[^b]*b+(([^ba][^b]*)b+)*a

("a" and "b" are used here instead of "/" and "*" to reduce the complexity of
escaping.)

If the absent operator is used, it can be:

ab!(ba)ba

7.2. Fit with regular language theory

7.2.1. Repetition of lazy match

ab.*?ba

This doesn't work well when concatenated with other regex. E.g.

ab.*?ba\n

This wrongly matches "/*2*/3/*4*/\n".
However,

ab!(ba)ba\n

this correctly matches "/*4*/\n".

7.2.2. No backtracking

ab(?>.*?ba)

This works well when concatenated with "\n":

ab(?>.*?ba)\n

However this hardly depends on the strategy of backtracking, and it doesn't
have theoretical backgrounds. E.g. it cannot be converted to a DFA.

7.2.3. Negative lookahead

ab((?!ba).)*ba

This works well.
However, it is not possible to define a language L((?!r)) which is generated
by (?!r).

Note: I'm not sure that the negative lookahead is regular or not.
There is a paper (in Japanese) which says that it is regular.

7.3. Inefficiency of complement sets

It is possible to implement a negation operator which matches the complement
set of r. This is, however, inefficient to implement on a backtracking engine.
The absent operator is much efficient than the negation operator.

  1. Related researches

8.1. Ragel

Ragel has the following operators:

  • negation operator !r
    This is the same as the complement set operator discussed in 7.3.
  • difference operator r1-r2
    This is the same as the complement set operator when r1 generates all the possible strings.
  • strong difference operator r1--r2
    This is the same as the absent operator when r1 generates all the possible strings.

8.2. Perl

8.3. Grail

  1. Conclusions
Owner

k-takata commented Mar 27, 2017

First, I'd like to translate the important points of his paper very very roughly.


  1. Introduction

Complement sets are useful to express C-style comments, CR LF terminated lines,
Python's """...""" strings, HTTP header terminated by CR LF CR LF, SMTP body
terminated by CR LF . CR LF, etc.
These can be expressed with the complement set of (any string)(terminator)(any string). However, it it not easy to construct a regex which matches such set.
This proposes the absent operator which is easy to write and fits well to regular language theory.

  1. Regular expressions

Regular expressions consist of:
e: empty string
c: character
rr: concatenation
r|r: alternation
r*: Kleene closure

  1. Implementation of regular expressions engine

A regular expressions engine can be implemented with DFA or backtracking.
Script languages nowadays like Perl, Python and Ruby uses backtracking,
because it is not easy to implement capturing with DFA.
He implemented a basic backtracking engine using Ruby.

3.1. Abstract Syntax Tree of regular expressions

3.2. A basic regular expressions engine

A basic implementation by Ruby.

# re: AST of a regex
# str: An array of characters
# pos: Start position
# block: A block executed when it is matched
#        (Callback)
def try(re, str, pos, &block)
  1. Proposal for the absent operator

If a regex engine uses DFA, complement sets can be handled easily.
However it is not easy for backtracking engine. (See 7.3 for detail.)

  1. The absent operator !r

(Note: Onigmo actually uses (?~r) instead of !r, because !r is not compatible
with current syntax.)
The language generated by !r can be defined as the following:

L(!r) = \Sigma * - L(.*r.*)

L(!r) is a complement set of L(.*r.*).

  1. Implementation of the absent operator

This is also described in pp.26-27 of his slide:
https://staff.aist.go.jp/tanaka-akira/pub/prosym49-akr-presen.pdf
See also the next comment.

  1. Advantage of the absent operator

7.1. Easy to write

C-style comments can be expressed with the following:

ab[^b]*b+(([^ba][^b]*)b+)*a

("a" and "b" are used here instead of "/" and "*" to reduce the complexity of
escaping.)

If the absent operator is used, it can be:

ab!(ba)ba

7.2. Fit with regular language theory

7.2.1. Repetition of lazy match

ab.*?ba

This doesn't work well when concatenated with other regex. E.g.

ab.*?ba\n

This wrongly matches "/*2*/3/*4*/\n".
However,

ab!(ba)ba\n

this correctly matches "/*4*/\n".

7.2.2. No backtracking

ab(?>.*?ba)

This works well when concatenated with "\n":

ab(?>.*?ba)\n

However this hardly depends on the strategy of backtracking, and it doesn't
have theoretical backgrounds. E.g. it cannot be converted to a DFA.

7.2.3. Negative lookahead

ab((?!ba).)*ba

This works well.
However, it is not possible to define a language L((?!r)) which is generated
by (?!r).

Note: I'm not sure that the negative lookahead is regular or not.
There is a paper (in Japanese) which says that it is regular.

7.3. Inefficiency of complement sets

It is possible to implement a negation operator which matches the complement
set of r. This is, however, inefficient to implement on a backtracking engine.
The absent operator is much efficient than the negation operator.

  1. Related researches

8.1. Ragel

Ragel has the following operators:

  • negation operator !r
    This is the same as the complement set operator discussed in 7.3.
  • difference operator r1-r2
    This is the same as the complement set operator when r1 generates all the possible strings.
  • strong difference operator r1--r2
    This is the same as the absent operator when r1 generates all the possible strings.

8.2. Perl

8.3. Grail

  1. Conclusions
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Mar 27, 2017

Owner

And partial translation of his slide.


p.26
Behavior of the absent operator

Search - not match
  Move start pos and search
    Move again and search
      Found
      Move again and search
        Found another one
	  Move and search
	    No need to search after this

The range which matches are not included.

p.27
Behavior of the absent operator

  • Search r in the right part of the given start position pos.
  • Search it starting from pos and go to right one by one character.
  • If nothing found, all part of the right of pos are the possible points which the absent operator can match.
  • If one or more found, select one which the right end is most left.
  • Just one character left before the right end of the selected one is the last possible point which the absent operator can match.
  • Right part of the found one is not needed to be searched.
Owner

k-takata commented Mar 27, 2017

And partial translation of his slide.


p.26
Behavior of the absent operator

Search - not match
  Move start pos and search
    Move again and search
      Found
      Move again and search
        Found another one
	  Move and search
	    No need to search after this

The range which matches are not included.

p.27
Behavior of the absent operator

  • Search r in the right part of the given start position pos.
  • Search it starting from pos and go to right one by one character.
  • If nothing found, all part of the right of pos are the possible points which the absent operator can match.
  • If one or more found, select one which the right end is most left.
  • Just one character left before the right end of the selected one is the last possible point which the absent operator can match.
  • Right part of the found one is not needed to be searched.
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Mar 27, 2017

Owner

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

Owner

k-takata commented Mar 27, 2017

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

@hachi8833

This comment has been minimized.

Show comment
Hide comment
@hachi8833

hachi8833 Mar 27, 2017

Good work! I'd help translate them if needed.

Good work! I'd help translate them if needed.

@janosch-x

This comment has been minimized.

Show comment
Hide comment
@janosch-x

janosch-x Apr 3, 2017

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

I think so. "absent operator" is a confusing name, because it could just as well refer to an operator that is not there. "absence operator" does not have this problem.

On the other hand, would you say that it is also a group? In the docs you have put it under "7. Extended groups". If so, it might be more consistent to use the adjective form, as with "passive" or "atomic". I'd still choose clarity over consistency, though.

BTW, should the name of the operator be "absence operator" instead of "absent operator"?

I think so. "absent operator" is a confusing name, because it could just as well refer to an operator that is not there. "absence operator" does not have this problem.

On the other hand, would you say that it is also a group? In the docs you have put it under "7. Extended groups". If so, it might be more consistent to use the adjective form, as with "passive" or "atomic". I'd still choose clarity over consistency, though.

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Apr 4, 2017

Owner

Thank you for the explanation.

would you say that it is also a group?

In Akira's paper, ! is used for the operator. However, I decided to use (?~...) in Onigmo, and yes, it is also a group (unlike !).
So "absence operator" or "absence group" seems better than "absent operator".

Owner

k-takata commented Apr 4, 2017

Thank you for the explanation.

would you say that it is also a group?

In Akira's paper, ! is used for the operator. However, I decided to use (?~...) in Onigmo, and yes, it is also a group (unlike !).
So "absence operator" or "absence group" seems better than "absent operator".

@janosch-x janosch-x referenced this issue in ammar/regexp_parser Apr 4, 2017

Merged

Add support for new absence operator (#33) #34

k-takata added a commit that referenced this issue Apr 5, 2017

doc: Improve document about absence operator (Issue #87)
Rename absent operator to absence operator.
Add more description.
@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Apr 5, 2017

Owner

I have slightly updated the document: 7911409
I changed the name of the operator to "absence operator".
(PR for fixing document (including grammar or typo) is always welcome.)

Owner

k-takata commented Apr 5, 2017

I have slightly updated the document: 7911409
I changed the name of the operator to "absence operator".
(PR for fixing document (including grammar or typo) is always welcome.)

@k-takata

This comment has been minimized.

Show comment
Hide comment
@k-takata

k-takata Apr 5, 2017

Owner

(Upvoting all comments is meaningless, I think...)

Owner

k-takata commented Apr 5, 2017

(Upvoting all comments is meaningless, I think...)

@k-takata k-takata changed the title from Improve document about the absent operator to Improve document about the absence operator Apr 5, 2017

@znz znz referenced this issue in rurema/doctree May 25, 2017

Closed

Absence Operator #395

@znz znz referenced this issue in ruby/www.ruby-lang.org Sep 25, 2017

Merged

Ruby 2.5.0-preview1 Released #1641

2 of 2 tasks complete

@JuanitoFatas JuanitoFatas referenced this issue in ruby/www.ruby-lang.org Dec 20, 2017

Merged

absent operator → absence operator #1699

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment