Rework RegExp engine and add support for proper unicode matching #3746

dbatyai · 2020-05-15T14:45:48Z

This change includes several bugfixes, general improvements, and support
for additional features.

Added full support for web compatibility syntax defined in Annex B
Implemented parsing and matching patterns in unicode mode
Fixed capture results when iterating with nested capturing groups
Significantly reduced regexp bytecode size
Reduced stack usage during regexp execution
Improved matching performance

dbatyai · 2020-05-15T14:46:33Z

Benchmark	Heap (bytes)	Stack (bytes)	Perf (sec)
regexp-dna.js	1307536 -> 1307368 : +0.013%	1648 -> 1600 : +2.913%	1.379 -> 1.262 : +8.469%
string-tagcloud.js	1001432 -> 993352 : +0.807%	2200 -> 2144 : +2.545%	2.265 -> 2.203 : +2.751%
string-unpack-code.js	645344 -> 645288 : +0.009%	1944 -> 1896 : +2.469%	0.977 -> 0.919 : +5.987%
string-validate-input.js	927816 -> 927688 : +0.014%	1824 -> 1768 : +3.070%	8.243 -> 7.392 : +10.331%
regexp.js	2140568 -> 2128008 : +0.587%	3916 -> 3364 : +14.096%	5.770 -> 4.411 : +23.545%

ES5.1 binary size	master	patch	Diff
.text	121656	121984	+328 bytes
.rodata	14015	13999	-16 bytes

ES6 binary size	master	patch	Diff
.text	191544	192424	+880 bytes
.rodata	19289	19273	-16 bytes

lauriro · 2020-05-17T19:33:44Z

| after non-capturing group and word boundary does not work

print("name.lower()".replace(/(?:a|b)\b|\.\w+/g, ""))
// jerry returns: name.lower()
// node returns: name()

dbatyai · 2020-05-18T09:35:47Z

Thanks @lauriro, should be fixed now.

lauriro · 2020-05-18T18:16:22Z

one more

var re = /([[])\w+(?:=(("|')(?:\\?.)*?\3))?]?/g
console.log("a[href='#a>b']".replace(re, ""))
// jerry returns: a='#a>b']
// node returns: a

strange thing is, when [ is escaped in first character set, then it seems to work.
simpler re /([[])/ without escape also works, some combination will break.

dbatyai · 2020-05-18T19:14:20Z

Thanks, fixed as well.

jerry-core/ecma/operations/ecma-regexp-object.c

jerry-core/parser/regexp/re-parser.c

jerry-core/ecma/operations/ecma-regexp-object.c

jerry-core/ecma/operations/ecma-regexp-object.h

jerry-core/parser/regexp/re-bytecode.c

jerry-core/parser/regexp/re-compiler-context.h

jerry-core/parser/regexp/re-parser.c

rerobika

LGTM, nice work!

zherczeg

The capturing group matching seems too complex to me. In a recursive implementation, it is enough to store 3 values on the stack (prev_start, prev_end, current_start), and do an update when the end is reached. I saw local array allocation and similar things in the code. Are they necessary?

jerry-core/ecma/builtin-objects/ecma-builtin-global.c

zherczeg · 2020-05-25T11:07:14Z

jerry-core/ecma/operations/ecma-regexp-object.c

+ */
+static JERRY_ATTR_NOINLINE const lit_utf8_byte_t *
+ecma_regexp_step_back (ecma_regexp_ctx_t *re_ctx_p,
+                       const lit_utf8_byte_t *str_p) /**< reference to string pointer */


When this function is exactly used? Word boundary?

If there is a surrogate pair, and match starts from the position between the two characters, ecma_regexp_step_back may go back the start of the surrogate pair instead of stopping at the middle.

It is used in the greedy atom iterator during backtracking.

In unicode mode surrogate pairs are required to be handled as a single code point, matching can never start from the middle of a surrogate pair, and it is required to read the entire surrogate pair when backtracking.

It seems you are right, The engine overwrites the start offset for surrogate pairs. Is there a test for this?

There is one now.

zherczeg · 2020-05-25T11:09:40Z

jerry-core/ecma/operations/ecma-regexp-object.c

+
+        /* If zero iterations are allowed, then execute the end opcode which will handle further iterations,
+         * otherwise run the 1st iteration immediately by executing group bytecode. */
+        if (qmin == 0)


Is this likely? Can it be handled at compile time in some way?

The likelihood would be roughly 50/50 in my opinion, it could be handled compile time, however that would increase code size, and performance impact would be negligible.

zherczeg · 2020-05-25T11:11:34Z

jerry-core/ecma/operations/ecma-regexp-object.c

-          JERRY_TRACE_MSG ("fail\n");
-          return NULL; /* fail */
+          /* Save and clear all nested capturing groups, and try to iterate. */
+          JERRY_VLA (const lit_utf8_byte_t *, saved_captures_p, group_p->subcapture_count);


Is this efficient?

It's required.

jerry-core/ecma/operations/ecma-regexp-object.c

jerry-core/ecma/operations/ecma-regexp-object.h

zherczeg · 2020-05-25T11:19:04Z

jerry-core/lit/lit-char-helpers.c

+    const uint32_t digit = (uint32_t) (*current_p++ - LIT_CHAR_0);
+    uint32_t new_value = value * 10 + digit;
+
+    if (JERRY_UNLIKELY (value > UINT32_MAX / 10) || JERRY_UNLIKELY (new_value < value))


Is this comes from the standard or an engine limit?

Engine limit, the value would overflow otherwise.

dbatyai · 2020-05-25T12:40:38Z

Technically 2 values would be enough since prev_end always equals to current_start, but only in very simple cases. When there are nested capturing groups in the current group, it's no longer enough to save only these values, and we need to save/clear all the nested captures at the start of each iteration.

Take for example this regex: \((a)|(b))+\.exec("ab"), the resulting captures would be 0:"ab", 1:"ab", 2: undefined, 3:"b", since in the final iteration the (a) group did not match. If we do not clear the nested captures, the second capture would still contain "a". Similarly, there are cases where we need to backtrack, and restore the captures of all nested groups after a failed iteration, which is why they also need to be saved.

We currently get away with only saving the starting pointer of the captures by not clobbering the end pointer unnecessarily, so in my opinion this is as efficient as it can get, while still conforming to the standard.

zherczeg · 2020-05-26T08:57:15Z

It looks like this is working differently from other engines, but this type of regexp has far less features so they can get away with it. If a regexp has 1000 captures inside an iterator, can it consume the stack easily?

By the way, would it be enough to use a capture reset bitset? So instead of storing the previous values (two uint32_t values), just store one bit which says the value is currently undefined (even though the capture vector contains its previous value). This can be a represented by a flag, or a negative number in the capture vector. When the engine enter to a group, just save the flags, and flag all numbers in the ovector. When restore the previous iteration, reset the flags only, not the values. Even using a byte vector is a lot of save compared to 2*4 bytes.

dbatyai · 2020-05-26T09:35:55Z

I don't think a reset bitset would be enough. /((x)|(.))*def/.exec("abcdef") In this case the iterator would consume the whole string, the last capture would be 'f', and then we would backtrack three times until the tail 'def' matches, and in each step we have to restore the previous matched capture.

Currently when there are no nested captures we are using less stack than the previous implementation. For every nested capture there is, we have to save one extra value per iteration, which is unfortunately a necessary evil. We could save these on the heap, but that would be horrible performance wise.

If nested captures don't have quantifiers then stack usage is still reasonable, and if they do, that would be bad stack-wise regardless of whether we save captures or not. Also IMO nested captures aren't extremely common, there are usually one or two per regexp at most, so I don't consider them to be major concern.

rerobika · 2020-05-26T09:54:43Z

It looks like this is working differently from other engines, but this type of regexp has far less features so they can get away with it. If a regexp has 1000 captures inside an iterator, can it consume the stack easily?

By the way, would it be enough to use a capture reset bitset? So instead of storing the previous values (two uint32_t values), just store one bit which says the value is currently undefined (even though the capture vector contains its previous value). This can be a represented by a flag, or a negative number in the capture vector. When the engine enter to a group, just save the flags, and flag all numbers in the ovector. When restore the previous iteration, reset the flags only, not the values. Even using a byte vector is a lot of save compared to 2*4 bytes.

I think these ideas are beyond the PR's scope. The rework currently has significant stack/memory consumption improvement however the main reason was to add the proper unicode matching support for the engine so that's a double benefit. So I'd suggest to keep these ideas as the institution for opening a follow up optimization PR.

zherczeg

LGTM

Nice job! I am sure I missed things because the patch is huge, and there are optimization oppportunities, but lets use this as a first step.

jerry-core/ecma/operations/ecma-regexp-object.c

This change includes several bugfixes, general improvements, and support for additional features. - Added full support for web compatibility syntax defined in Annex B - Implemented parsing and matching patterns in unicode mode - Fixed capture results when iterating with nested capturing groups - Significantly reduced regexp bytecode size - Reduced stack usage during regexp execution - Improved matching performance JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai@inf.u-szeged.hu

rerobika · 2020-05-26T11:04:51Z

Just to mention for the credits:
During the development of the PR the Fuzzinator found 10 unique issues with using grammarinator.

dbatyai added enhancement An improvement performance Affects performance ES2015 Related to ES2015 features labels May 15, 2020

dbatyai force-pushed the regexp_unicode branch from ce079b4 to 9ddb1bf Compare May 15, 2020 14:53

dbatyai mentioned this pull request May 15, 2020

String#replace with regex provides previous capture group when group not found #3743

Closed

dbatyai force-pushed the regexp_unicode branch from 9ddb1bf to a7c6b62 Compare May 18, 2020 09:35

dbatyai mentioned this pull request May 18, 2020

stack-overflow in ecma_regexp_match #3753

Closed

dbatyai force-pushed the regexp_unicode branch 7 times, most recently from 801c385 to a8d39aa Compare May 19, 2020 15:50

galpeter reviewed May 19, 2020

View reviewed changes

jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved

jerry-core/parser/regexp/re-parser.c Outdated Show resolved Hide resolved

dbatyai force-pushed the regexp_unicode branch 4 times, most recently from 4ea579f to caa0edf Compare May 21, 2020 18:31

dbatyai requested review from rerobika, zherczeg and galpeter May 22, 2020 12:33

rerobika reviewed May 22, 2020

View reviewed changes

dbatyai force-pushed the regexp_unicode branch from caa0edf to 644a515 Compare May 25, 2020 10:46

dbatyai requested a review from rerobika May 25, 2020 10:47

rerobika approved these changes May 25, 2020

View reviewed changes

zherczeg reviewed May 25, 2020

View reviewed changes

dbatyai force-pushed the regexp_unicode branch 2 times, most recently from 2b13455 to ae1e96f Compare May 25, 2020 13:38

dbatyai force-pushed the regexp_unicode branch from ae1e96f to 8ca1e05 Compare May 26, 2020 09:57

zherczeg approved these changes May 26, 2020

View reviewed changes

jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved

dbatyai force-pushed the regexp_unicode branch from 8ca1e05 to 8b2e231 Compare May 26, 2020 10:57

rerobika merged commit 8f76a1f into jerryscript-project:master May 26, 2020

dbatyai deleted the regexp_unicode branch June 10, 2020 13:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework RegExp engine and add support for proper unicode matching #3746

Rework RegExp engine and add support for proper unicode matching #3746

dbatyai commented May 15, 2020

dbatyai commented May 15, 2020

lauriro commented May 17, 2020 •

edited

dbatyai commented May 18, 2020

lauriro commented May 18, 2020

dbatyai commented May 18, 2020

rerobika left a comment •

edited

zherczeg left a comment

zherczeg May 25, 2020

dbatyai May 25, 2020

zherczeg May 26, 2020

dbatyai May 26, 2020

zherczeg May 25, 2020

dbatyai May 25, 2020

zherczeg May 25, 2020

dbatyai May 25, 2020

zherczeg May 25, 2020

dbatyai May 25, 2020

dbatyai commented May 25, 2020

zherczeg commented May 26, 2020

dbatyai commented May 26, 2020 •

edited

rerobika commented May 26, 2020

zherczeg left a comment

rerobika commented May 26, 2020

Rework RegExp engine and add support for proper unicode matching #3746

Rework RegExp engine and add support for proper unicode matching #3746

Conversation

dbatyai commented May 15, 2020

dbatyai commented May 15, 2020

lauriro commented May 17, 2020 • edited

dbatyai commented May 18, 2020

lauriro commented May 18, 2020

dbatyai commented May 18, 2020

rerobika left a comment • edited

Choose a reason for hiding this comment

zherczeg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbatyai commented May 25, 2020

zherczeg commented May 26, 2020

dbatyai commented May 26, 2020 • edited

rerobika commented May 26, 2020

zherczeg left a comment

Choose a reason for hiding this comment

rerobika commented May 26, 2020

lauriro commented May 17, 2020 •

edited

rerobika left a comment •

edited

dbatyai commented May 26, 2020 •

edited