Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework RegExp engine and add support for proper unicode matching #3746

Merged
merged 1 commit into from May 26, 2020

Conversation

dbatyai
Copy link
Member

@dbatyai dbatyai commented May 15, 2020

This change includes several bugfixes, general improvements, and support
for additional features.

  • Added full support for web compatibility syntax defined in Annex B
  • Implemented parsing and matching patterns in unicode mode
  • Fixed capture results when iterating with nested capturing groups
  • Significantly reduced regexp bytecode size
  • Reduced stack usage during regexp execution
  • Improved matching performance

@dbatyai dbatyai added enhancement An improvement performance Affects performance ES2015 Related to ES2015 features labels May 15, 2020
@dbatyai
Copy link
Member Author

dbatyai commented May 15, 2020

Benchmark Heap (bytes) Stack (bytes) Perf (sec)
regexp-dna.js 1307536 -> 1307368 : +0.013% 1648 -> 1600 : +2.913% 1.379 -> 1.262 : +8.469%
string-tagcloud.js 1001432 -> 993352 : +0.807% 2200 -> 2144 : +2.545% 2.265 -> 2.203 : +2.751%
string-unpack-code.js 645344 -> 645288 : +0.009% 1944 -> 1896 : +2.469% 0.977 -> 0.919 : +5.987%
string-validate-input.js 927816 -> 927688 : +0.014% 1824 -> 1768 : +3.070% 8.243 -> 7.392 : +10.331%
regexp.js 2140568 -> 2128008 : +0.587% 3916 -> 3364 : +14.096% 5.770 -> 4.411 : +23.545%
ES5.1 binary size master patch Diff
.text 121656 121984 +328 bytes
.rodata 14015 13999 -16 bytes
ES6 binary size master patch Diff
.text 191544 192424 +880 bytes
.rodata 19289 19273 -16 bytes

@lauriro
Copy link

lauriro commented May 17, 2020

| after non-capturing group and word boundary does not work

print("name.lower()".replace(/(?:a|b)\b|\.\w+/g, ""))
// jerry returns: name.lower()
// node returns: name()

@dbatyai
Copy link
Member Author

dbatyai commented May 18, 2020

Thanks @lauriro, should be fixed now.

@lauriro
Copy link

lauriro commented May 18, 2020

one more

var re = /([[])\w+(?:=(("|')(?:\\?.)*?\3))?]?/g
console.log("a[href='#a>b']".replace(re, ""))
// jerry returns: a='#a>b']
// node returns: a

strange thing is, when [ is escaped in first character set, then it seems to work.
simpler re /([[])/ without escape also works, some combination will break.

@dbatyai
Copy link
Member Author

dbatyai commented May 18, 2020

Thanks, fixed as well.

@dbatyai dbatyai force-pushed the regexp_unicode branch 7 times, most recently from 801c385 to a8d39aa Compare May 19, 2020 15:50
@dbatyai dbatyai force-pushed the regexp_unicode branch 4 times, most recently from 4ea579f to caa0edf Compare May 21, 2020 18:31
jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved
jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved
jerry-core/ecma/operations/ecma-regexp-object.h Outdated Show resolved Hide resolved
jerry-core/parser/regexp/re-bytecode.c Outdated Show resolved Hide resolved
jerry-core/parser/regexp/re-bytecode.c Outdated Show resolved Hide resolved
jerry-core/parser/regexp/re-compiler-context.h Outdated Show resolved Hide resolved
jerry-core/parser/regexp/re-parser.c Show resolved Hide resolved
Copy link
Member

@rerobika rerobika left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice work!

Copy link
Member

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The capturing group matching seems too complex to me. In a recursive implementation, it is enough to store 3 values on the stack (prev_start, prev_end, current_start), and do an update when the end is reached. I saw local array allocation and similar things in the code. Are they necessary?

jerry-core/ecma/builtin-objects/ecma-builtin-global.c Outdated Show resolved Hide resolved
*/
static JERRY_ATTR_NOINLINE const lit_utf8_byte_t *
ecma_regexp_step_back (ecma_regexp_ctx_t *re_ctx_p,
const lit_utf8_byte_t *str_p) /**< reference to string pointer */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this function is exactly used? Word boundary?

If there is a surrogate pair, and match starts from the position between the two characters, ecma_regexp_step_back may go back the start of the surrogate pair instead of stopping at the middle.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in the greedy atom iterator during backtracking.

In unicode mode surrogate pairs are required to be handled as a single code point, matching can never start from the middle of a surrogate pair, and it is required to read the entire surrogate pair when backtracking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you are right, The engine overwrites the start offset for surrogate pairs. Is there a test for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one now.


/* If zero iterations are allowed, then execute the end opcode which will handle further iterations,
* otherwise run the 1st iteration immediately by executing group bytecode. */
if (qmin == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this likely? Can it be handled at compile time in some way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The likelihood would be roughly 50/50 in my opinion, it could be handled compile time, however that would increase code size, and performance impact would be negligible.

JERRY_TRACE_MSG ("fail\n");
return NULL; /* fail */
/* Save and clear all nested capturing groups, and try to iterate. */
JERRY_VLA (const lit_utf8_byte_t *, saved_captures_p, group_p->subcapture_count);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this efficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's required.

jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved
const uint32_t digit = (uint32_t) (*current_p++ - LIT_CHAR_0);
uint32_t new_value = value * 10 + digit;

if (JERRY_UNLIKELY (value > UINT32_MAX / 10) || JERRY_UNLIKELY (new_value < value))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comes from the standard or an engine limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Engine limit, the value would overflow otherwise.

@dbatyai
Copy link
Member Author

dbatyai commented May 25, 2020

Technically 2 values would be enough since prev_end always equals to current_start, but only in very simple cases. When there are nested capturing groups in the current group, it's no longer enough to save only these values, and we need to save/clear all the nested captures at the start of each iteration.

Take for example this regex: \((a)|(b))+\.exec("ab"), the resulting captures would be 0:"ab", 1:"ab", 2: undefined, 3:"b", since in the final iteration the (a) group did not match. If we do not clear the nested captures, the second capture would still contain "a". Similarly, there are cases where we need to backtrack, and restore the captures of all nested groups after a failed iteration, which is why they also need to be saved.

We currently get away with only saving the starting pointer of the captures by not clobbering the end pointer unnecessarily, so in my opinion this is as efficient as it can get, while still conforming to the standard.

@dbatyai dbatyai force-pushed the regexp_unicode branch 2 times, most recently from 2b13455 to ae1e96f Compare May 25, 2020 13:38
@zherczeg
Copy link
Member

It looks like this is working differently from other engines, but this type of regexp has far less features so they can get away with it. If a regexp has 1000 captures inside an iterator, can it consume the stack easily?

By the way, would it be enough to use a capture reset bitset? So instead of storing the previous values (two uint32_t values), just store one bit which says the value is currently undefined (even though the capture vector contains its previous value). This can be a represented by a flag, or a negative number in the capture vector. When the engine enter to a group, just save the flags, and flag all numbers in the ovector. When restore the previous iteration, reset the flags only, not the values. Even using a byte vector is a lot of save compared to 2*4 bytes.

@dbatyai
Copy link
Member Author

dbatyai commented May 26, 2020

I don't think a reset bitset would be enough. /((x)|(.))*def/.exec("abcdef") In this case the iterator would consume the whole string, the last capture would be 'f', and then we would backtrack three times until the tail 'def' matches, and in each step we have to restore the previous matched capture.

Currently when there are no nested captures we are using less stack than the previous implementation. For every nested capture there is, we have to save one extra value per iteration, which is unfortunately a necessary evil. We could save these on the heap, but that would be horrible performance wise.

If nested captures don't have quantifiers then stack usage is still reasonable, and if they do, that would be bad stack-wise regardless of whether we save captures or not. Also IMO nested captures aren't extremely common, there are usually one or two per regexp at most, so I don't consider them to be major concern.

@rerobika
Copy link
Member

It looks like this is working differently from other engines, but this type of regexp has far less features so they can get away with it. If a regexp has 1000 captures inside an iterator, can it consume the stack easily?

By the way, would it be enough to use a capture reset bitset? So instead of storing the previous values (two uint32_t values), just store one bit which says the value is currently undefined (even though the capture vector contains its previous value). This can be a represented by a flag, or a negative number in the capture vector. When the engine enter to a group, just save the flags, and flag all numbers in the ovector. When restore the previous iteration, reset the flags only, not the values. Even using a byte vector is a lot of save compared to 2*4 bytes.

I think these ideas are beyond the PR's scope. The rework currently has significant stack/memory consumption improvement however the main reason was to add the proper unicode matching support for the engine so that's a double benefit. So I'd suggest to keep these ideas as the institution for opening a follow up optimization PR.

Copy link
Member

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Nice job! I am sure I missed things because the patch is huge, and there are optimization oppportunities, but lets use this as a first step.

jerry-core/ecma/operations/ecma-regexp-object.c Outdated Show resolved Hide resolved
This change includes several bugfixes, general improvements, and support
for additional features.
- Added full support for web compatibility syntax defined in Annex B
- Implemented parsing and matching patterns in unicode mode
- Fixed capture results when iterating with nested capturing groups
- Significantly reduced regexp bytecode size
- Reduced stack usage during regexp execution
- Improved matching performance

JerryScript-DCO-1.0-Signed-off-by: Dániel Bátyai dbatyai@inf.u-szeged.hu
@rerobika
Copy link
Member

Just to mention for the credits:
During the development of the PR the Fuzzinator found 10 unique issues with using grammarinator.

@rerobika rerobika merged commit 8f76a1f into jerryscript-project:master May 26, 2020
@dbatyai dbatyai deleted the regexp_unicode branch June 10, 2020 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement ES2015 Related to ES2015 features performance Affects performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants