Some existing rules are unrepeatable and should be corrected #2140

joshgoebel · 2019-09-25T12:52:22Z

When I say "no matches" I mean unrepeatable by processLexeme when it tries to identify the rule that brought the regex engine to a halt. This could be for several reasons:

rules that match 0 width strings
rules with look-behinds/look-aheads that no longer match the smaller string processLexeme has access to

Possibly others. The 2nd one may not easily be fixable (but I'm working on that in a separate PR). But I have a feeling there may be more like the two examples below that COULD be fixed.

The two I noticed (same rule) are in fortran.js and irpf90.js:

className: 'number',
(?=\\b|\\+|\\-|\\.)(?=\\.\\d|\\d)(?:\\d+)?(?:\\.?\\d*)(?:[de][+-]?\\d+)?\\b\\.?

All the matches here are either look ahead or optional so it can actually match 0 width areas of text around digits (and does so plenty). The existing system will filter these out thankfully but there must be a cost (esp. when doing auto-analysis) of firing this rule and then ignoring it over and over.

I'd recommend we consider changing these rules so it's impossible to match a 0 width piece of text. (and in general discourage rules like that... or maybe this is the only way to match such numbers? I dunno. If so though it seems perhaps the filter should be tweaked a bit and not labeled "Parser should not reach this point" if it's just normal daily operations to filter out rules that don't ALWAYS match content, but sometimes do.

Here is the current code that handles this (console logging is mine):

      /*
      Parser should not reach this point as all types of lexemes should be caught
      earlier, but if it does due to some bug make sure it advances at least one
      character forward to prevent infinite looping.
      */
      console.log("should never reach this point")
      mode_buffer += lexeme;
      return lexeme.length || 1;

Running the full test suite (all green) results in 1,911 occurrences of reaching the point we should never reach.

Removing both the number rules mentioned above still results in 1,693 occurrences.

Thoughts?

The text was updated successfully, but these errors were encountered:

egor-rogov · 2019-09-25T19:33:38Z

Hmm. Okay, maybe it's possible to get rid of existing "ill-behaving" rules. (At least I'm pretty sure fortran's number regexp can be rewritten.)
But why should we? Just to save some cpu cycles? Or do these rules somehow related to #2135 and make your life harder?
And also we have no way to prevent new such rules from appearing, right?

joshgoebel · 2019-09-25T21:33:33Z

Or do these rules somehow related to #2135 and make your life harder?

They were being annoying, but I've already solved it with a filter... so shrugs.

But why should we? Just to save some cpu cycles?

Well, yeah and just the idea that the rules that can match nothing are kind of are a little flakey as-is, aren't they? :-) But seems like a good metric to get to 0 over time. I'm a big fan of "correctness". :-) Though I'm not saying it should be a SUPER high priority or anything.

It's also possible this is the only way to write some tricky rules, but without anyone arguing that side, I'm not convinced. :-) Seems logical that a "real" match (with a class, etc.) should be required to match SOMETHING. I could see using look-aheads or something to START blocks and such though, but that would be different... Actually lookaheads to start certain blocks could avoid the whole "rewind" issue you have now with returnEnd and infinite loops. :-) So maybe this needs to marinate more. :-) But still feel like perhaps these two number rules could be improved :-)

And also we have no way to prevent new such rules from appearing, right?

Sure you do. Seems like you could add tests to the suite that would fail if someone added broken patterns. You could just "tally them" (a very inexpensive operation) and say after running the "should match" tests (which is a torture tests for bad rules) expect tally to == 0.

If you decided it mattered you could also IMMEDIATELY add a test and say tally <= 1911, not allowing any future regressions and slowly work the # down over time.

I'm not sure the rules are labeled well enough that it'd be super easy for the tests to tell someone exactly WHERE the bad rule was, but it could definitely detect the bad rules and show them the regex, etc.

egor-rogov · 2019-09-26T12:52:12Z

I think it would be nice to get rid of such rules, and I agree with your reasoning. But it requires a lot of work which will take forever with our limited resources.
OK, the problem is stated, and maybe we can do something about it in the future.

joshgoebel · 2019-09-26T17:03:36Z

I might be able to generate a list. Maybe we could tag this hacktober fest or something.

egor-rogov · 2019-09-26T18:50:41Z

A list will be helpful, of course. Even better - an ability of the parser to complain on such rules (dunno if this is doable).

joshgoebel · 2019-09-26T21:32:13Z

an ability of the parser to complain on such rules

How would it complain exactly? Like when running tests? I thought a "warn" mode would be neat if the test suite allowed that... so tallying up the list and then adding a "WARNING" to the tests... so the tests are green, but you'd still have visibility to this metric over time and could work to reduce it without making it a hard error.

egor-rogov · 2019-09-26T22:05:50Z

"How to complain" is not so important as "how to produce helpful diagnostics", I guess. Warn mode looks fine, yes.

joshgoebel · 2019-09-26T22:13:52Z

Well with the current parser it's hard to get more than a count, since the problem is once you get inside processLexeme you have no context as to WHICH rule misfired. I supposed you could get the LOCATION the match happened, but that's not super useful either.

My new parser should open up a lot of possibilities here if I can get it working since then you could just capture a list of which language/regex was at fault (perhaps it's className, etc)... and work backwards from there.

I'll add that to my list after I get the thing working. :-)

Warn mode looks fine, yes.

Does Mocha have some inbuilt facilities for "warnings" or would we just add our own custom message to serve that purpose?

egor-rogov · 2019-09-26T22:25:29Z

Does Mocha have some inbuilt facilities for "warnings" or would we just add our own custom message to serve that purpose?

Unfortunately I have no idea. Anyway it's not urgent.

I'm a little scared by the words "new parser". Didn't it start out as a small change? (:

joshgoebel · 2019-09-26T22:52:47Z

I'm a little scared by the words "new parser". Didn't it start out as a small change? (:

Yeah, but then I found a few termites in the walls, then noticed the paint was fading in a few spots, then... 😆

Don't be scared.

I promise I'll do my best to make it nice and clean, and you'll have plenty of time to review it. And I think if it turns out to be faster fingers crossed that'll speak for itself. The fact that the concept version passes 100% of tests without really any syntax file changes (excluding the markdown bug) I think says a lot also. I was worried there would be more hidden "gotchas".

I'm really just attempting a PURER expression of the original intent. Perhaps "new parser" is saying a bit too much. I think the size of the change will be surprisingly small once I have a clean version.

joshgoebel · 2019-10-09T11:09:27Z

@egor-rogov Turns out this isn't just bad but also destroys data:

#1566

:-)

egor-rogov · 2019-10-09T14:15:53Z

Turns out this isn't just bad but also destroys data:

OMG!..

joshgoebel · 2019-10-09T17:32:21Z

And never believe the source. :-) This piece of code can be reached easily during NORMAL functioning of the parser also when you have a beginSameAsEnd where multiple pieces of content inside match the begin regex but aren't exactly the same (ie, they aren't the end)... since the beginning "termination" regex is only compiled once it doesn't auto-magically detect an updated to endRe... so they will just keep matching and matching while triggering that piece of code to "skip" them... all until it finds the TRUE end that it's searching for.

Example:

def foo()
  msg = <<-HTML
  <div>
    <h4>#{bar}</h4>
  </div>
  HTML
end

HTML will trigger a beginSameAsEnd match, but the parser still still match "div", 'h4", "h4" (again) and "div" (again) before it finds HTML (it's searching for \w+)... and all of this will hit the "this should never be reached" checkpoint. :-)

egor-rogov · 2019-10-09T18:42:34Z

Hmm... Not sure I'm following. Do we have a problem, Huston? Is it something you can show in jsfiddle?

joshgoebel · 2019-10-09T19:24:21Z

Not sure it's a problem, just interesting. And just saying the comment is entirely wrong. :-) It still works since eventually the matcher will find a match that both passes the original "begin" matcher and the new dynamic end matcher. :-)

joshgoebel · 2020-01-31T23:27:29Z

Marking this as resolved with #2379. A repeating 0 width match is now a thrown error (outside of safe mode) and should (most likely) result in failing tests during auto-detect. So any often hit bad rules in the core lib will be found by auto-detect tests. Obscure rules (deeply nested) could still sneak thru, but in production (safe mode) they will just be skipped (as before).

Someone building a new grammar running in debug mode should be notified of the problem though so they can fix it vs it just "silently working" as it did before.

Closes highlightjs#2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.

egor-rogov · 2020-02-06T21:02:41Z

(moved here from #2379)
I realized that debug and safe confuse me. They don't look like opposites.
debug and production?
unsafe and safe?
strict and relaxed?
I can't find a good pair right now.

…r detection (#2379) * enh(fortran) support intrinsic data types Closes #1723. * (parser) throw "0 width match" error for bad regex Closes #2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.

joshgoebel · 2020-02-06T21:38:54Z

I agree this could be better. I think we can still change later. I'll make a PR declaring these as a private API intended for developer use only. I don't really see why you'd want to use them in general... so renaming them later should affect a very small number of people (and I'd feel fine doing this outside of a major release cycle).

I was probably too creative originally by calling it "safe mode"... :-/

…r detection (highlightjs#2379) * enh(fortran) support intrinsic data types Closes highlightjs#1723. * (parser) throw "0 width match" error for bad regex Closes highlightjs#2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.

joshgoebel mentioned this issue Sep 25, 2019

Request: Support regex look-ahead for begin and end matchers #1349

Closed

joshgoebel self-assigned this Oct 7, 2019

joshgoebel added the help welcome Could use help from community label Oct 7, 2019

joshgoebel removed their assignment Oct 26, 2019

joshgoebel mentioned this issue Jan 31, 2020

enh(fortran) support intrinsic data types & better 0 width match error detection #2379

Merged

joshgoebel closed this as completed in #2379 Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some existing rules are unrepeatable and should be corrected #2140

Some existing rules are unrepeatable and should be corrected #2140

joshgoebel commented Sep 25, 2019 •

edited

egor-rogov commented Sep 25, 2019

joshgoebel commented Sep 25, 2019 •

edited

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019 •

edited

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019 •

edited

joshgoebel commented Oct 9, 2019

egor-rogov commented Oct 9, 2019

joshgoebel commented Oct 9, 2019 •

edited

egor-rogov commented Oct 9, 2019

joshgoebel commented Oct 9, 2019

joshgoebel commented Jan 31, 2020 •

edited

egor-rogov commented Feb 6, 2020

joshgoebel commented Feb 6, 2020 •

edited

Some existing rules are unrepeatable and should be corrected #2140

Some existing rules are unrepeatable and should be corrected #2140

Comments

joshgoebel commented Sep 25, 2019 • edited

egor-rogov commented Sep 25, 2019

joshgoebel commented Sep 25, 2019 • edited

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019 • edited

egor-rogov commented Sep 26, 2019

joshgoebel commented Sep 26, 2019 • edited

joshgoebel commented Oct 9, 2019

egor-rogov commented Oct 9, 2019

joshgoebel commented Oct 9, 2019 • edited

egor-rogov commented Oct 9, 2019

joshgoebel commented Oct 9, 2019

joshgoebel commented Jan 31, 2020 • edited

egor-rogov commented Feb 6, 2020

joshgoebel commented Feb 6, 2020 • edited

joshgoebel commented Sep 25, 2019 •

edited

joshgoebel commented Sep 25, 2019 •

edited

joshgoebel commented Sep 26, 2019 •

edited

joshgoebel commented Sep 26, 2019 •

edited

joshgoebel commented Oct 9, 2019 •

edited

joshgoebel commented Jan 31, 2020 •

edited

joshgoebel commented Feb 6, 2020 •

edited