Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some existing rules are unrepeatable and should be corrected #2140

Closed
joshgoebel opened this issue Sep 25, 2019 · 18 comments · Fixed by #2379
Closed

Some existing rules are unrepeatable and should be corrected #2140

joshgoebel opened this issue Sep 25, 2019 · 18 comments · Fixed by #2379
Labels
help welcome Could use help from community

Comments

@joshgoebel
Copy link
Member

joshgoebel commented Sep 25, 2019

When I say "no matches" I mean unrepeatable by processLexeme when it tries to identify the rule that brought the regex engine to a halt. This could be for several reasons:

  • rules that match 0 width strings
  • rules with look-behinds/look-aheads that no longer match the smaller string processLexeme has access to

Possibly others. The 2nd one may not easily be fixable (but I'm working on that in a separate PR). But I have a feeling there may be more like the two examples below that COULD be fixed.


The two I noticed (same rule) are in fortran.js and irpf90.js:

className: 'number',
(?=\\b|\\+|\\-|\\.)(?=\\.\\d|\\d)(?:\\d+)?(?:\\.?\\d*)(?:[de][+-]?\\d+)?\\b\\.?

All the matches here are either look ahead or optional so it can actually match 0 width areas of text around digits (and does so plenty). The existing system will filter these out thankfully but there must be a cost (esp. when doing auto-analysis) of firing this rule and then ignoring it over and over.

I'd recommend we consider changing these rules so it's impossible to match a 0 width piece of text. (and in general discourage rules like that... or maybe this is the only way to match such numbers? I dunno. If so though it seems perhaps the filter should be tweaked a bit and not labeled "Parser should not reach this point" if it's just normal daily operations to filter out rules that don't ALWAYS match content, but sometimes do.


Here is the current code that handles this (console logging is mine):

      /*
      Parser should not reach this point as all types of lexemes should be caught
      earlier, but if it does due to some bug make sure it advances at least one
      character forward to prevent infinite looping.
      */
      console.log("should never reach this point")
      mode_buffer += lexeme;
      return lexeme.length || 1;

Running the full test suite (all green) results in 1,911 occurrences of reaching the point we should never reach.

Removing both the number rules mentioned above still results in 1,693 occurrences.

Thoughts?

@egor-rogov
Copy link
Collaborator

Hmm. Okay, maybe it's possible to get rid of existing "ill-behaving" rules. (At least I'm pretty sure fortran's number regexp can be rewritten.)
But why should we? Just to save some cpu cycles? Or do these rules somehow related to #2135 and make your life harder?
And also we have no way to prevent new such rules from appearing, right?

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 25, 2019

Or do these rules somehow related to #2135 and make your life harder?

They were being annoying, but I've already solved it with a filter... so shrugs.

But why should we? Just to save some cpu cycles?

Well, yeah and just the idea that the rules that can match nothing are kind of are a little flakey as-is, aren't they? :-) But seems like a good metric to get to 0 over time. I'm a big fan of "correctness". :-) Though I'm not saying it should be a SUPER high priority or anything.

It's also possible this is the only way to write some tricky rules, but without anyone arguing that side, I'm not convinced. :-) Seems logical that a "real" match (with a class, etc.) should be required to match SOMETHING. I could see using look-aheads or something to START blocks and such though, but that would be different... Actually lookaheads to start certain blocks could avoid the whole "rewind" issue you have now with returnEnd and infinite loops. :-) So maybe this needs to marinate more. :-) But still feel like perhaps these two number rules could be improved :-)

And also we have no way to prevent new such rules from appearing, right?

Sure you do. Seems like you could add tests to the suite that would fail if someone added broken patterns. You could just "tally them" (a very inexpensive operation) and say after running the "should match" tests (which is a torture tests for bad rules) expect tally to == 0.

If you decided it mattered you could also IMMEDIATELY add a test and say tally <= 1911, not allowing any future regressions and slowly work the # down over time.

I'm not sure the rules are labeled well enough that it'd be super easy for the tests to tell someone exactly WHERE the bad rule was, but it could definitely detect the bad rules and show them the regex, etc.

@egor-rogov
Copy link
Collaborator

I think it would be nice to get rid of such rules, and I agree with your reasoning. But it requires a lot of work which will take forever with our limited resources.
OK, the problem is stated, and maybe we can do something about it in the future.

@joshgoebel
Copy link
Member Author

I might be able to generate a list. Maybe we could tag this hacktober fest or something.

@egor-rogov
Copy link
Collaborator

A list will be helpful, of course. Even better - an ability of the parser to complain on such rules (dunno if this is doable).

@joshgoebel
Copy link
Member Author

an ability of the parser to complain on such rules

How would it complain exactly? Like when running tests? I thought a "warn" mode would be neat if the test suite allowed that... so tallying up the list and then adding a "WARNING" to the tests... so the tests are green, but you'd still have visibility to this metric over time and could work to reduce it without making it a hard error.

@egor-rogov
Copy link
Collaborator

"How to complain" is not so important as "how to produce helpful diagnostics", I guess. Warn mode looks fine, yes.

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 26, 2019

Well with the current parser it's hard to get more than a count, since the problem is once you get inside processLexeme you have no context as to WHICH rule misfired. I supposed you could get the LOCATION the match happened, but that's not super useful either.

My new parser should open up a lot of possibilities here if I can get it working since then you could just capture a list of which language/regex was at fault (perhaps it's className, etc)... and work backwards from there.

I'll add that to my list after I get the thing working. :-)

Warn mode looks fine, yes.

Does Mocha have some inbuilt facilities for "warnings" or would we just add our own custom message to serve that purpose?

@egor-rogov
Copy link
Collaborator

Does Mocha have some inbuilt facilities for "warnings" or would we just add our own custom message to serve that purpose?

Unfortunately I have no idea. Anyway it's not urgent.

I'm a little scared by the words "new parser". Didn't it start out as a small change? (:

@joshgoebel
Copy link
Member Author

joshgoebel commented Sep 26, 2019

I'm a little scared by the words "new parser". Didn't it start out as a small change? (:

Yeah, but then I found a few termites in the walls, then noticed the paint was fading in a few spots, then... 😆

Don't be scared.

I promise I'll do my best to make it nice and clean, and you'll have plenty of time to review it. And I think if it turns out to be faster fingers crossed that'll speak for itself. The fact that the concept version passes 100% of tests without really any syntax file changes (excluding the markdown bug) I think says a lot also. I was worried there would be more hidden "gotchas".

I'm really just attempting a PURER expression of the original intent. Perhaps "new parser" is saying a bit too much. I think the size of the change will be surprisingly small once I have a clean version.

@joshgoebel joshgoebel self-assigned this Oct 7, 2019
@joshgoebel joshgoebel added the help welcome Could use help from community label Oct 7, 2019
@joshgoebel
Copy link
Member Author

@egor-rogov Turns out this isn't just bad but also destroys data:

#1566

:-)

@egor-rogov
Copy link
Collaborator

Turns out this isn't just bad but also destroys data:

OMG!..

@joshgoebel
Copy link
Member Author

joshgoebel commented Oct 9, 2019

And never believe the source. :-) This piece of code can be reached easily during NORMAL functioning of the parser also when you have a beginSameAsEnd where multiple pieces of content inside match the begin regex but aren't exactly the same (ie, they aren't the end)... since the beginning "termination" regex is only compiled once it doesn't auto-magically detect an updated to endRe... so they will just keep matching and matching while triggering that piece of code to "skip" them... all until it finds the TRUE end that it's searching for.

Example:

def foo()
  msg = <<-HTML
  <div>
    <h4>#{bar}</h4>
  </div>
  HTML
end

HTML will trigger a beginSameAsEnd match, but the parser still still match "div", 'h4", "h4" (again) and "div" (again) before it finds HTML (it's searching for \w+)... and all of this will hit the "this should never be reached" checkpoint. :-)

@egor-rogov
Copy link
Collaborator

Hmm... Not sure I'm following. Do we have a problem, Huston? Is it something you can show in jsfiddle?

@joshgoebel
Copy link
Member Author

Not sure it's a problem, just interesting. And just saying the comment is entirely wrong. :-) It still works since eventually the matcher will find a match that both passes the original "begin" matcher and the new dynamic end matcher. :-)

@joshgoebel
Copy link
Member Author

joshgoebel commented Jan 31, 2020

Marking this as resolved with #2379. A repeating 0 width match is now a thrown error (outside of safe mode) and should (most likely) result in failing tests during auto-detect. So any often hit bad rules in the core lib will be found by auto-detect tests. Obscure rules (deeply nested) could still sneak thru, but in production (safe mode) they will just be skipped (as before).

Someone building a new grammar running in debug mode should be notified of the problem though so they can fix it vs it just "silently working" as it did before.

joshgoebel added a commit to joshgoebel/highlight.js that referenced this issue Jan 31, 2020
Closes highlightjs#2140.

- In safe mode 0 width matches will be safely and quietly ignored
  and advance the cursor 1 step, as before.
- In debug mode a "0 width match" error will be thrown.

This should help prevent such misbehaved rules from sneaking back
into the library in the future.
@egor-rogov
Copy link
Collaborator

(moved here from #2379)
I realized that debug and safe confuse me. They don't look like opposites.
debug and production?
unsafe and safe?
strict and relaxed?
I can't find a good pair right now.

joshgoebel added a commit that referenced this issue Feb 6, 2020
…r detection (#2379)

* enh(fortran) support intrinsic data types

Closes #1723.

* (parser) throw "0 width match" error for bad regex

Closes #2140.

- In safe mode 0 width matches will be safely and quietly ignored
  and advance the cursor 1 step, as before.
- In debug mode a "0 width match" error will be thrown.

This should help prevent such misbehaved rules from sneaking back
into the library in the future.
@joshgoebel
Copy link
Member Author

joshgoebel commented Feb 6, 2020

I agree this could be better. I think we can still change later. I'll make a PR declaring these as a private API intended for developer use only. I don't really see why you'd want to use them in general... so renaming them later should affect a very small number of people (and I'd feel fine doing this outside of a major release cycle).

I was probably too creative originally by calling it "safe mode"... :-/

taufik-nurrohman pushed a commit to taufik-nurrohman/highlight.js that referenced this issue Feb 18, 2020
…r detection (highlightjs#2379)

* enh(fortran) support intrinsic data types

Closes highlightjs#1723.

* (parser) throw "0 width match" error for bad regex

Closes highlightjs#2140.

- In safe mode 0 width matches will be safely and quietly ignored
  and advance the cursor 1 step, as before.
- In debug mode a "0 width match" error will be thrown.

This should help prevent such misbehaved rules from sneaking back
into the library in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help welcome Could use help from community
Projects
None yet
2 participants