New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some existing rules are unrepeatable and should be corrected #2140
Comments
Hmm. Okay, maybe it's possible to get rid of existing "ill-behaving" rules. (At least I'm pretty sure fortran's number regexp can be rewritten.) |
They were being annoying, but I've already solved it with a filter... so shrugs.
Well, yeah and just the idea that the rules that can match nothing are kind of are a little flakey as-is, aren't they? :-) But seems like a good metric to get to 0 over time. I'm a big fan of "correctness". :-) Though I'm not saying it should be a SUPER high priority or anything. It's also possible this is the only way to write some tricky rules, but without anyone arguing that side, I'm not convinced. :-) Seems logical that a "real" match (with a class, etc.) should be required to match SOMETHING. I could see using look-aheads or something to START blocks and such though, but that would be different... Actually lookaheads to start certain blocks could avoid the whole "rewind" issue you have now with
Sure you do. Seems like you could add tests to the suite that would fail if someone added broken patterns. You could just "tally them" (a very inexpensive operation) and say after running the "should match" tests (which is a torture tests for bad rules) expect If you decided it mattered you could also IMMEDIATELY add a test and say I'm not sure the rules are labeled well enough that it'd be super easy for the tests to tell someone exactly WHERE the bad rule was, but it could definitely detect the bad rules and show them the regex, etc. |
I think it would be nice to get rid of such rules, and I agree with your reasoning. But it requires a lot of work which will take forever with our limited resources. |
I might be able to generate a list. Maybe we could tag this hacktober fest or something. |
A list will be helpful, of course. Even better - an ability of the parser to complain on such rules (dunno if this is doable). |
How would it complain exactly? Like when running tests? I thought a "warn" mode would be neat if the test suite allowed that... so tallying up the list and then adding a "WARNING" to the tests... so the tests are green, but you'd still have visibility to this metric over time and could work to reduce it without making it a hard error. |
"How to complain" is not so important as "how to produce helpful diagnostics", I guess. Warn mode looks fine, yes. |
Well with the current parser it's hard to get more than a count, since the problem is once you get inside My new parser should open up a lot of possibilities here if I can get it working since then you could just capture a list of which language/regex was at fault (perhaps it's className, etc)... and work backwards from there. I'll add that to my list after I get the thing working. :-)
Does Mocha have some inbuilt facilities for "warnings" or would we just add our own custom message to serve that purpose? |
Unfortunately I have no idea. Anyway it's not urgent. I'm a little scared by the words "new parser". Didn't it start out as a small change? (: |
Yeah, but then I found a few termites in the walls, then noticed the paint was fading in a few spots, then... 😆 Don't be scared. I promise I'll do my best to make it nice and clean, and you'll have plenty of time to review it. And I think if it turns out to be faster fingers crossed that'll speak for itself. The fact that the concept version passes 100% of tests without really any syntax file changes (excluding the markdown bug) I think says a lot also. I was worried there would be more hidden "gotchas". I'm really just attempting a PURER expression of the original intent. Perhaps "new parser" is saying a bit too much. I think the size of the change will be surprisingly small once I have a clean version. |
@egor-rogov Turns out this isn't just bad but also destroys data: :-) |
OMG!.. |
And never believe the source. :-) This piece of code can be reached easily during NORMAL functioning of the parser also when you have a beginSameAsEnd where multiple pieces of content inside match the begin regex but aren't exactly the same (ie, they aren't the end)... since the beginning "termination" regex is only compiled once it doesn't auto-magically detect an updated to Example: def foo()
msg = <<-HTML
<div>
<h4>#{bar}</h4>
</div>
HTML
end HTML will trigger a beginSameAsEnd match, but the parser still still match "div", 'h4", "h4" (again) and "div" (again) before it finds HTML (it's searching for |
Hmm... Not sure I'm following. Do we have a problem, Huston? Is it something you can show in jsfiddle? |
Not sure it's a problem, just interesting. And just saying the comment is entirely wrong. :-) It still works since eventually the matcher will find a match that both passes the original "begin" matcher and the new dynamic end matcher. :-) |
Marking this as resolved with #2379. A repeating 0 width match is now a thrown error (outside of safe mode) and should (most likely) result in failing tests during auto-detect. So any often hit bad rules in the core lib will be found by auto-detect tests. Obscure rules (deeply nested) could still sneak thru, but in production (safe mode) they will just be skipped (as before). Someone building a new grammar running in debug mode should be notified of the problem though so they can fix it vs it just "silently working" as it did before. |
Closes highlightjs#2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.
(moved here from #2379) |
…r detection (#2379) * enh(fortran) support intrinsic data types Closes #1723. * (parser) throw "0 width match" error for bad regex Closes #2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.
I agree this could be better. I think we can still change later. I'll make a PR declaring these as a private API intended for developer use only. I don't really see why you'd want to use them in general... so renaming them later should affect a very small number of people (and I'd feel fine doing this outside of a major release cycle). I was probably too creative originally by calling it "safe mode"... :-/ |
…r detection (highlightjs#2379) * enh(fortran) support intrinsic data types Closes highlightjs#1723. * (parser) throw "0 width match" error for bad regex Closes highlightjs#2140. - In safe mode 0 width matches will be safely and quietly ignored and advance the cursor 1 step, as before. - In debug mode a "0 width match" error will be thrown. This should help prevent such misbehaved rules from sneaking back into the library in the future.
When I say "no matches" I mean unrepeatable by
processLexeme
when it tries to identify the rule that brought the regex engine to a halt. This could be for several reasons:processLexeme
has access toPossibly others. The 2nd one may not easily be fixable (but I'm working on that in a separate PR). But I have a feeling there may be more like the two examples below that COULD be fixed.
The two I noticed (same rule) are in fortran.js and irpf90.js:
All the matches here are either look ahead or optional so it can actually match 0 width areas of text around digits (and does so plenty). The existing system will filter these out thankfully but there must be a cost (esp. when doing auto-analysis) of firing this rule and then ignoring it over and over.
I'd recommend we consider changing these rules so it's impossible to match a 0 width piece of text. (and in general discourage rules like that... or maybe this is the only way to match such numbers? I dunno. If so though it seems perhaps the filter should be tweaked a bit and not labeled "Parser should not reach this point" if it's just normal daily operations to filter out rules that don't ALWAYS match content, but sometimes do.
Here is the current code that handles this (console logging is mine):
Running the full test suite (all green) results in 1,911 occurrences of reaching the point we should never reach.
Removing both the number rules mentioned above still results in 1,693 occurrences.
Thoughts?
The text was updated successfully, but these errors were encountered: