Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

regex for linguist grammar compiler #92

Closed
jesslin02 opened this issue Jan 21, 2023 · 6 comments
Closed

regex for linguist grammar compiler #92

jesslin02 opened this issue Jan 21, 2023 · 6 comments

Comments

@jesslin02
Copy link
Contributor

when attempting to add support for lflang to linguist, the grammar compiler detects these issues with the lflang.tmLanguage.json file:

- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`;|(?<=}.*)$|(?<=target[^\{]*)$`": lookbehind assertion is not fixed length (at offset 9))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(^|,|\{)(\s)*)\b([\w\-]+)\b(`...": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\s*=\s*new\s*(\[[^\]]*\])?\s`...": lookbehind assertion is not fixed length (at offset 33))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\w+((\.)|(::)))(\b\w+\b)`": lookbehind assertion is not fixed length (at offset 18))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\breactor\b\s*)(?!at)(?!exte`...": lookbehind assertion is not fixed length (at offset 18))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=\bmethod\b\s+)\w+\b`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 34))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(\\.|[^"\\])*)("|$)`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(\\.|[^'\\])*)('|$)`": lookbehind assertion is not fixed length (at offset 17))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 39))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`(?<=(input|output|action|state).`...": lookbehind assertion is not fixed length (at offset 34))
- Invalid regex in grammar: `source.lf` (in `syntaxes/lflang.tmLanguage.json`) contains a malformed regex (regex "`;|(?<=}.*)$|(?<=import[^\{]*)$`": lookbehind assertion is not fixed length (at offset 9))
@jesslin02 jesslin02 mentioned this issue Jan 21, 2023
4 tasks
@lhstrh
Copy link
Member

lhstrh commented Jan 21, 2023

The problems that Linguist reports on have to do with the fact that the grammar is trying to do a bit more than just tokenization, which it probably shouldn't do. Essentially, all that this grammar is supposed to do is recognize, based on Oniguruma regular expressions, tokens in a stream of text. In our context-free grammar (https://grammar.lf-lang.org/) you can find these regular expressions also. While TextMate files are often are referred to as a "grammar", they really are not grammars, because most programming languages are context-free, meaning they can't be parsed merely using regular expressions.

In other words, a simplification of the lflang.tmLanguage.json is in order, meaning that some of the more general things that happen in it might have to move into the TypeScript code. Would you be interested in taking a stab at this, @jesslin02?

@petervdonovan
Copy link
Contributor

I started writing at the same time as Marten. I'll still just post this anyway to try to fill in some additional details:

Certain regex features are not universally supported, probably because they are hard to implement efficiently. Apparently non-fixed-width lookbehind is one such feature.

Code highlighting in VS Code is divided between syntactic highlighting and semantic highlighting. Both syntactic and semantic highlighting are performed in a way that is a little bit wrong using code that is completely separate from the parser that is used in the compiler -- this is acceptable because it almost always works properly, and because highlighting is not a very hard requirement. The syntactic highlighting is handled by TextMate; the semantic highlighting is handled by highlight.ts. There isn't a very sharp distinction between syntactic and semantic highlighting, but Marten says that the job of the TextMate grammar should be limited to tokenization.

We have a few options then:

  • Remove the offending rules (which use non-fixed-width lookbehind) from the TextMate grammar and put them in highlight.ts instead. Note that this will make the highlighting on our website and by linguist somewhat less rich because neither use highlight.ts.
  • Patch up the offending rules, but keep them in the TextMate grammar. This will likely make the TextMate grammar a little bit more wrong than it already is; we get to decide what level of wrongness is acceptable to us.

In any case, changes that we make to work around this issue must be tested by visually assessing the highlighting of example LF programs (we don't have regression tests for this). It is wise to use the scope inspector to make sure things work as you think.. Tokens that are highlighted by the TextMate grammar will have scope names that end in ".lflang". We need to avoid significant regressions not only in overall highlighting, but also in highlighting that is just performed by the TextMate grammar. You can also disable semantic highlighting by commenting out this, just to make sure that things still look OK without it.

@lhstrh
Copy link
Member

lhstrh commented Jan 23, 2023

These are great suggestions. Thanks, @petervdonovan!

@jesslin02
Copy link
Contributor Author

thanks @petervdonovan and @lhstrh! do you have any recommendations for example LF programs for visual tests? i've been finding some specific example lines that match the TextMate scope that i would like to modify by trial and error, but i'm not sure if there's a more comprehensive way to go about it.

@lhstrh
Copy link
Member

lhstrh commented Jan 27, 2023

I think that if you focus on unit tests you can get away with using very small code fragments that hit those lines. You don't need complete programs that are able to execute or do anything meaningful.

@jesslin02
Copy link
Contributor Author

resolved through changes in #104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants