Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(latex) complete ground up rewrite of the LaTeX grammar #2726

Merged
merged 61 commits into from Oct 17, 2020

Conversation

schtandard
Copy link
Contributor

@schtandard schtandard commented Oct 3, 2020

Resolves #2708. Resolves #2709.


This is a complete redo of the LaTeX language parser. It aims to solve #2708 and #2709 as well as some other problems not reported in the issue tracker but collected on tex.sx. Since TeX is a macro expansion language, its syntax has some peculiarities compared to other programming languages that require special handling by a syntax highlighter. I will try to outline the main issues and my approach to tackling them here.

The Problem: Category Codes

Every character in a TeX program has a so-called category code (catcode for short). Characters with special catcodes like %, {, }, #, &, $, ^ or _ can be recognized by a syntax parser easily enough. It gets a bit difficult when it comes to control sequences, though.

Recognizing control sequences (these are tokens representing macros or primitive commands) is the main task of any (La)TeX syntax highlighter. They consist of a backslash \ followed by either an arbitrary number of letters or a single non-letter. If a character is a letter is decided by its catcode. Usually, letters are what one would expect: a-zA-Z. However, catcodes can differ between files and even change in the middle of a program. It is impossible to determine the correct catcodes of every character in a program without executing it (using TeX). However, there are two very common and standardized ways of changing which characters are considered letters:

  1. The character @ is usually not a letter, but is commonly made one in order to use it in internal macro names. This done using \makeatletter. \makeatother makes it a non-letter again. An example:

    \@gobble \foo@bar
    \makeatletter
      \@gobble \foo@bar
    \makeatother

    Assuming that the default LaTeX catcodes are in effect, the first line contains the control sequences \@ and \foo. (The rest of the line consist of normal text.) After making @ a letter, the third line contains the control sequences \@gobble and \foo@bar.
    (As you can see, the GitHub syntax highlighter always treats @ as a letter, which is not always correct but a better choice than never treating it as one.

  2. The long term future of LaTeX is called LaTeX3 (the current version is called LaTeX2e). It provides a collection of macros that greatly simplify writing TeX programs which is also known as expl3 ("experimental LaTeX3"). In order to use this programming layer, one has to enter a category code regime where (among some other changes) _ and : are letters. It is entered using \ExplSyntaxOn and left using \ExplSyntaxOff. An example:

    \tl_set_eq:NN \l_tmpa_tl \l_tmpb_tl
    \ExplSyntaxOn
      \tl_set_eq:NN \l_tmpa_tl \l_tmpb_tl
    \ExplSyntaxOff

    With the default LaTeX catcodes, the first line contains the control sequences \tl, \l and \l. It is also utter nonsense and would in fact cause an error when compiling (due to _ being used outside math mode). After an \ExplSyntaxOn however, it consists of the control sequences \tl_set_eq:NN, \l_tmpa_tl and \l_tmpb_tl.
    (As you can see, the GitHub syntax highlighter does not know about this.)

    Importantly, while it is an okay choice to just always treat @ as a letter when highlighting syntax, this is not true for : and _. Unlike @, these characters commonly occur in normal LaTeX code, often directly after control sequences.

Both of these cases (and in fact their combination) are already present in many LaTeX packages and online posts, where \makeatletter and \ExplSyntaxOn may not be present.

I write all of this to try and convince you of the following:

  • highlight.js should support both the "normal" LaTeX and LaTeX3.
  • In order to support LaTeX3, we need a separate language file latex3.js and we need to support switching to and from that language using \ExplSyntaxOn and \ExplSyntaxOff.
  • If we are already at it, we can do the same for \makeatletter and \makeatother. Apart from providing correct highlighting, this would also help beginners finding code snippets online (for example on tex.sx) by correctly pointing out where control sequences end. (Not understanding \makeatletter and forgetting to use it are common problems among LaTeX newbies.)

What should be highlighted?

I decided the parser should be able to recognize the following elements:

  • Control sequences (as described above)

  • Macro parameters: These consist of one or more # usually followed by a digit.

    (Actually, the number of # should be 2^n for some n, but the parser doesn't have to know this, in my opinion.)

  • Comments: These start with % and contain the rest of the line.

  • Characters with special catcodes (that have not yet been covered): Namely {, }, $, &, ^ and _.

I decided not to include (for now):

  • The ^^ syntax: A double ^ followed by any character should be recognized as a single character.

  • Tagging the different elements of LaTeX macro names: The LaTeX3 programming layer follows a strict naming scheme. For example \hljs_do_somehthing:Ncx is a function in the module hljs with the name do_something and the signature Ncx, whereas \l_hljs_some_value_tl is a local (l) variable of the module hljs of type tl with the name some_value.

    I'm not sure if this should be inside or outside the scope of highlight.js.

  • Units, stretch values and the like: These are impossible to differentiate from normal text and the value added by highlighting them is very limited.

  • Math mode: The same is true. While it would be feasible to cover the most common macros that contain / switch to math mode, there will always be non-edge cases that are not recognized. It is clearer to not differentiate at all.

How to parse control sequences

I settled on the following method for detecting control sequences and switching languages ("changing catcodes"):

  • Upon finding a backslash \, enter the mode control_sequence.
    • Now, first check if the control sequence is one that indicates that this is in fact LaTeX (with special catcodes)
    • If so, give it a high relevance.
    • Otherwise, scan for any other control sequence and give it the relevance 0.
  • After the control sequence is complete, check if it was one that changes catcodes.
    • If it was, switch to the corresponding language for the rest of the code.

Which classes to use

In this first draft, I have not yet decided which CSS classes to use for highlighting. Since LaTeX does not have the same syntax elements as most other languages, there is quite some room for interpretation here. I guess, one should survey the existing color schemes and select keywords that make LaTeX look "right" in most of those.

Any notes on this?

What I'm unclear on

As I am new to highlight.js, there are some things I am not totally clear on:

Does the circular dependency of the languages work?

When I remove the Requires: line from the header of all four language files and compile them with

node ./tools/build.js -n latex latex@ latex3 latex3@

everything works fine.

Adding the Requires: line leads to an error as described in #2725.

However, when I try to compile everything

node ./tools/build.js

the script complains about a circular reference.

Are circular references not allowed at all? That would greatly complicate the task at hand, since LaTeX2e can occur inside LaTeX3 and vice versa.

Could nesting the languages lead to problems?

In my current approach, the parser does not switch back from a sublanguage before the end of the code. After saying \ExplSyntaxOn\ExplSyntaxOff, there is a latex sublanguage inside a latex3 sublanguage.

Avoiding this is not trivial, as \ExplSyntaxOn\makeatletter\ExplsyntaxOff is possible: \ExplSyntaxOff reverts the changes done by \ExplSyntaxOn but not those done by \makeatletter.

Could we instead change the grammar dynamically?

Having an array containing all "letters" that is used when scanning for control sequences and that can be changed when one of the category code changing macros is found would be somewhat more natural. (I guess we would still need several languages for the state at the start of parsing to differ.) Is this possible?

Is @ in the language name legal?

The language names latex and latex3 are fairly obvious. I decided to append an @ to the name for those langauges where it is a letter. Is this safe or is that character not allowed in language names? (I did not experience any problems in my tests.)

Can control sequences have a relevance of 1 (instead of 0)?

In the existing latex.js placeholder*, control sequences were given a relevance of 0. I left it that way for now.

I would consider it more natural to give them a relevance of 1 in order to also support auto-detection in short code snippets that do not contain any of the more obvious macros that are given high relevance. This is not very pressing, though, I believe, since sites where LaTeX is to be expected can set it as the default language.

I guess the decision was made because the backslash can also appear in paths or as an escape character. I don't feel I have a good sense of how often other code snippets would be falsely identified as LaTeX if it got a relevance of 1 for every backslash.

Should I worry about efficiency?

Every time a backslash is found, latex.js checks for 34 known control sequences for language detection. Is this a lot? Should this be reduced in order not to slow down the parser? I have no idea how fast this thing runs / should run.


I look forward to any feedback or advice!

Four language files are provided:
- latex.js for LaTeX with normal category codes
- latex@.js for LaTeX with @ considered a letter
- latex3.js for LaTeX3 (: and _ are also letters)
- latex3@.js for LaTeX3 with @ considered a letter

The macros that switch between these category code regimes
(\makeatletter, \makeatother, \ExplSyntaxOn and \ExplSyntaxOff) are
recognized and lead to the parser switching languages.
@joshgoebel
Copy link
Member

joshgoebel commented Oct 3, 2020

Let me try and take a quick very high level pass, and then I'll tackle the questions. I'm not sure this first approach is what we'd want (too specific, too complex)... our goal is to highlight well "most of the time" not be an entirely accurate parser for complex languages (I'm starting to think LaTex might fall under this category). I definitely do not think we want 4 variants and all this interconnectivity. If that's truly the best way to go this might be better as a 3rd party module (where you're free to do things however you see fit). But first let's tackle the questions and see if we can simplify all this.

There are also auto-detect issues with this approach where now all 4 latex languages are fighting over every single latex file...
rather than a single grammar going "yes this looks like latex" then making a "best effort" to highlight it.

First one high level question of my own:

  • What are the real differences between Latex 2 and 3? What is the harm in just assuming everything is version 3 and highlighting accordingly?

Which classes to use... Any notes on this?

We need to try and map things semantically the best possible... that might mean \\blah becomes a keyword, etc... ie, map to the existing CSS classes. You're welcome to look at the styles when doing this, but I still think it's best to try for the best semantic match vs visual matching. It might be possible LaTex requires custom CSS to get the most benefit if it's so very different from most languages. You're welcome to make a list of which classes are "missing" and we can add that to the larger discussion (same discussion it taking place with Mathematica).

I'm open to new classes in the future but the problem with adding new classes is we have 100+ themes that would then be "broken"... so we're stuck with the existing classes (or aliasing to them) for the most part until a great solution is found for this. So short-term having a grammar that only works with one "blessed theme" is a no-no. But always open to ideas here.

Does the circular dependency of the languages work? When I remove the Requires: line from the header of all four language files and compile them with

True dependencies cannot be circular. Dependencies spell out "this language CANNOT load without this other language". But "sublanguage" is not a hard dependency... if a sublanguage is not loaded, it is simply ignored and will not be highlighted... so you would see no errors until you tried to highlight and did not get the expected result. Of course if you manually build ALL the variants then this would "just work".

If you end up publishing 4 separate modules I think you'd want one master named "latex" that simply "required" all the others... for "convenience" so one could just "build latex"...

Are circular references not allowed at all? That would greatly complicate the task at hand, since LaTeX2e can occur inside LaTeX3 and vice versa.

They are not, but again hopefully we can simplify all this. But again if the dependencies are not "hard" then this may not be a HUGE issue... you just have to make sure someone always builds all the variants.

Could nesting the languages lead to problems? In my current approach, the parser does not switch back from a sublanguage before the end of the code. After saying \ExplSyntaxOn\ExplSyntaxOff, there is a latex sublanguage inside a latex3 sublanguage.

I don't follow this... All tags should be properly closed... the very last that that should happen... But if a sublanguage is still running when the end of content is hit that it should process everything (I think) and then the final closing tags will be added to terminate any nested modes.

Avoiding this is not trivial, as \ExplSyntaxOn\makeatletter\ExplsyntaxOff is possible: \ExplSyntaxOff reverts the changes done by \ExplSyntaxOn but not those done by \makeatletter.

I think you may be trying to make the grammar more complex than is possible/recommended. Generally we're just a pattern highlighter... we do not make huge efforts to "understand" mode and "parse" the languages as you're trying to do here.

Could we instead change the grammar dynamically? Having an array containing all "letters" that is used when scanning for control sequences and that can be changed when one of the category code changing macros is found would be somewhat more natural. (I guess we would still need several languages for the state at the start of parsing to differ.) Is this possible?

You only dynacism within the parser currently is our on:begin and on:end callbacks... they can dynamically choose to skip a match... so you could say match ALL identifiers (including "@") and then choose to "skip" matches that are found to have a "@" if you aren't in that "mode". You may want to explore this. Again I worry the complexity might be more than we want, but I'd have to see it in order to judge - and I think if it worked well it'd be preferable to this 4 variant version.

Is @ in the language name legal? The language names latex and latex3 are fairly obvious. I decided to append an @ to the name for those langauges where it is a letter. Is this safe or is that character not allowed in language names? (I did not experience any problems in my tests.)

I think so, but again I'm not really comfortable with all these variants, so we need to try and figure something else out here.

Can control sequences have a relevance of 1 (instead of 0)? In the existing latex.js placeholder*, control sequences were given a relevance of 0. I left it that way for now.

Not opposed. Most matches are allowed a relevance of 1. It might be worth exploring why it was defaulted to 0 in the past... often this is done when there are a lot of other languages that do something similar (or it's a pattern that occurs often in the wild making it a poor "heiristic").

sites where LaTeX is to be expected can set it as the default language.

Setting language manually always produces the best results.

I guess the decision was made because the backslash can also appear in paths or as an escape character.

Yeah, perhaps... often you can just test this (to some degree) by changing something and running the test suits and seeing if suddenly there are auto-detect problems with our own set of other languages and sample files..

Should I worry about efficiency? Every time a backslash is found, latex.js checks for 34 known control sequences for language detection. Is this a lot? Should this be reduced in order not to slow down the parser? I have no idea how fast this thing runs / should run.

There is nothing wrong with a fairly large | regex expression, many of our grammars do that...

@joshgoebel
Copy link
Member

So first... lets try and kill the @ and non-@ variants... What is the harm in ALWAYS assuming @ COULD possibly be a character? Shouldn't non-@ LaTex simply not use the @ in it's control sequences... making it a non-issue? Ie, @ would be a superset of non-@? Or am I missing something?

@joshgoebel
Copy link
Member

joshgoebel commented Oct 3, 2020

(As you can see, the GitHub syntax highlighter always treats @ as a letter, which is not always correct but a better choice than never treating it as one.

I'd suggest we simply do this as I worry about the complexity of doing anything more complicated.

...or perhaps play with a stateful toggle set in an on:begin rule... so you always match with @ but then depending on the mode you skip matches that don't match the mode. You could try that and we could see what it looks like... the problem is we don't currently provide any way to "reset" state to a clean for subsequent highlighting so you could quickly get in a state where you have buggy/inconsistent behavior when highlighting multiple things.

That's probably something we could fix if we're going to encourage/allow this type of usage of rules (currently we don't and there is no precedent)... but I'm about to consider something similar to support JSX in JS/TS... so this may be worth exploring
if you're game.

{
  begin: /(?<=\\makeatletter)/,
  end: /0^/,
  "on:begin": () => { at_mode = true }
}
// some other rule
{
  begin: // might include an @
  "on:begin": (match, response) => {
     if (match[0].includes?("@") && !at_mode) { response.ignoreMatch()} 
  }
}

This state is going to be "global" though so may not work well with nested sublanguages (so you couldn't mix the two approaches).

@schtandard
Copy link
Contributor Author

schtandard commented Oct 3, 2020

Ie, @ would be a superset of non-@? Or am I missing something?

A control word in TeX (i.e. a control sequence consisting of letters) is terminated by the first non-letter, regardless of what that is. That is, in the string \foo@bar, the @ actually tells TeX that the control word is over, because it is not a letter. Any non-letter terminates the control word.

Is this a large issue? No. The @ character does not occur regularly in most texts (which is what is typeset with LaTeX), so it is reasonable to just assume it is a letter. This could lead to wrong results when an email address is typeset like \name@domain.com (here the highlighter would highlight \name@domain as a control sequence even though only \name would be right), but that is very uncommon.

In short, always treating @ as a letter is fine. This would bring the number of languages down to 2.


This does not work at all for : and _, though. Let's consider _: This character usually denotes a subscript in math mode. Code like

\theta_a + \xi_a = \int_\infty^0 \partial_x f(x, y) \mathrm{d} y

is very common. (Here, the underscores are not part of the macro names.)

However, in the LaTeX3 syntax, every macro contains at least two underscores. These two contexts need different highlighting.


Having two different languages for LaTeX and LaTeX3 is not unreasonable, I'd say. While both run on TeX, they really are completely different languages, at least in terms of how they should be highlighted.

@schtandard
Copy link
Contributor Author

Selecting classes semantically is a bit tricky, as most of the classes don't map nicely onto LaTeX. For example, a macro could both be a function (i.e. a macro containing code) or a variable (i.e. a macro containing data) and would look exactly the same. This is different in LaTeX3, though. I will think about this some more.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 3, 2020

I'm still a little hazy on Latex vs 3...

\theta_a + \xi_a = \int_\infty^0 \partial_x f(x, y) \mathrm{d} y

Is this supposed to be LaTex or LaTex 3? Which mode is this in? I kind of get the "could be an exponent" but is the _ just silently dropped then and it's serving as a space?

Having two different languages for LaTeX and LaTeX3 is not unreasonable

I'm not 100% opposed here, but want to understand this thoroughly... how does the parser detect one from another? Switching in an out inside a single file probably really isn't something we want to get into... ie this isn't something we'd want to handle with context and sublanguage... if we had two grammars then a file would either be "ALL latex" or "ALL latex3"... so we're not going to understand something like:

\tl_set_eq:NN \l_tmpa_tl \l_tmpb_tl
\ExplSyntaxOn
  \tl_set_eq:NN \l_tmpa_tl \l_tmpb_tl
\ExplSyntaxOff

We might see "\ExplSyntaxOn" (give it a high relevancy) and then decide that the file is "latex3" and the WHOLE file would be highlighted as such...


If you're willing to explore a bit more I wonder if this couldn't be solved with callbacks as I was suggesting for the @ issue earlier? IE, a single grammar with a "latex3" state machine... or simply moving the whole "latex3" ruleset inside a mode in latex? Then all the commonalities should be shared... and there is a single "latex"...

Right now it seems you have "basic latex" until you hit the \ExplSyntaxOn directive then it's LaTex3, is that correct?

@joshgoebel
Copy link
Member

Is there such a think as "LaTex3" if said file does not include a \ExplSyntaxOn directive?

@schtandard
Copy link
Contributor Author

Is this supposed to be LaTex or LaTex 3?

Sorry for not being clear. This is LaTeX in math mode. (LaTeX is for typesetting documents and has two different modes: text mode and math mode. The typesetting rules are quite different in both, but not the language rules, i.e. a control sequences looks the same in either. A code highlighter does not need to know the difference.) Here, a ^ means what comes next is superscript and _ means what comes next is subscript.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 3, 2020

            {
              begin: /(?<=\\ExplSyntaxOn)/,
              end: /0^/,
              subLanguage: 'latex3'
            }

What is end trying to match?

We may be closer than you think... I'm suggesting latex3 may be a submode instead of a sublanguage. sublanguage may change drastically in the future and I'd like to avoid encouraging it's use for complex stuff if there are other, better wants. So if latex3 only exists inside of latex then it really shouldn't be a sublanguage... like JSX only exists inside JS so there is no "JSX" language rather it's a submode of JS/TS.

IE something like:

// keep the separation of concerns if it's nicer for organization.
// latex3 = require lib/latex3
{
  begin: /(?<=\\ExplSyntaxOn)/,
  end: /0^/,
  contains: latex3.contains
}

@schtandard
Copy link
Contributor Author

how does the parser detect one from another?

Right now it seems you have "basic latex" until you hit the \ExplSyntaxOn directive then it's LaTex3, is that correct?

Exactly. In a complete document, the parser always starts in LaTeX2e mode and only switches over when it executes an \ExplSyntaxOn: The TeX engine does not load the entire file, parses it and the executes it. Instead, it only reads the first thing in the file, executes that (which may include a lot of steps of macro expansion) and only then moves on to the next thing. Crucially, the rules for how the parser should scan ahead for "the next thing" or what that thing means can be changed in the process (via catcodes). Keeping track of all the possible ways to do this is way outside the scope of any syntax highlighter. Switching over to LaTeX3 catcodes is rather common though (and will only become more common in the future) and profoundly changes the appearance of the code (which is what highlighters care about). This is why I believe this case should be supported.

I think it's getting a bit confusing to talk about LaTeX and LaTeX3 since the names are so similar. Let's instead call them typesetting mode and programming mode. Their use somewhat resembles the use of php in html. (Don't take this too literally, but only looking at the syntax it is close enough to true.)

Now, while the parser always starts out in typesetting mode when compiling a document, that does not mean that it is in typesetting mode at the start of every file. Files can \input other files, which means that the parser will read that file in whatever state it currently is. A file may be completely in programming mode, with no \ExplSyntaxOn present. This is even more true for code snippets online.

I think it would be doable to put both typesetting and programming mode in one file, but unless there is a way for the language to look at the whole file in order to decide in which mode to start, this won't work. Could users decide which mode to start in, like they can decide which language to use?

@schtandard
Copy link
Contributor Author

What is end trying to match?

It is trying to match the end of the file (which is why it has a regex that matches nothing). I did this in order to accommodate for arbitrary movement between the four sublanguages. If we kick two of them, this won't be necessary anymore (as the only place to go from the second language is back to the first).

@schtandard
Copy link
Contributor Author

schtandard commented Oct 3, 2020

Is there such a think as "LaTex3" if said file does not include a \ExplSyntaxOn directive?

Yes. The whole file might be LaTeX3.

@joshgoebel
Copy link
Member

joshgoebel commented Oct 3, 2020

Exactly. In a complete document, the parser always starts in LaTeX2e mode and only switches over when it executes an \ExplSyntaxOn:

Perfect. So I understand that correctly.

Now, while the parser always starts out in typesetting mode when compiling a document, that does not mean that it is in typesetting mode at the start of every file. Files can \input other files ... [also] code snippets online.

I think it would be doable to put both typesetting and programming mode in one file, but unless there is a way for the language to look at the whole file in order to decide in which mode to start, this won't work. ...

And now we have the crux of the problem. This is why we have php and php-template (xml with embedded PHP) though I HATE the necessity of it - and not sure we have the right tools to handle this sort of thing. :-) One of our 3rd party grammar modules has a similar issue and solves it with two grammars (although they are both built from a single JS function)...

Can we make this a bit less abstract? Can you pick a small (but telling) snippet of LaTex3 and highlight it with both "Latex" and "Latex3" (and show snaps) so I can visualize what differences we're talking about? IE, how does "latex" misunderstand latex3? Is it just the _ and : stuff? Still I'm a visualizer...

Or perhaps two side by side examples would help... "this is latex"... "this is latex3"...

@joshgoebel
Copy link
Member

For auto-detect if a file does not include "\ExplSyntaxOn" isn deciding whether it's latex or latex3 a lost cause? That's a consideration here also.. php and php-template work with auto-detect because XML is very different than PHP and can pick up a lot of relevancy from HTML content... So php-template will score much higher than php against a template..

@egor-rogov
Copy link
Collaborator

egor-rogov commented Oct 3, 2020

Why not treat any macro name with at least two underscores in it as LaTeX3? I believe it's a simple and satisfactory solution (given that commands like \theta_a_b are incorrect for LaTeX2e).
In any case there is no 100% accurate way to handle switches between 2e/3 modes - we need a LaTeX interpreter for that (e. g. I can create my own macro to invoke \ExplSyntaxOn in it).

I'm +1 to always treats @ as a letter. It is simple and covers absolute majority of cases.

Units, stretch values and the like: These are impossible to differentiate from normal text and the value added by highlighting them is very limited.

I think we'd better highlight them. I can hardly imagine normal text with, say, 3em in it. Even so, it would make no harm to highlight it.

Can control sequences have a relevance of 1 (instead of 0) ... to also support auto-detection in short code snippets

We can give high relevance to some widely used command like \begin, \end, \item, and so on. But it's a minor issue.

The @ variants of the languages are dropped in favor of always treating
@ as a letter.

The Requires line from the header is also dropped in order to avoid
circular references.
@schtandard
Copy link
Contributor Author

Can you pick a small (but telling) snippet of LaTex3 and highlight it with both "Latex" and "Latex3" (and show snaps) so I can visualize what differences we're talking about?

Sure. Here's a snippet of both with just the control sequences highlighted. (Of course, the LaTeX renderer would additionally misinterpret the underscores as math-subscripts if that were highlighted.) The first image is normal LaTeX highlighting, the second one is LaTeX3 highlighting.

latex-highlighting
latex3-highlighting

For auto-detect if a file does not include "\ExplSyntaxOn" isn deciding whether it's latex or latex3 a lost cause?

No, we can still look for macro names that look like LaTeX3 (, which actually already worked pretty well in this draft.

@schtandard
Copy link
Contributor Author

Why not treat any macro name with at least two underscores in it as LaTeX3?

After reconsidering I discovered that my claim that all LaTeX3 macros have at least two underscores is simply wrong (the screenshots above show some examples). This is still feasible, however. In fact, I really like this idea.

We can look for macros that would adhere to the LaTeX3 naming convention if LaTeX3 catcodes were in effect. This is basically what I already did for language detection. If people don't follow the naming convention, this will of course fail, but that's fine I'd say.

@egor-rogov
Copy link
Collaborator

Pattern can be e. g. like this: "letters, followed by underscore, followed by at least two letters" (in LaTeX2e you'd write \foo_{bar}, not \foo_bar - although the latter is formally correct), or "letters and underscores, followed by colon, followed by one or more letter", or something like this.
So the idea is to put everything in one grammar. This should be much simpler.

@schtandard
Copy link
Contributor Author

I think we'd better highlight them. I can hardly imagine normal text with, say, 3em in it.

Hmm, that's true. I'm still not sure about this, though. For consistency I would want to highlight counter values as well. After all, assigning a value to a counter and assigning a value to a dimension are pretty similar. But I'd find highlighting numbers in the document text rather confusing as they're not different from other characters.

I guess I just don't see the value of highlighting dimensions: When they appear in an assignment, they are surrounded by other stuff that is highlighted, namely macros and braces, so they stand out anyway. From a language standpoint they are nothing special: They are not literals as in other languages, they are just characters. What makes them special is the context, which is highlighted.


Highlighting stretch values (i.e. plus and minus) is not feasible in any case, I think.

@schtandard
Copy link
Contributor Author

So the idea is to put everything in one grammar. This should be much simpler.

Agreed. I will commit some changes tomorrow.

Anything that looks like a LaTeX3 macro is now treated as such,
otherwise LaTeX2e rules apply (with @ being a letter).
While TeX only knows the double caret, LuaTeX and XeTeX accept up to
six carets followed by the same number of lower-case hex digits. This
has also been included.
src/languages/latex.js Outdated Show resolved Hide resolved
src/languages/latex.js Outdated Show resolved Hide resolved
src/languages/latex.js Outdated Show resolved Hide resolved
@joshgoebel
Copy link
Member

Ok, that was a bear to review. I think that's all I got though. :-) I can smell the finish line. 👃

@joshgoebel
Copy link
Member

This is one of only two blockers for 10.3 now! :-)

schtandard and others added 2 commits October 16, 2020 20:42
Co-authored-by: Josh Goebel <me@joshgoebel.com>
@joshgoebel
Copy link
Member

@egor-rogov I'm going to push this home once all the tiny little final changes are made here (so I can release 10.3). I think at this point we've dropped most of the "controversial" things discussed earlier that we had different diverging feelings on. And of course this isn't the last word on the topic, hopefully the grammar will continue to evolve as all our grammars do.

@schtandard Thanks for all your hard work on this and letting us put you thru the ringer to polish this to something that fits nicely in the scope of the core library. It's a great contribution.

@joshgoebel
Copy link
Member

What's your @name on Meta Stackoverflow if you're there... I'll definitely credit you there when I mention the new release (I was planning to respond to a latex thread anyways)...

@schtandard
Copy link
Contributor Author

@joshgoebel Thanks for coaching me through this, that was really helpful!

I hope to do some more work on this in the future, but may take a little more time now that we got this far. :-)

@schtandard
Copy link
Contributor Author

What's your @name on Meta Stackoverflow if you're there... I'll definitely credit you there when I mention the new release (I was planning to respond to a latex thread anyways)...

My name there is schtandard, thanks for the thought. I hope, the powers that be will update to 10.3 in a timely manner..

@joshgoebel
Copy link
Member

I think I'm mostly good. Let me know when you've gotten any last comments added, etc...

@schtandard
Copy link
Contributor Author

@joshgoebel I think I'm done. Maybe have a look at the comments to check for typos, etc.

I also reordered one of the regexes to make it reflect how I describe the rule more closely (it also got shorter). Did not change the matching behavior, though.

@joshgoebel joshgoebel merged commit 5d8e97b into highlightjs:master Oct 17, 2020
@JamesTheAwesomeDude
Copy link
Contributor

Is the attached-screenshotted bug with the highlighting an instance of this issue? Or should it be opened separately?

image

@joshgoebel
Copy link
Member

This has been merged. Please open a new issue.

@PhelypeOleinik
Copy link

@JamesTheAwesomeDude That looks like a screenshot from stackexchange: they didn't update yet, so the highlighting you see is from the old version (the one that kickstarted the rewrite in this PR). That should be highlighted correctly now, see https://tex.meta.stackexchange.com/q/8665/134574 and https://tex.meta.stackexchange.com/q/8688/134574

{
className: 'formula',
contains: [COMMAND],
begin: '\\\\begin(?=\\s*\\r?\\n?\\s*\\{' + envname + '\\})',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need to change this, this can result in a polynomial backtracking. Because the \s* will dual with each other since there is no guarantee there will be any content in the middle.

Can this be replaced with a single group? [\s\r\n] ? If not we may need two different paths here to avoid backtracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it can. The real rule is that there may only be one newline, but that's really not so important after \begin. I'll try to make a PR shortly.

(Sorry for being so unresponsive lately. It's been difficult to find some time to work on this beside my other obligations.. There will be different times again, though.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick pass:

begin: '\\\\begin(?=[ \t]*(\\r?\\n[ \t]*)?\\{' + envname + '\\})',

Thoughts?

Passes all your tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what it should be. (I think I thought \s didn't capture newlines, which seems to also be wrong.) Don't know if there are any (dis)advantages to making the group non-capturing performance-wise..?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know if there are any (dis)advantages to making the group non-capturing performance-wise..?

I've seen mixed results on this, so typically I don't worry about captures unless I'm using references or one of the few rules that uses reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
6 participants