Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(swift) Improved highlighting for functions, initializers, and subscripts #2930

Merged
merged 6 commits into from
Dec 28, 2020
Merged

(swift) Improved highlighting for functions, initializers, and subscripts #2930

merged 6 commits into from
Dec 28, 2020

Conversation

svanimpe
Copy link
Contributor

This PR improves highlighting for functions, initializers, and subscripts. This includes detailed highlighting for generic parameters as well as function parameters.

To differentiate between tuple elements and function params, I've also added a mode to match tuples, whose element names should not be highlighted as params.

Fixes #2895
Fixes #2857

One open question: Do we still need the illegal attributes?

@svanimpe
Copy link
Contributor Author

As explained on #2908, I had to revert some of your keyword changes and change keywords with parentheses back to regexes, so they're included in the KEYWORD_MODES and not matched by TUPLE first.

src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
src/languages/swift.js Outdated Show resolved Hide resolved
Copy link
Member

@joshgoebel joshgoebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels a bit like you're still asking yourself "how do i parse the language" vs "what's the simplest rule that could solve this problem"... that tends to result in very large and complex rulesets... so a lot of my thoughts are on that and simplifying.

@svanimpe
Copy link
Contributor Author

I totally understand, but in this case I don't think we can simplify it much. My first attempt didn't include a rule for tuples, and as a result, every blah: in a function declaration would match as param, even if it was part of a tuple or function type, and not a parameter name. I added the tuple rule to differentiate between params in the function declaration, and every other use of tuples/function types.

src/languages/swift.js Outdated Show resolved Hide resolved
@joshgoebel
Copy link
Member

joshgoebel commented Dec 19, 2020

I added the tuple rule to differentiate between params in the function declaration, and every other use of tuples/function types.

I might be missing something here... isn't that what FUNCTION_PARAMETERS is for? You already have a scope for treating params differently if need be, no?

Update:

subscript<X: A>(_ p: @attribute inout (x: Int, var: Int) = (0, 0))  { }

Ok I think I see now, you're trying to differentiate between:

  • function params names (p)
  • tuples named items (x, var)

So first... what are the x and var here exactly (is this whole thing a type spec)? Github seems to take the opposite approach and color the tuple names but not the param names... (not that we should follow them or anything, but it's interesting)

@joshgoebel
Copy link
Member

joshgoebel commented Dec 19, 2020

Let's try to simplify the other things first and then we can circle back to tuple at the very end. If we insist on function params being treated differently than the tuple stuff then we may indeed need a separate rule. Though then I think I might at least consider something like (roughly):

  const TUPLE = {
    // make sure it's really a tuple
    begin: /\(/ ... lookahead(/\s*/, concat(Swift.identifier, /\s*:/)),
    end: /\)/,

This prevents false positives on blah(blah) style keywords and also prevents TUPLE from firing on non-tuple like expressions.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 19, 2020

I did have an idea the other day about throwing errors if keywords don't match $pattern (to help grammar creators) but you could extend the idea to simply auto-handle common situations... like:

  • default to \w+
  • add any OTHER characters found in keywords to the group

So given:

keywords: "booger? what!"

You'd end up with $pattern as /[\w?!]+/ auto-generated without needing any other effort. And then you'd only need to manually adjust $pattern if you needed to REDUCE the scope (if \w was too wide) or target the pattern more precisely.

Would that help simplify some of your issues with keywords? Look-behind will come along eventually and solve a lot of the . guard issues... (though we may need multiple $patterns which I'm not opposed to) then we're just left with the sequencing issues (modes first vs keywords last).

Once #2834 is merged there is all sorts of room for exploration regarding extending keywords to add sugar, I think the hard part is the syntax and the docs (making sure it's understandable). For example what syntax would you prefer for specifying sequencing?

It goes without saying that I think there is much value in specifying keywords in a single place/constant and having them "just work" rather than writing a ton of individual keyword rules. This simplicity is amazing and elegant in the many simpler languages where it already "just works".

keywords: {
  blah: "blah(boo) ...",
  keywords: "begin end"
}

How do we say that blah should be run "early" vs keywords running "late"?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 19, 2020

I just realized a lot of the complexity you are adding is your not embracing the keywords system or not fully understanding how it works. When you force keywords to always be a mode then they must consume content (vs being found inside the current buffer). This leads to complex/non-typical patterns like:

{
  begin: lookahead("func"),
  contains: [ FUNC_KEYWORD, ...

Where you force begin not to consume because you need FUNC_KEYWORD to consume... If begin consumed then FUNC_KEYWORD would have nothing to consume, etc... keywords and beginKeywords avoid all this complexity because they highlight within the match/current buffer. There are times that works against you, but many times it can work for you.

If real-time chat would ever be more helpful you're welcome to join our Slack.

@svanimpe
Copy link
Contributor Author

I pushed a new commit with some simplifications regarding keywords, and the spaces in the function declaration.

subscript<X: A>(_ p: @attribute inout (x: Int, var: Int) = (0, 0)) { }
Ok I think I see now, you're trying to differentiate between:
function params names (p)
tuples named items (x, var)
So first... what are the x and var here exactly (is this whole thing a type spec)?

x and var are tuple element names here, and (x: Int, var: Int) is a tuple type.

Let's try to simplify the other things first and then we can circle back to tuple at the very end. If we insist on function params being treated differently than the tuple stuff then we may indeed need a separate rule. Though then I think I might at least consider something like (roughly):

  const TUPLE = {
    // make sure it's really a tuple
    begin: /\(/ ... lookahead(/\s*/, concat(Swift.identifier, /\s*:/)),
    end: /\)/,

This prevents false positives on blah(blah) style keywords and also prevents TUPLE from firing on non-tuple like expressions.

This would make sense in the global scope, but would cause problems in function declarations. There, I need TUPLE to match wide, so it consumes all nested parentheses: both regular tuples, and the parameter list of a function type, both of which aren't required to use identifiers (but may do so).

You'd end up with $pattern as /[\w?!]+/ auto-generated without needing any other effort. And then you'd only need to manually adjust $pattern if you needed to REDUCE the scope (if \w was too wide) or target the pattern more precisely.

This could cause many other problems. A keyword is a very specific list of characters in a specific order. Only allowing additional characters would match too wide. For example, open is currently a contextual keyword, but we can't match open! or !open as keywords, because those are valid expressions using operators. So if init! is a keyword, we can only match init!, and nothing else that happens to contain an exclamation point.

It goes without saying that I think there is much value in specifying keywords in a single place/constant and having them "just work" rather than writing a ton of individual keyword rules. This simplicity is amazing and elegant in the many simpler languages where it already "just works".

The problem here is that Swift is no longer a "simple" language like those from earlier generations. There is no such thing as a list of old-school globally-reserverd keywords in Swift. A lot of keywords are contextual, but can be valid identifiers in other contexts. For example, the keyword open is contextual because it only applies to class declarations. So in open class X { } it's a keyword, but in var open = true it's an identifier.

The only way to highlight these properly is do as much parsing as possible. That's how you end up with a grammar like TextMate 😕 Right now, I'm just trying to get HLJS to a point where it's usable for me, but long-term, I may have to move to a semantic highlighter that isn't based on regular expressions.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 22, 2020

The only way to highlight these properly is do as much parsing as possible. That's how you end up with a grammar like TextMate

Yes, that is my concern. Why we don't have any strict "size requirements" per se Swift has rapidly gone from just 5kb to 20kb... and it's included in our common bundle, so that makes the size slightly relevant. It's surpassed JS/TS which is one of our other more complex/larger grammars that isn't just pure keywords - which is only 14kb (if you count the ECMAScript keywords also).

I'm starting to wonder what the upper limit on fidelity is... because I have a feeling you could easily end up with a very large grammar if you just kept cramming things in... lets take just one example:

  const AVAILABLE_ATTRIBUTE = {
    begin: /(@|#)available\(/,
    end: /\)/,
    keywords: {
      $pattern: /[@#]?\w+/,
      keyword: Swift.availabilityKeywords
        .concat([
          "@available",
          "#available"
        ])
        .join(' ')
    },
    contains: [
      ...OPERATORS,
      NUMBER,
      STRING
    ]
  };

I think all this does is provide scope for the "unique" availabilityKeywords (which arguably could be moved to the global scope, though perhaps it's less "correct")... and this could be simplified to two lines of code:

          "@available",
          "#available"

...just added to the appropriate list of keywords already matched by existing rules.

@allejo @egor-rogov Any thoughts on where to draw the limits on raw size, or should their simply be no limits? Does it matter at all that Swift is in the "common" set... I know we have some other huge grammars (mostly keywords) but we don't ship them in the common set. Is there perhaps any point where we say "built-in grammars cover the basics" and if someone wants a 100kb grammar that does it all they ship it as a 3rd party grammar?

Any thoughts on complexity vs "perfection"? Is it worth an extra 18 lines of code and additional rules just to avoid highlighting a few keywords that might be used as identifiers?

@svanimpe That's what we're talking about in this case, right? All of this is to avoid mis-highlighting those things as keywords in other contexts, yes?

@joshgoebel
Copy link
Member

joshgoebel commented Dec 22, 2020

Another example, operators:

// Valid first characters for operators.
export const operatorHead = either(
  /[/=\-+!*%<>&|^~?]/,
  /[\u00A1-\u00A7]/,
  /[\u00A9\u00AB]/,
  /[\u00AC\u00AE]/,
  /[\u00B0\u00B1]/,
  /[\u00B6\u00BB\u00BF\u00D7\u00F7]/,
  /[\u2016-\u2017]/,
  /[\u2020-\u2027]/,
  /[\u2030-\u203E]/,
  /[\u2041-\u2053]/,
  /[\u2055-\u205E]/,
  /[\u2190-\u23FF]/,
  /[\u2500-\u2775]/,
  /[\u2794-\u2BFF]/,
  /[\u2E00-\u2E7F]/,
  /[\u3001-\u3003]/,
  /[\u3008-\u3020]/,
  /[\u3030]/
);

// Valid characters for operators.
export const operatorCharacter = either(
  operatorHead,
  /[\u0300-\u036F]/,
  /[\u1DC0-\u1DFF]/,
  /[\u20D0-\u20FF]/,
  /[\uFE00-\uFE0F]/,
  /[\uFE20-\uFE2F]/
  // TODO: The following characters are also allowed, but the regex isn't supported yet.
  // /[\u{E0100}-\u{E01EF}]/u
);

This is just the keyword lists, not counting the rules in the grammar (another 25 lines)... if I had been writing the grammar I probably would have supported only the common operators:

  /[/=\-+!*%<>&|^~?]/,

And then said "we simply don't support every possibly operator under the sun, too complex" - or revisited the topic if it came up later. I'm not suggesting we undo this work now that it's done, but I think we should have some big picture idea for the next time this comes up... I feel like in the past the goal has often been 'minimum ruleset to provide good highlighting' experience and that this is perhaps pushed far past that.

I'm not 100% opposed to this level of fidelity... I just think it would be setting a different direction that what we've tried to do in the past... so if that's the decision I want it to just be decided rather than stumbled into. :)

I feel like here we're trying to build a grammar to cover the WHOLE spec (again speaking strictly of operators) rather than just "what might be most helpful in a highlighting context"... and those are VERY different goals which lead to very different outcomes.

Is the goal to cover 90% of cases reasonably well or to cover 100%?

@@ -291,8 +295,7 @@ export default function(hljs) {
// https://docs.swift.org/swift-book/ReferenceManual/Expressions.html#ID552
// Prevents element names from being highlighted as keywords.
const TUPLE_ELEMENT_NAME = {
begin: lookahead(concat(Swift.identifier, /\s*:/)),
end: lookahead(/:/),
begin: concat(Swift.identifier, /\s*:/),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand why we needed lookahead at all here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a leftover from when I used contains. The goal here was to check that the upcoming code is indeed an identifier, but leave matching to the contained rules.

Will the keywords: "_|0" rule still work with this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, keywords match the buffer, they do not consume the input.

  • begin matches
  • end matches
  • buffer = [begin, text in middle, end].concat()

That whole buffer is fed into the keyword engine for processing. Of course children rules that match can flush the buffer... but it's easy enough to reason about in a simple case with no children.

That's why keywords can get difficult in edge cases, because they are happening absolutely last and at the mercy of the whole ruleset.

@@ -351,7 +351,7 @@ export default function(hljs) {
lookahead(concat(Swift.identifier, /\s*:/)),
lookahead(concat(Swift.identifier, /\s+/, Swift.identifier, /\s*:/))
),
end: lookahead(/:/),
end: /:/,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No good reason not to consume this that I can see, so we just do that, since it's simpler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be fine. I didn't consume it just it case I need it to detect a type annotation (which always starts with :), but I managed to get proper highlighting without having to do that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't consume it just it case I need it to detect a type annotation

We really need look-behind. :( Lots of complexity dealing with these kind of things without it.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 22, 2020

@svanimpe I think we're closing in on this. I simplified a few small things. I see why we need TUPLE since it's serving dual-purposes (and also if we really want tuples highlighted differently than param names)... and I now understand why you're writing the function rules as you are. I added a few comments on my commit.

Any questions about what I changed?

@joshgoebel
Copy link
Member

And yes if illegal isn't hurting anything we should keep - it's a HUGE help for auto-detect because of how fast it can rule languages out.

Comment on lines 285 to 289
contains: [
...COMMENTS,
...KEYWORD_MODES,
...ATTRIBUTES,
OPERATOR_GUARD,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you think of any good reason the compiler shouldn't just flatten contains for you avoiding keeping track of which items and rules vs arrays of rules?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean here. Am I not already flattening the nested arrays with the leading ... operator?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it's error prone and you have to constantly remember which is a mode and which is a list of modes... vs just not requiring the ... and letting the compiler auto-detect arrays vs single items.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously ... is more explicit, but is there value in that explicitness or does it just get in the way?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would the alternative code be?

This may be just a case of me not being a JS developer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to keep track of it manually.

    contains: [
      ...COMMENTS,
      ...KEYWORD_MODES,
      ...ATTRIBUTES,
      OPERATOR_GUARD,

vs

The compiler figure it out for you.

    contains: [
      COMMENTS,
      KEYWORD_MODES,
      ATTRIBUTES,
      OPERATOR_GUARD,

The question is does forcing the reader/writer to keep track of what needs ... and what doesn't add any value... I'm not sure it does.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not interested in an array here ever having any special type of semantic meaning... so that means technically it would be easy to move this type of thing into the compiler and make the type passed (array or item) irrelevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By "compiler", do you mean some part of HLJS that processes these language files? Or is this a JS "feature" ? In the first case, I'd still prefer the explicit version I think. It's less surprising/confusing.

For example, in #2938, I have the following code:

keywords: [
  ...Swift.precedencegroupKeywords,
  ...Swift.literals
].join(' '),

I tried leaving off the leading ... here, but that didn't work. Of course, this being JS and all, I didn't get any errors. It just didn't work 🤷‍♂️

@svanimpe
Copy link
Contributor Author

@svanimpe That's what we're talking about in this case, right? All of this is to avoid mis-highlighting those things as keywords in other contexts, yes?

In the case of @/#available, my goal was to properly highlight the arguments for that keyword. Those arguments are very specific and not like regular code. They also contain some specific keywords. Those keywords should never be added as global keywords, as they aren't used anywhere else, and will only cause confusion if we mis-highlight them.

Additionally, @available is the only attribute that has an argument list, so by handling it separately, we don't have to account for argument lists in the general ATTRIBUTE rules.

And then said "we simply don't support every possibly operator under the sun, too complex"

If code size is becoming an issue, then reducing the list of identifier/operator characters would make sense. If we reduce them to only the ones that are easy to type, we can probably remove all the Unicode characters and shrink the code size.

Note that we still have to support custom operators using these characters, so we don't break functions (with operator titles) again.

Also, if we narrow down the list of characters we support, we may be excluding users who don't write their code in English, and for whom the characters we remove may be normal everyday characters ...

Is the goal to cover 90% of cases reasonably well or to cover 100%?

The goal should be correctness without completeness. I'm using HLJS to highlight teaching examples, and I cannot afford to have a single mis-highlight in my course. Those are really confusing to new learners.

I'm only doing this work because it's easy for me to integrate HLJS into my publishing setup, and fixing it seemed easier than integrating a different solution (which will likely be the long-term plan).

@joshgoebel
Copy link
Member

joshgoebel commented Dec 26, 2020

If code size is becoming an issue, then reducing the list of identifier/operator characters would make sense. If we reduce them to only the ones that are easy to type, we can probably remove all the Unicode characters and shrink the code size.

Well, as I said the problem is we have no hard and fast rules, we're just playing it by ear. :-) And complexity also plays a role, though your code tends to be very clean and clear, which does lower the fear of the complexity. :)

Note that we still have to support custom operators using these characters, so we don't break functions (with operator titles) again.

Well there is always a side chance we could make that work with \S (not space)... or some such rather than an explicit rule. (see my thoughts below on "close enough")

Also, if we narrow down the list of characters we support, we may be excluding users who don't write their code in English, and for whom the characters we remove may be normal everyday characters ...

I think that 98% of the library already skews English, all our docs are English (though there is talk of changing that, but it requires contributors who know those languages), etc... so that seems a somewhat natural limitation for the time being (even if suboptimal). Proper multi-lingual support (and related docs) is a pretty big undertaking... Plus I'm not really even sure we can honestly have great multi-lingual support until someone digs into #2756

The goal should be correctness without completeness. I'm using HLJS to highlight teaching examples, and I cannot afford to have a single mis-highlight in my course. Those are really confusing to new learners.

Our goals may be slightly misaligned then. While I generally agree I also don't think 100% correctness is possible (see our regex support in like every language). Sometimes it's impossible (because we're not a full parser). Other times it's simply too complex/hard to achieve (lack of look-behind, engine limitations, etc). So we try to get "close enough" - to highlight 98% of reasonable code well - even if 2% of edge cases might be broken and unfixable.

Generally if the entire highlighting goes off the rails (gets stuck in a child mode), that's an issue we'd like to fix... but if a few tokens or a really problematic section glitches for just a moment... when a strange edge case is found of course first we'd consider fixing it - but if it proved to be hard or impossible to fix that does not necessary mean we would cripple the highlighter by refusing to highlight a thing that works fine for 99% of people in 99% of circumstances.

I'm only doing this work because it's easy for me to integrate HLJS into my publishing setup, and fixing it seemed easier than integrating a different solution (which will likely be the long-term plan).

At some point you may want to consider forking the internal Swift grammar into your own 3rd party grammar. Full control without any review - and you're limited only by what the engine makes possible - not what we think is a good idea or good for the core library or not. :)

I'm not telling you to go do that... we're very happy to have your great contributions. Only mentioning the possibility that eventually we may eventually find a rock and hard place where practicality is at odds with the idea absolutely correctness.

Of course even on your own you may still eventually find some things are impossible/difficult. :-)

@svanimpe
Copy link
Contributor Author

svanimpe commented Dec 27, 2020

So we try to get "close enough" - to highlight 98% of reasonable code well - even if 2% of edge cases might be broken and unfixable.

The text I'm currently highlighting is a programming fundamentals course, so definitely no edge cases in there, only basics :D

At some point you may want to consider forking the internal Swift grammar into your own 3rd party grammar.

Long term, I may have to move to pre-rendering, as that isn't limited by speed or code size. The Swift compiler actually includes a library that does most of the hard work: https://github.com/apple/swift-syntax
That should be able to categorize all the tokens in the source code, without having to worry about regular expressions anymore. But I still have to look into exactly how much information it provides, how well it can handle incomplete code snippets (that don't compile on their own), how I can hook it into my setup, etc... For now, HLJS is still easier to use.

@joshgoebel
Copy link
Member

joshgoebel commented Dec 27, 2020

The text I'm currently highlighting is a programming fundamentals course, so definitely no edge cases in there, only basics :D

That doesn't mean there are no edge cases (from a highlighting perspective) though it prolly helps. :) Often if one can rewrite the source easily they can avoid many of these kind of pitfalls though (by tweaking the source to work around them).

The Swift compiler actually includes a library that does most of the hard work: https://github.com/apple/swift-syntax

I was wondering if a Swift solution for parsing Swift might not be a better fit for perfect parsing.

For now, HLJS is still easier to use.

That is one advantage for sure. :)


I'm also (slightly) interested in 3rd party parsing engines... (allowing someone to replace our parsing engine entirely) so if you wanted to write a full JS Swift parser that spit out tokens I'd be open to a discussion about what kind of plug-in architecture we'd need to support something like that.

@joshgoebel
Copy link
Member

@svanimpe I will take your merging as a tacit agreement that all is well and this is good to go. :)

@joshgoebel joshgoebel merged commit 35cd463 into highlightjs:master Dec 28, 2020
@joshgoebel
Copy link
Member

@svanimpe Thanks for all the hard work on this. :) Will ship in Jan with 10.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

(Swift) Incorrect highlighting in operator functions (Swift) params does not handled nested parens properly
2 participants