Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ts/js) use identifier to match potential keywords #2519

Merged
merged 10 commits into from May 7, 2020

Conversation

@joshgoebel
Copy link
Member

@joshgoebel joshgoebel commented Apr 29, 2020

  • fix(typescript) use identifier regex to match potential keywords
  • fix(javascript) use identifier regex to match potential keywords

The treats expressions like $class as a whole identifier and therefore
avoids the class portion of that identifier incorretly being highlighted
as a keyword.

Resolve #2516.

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Apr 29, 2020

This adds $lexemes:

  var KEYWORDS = {
     $lexemes: ECMAScript.IDENT_RE,
     keyword: ECMAScript.KEYWORDS.join(" "),
     literal: ECMAScript.LITERALS.join(" "),
     built_in: ECMAScript.BUILT_INS.join(" ")

I can document this syntax, but I think this makes a lot of sense as the whole purpose of lexemes is to process keywords - it's not used for anything else. And many grammars use a KEYWORD constant that they embed over and over in many places... making this a very useful thing to have to avoid repeating yourself.

I would suggest either:

  • mode level lexemes (if present) would supersede keywords/$lexemes
  • having both is a hard error and the language does not compile (ie, grammar designers would not be allowed to do this when designing grammars)
  • Although perhaps no special behavior is needed here... we could let $lexemes always win and it can always be destroyed with inherit if they REALLY need an exception case for some deep mode - the same as you would do for any other key.

Long term I think perhaps we should deprecate MODE.lexemes in favor or MODE.keywords.$lexemes.

@joshgoebel joshgoebel requested review from allejo and egor-rogov Apr 29, 2020
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Apr 29, 2020

Also the name "lexemes" drives me nuts and we could use this as an opportunity to start renaming it... $pattern?

@allejo
Copy link
Member

@allejo allejo commented Apr 30, 2020

  • mode level lexemes (if present) would supersede keywords/$lexemes

Mode level lexemes currently exist and are potentially in use right now. Wouldn't we want keywords/$lexemes to supersede that? i.e. we're introducing new behavior and the old behavior will potentially be deprecated, so shouldn't that new behavior be more important?

  • having both is a hard error and the language does not compile (ie, grammar designers would not be allowed to do this when designing grammars)

Do you know if there's a use case of when anyone would need to use both? If not, then I would think that discouraging this behavior would be ideal so a hard error.

Unless there's a use case, this has my vote 👍

  • Although perhaps no special behavior is needed here... we could let $lexemes always win and it can always be destroyed with inherit if they REALLY need an exception case for some deep mode - the same as you would do for any other key.

If supporting both methods at the same time is important, then I think this idea is the same as my thoughts on the first option.


Long term I think perhaps we should deprecate MODE.lexemes in favor or MODE.keywords.$lexemes.

If my understanding is correct, lexemes are currently defined per mode (as a sibling to keywords). However, a common practice in writing grammars is that keywords is a single object/string and the lexemes for said keywords are outside that object, correct? If so, I'm in favor of deprecating MODE.lexemes.


Also the name "lexemes" drives me nuts and we could use this as an opportunity to start renaming it... $pattern?

👍 for $pattern. What this feature is defining is the pattern for a lexeme or token, not the lexemes themselves.

@egor-rogov
Copy link
Collaborator

@egor-rogov egor-rogov commented Apr 30, 2020

This adds $lexemes

It looks reasonable, but given we already have one method (`MODE.lexemes)), do we really need another one? I'm not strongly against it, just in a doubt.

Long term I think perhaps we should deprecate MODE.lexemes in favor or MODE.keywords.$lexemes.

We'll have troubles doing so, because a lot of grammars use it.

Also the name "lexemes" drives me nuts and we could use this as an opportunity to start renaming it... $pattern?

I agree, "lexemes" always seemed strange to me. keywords.$pattern looks better. But why do you want to use this dollar?

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Apr 30, 2020

keywords.$pattern looks better. But why do you want to use this dollar?

Because it's a reserved namespace. Someone might be writing a grammar for a pattern matching library and want to have a "pattern" type of keyword and have "pattern" as the className... so we have to have some type of "reserved" namespace if we are going to allow configuration options right next to class names.

but given we already have one method (`MODE.lexemes)), do we really need another one? I'm not strongly against it, just in a doubt.

I was cutting and pasting the code 4-5 times for Javascript and then about to have to do it gain for TypeScript - and this is literally why we program, to eliminate this type of manual cutting and pasting - and I'm about ready for grammars to get a bit easier to write. Plus cutting and pasting is error prone, and it's too easy to miss a spot, etc...

This exact "cut and paste" problem is the ONLY reason that's stopped me from fixing the __keywords issue with C++, it's just so yucky a solution, and C++ is even worse (number of times you'd need to copy/paste).

Do you know if there's a use case of when anyone would need to use both? If not, then I would think that discouraging this behavior would be ideal so a hard error.

We could always discourage it UNTIL someone finds a use case. :-)


The other idea here would be some type of inheritance setting for keywords/lexemes... but I'm not sure what that would look like. You'd want it at a single level (top) to prevent hunt and find and cut and paste over and over...

keywords: { ... },
lexemes: /[a-z]+/,
inheritKeywords: true,
inheritLexemes: true,
// or
keywords: { 
  $inherit: true,
  $lexemes: /[a-z_]+/,
  keywords: " ... ",
  built_ins: " ... "
}

I'm already liking this idea of keeping the configuration with the keywords themselves though, as honestly it seems the most natural place to me.

So once inherit was turned on the keywords would be inherited down... and then (say inside a string or comment) you'd have to say keywords: null to stop the inheritance chain.

My worry is there we might start getting people accidentally highlighting keywords inside strings/comments just because they fail to turn off inheritance... but I'm not sure that's much worse than the current behavior of not highlighting keywords at all because people forgot to do the keyword inheritance manually?

I've kind of felt for a while that keywords should perhaps even "auto-inherit"... and you've have to manually intervene to disable that behavior...


And the more I think about it we don't need both at all:

lexemes: /[a-z+]/,
keywords: {
  $pattern: /[a-z]+/

Should throw a hard error ("Don't use both lexemes and keywords.$pattern in the same block.").

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Apr 30, 2020

We'll have troubles doing so, because a lot of grammars use it.

We'd deprecate it like you do anything... with a major version, first to deprecate it - add warnings... and then in the next major version to turn it off completely. But I'm fine with it living for a long while if need be...

If the new location becomes the blessed one it's one line of code to just copy lexemes into keywords.pattern when compiling modes... and I think we can maintain that one line of code many years into the future if we MUST. :-)

@egor-rogov
Copy link
Collaborator

@egor-rogov egor-rogov commented May 1, 2020

Okay, I'm fine with a hard error.
So we agreed on $pattern, I think? And it should be documented, of course.

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented May 2, 2020

No worries I will document it. :)

@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented May 2, 2020

How is that?

@joshgoebel joshgoebel force-pushed the joshgoebel:ts_js_identifier_vs_keyword branch from 6653b3e to 1606b83 May 2, 2020
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented May 6, 2020

Better?

Copy link
Collaborator

@egor-rogov egor-rogov left a comment

Oh yeah, that's awesome.

docs/language-guide.rst Outdated Show resolved Hide resolved
src/lib/mode_compiler.js Show resolved Hide resolved
@joshgoebel joshgoebel merged commit 9e4f2dc into highlightjs:master May 7, 2020
1 check was pending
1 check was pending
continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
0xflotus added a commit to 0xflotus/highlight.js that referenced this pull request Jun 12, 2020
* (docs) Add Chaos to supported languages (highlightjs#2510)

* fix(parser) fixes sublanguage with no rule matches (highlightjs#2506)

* fix(parser) fixes sublanguage with no rule matches

Resolves highlightjs#2504.

* (chore) Add ESLint config and clean up the major stuff (highlightjs#2503)

* (chore) eslint:recommended
* (chore): eslint_standard
* relax eslint rules for language grammars (to discourage rewriting them in one fell swoop; I'd rather have the blame history intact)
* remove extra escaping
* clean up variables
* more camelcase

* (docs) Add Visual Basic for Applications (VBA) to supported languages (highlightjs#2512)

* (yaml) improve tag support; add verbatim tags (highlightjs#2487)

* YAML parse non-word characters as part of tags
* adds support for verbatim tags

Co-authored-by: Josh Goebel <me@joshgoebel.com>

* fix(javascript/typescript): lambda with parens in parameters fails (highlightjs#2502)

* fix(javascript/typescript): lambda with parens in parameters fails

- Fixes both JavaScript and TypeScript grammars

Fixes samples like:

    const bad = ((a, b) => [...a, b]);
    sides.every((length,width=(3+2+(4/5))) => length > 0 );

This is done by counting parens in the regex that finds arrows
functions. Currently we can only handle 2 levels of nesting as
shown in the second example above.

* allow much richer highlighting inside params
* improve highlighting inside arguments on typescript

* enh(cpp): Improve highlighting of unterminated raw strings

PR highlightjs#1897 switched C++ raw strings to use backreferences, however this
breaks souce files where raw strings are truncated. Like comments, it
would be preferable to highlight them.

- Add `on:begin` and `on:end` to allow more granular matching when
  then end match is dynamic and based on a part of the begin match
- This deprecates the `endSameAsBegin` attribute. That attribute was
  a very specific way to solve this problem, but now we have a much
  more general solution in these added callbacks.

Also related: highlightjs#2259.

Co-authored-by: Josh Goebel <me@joshgoebel.com>

* (chore) C-like uses the new END_SAME_AS_BEGIN mode

* (chore) Ruby uses END_SAME_AS_BEGIN mode/rule

* (parser) make END_SAME_AS_BEGIN a function helper

Adds a mode helper to replace the deprecated `endSameAsBegin` attribute.

The first match group from the begin regex will be compared to the first
match group from the end regex and the end regex will only match if both
strings are identical.

Note this is more advanced functionality than before since now you can
match a larger selection of text yet only use a small portion of it for
the actual "end must match begin" portion.

* (pgsql) add test for $$ quoting existing behavior

- even if that existing behavior is questionable
- the ending span should really close before the $$, not after

Fixing this would involve delving into the sublanguage behavior and I'm
not sure we have time to tackle that right this moment.

* (chore) pgsql uses END_SAME_AS_BEGIN mode/rule now also

* (docs) rename to `mode_reference`; docs for callbacks

- I can never find this file because it's name didn't fully match.
- rename callbacks to `on:begin` and `on:end`

* prevented setter keyword conflicting with setTimeout|setInterval and highlighted them (highlightjs#2514) (highlightjs#2515)

* fix(javascript) prevent setter keyword 'set' conflicting with setTimeout|setInterval (highlightjs#2514)
* enh(javascript) setTimeout|setInterval now highlighted (highlightjs#2514)
* enh (javascript) clearInterval and clearTimeout now highlighted
* add keywords to TypeScript also

* (docs) add TLDR instructions for building and testing

* (dev) improve developer tool UI

* (parser) Build common EMCAscript foundation

Builds a common keyword foundation for any grammar that is
building on top of JavaScript:

- LiveScript
- CoffeeScript
- TypeScript

Also uses this common foundation for JS itself.

* (parser) Adds SHEBANG mode

* (yaml) Add support for inline sequences and mappings (highlightjs#2513)

* Use containers to match inline sequences and mappings
* Add string type for inside inline elements
* Handle nested inline sequences and mappings
* Disallow all braces brackets and commas from strings inside inline mappings or sequences
* clean up implementation
* feed the linter

Co-authored-by: Josh Goebel <me@joshgoebel.com>

* [enh] Add `OPTIMIZE:` and `HACK:` to comment doctags

* (build) browser build is CommonJS and IIFE, no more AMD (highlightjs#2511)

* (build) browser build is CommonJS and IIFE (global) now
* (build) dropping support for AMD, which we never truly supported
  properly in the first place
* (build) add test to make sure browser build works as commonJS module

  Resolves highlightjs#2505

* fix(parser) Fix freezing issue with illegal 0 width matches (highlightjs#2524)

* fix[parser] add edge case handle for illegal 0 width matches
* add last ditch catch all that tries to detect other uncaught freezes

* (docs) added unicorn-rails-log as an 3rd-party language (highlightjs#2528)

- (docs) Add syntax highlighting for Rails Unicorn logging to supported languages.

* (docs) fix supported languages link: it moved again! (highlightjs#2533)

* fix(ts/js) use identifier to match potential keywords (highlightjs#2519)

- (parser) Adds `keywords.$pattern` key to grammar definitions
- `lexemes` is now deprecated in favor of `keywords.$pattern` key
- enh(typescript) use identifier to match potential keywords, preventing false positives 
- enh(javascript) use identifier to match potential keywords, preventing false positives

* fix(javascript) fix regex inside parens after a non-regex (highlightjs#2531)

* make the object attr container smarter
* deal with multi-line comments also
* comments in any order, spanning multiple lines

Essentially makes the object attr container much more sensitive by allowing it to look-ahead thru comments to find object keys - and therefore prevent them from being incorrectly matched by the "value container" rule.

* (parser) Add hljs.registerAlias() public API (highlightjs#2540)

* Add hljs.registerAlias(alias, languageName) public API
* Add .registerAlias() test

* enh(cpp) add `pair`, `make_pair`, `priority_queue` as built-ins (highlightjs#2538)

* (fix) `fixMarkup` would rarely destroy markup when `useBR` was enabled (highlightjs#2532)

* enh(cpp) Recognize `priority_queue`, `pair` as containers (highlightjs#2541)

* chore: rename `registerAlias` to `registerAliases`

Plural form is clearly better as it's not surprising to the reader
to see it being passed an array - where as the singular form might
have been.  Meanwhile it's also easy to assume that it also supports
arrays of any size - including an array with a singular alias.

The fact that it can magically accept a string as the first argument
might not be obvious, but we document it and even if one didn't know
this they could still use the array form of the API without any issue
by passing a one item array.

* (swift) @objcMembers not completely highlighted (highlightjs#2543)

* Fixed @objcMembers in Swift

Would match `@objc` first, and the `Members` part would be unhighlighted

* Update CHANGES.md

* Update swift.js

* (docs) add OCL to list of supported languages (highlightjs#2547)

* (docs) Add Svelte to list of supported languages (highlightjs#2549)

* enh(dart) Add `late` and `required` keywords, and `Never` built-in type (Dart 2.9) (highlightjs#2551)

* Add new Dart 2.9 keywords for Null Safety language feature

* enh(erlang) add support for underscore separators in numeric literals (highlightjs#2554)

* (erlang) add support for underscore separators in numeric literals
* (erlang) add tests

* (docs) add Jolie to Supported Languages (highlightjs#2556)

* (parser/docs) Add jsdoc annotations and TypeScript type file (highlightjs#2517)

Adds JSDoc annotations and a .tsconfig that allows TypeScript to be run in it's "allowJS" mode and apply type and sanity checking to JavaScript code also. See Type Checking JavaScript Files.

I've been using TypeScript a lot lately and finding it very beneficial and wanted to get those same benefits here but without converting the whole project to TypeScript. It was rough at the beginning but now that this is finished I think it's about 80%-90% of the benefits without any of the TS compilation pipeline. The big difference in being JSDoc for adding typing information vs inline types with TypeScript.

Should be super helpful for maintainers using an editor with tight TypeScript integration and the improved docs/comments should help everyone else.

- Adds types/index.d.ts to NPM build (should be useful for TypeScript peeps)
- Improves documentation of many functions
- Adds JSDoc annotations to almost all functions
- Adds JSDoc type annotations to variables that can't be inferred
- Refactors a few smaller things to allow the TypeScript compiler to better infer what 
   is happening (and usually also made the code clearer)

* (parser) highlightBlock result key `re` => `relevance` (highlightjs#2553)

* enh(handlebars) Support for sub-expressions, path-expressions, hashes, block-parameters and literals (highlightjs#2344)

- `htmlbars` grammar is now deprecated. Use `handlebars` instead.

A stub is included so that anyone literally referencing the old `htmlbars` file (say manually requiring it in Node.js, etc) is still covered, but everyone should transition to `handlebars` now.

* fix(typescript) Add missing `readonly` keyword (highlightjs#2562)

* (docs) Mention `c` is a possible class for C (highlightjs#2577)

* fix(groovy) strings are not allowed inside ternary clauses (highlightjs#2565)

* fix(groovy) strings are not allowed inside ternary clauses
* whitespace can also include tabs

* Update @typescript-eslint/parser to the latest version 🚀 (highlightjs#2575)

* chore(package): update @typescript-eslint/parser to version 3.0.0
* chore(package): update lockfile package-lock.json

Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com>
Co-authored-by: Josh Goebel <me@joshgoebel.com>

* Update @typescript-eslint/eslint-plugin to the latest version 🚀 (highlightjs#2576)

* chore(package): update @typescript-eslint/eslint-plugin to version 3.0.0
* chore(package): update lockfile package-lock.json

Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com>
Co-authored-by: Josh Goebel <me@joshgoebel.com>

* (parser) properly escape ' and " in HTML output (highlightjs#2564)

* escape quotes also in final HTML output
* [style] update test coding style
* update markup tests with new escaping

This shouldn't be a security issue -- we've always escaped double quotes inside of HTML attribute values (where they could be used to break out of context) - and we've always used double quotes for enclosing attribute values. 

This just goes all the way and now properly escapes quotes everywhere.  Better safe than sorry.

* (docs) add changelog entry for last PR

* add nnfx theme (highlightjs#2571)

* (themes) Add new lioshi theme (highlightjs#2581)

* Added Cisco Command Line to SUPPORTED_LANGUAGES.md (highlightjs#2583)

* (themes) add `nnfx-dark` theme (highlightjs#2584)

* enh(protobuf) Support multiline comments  (highlightjs#2597)

* enh(java) added support for hexadecimal floating point literals (highlightjs#2509)

- Added support for many additional types of floating point literals
- Added related tests

There still may be a few gaps, but this is a pretty large improvement.

Co-authored-by: Josh Goebel <me@joshgoebel.com>

* (chore) Update issue templates (highlightjs#2574)

Co-authored-by: Vladimir Jimenez <allejo@me.com>

* enh(toml)(ini) Improve parsing of complex keys (highlightjs#2595)

Fixes: highlightjs#2594

* (chore) add `.js` extension to import statements (highlightjs#2601)

Adds file extensions to all import specifiers in ./src/ files.  This is useful to run the files straight from source with a web browser , Node.js ESM or Deno.

- Also add eslint rules regarding extensions for imports

* enh(dart) highlight built-in nullable types (highlightjs#2598)

* Dart: allow built-in nullable types with trailing ? to be highlighted

* enh(csharp) highlight generics in more cases (highlightjs#2599)

* (chore) fix tiny style issues, add linting npm task

- fixes tiny style issues
- adds `npm run lint` for linting the main library source
  (not languages which are still much messier)

* (chore) bump dev dependencies

* (chore) upgrade some dev stuff to newer versions

* bump v10.1.0

* (chore) bump copyright

* (chore) more import below metadata comment

Co-authored-by: M. Mert Yıldıran <mehmetmertyildiran@gmail.com>
Co-authored-by: Josh Goebel <me@joshgoebel.com>
Co-authored-by: Hugo Leblanc <dullin@hololink.org>
Co-authored-by: Peter Massey-Plantinga <plantinga.peter@gmail.com>
Co-authored-by: David Benjamin <davidben@google.com>
Co-authored-by: Vania Kucher <dev.kucher@gmail.com>
Co-authored-by: SweetPPro <sweetppro@users.noreply.github.com>
Co-authored-by: Alexandre ZANNI <16578570+noraj@users.noreply.github.com>
Co-authored-by: Taufik Nurrohman <t.nurrohman77@gmail.com>
Co-authored-by: Lin <50829219+Linhk1606@users.noreply.github.com>
Co-authored-by: nicked <nicked@gmail.com>
Co-authored-by: Nicolas Homble <nhomble@terpmail.umd.edu>
Co-authored-by: Ryandi Tjia <ryandi.tjia@me.com>
Co-authored-by: Sam Rawlins <srawlins@google.com>
Co-authored-by: Sergey Prokhorov <seriy.pr@gmail.com>
Co-authored-by: Brian Alberg <brian@alberg.org>
Co-authored-by: Nils Knappmeier <github@knappi.org>
Co-authored-by: Martin <7252614+Lhoerion@users.noreply.github.com>
Co-authored-by: Derek Lewis <DerekNonGeneric@inf.is>
Co-authored-by: greenkeeper[bot] <23040076+greenkeeper[bot]@users.noreply.github.com>
Co-authored-by: Jim Mason <jmason@ibinx.com>
Co-authored-by: lioshi <lionel.fenneteau@gmail.com>
Co-authored-by: BMatheas <65114274+BMatheas@users.noreply.github.com>
Co-authored-by: Pavel Evstigneev <pavel.evst@gmail.com>
Co-authored-by: Vladimir Jimenez <allejo@me.com>
Co-authored-by: Antoine du Hamel <duhamelantoine1995@gmail.com>
Co-authored-by: TupikovVladimir <vladimir.tupikov@devexpress.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

3 participants