Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bracket Highlight does not detect word boundary for language using word as bracket #132162

Closed
mrvkino opened this issue Sep 2, 2021 · 14 comments
Closed
Assignees
Labels
bracket-pair-colorization bug Issue identified by VS Code Team member as probable bug insiders-released Patch has been released in VS Code Insiders verified Verification succeeded
Milestone

Comments

@mrvkino
Copy link

mrvkino commented Sep 2, 2021

Issue Type: Bug

When using systemverilog language the bracket colorization is highlighting part of a word that match the bracket work.

The language uses begin..end as bracket and highlight all instance of "end".

image

the end in the word "extends" should not be highlighted.

Here is a small sample that trigger the issue for the word "class" and "end"

class test_class extends parent_class;
  function new();
  endfunction
endclass

You can find the list of bracket word used in the language.

VS Code version: Code 1.60.0 (e7d7e9a, 2021-09-01T10:54:53.442Z)
OS version: Darwin x64 20.6.0
Restricted Mode: No
Remote OS version: Linux x64 3.10.0-1160.41.1.el7.x86_64

System Info
Item Value
CPUs Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz (12 x 2600)
GPU Status 2d_canvas: enabled
gpu_compositing: enabled
metal: disabled_off
multiple_raster_threads: enabled_on
oop_rasterization: enabled
opengl: enabled_on
rasterization: enabled
skia_renderer: disabled_off_ok
video_decode: enabled
webgl: enabled
webgl2: enabled
Load (avg) 3, 4, 3
Memory (System) 32.00GB (0.63GB free)
Process Argv --crash-reporter-id ed42b361-9981-44df-8b87-d02ddcab40e1
Screen Reader no
VM 0%
Item Value
Remote SSH: atl-s-asic37
OS Linux x64 3.10.0-1160.41.1.el7.x86_64
CPUs Intel(R) Xeon(R) Gold 5122 CPU @ 3.60GHz (8 x 3686)
Memory (System) 251.41GB (245.88GB free)
VM 0%
Extensions (46)
Extension Author (truncated) Version
output-colorizer IBM 0.1.2
Bookmarks ale 13.1.0
docs-view bie 0.0.9
vscode-markdownlint Dav 0.43.2
LogFileHighlighter emi 2.11.0
file-icons fil 1.0.29
CppSnippets har 0.0.15
vscode-drawio hed 1.6.2
vscode-todo-highlight jgc 2.0.3
vscode-peacock joh 3.9.1
AWK lug 0.0.2
VS-code-vagrantfile mar 0.0.7
vscode-rdef Mic 0.1.3
vscode-edge-devtools ms- 1.3.0
python ms- 2021.9.1191016588
vscode-pylance ms- 2021.9.0
remote-containers ms- 0.194.0
remote-ssh ms- 0.65.7
remote-ssh-edit ms- 0.65.7
remote-wsl ms- 0.58.2
vscode-remote-extensionpack ms- 0.21.0
vsliveshare-pack ms- 0.4.0
material-icon-theme PKi 4.10.0
jinjahtml sam 0.16.0
vscode-nginx sha 0.6.0
tcl sle 0.2.0
errorlens use 3.4.0
vscode-icons vsc 11.6.0
gistfs vsl 0.3.0
jinja2-snippet-kit Wya 2.0.0
material-theme zhu 3.11.4
read-only-indicator ale 3.6.0
blamer-vs bea 0.5.2
path-intellisense chr 2.4.0
chmod dle 1.1.1
xml Dot 2.5.1
filter-lines ear 1.0.0
systemverilog eir 0.11.3
svn-scm joh 2.13.6
vsliveshare ms- 1.0.4761
vsliveshare-audio ms- 0.1.91
vsliveshare-pack ms- 0.4.0
copy-file-name nem 1.2.0
vscode-fileutils sle 3.4.5
comment-divider sta 0.4.0
vscode-jumpy wma 0.3.1
A/B Experiments
vsliv368cf:30146710
vsreu685:30147344
python383:30185418
pythonvspyt602:30300191
vspor879:30202332
vspor708:30202333
vspor363:30204092
pythonvspyt639:30300192
pythontb:30283811
pythonptprofiler:30281270
vsdfh931:30280409
vshan820:30294714
vstes263:30335439
vscoreces:30358480
pythondataviewer:30285071
pythonvsuse255:30340121
vscod805cf:30301675
pythonvspyt200:30340761
vscextlangct:30333562
binariesv615:30325510
pythonvssor306:30344512
bridge0708:30335490
pygetstartedc2:30360494
bridge0723:30353136
vsdyn321cf:30356811
vscus224:30358039

@mrvkino mrvkino changed the title Detect word boundary for language using word as bracket Bracket Highlight does not detect word boundary for language using word as bracket Sep 2, 2021
@hediet hediet added bracket-pair-colorization bug Issue identified by VS Code Team member as probable bug labels Sep 5, 2021
@hediet hediet modified the milestones: On Deck, September 2021 Sep 5, 2021
@hediet
Copy link
Member

hediet commented Sep 7, 2021

It is difficult to not detect _class_, but class and _(_.
Basically, the bracket pair class/endclass works different from {/} in the following example:

_X_
_Y_

In case of X/Y = class/endclass, no bracket pair should be detected, but in case of X/Y = {/}, it should.

Suggestions?

@mrvkino
Copy link
Author

mrvkino commented Sep 7, 2021

@hediet
I know RTL language such as VHDL and System Verilog require spaces around the word pair or can only have indexing/calling symbol such as "(", "{" and "[" next to them. Looking at other language such as ADA it seems to be the same.

An option would be for word base block delimitation to use non word delimiter (in regex it's \W or \b, which exclude a-zA-Z0-9 and _). This would limit detection of the keyword by themselves or next to other non-word symbol (such as "(").
https://regex101.com/r/OthCMI/2

For "(", detecting the symbol as done now should be fine. I haven't seen any weird issue for now, except that the highlighting should be disabled in the comments.
"(" in a variable is illegal in pretty much any language I have heard of. This means seeing "something_(other" should be illegal or flagged as missing a terminating ")". Since it can't be a valid variable, the best option would be to consider "something" a function call and "_other" a variable and we are missing a closing ")".
https://regex101.com/r/aMXWwW/2

This does require to have slightly different matching rules about surrounding boundary to the bracket pair. I'm not sure if this is easily feasible with the current design of the bracket highlighter. But, this seems reasonable since it is two completely different way of delimiting block in programming languages with a pair of symbol. Expecting to have the same matching logic for both seems to be difficult as you showed.

@hediet
Copy link
Member

hediet commented Sep 7, 2021

How would you decide which matching logic to use? Theoretically, there could even be brackets like begin{ end }end.

We could use word-delimiters if and only if the bracket matches ID_Continue+ as defined by the unicode standard.

@mrvkino
Copy link
Author

mrvkino commented Sep 14, 2021

In the case of begin{, since there is a { this would prevent the issue of matching a substring of a variable name or function name, since { is not valid as a naming character.

So, anytime there is a bracket character such as ( [ { we can use the behaviour in place and if it is only valid word character a-zA-Z0-9_ (in regex parlance) we add the word boundary requirement.

I am not familiar with unicode standard, so I cannot comment on the ID_Continue+ solution.

@maxnordlund
Copy link

maxnordlund commented Sep 15, 2021

ID_Start and ID_Continue are part of a unicode standard that defines a set of characters to be used as identifiers, e.g. variables, in programming languages. At least Python, JavaScript and Erlang uses it as part of their respective syntax.

If you look at #132504 (comment), which I think is related to this, you can see a list of brackets with both word and non-word symbols in them taken from LaTeX, for example \left( and \right).

I would say you need three different strategies for brackets depending if the consists of:

  1. Only word symbols, matching the JavaScript RegExp \p{ID_Start}\p{ID_Continue}*
  2. Only non-word symbols, the inverse of the above?
  3. Mixed word and non word

For the first \b should suffice to differentiate it from surrounding context. For the second I guess you want to just match it anywhere. Finally for the mixed case I think you could prepend/append \b to the symbol if it starts/ends with a word symbol.

What do you think?

@mrvkino
Copy link
Author

mrvkino commented Sep 15, 2021

I like the three way split. The \b to gate the word part of the bracket should allow all 3 to behave similarly. The comment you have linked to will also solve an issue I didn't think about, shared bracket in multiple pair.

How is ID_Start and ID_Continue populated? By the VSCode language extension bracket definition? Sorry if the question seems stupid, but it is the first time I have heard that unicode has this capability. Granted, I am more on the hardware side of development.

@jecassis
Copy link

jecassis commented Oct 9, 2021

I am a minor contributor to the VSCode-SystemVerilog extension and have been thinking about this issue since updating to the super speed brace matching. Mostly on whether there is a fix/workaround we could apply or whether we should remove the words as brackets altogether.

For SystemVerilog, I believe @maxnordlund's solution would work for the words (i.e. 1. + \b), but I'll leave it to you to decide on the implementation.

To add a bit to the discussion here for SystemVerilog, per the standard there are two ways to specify an identifier:

  • [a-zA-Z_][a-zA-Z0-9_$]*
  • \\[!-~]+(?=$|[ \t\r\n])

But also note from the second regular expression that you could have a legal identifier in SystemVerilog using any printable ASCII character as long as it begins with \ and ends with a whitespace character. For example this \$endmodule]tes\t(}[3:1] in a file might break matching assumptions and highlighting. So, just as you're skipping comments/strings/regexp or other regions it might be a good idea to consider being able to configure language-specific nuances such as this.

@hediet hediet closed this as completed in e9e861f Oct 27, 2021
@hediet
Copy link
Member

hediet commented Oct 27, 2021

Verification steps:

"[plaintext]": {
    "editor.language.colorizedBracketPairs": [
        ["class", "end"]
    ]
},

Use plain text mode and enable bracket pair colorization (editor.bracketPairColorization.enabled: true):

class test_class extends parent_class;
  function new();
  endfunction
endclass

Notice that it looks like this:
image

@jecassis
Copy link

It works well with the "brackets" key already in language-configuration.json for SystemVerilog (notice the example below includes combinations of different kinds of traditional bracket pairs, preprocessor directives, and identifiers that embed keywords):

image

The scenario I outlined above does cause it to slightly misidentify the keyword specifically for SystemVerilog:

image

However, I suspect the use of escaped identifiers is minimal and this update covers the majority of scenarios for SystemVerilog.

@hediet
Copy link
Member

hediet commented Oct 28, 2021

So, just as you're skipping comments/strings/regexp or other regions it might be a good idea to consider being able to configure language-specific nuances such as this.

Sorry, I missed that comment. Unfortunately, our current design of the text-mate tokenizer only tracks comments/strings/regexp tokens semantically, which aren't really extendable.

It is unlikely that we will introduce a mechanism to detect that { in \$endmodule]tes\t(}[3:1] is not a bracket, if the text mate grammar does not mark the identifier as string, comment or regexp, especially if this is only an issue for SystemVerilog and no other language (if such cases even occur frequently in SystemVerilog).

However, feel free to open a new issue.

@jecassis
Copy link

Thank you for the explanation, Henning. That gave me an idea and I've made it work:

image

@jrieken jrieken added the verified Verification succeeded label Oct 28, 2021
@mrvkino
Copy link
Author

mrvkino commented Oct 29, 2021

Just to help decide if the complexity that \ brings might be useful to support.

I am a hardware designer and I used SystemVerilog for many years (for multiple clients).

The only time I saw the \ used in the language was with synthesized code from tools decomposing all indexes of a signal in multiple one bit signal. They try to keep the original name with the index included.

All coding style guide forbid this style of naming, since it may cause havoc with synthesis tools that are not expecting this kind of naming scheme.

Unless someone comes in with a compelling reason, supporting the \ might cause more bug than it solves.

@jecassis
Copy link

Yep Michael. That is almost exactly where I've seen it before. In tool output where they express the physical netlist itself in Verilog. Most of the use of escaped identifiers looked to me as a way to maintain Verilog 1995/2001 backward compatibility where structs/enums/etc... are not supported. My suspicion is that this is necessary to keep the underlying engines happy as most of the core engine code in them dates back to the 80's. I agree that this style of destructuring is not something a hardware designer entering HDL into an editor would be using much of.

The way I've solved/hacked it in the extension is to classify the escaped identifier as a regular expression as that appears to be a fenced off region already for the bracket matching and SystemVerilog has no regular expressions. I then recolor the token either in the settings or through semantic highlighting to make it appear "technically correct" as an identifier.

@mrvkino
Copy link
Author

mrvkino commented Oct 30, 2021

@jecassis If you solution seems to work on the full identifier naming rule I will gladly take it. I was just warning against putting to much focus on a fringe case.

I am a happy user of VSCode-SystemVerilog, so if the colouring works for tool generated code that is great. It will help me when debugging synthesis issue.

Thank you for your work

@github-actions github-actions bot locked and limited conversation to collaborators Dec 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bracket-pair-colorization bug Issue identified by VS Code Team member as probable bug insiders-released Patch has been released in VS Code Insiders verified Verification succeeded
Projects
None yet
Development

No branches or pull requests

6 participants
@jecassis @mrvkino @maxnordlund @jrieken @hediet and others