Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Tokens API #86415

Open
alexdima opened this issue Dec 5, 2019 · 3 comments
Open

Semantic Tokens API #86415

alexdima opened this issue Dec 5, 2019 · 3 comments

Comments

@alexdima
Copy link
Member

@alexdima alexdima commented Dec 5, 2019

This issue tracks the API proposal for semantic tokens.


Sample: ---> https://github.com/microsoft/vscode-extension-samples/tree/master/semantic-tokens-sample


In depth API explanation

A file can contain many tokens, perhaps even hundreds of thousands of tokens. Therefore, to improve the memory consumption around describing semantic tokens, we have decided to avoid allocating an object for each token and we represent tokens from a file as an array of integers. Furthermore, the position of each token is expressed relative to the token before it because most tokens remain stable relative to each other when edits are made in a file.


In short, each token takes 5 integers to represent, so a specific token i in the file consists of the following fields:

  • at index 5*i - deltaLine: token line number, relative to the previous token
  • at index 5*i+1 - deltaStart: token start character, relative to the previous token (relative to 0 or the previous token's start if they are on the same line)
  • at index 5*i+2 - length: the length of the token. A token cannot be multiline.
  • at index 5*i+3 - tokenType: will be looked up in SemanticTokensLegend.tokenTypes
  • at index 5*i+4 - tokenModifiers: each set bit will be looked up in SemanticTokensLegend.tokenModifiers

How to encode tokens

Here is an example for encoding a file with 3 tokens:

   { line: 2, startChar:  5, length: 3, tokenType: "properties", tokenModifiers: ["private", "static"] },
   { line: 2, startChar: 10, length: 4, tokenType: "types",      tokenModifiers: [] },
   { line: 5, startChar:  2, length: 7, tokenType: "classes",    tokenModifiers: [] }
  1. First of all, a legend must be devised. This legend must be provided up-front and capture all possible token types. For this example, we will choose the following legend which must be passed in when registering the provider:
   tokenTypes: ['properties', 'types', 'classes'],
   tokenModifiers: ['private', 'static']
  1. The first transformation step is to encode tokenType and tokenModifiers as integers using the legend. Token types are looked up by index, so a tokenType value of 1 means tokenTypes[1]. Multiple token modifiers can be set by using bit flags, so a tokenModifier value of 3 is first viewed as binary 0b00000011, which means [tokenModifiers[0], tokenModifiers[1]]` because
    bits 0 and 1 are set. Using this legend, the tokens now are:
   { line: 2, startChar:  5, length: 3, tokenType: 0, tokenModifiers: 3 },
   { line: 2, startChar: 10, length: 4, tokenType: 1, tokenModifiers: 0 },
   { line: 5, startChar:  2, length: 7, tokenType: 2, tokenModifiers: 0 }
  1. The next steps is to encode each token relative to the previous token in the file. In this case, the second token is on the same line as the first token, so the startChar of the second token is made relative to the startChar of the first token, so it will be 10 - 5. The third token is on a different line than the second token, so the startChar of the third token will not be altered:
   { deltaLine: 2, deltaStartChar: 5, length: 3, tokenType: 0, tokenModifiers: 3 },
   { deltaLine: 0, deltaStartChar: 5, length: 4, tokenType: 1, tokenModifiers: 0 },
   { deltaLine: 3, deltaStartChar: 2, length: 7, tokenType: 2, tokenModifiers: 0 }
  1. Finally, the last step is to inline each of the 5 fields for a token in a single array, which is a memory friendly representation:
   // 1st token,  2nd token,  3rd token
   [  2,5,3,0,3,  0,5,4,1,0,  3,2,7,2,0 ]

How tokens change when the document changes

Let's look at how tokens might change.

Continuing with the above example, suppose a new line was inserted at the top of the file. That would make all the tokens move down by one line (notice how the line has changed for each one):

   { line: 3, startChar:  5, length: 3, tokenType: "properties", tokenModifiers: ["private", "static"] },
   { line: 3, startChar: 10, length: 4, tokenType: "types",      tokenModifiers: [] },
   { line: 6, startChar:  2, length: 7, tokenType: "classes",    tokenModifiers: [] }

The integer encoding of the tokens does not change substantially because of the delta-encoding of positions:

   // 1st token,  2nd token,  3rd token
   [  3,5,3,0,3,  0,5,4,1,0,  3,2,7,2,0 ]

It is possible to express these new tokens in terms of an edit applied to the previous tokens:

   [  2,5,3,0,3,  0,5,4,1,0,  3,2,7,2,0 ]
   [  3,5,3,0,3,  0,5,4,1,0,  3,2,7,2,0 ]

   edit: { start:  0, deleteCount: 1, data: [3] } // replace integer at offset 0 with 3

Furthermore, let's assume that a new token has appeared on line 4:

   { line: 3, startChar:  5, length: 3, tokenType: "properties", tokenModifiers: ["private", "static"] },
   { line: 3, startChar: 10, length: 4, tokenType: "types",      tokenModifiers: [] },
   { line: 4, startChar:  3, length: 5, tokenType: "properties", tokenModifiers: ["static"] },
   { line: 6, startChar:  2, length: 7, tokenType: "classes",    tokenModifiers: [] }

The integer encoding of the tokens is:

   // 1st token,  2nd token,  3rd token,  4th token
   [  3,5,3,0,3,  0,5,4,1,0,  1,3,5,0,2,  2,2,7,2,0, ]

Again, it is possible to express these new tokens in terms of an edit applied to the previous tokens:

   [  3,5,3,0,3,  0,5,4,1,0,  3,2,7,2,0 ]
   [  3,5,3,0,3,  0,5,4,1,0,  1,3,5,0,2,  2,2,7,2,0, ]

   edit: { start: 10, deleteCount: 1, data: [1,3,5,0,2,2] } // replace integer at offset 10 with [1,3,5,0,2,2]

When to return SemanticTokensEdits

When doing edits, it is possible that multiple edits occur until VS Code decides to invoke the semantic tokens provider. In principle, each call to provideSemanticTokens can return a full representations of the semantic tokens, and that would be a perfectly reasonable semantic tokens provider implementation.

However, when having a language server running in a separate process, transferring all the tokens between processes might be slow, so VS Code allows to return the new tokens expressed in terms of multiple edits applied to the previous tokens.

To clearly define what "previous tokens" means, it is possible to return a resultId with the semantic tokens. If the editor still has in memory the previous result, the editor will pass in options the previous resultId at SemanticTokensRequestOptions.previousResultId. Only when the editor passes in the previous resultId, it is allowed that a semantic tokens provider returns the new tokens expressed as edits to be applied to the previous result. Even in this case, the semantic tokens provider needs to return a new resultId that will identify these new tokens as a basis for the next request.

NOTE 1: It is illegal to return SemanticTokensEdits if options.previousResultId is not set.
NOTE 2: All edits in SemanticTokensEdits contain indices in the old integers array, so they all refer to the previous result state.

@alexdima alexdima assigned alexdima and unassigned jrieken Dec 5, 2019
@alexdima alexdima added this to the November 2019 milestone Dec 5, 2019
@alexdima alexdima added the plan-item label Dec 5, 2019
@alexdima alexdima modified the milestones: November 2019, December 2019 Dec 5, 2019
@Vigilans

This comment has been minimized.

Copy link
Member

@Vigilans Vigilans commented Dec 19, 2019

Since microsoft/vscode-languageserver-node#367 proposed the protocol to be server pushing notification to the client, would it be fine for SemanticTokensProvider to provide an event to be registered, just like what TreeDataProvider do? e.g.:

treeDataProvider.onDidChangeTreeDataEvent.fire(item)

vs

semanticTokenProvider.onDidChangeSemanticTokensEvent.fire(document)

In such case, the client is capable of requesting vscode to re-render the semantic highlighting once notification from server is received.

@Vigilans

This comment has been minimized.

Copy link
Member

@Vigilans Vigilans commented Jan 13, 2020

Hi, I've implemented a provider prototype combining java language server and semantic tokens api in vscode-java. Just clone it and run the extension in insider version then the provider is ready for test.

Demo:
deepin-screen-recorder_code - insiders_20200113192405
deepin-screen-recorder_Select area_20200114125052
deepin-screen-recorder_code - insiders_20200113192634

@matklad

This comment has been minimized.

Copy link

@matklad matklad commented Jan 14, 2020

Am I correct that SemanticTokensRequestOptions.ranges allows the client to specify the "viewport" / subset of the document, for which highlights are requested?

So, for example, if I "goto definition" to the new file, the editor will first ask to only color the small visible part of the document (for faster response), and then it'll ask the second query to color the whole file (so that scrolling doesn't show boring colorless text)?

This is a super-important optimization, 👍

Also I'd like to suggest that maybe, if we have ranges, we don't need SemanticTokensEdits at all? I imagine this can work like this:

  • an editor always asks to highlight the visible range of the document (which is O(1) worth of data)
  • when a file is opened for the first time, the editor asks to highlight the whole file, in background, such that scrolling shows colored code
  • on the editor side we maintain this global highlighting, by adjusting ranges of the highlights. Ie, if the user types at the start of the document, we only re-ask about first screen-full of lines of the document, and just shift the ranges for the rest of the document. If the user goes to the end of the document, we immediately show them cached shifted highlighting (which might be slightly off), and also ask the server to re-highlight the end region.
  • if an edit happens which invalidates the whole of the cached highligting map, we re-ask the server to highlight the whole file.

I don't like the presence of SemanticTokensEdits for two reasons:

  • this is a statefull bit, which would be annoying to synchronize between the client and the server,
  • I fear that folks would focus on "incrementally highlight the whole file" use case, which I believe is fundamentally less efficient than "from-scratch highlight of the visible range". It is less efficient because, although you can send final highlighting as edits, you typically still would do O(file len) processing to compute those edits, even in many happy cases (for example, if you declare a new global variable, you need to re-highlight all function bodies). In some pathological cases, even the diff itself would be O(file len) (adding or removing a single " in a language with multilne strings). In contrast, the viewport approach is always roughly O(1) processing, because you can only fit so many letters on the screen.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.