[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

Peefy · 2023-02-21T11:25:06Z

Background

At present, the KCL compiler front-end can only better meet the role and information reserve of the forward compilation process. It needs a more modern IDE-oriented compiler front-end, which mainly includes the lexer, parser, and resolver parts, to provide more information about the compilation process, and achieve the ability of syntax error recovery and incremental compilation.

Goals & Principles

Support syntax error recovery, a better AST design with information pull mode and memory cache, and incremental compilation to provide a basic guarantee for IDE performance and stability.
The front end of the IDE or the editor plays the role of knowing nothing. All information is obtained through LSP and kcl-language-server.
The IDE and KCL CLI users are treated equally. All the information needed can be obtained during the compilation process, avoiding the need for the CLI and IDE to maintain two sets of front-end just like rustc and rust analyzer.
The kcl-language-server acts as the middle layer of the compiler and IDE through the LSP protocol, only does a small amount of information conversion, and does not maintain too heavy logic itself. At the same time, the kcl-language-server should be stable and can automatically recover from various internal panics.
Build a set of an overall test plans for the IDE and provide the basic guarantee for the stability of the KCL compiler.
In a KCL code library of a certain scale, such as Konfig, the response time of the basic functions of the IDE, such as completion and jump, is within 100ms.

Overview

"Semantic code model" is basically an object-oriented representation of modules, functions, and types that appear in the source code. This representation is completely "resolved": all expressions have types (Note that there may be expression types that cannot be deduced in KCL, and they will be defined as the any type), and all references are bound to declarations, etc.
"Input Data" is basically a set of file contents Vec<(Pathbuf, String)>.
The client can submit a small amount of input data (usually changes to a single file) and obtain a new code model to explain the changes.
The underlying engine ensures that the model is lazy (on-demand) and incrementally calculated, and can be quickly updated for small changes.

Design

Incremental Lexer: Modify TokenStream according to the change of input, and use SouceMap to cache source code files in memory. Calculate the diff of Souce each time, and then update the corresponding Token according to the diff.
Parallel Partial Parser with Error Recovery: Usually parser returns parser generation (ast::Program, Vec<Error>) instead of Result<ast::Program, Error>. The most important feature of a handwriting parser is its strong support for error recovery and partial parsing (In KCL, we have written LL (1) parsers with recursive descent instead of LR parsers generated according to syntax, so the difficulty of syntax error recovery has been reduced).
Resolver: The KCL resolver currently generates scope information including the symbol table. However, IDE requires a richer Semantic Code Model, more accurate location information, Typing Recovery strategy, reverse type derivation, and type guard (to help users write fewer type annotations).

Incremental Lexer

This feature can be ignored in the early stage because the performance improvement is not as good as incremental and parallel parsing.

Parallel Partial Parser with Error Recovery

Error Recovery

The program may have different levels of errors:

Lexical errors: including misspelling of identifiers, keywords, and operators, as well as incorrect quotation marks on the string text: for example, the identifier person is wrongly written as preson
Syntax error: redundant or missing curly brackets
Semantic error: the type does not match, for example, the type declared for the schema attribute does not match the type assigned to the default value
Logic error: It can be any error caused by the wrong reasoning of the programmer, such as the existence of unreachable code, the existence of a dead cycle, etc. Such a program may be well structured, but it is not consistent with the actual intention of the programmer.

The recovery strategy of KCL is as follows:

If you are parsing a homogeneous sequence of things (i.e, you are inside the loop), and the current token does not look like it can begin a new element, you just skip over it and start the next iteration of the loop. Here’s an example from Kotlin. At this line, we’ll get null if current token could not begin a class member declaration. Here we just skip over it.
If you are parsing a particular thing T, and you expect token foo, but see bar, then, roughly:
- if bar is not in the FOLLOW(T), you skip over it and emit error,
- if bar is in FOLLOW(T), you emit an error, but don’t skip the token.
- where FOLLOW(T) denotes the token set that you want to recover from errors, e.g., https://github.com/JetBrains/kotlin/blob/9891f562cc0acb505ee5ff2f30626253ace0201a/compiler/psi/src/org/jetbrains/kotlin/parsing/KotlinParsing.java#L1387 for more documents

    /// Create an error node and consume the next token.
    pub(crate) fn err_recover(&mut self, message: &str, recovery: TokenSet) {
        match self.current() {
            T!['{'] | T!['}'] => {
                self.error(message);
                return;
            }
            _ => (),
        }

        if self.at_ts(recovery) {
            self.error(message);
            return;
        }

        let m = self.start();
        self.error(message);
        self.bump_any();
        m.complete(self, ERROR);
    }

In the lexical phase, it is necessary to recover (, ), [, ], {, }, and indentation symbols from errors. The specific method is to complete the missing brackets and indentation in the bracket matching and indentation phase of the lexer. For example, the (1 + 2 expression returns AST ParenExpr(BinaryExpr) and errors [ParenMismatchError].
For the parser, the parse function of each syntax node, Option<ast::Stmt> or Option<ast::Expr> is returned.

For example

dot recovery

missing expression

Parallel Parsing

In the simplest way, we can perform parallel parsing according to the granularity of the KCL package, just like KCL parallel codegen.

Partial Parsing

For local parsing, it means that the parser can start parsing from anywhere between the root node and the leaf node of the AST. We can design different parse entries to meet this point. The definition of the usual parse entry may be as follows

#[derive(Debug)]
pub enum ParseEntryPoint {
    TopLevel,
    Stmt,
    Ty,
    Expr,
    Block,
    Schema,
    // Omit more entries.
}

The specific form of the analytic entry function is

trait Parse {
     type Input;
     type Output;
     fn parse(&self, input: &Self::Input) -> Self::Output;
}

impl Parse for ParseEntryPoint {
    type Input = String;
    type Output = Tree;
    fn parse(&self, input: &Self::Input) -> Self::Output {
        let entry_point: fn(&'_ mut kclvm_parse::Parser<'_>) = match self {
            ParseEntryPoint::TopLevel => kclvm_parse::parse,
            ParseEntryPoint::Stmt => kclvm_parse::parse_stmt,
            ParseEntryPoint::Ty => kclvm_parse::parse_ty,
            ParseEntryPoint::Expr => kclvm_parse::parse_expr,
            ParseEntryPoint::Block => kclvm_parse::parse_block,
            ParseEntryPoint::Schema => kclvm_parse::parse_schema,
        };
        // Omit more code
    }
}

Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree

When writing an IDE, one of the core data structures is the lossless (describing) syntax tree. It is a full-fidelity tree that represents the original source code in detail, including parenthesis, comments, and whitespace. CSTs are used for the initial analysis of the language. They are also a vocabulary type for refactors. Although the ultimate result of a refactor is a text diff, tree modification is a more convenient internal representation.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

Another example, we want to show some information after the right bracket.

Therefore, we have two ways to complete AST information:

Provide more span location information of tokens. An AST Module node contains token information to store and avoid redundant calculations based on AST information. Use AST information directly as DB storage of incremental computing engine.

struct Module {}
impl Module {
    pub fn mod_token(&self)       -> Option<SyntaxToken> { ... }
    pub fn item_list(&self)       -> Option<ItemList>    { ... }
    pub fn semicolon_token(&self) -> Option<SyntaxToken> { ... }
}

Token stream can be returned on AST node.

/// A trait for AST nodes having (or not having) collected tokens.
pub trait HasTokens {
    fn tokens(&self) -> Option<&LazyAttrTokenStream>;
    fn tokens_mut(&mut self) -> Option<&mut Option<LazyAttrTokenStream>>;
}

Resolver

Semantic Model

All AST node has types after type inferencing and checking.

type ExprId = usize;

pub struct SemanticModel {
    program: Arc<kclvm_ast::Program>,  // Add more lossless information for AST nodes by adding AstToken trait.
    ast_expr_mapping: Arc<IndexMap<ExprId, IndexSet<Option<ExprId>>>>,  // Store AST parents and childrens
    scope: Arc<kclvm_sema::Scope>,  // Scope contains builtin functions.
    type_of_expr: Arc<IndexMap<ExprId, kclvm_sema::Ty>>,  // All AST types.
    parse_errors: Vec<Error>,  // Parse errors
    resolve_errors: Vec<Error>,  // Resolve errors
    resolver_warnings: Vec<Warning>,  // Resolve warnings
    // Other fields related to the language deisgn such as session, target and compiler options.
}

Walking AST from leaf nodes to root nodes

Incremental Building

AST as DB (using salsa as the incremental computation engine). For example

type FileId = usize;
type AstId = usize;
type AstIdMap = IndexMap<AstId, kclvm_ast::Module>;
type ParseResult = (kclvm_ast::Module>, Vec<Error>)

/// SourceMap Database which stores all significant input facts: source code and project model.
#[salsa::query_group(SourceDatabaseStorage)]
pub trait SourceDatabase: salsa::Database {
    /// Text of the file. `salsa::input` denotes that we can call `db.set_file_text(file_id, Arc::new(file_text))` and `file_id` is the database index.
    #[salsa::input]
    fn file_text(&self, file_id: FileId) -> Arc<String>;

    /// Returns the relative path of a file
    fn file_relative_path(&self, file_id: FileId) -> PathBuf;

    /// Returns the relative path of a file
    fn file_absolute_path(&self, file_id: FileId) -> PathBuf;
}

/// Syntax Database
#[salsa::query_group(AstDatabaseStorage)]
pub trait AstDatabase: SourceDatabase {
    /// Parses the file into AST
    #[salsa::invoke(parse_query)]
    fn parse(&self, file_id: FileId) -> ParseResult;

    /// Returns the top-level AST mapping
    #[salsa::invoke(crate::source_id::AstIdMap::ast_id_map_query)]
    fn ast_id_map(&self, file_id: FileId) -> Arc<AstIdMap>;
}

/// A Parse function based on the DB querying
fn parse_query(db: &dyn AstDatabase, file_id: FileId) -> ParseResult {
    let text = db.file_text(file_id);
    let file = db.file_absolute_path(file_id);
    kclvm_parse::parse_file(file, Some(text))
}

Semantic Model as Database: We will also salsa as the incremental calculation and query engine, build the storage and query of the semantic model. At this stage, the main tasks are type derivation and type checking.

pub trait SemaDatabaseStorage: AstDatabase + SourceDatabase {
    // Omit methods.
}

Codegen Database: If some semantic information is checked by IR in a lower dimension, such as control flow and dead code, you can also build a database to store this information at this stage.
Monitor in the Demon Compiler with VFS: Breaking with the Ctrl+C signal on file write, create, remove, and rename events.

#[salsa::database(
    SourceDatabaseStorage,
    AstDatabaseStorage,
    SemaDatabaseStorage,
    CodeGenDatabaseStorage
)]
pub struct CompilerDatabase {
    storage: salsa::Storage<Self>,
}

pub struct CompilerConfig {
    db: CompilerDatabase,
    // Omit other configs
}

pub struct Compiler {
    config: CompilerConfig,
    // Omit other fields
}

/// A sample compile function with the compiler database
fn compile<T: AsRef<Path>>(compiler: &mut Compiler, file: T) {
    let file_id = alloc_file_id(file);
    let db = &compiler.config.db;
    db.set_file_text(file_id, Arc::new(std::fs::read_text(file)))
    let parse_result = db.parse(file_id)
    // Omit the resolve and codegen process.
}

CLI

kclvm_cli build --watch --incremental

Watch Mode: Whether to enable monitoring mode, monitor file changes, and compile. (In the language server, the feature will be enabled by default).
Incremental Mode: The feature flag decides whether to incrementally build in the compiler.

Language Server Workspace

Similar to rust cargo.toml as a crate, a workspace can contain multiple crates. KCL projects, and a KCL project contains the kcl.yaml compile the manifest file.

use kclvm_config::Config;

/// The configuration used by the language server.
#[derive(Debug, Clone)]
pub struct Config {
    /// The root directory of the workspace
    pub root: AbsPathBuf,
    /// A collection of projects discovered within the workspace
    pub discovered_projects: Option<Vec<Config>>,
}

Language Server

See #297

Note: For the incremental compilation feature, the language server obtains semantic information containing errors from various databases. Different KCL projects in the same workspace share the AST DB. The ability of language-server will not be implemented in the compiler, such as document_ Symbol, completion, jump, etc., which are implemented by language-server and can be implemented by adding caches such as AnalysisDatabase.

Tasks

Reference

Anders Hejlsberg on Modern Compiler Construction: https://learn.microsoft.com/en-us/shows/seth-juarez/anders-hejlsberg-on-modern-compiler-construction
Rust Analyzer https://github.com/rust-lang/rust-analyzer
Rust Compiler: https://github.com/rust-lang/rust
Rust Union Find Impl: https://github.com/rust-lang/ena
Rust Snippets: https://github.com/rust-lang/annotate-snippets-rs
Rust Compiler-RT binding: https://github.com/rust-lang/compiler-builtins
Typescript: https://github.com/microsoft/TypeScript
Pyright: https://github.com/microsoft/pyright
Onflow Cadence Language Server: https://github.com/onflow/cadence-tools/tree/master/languageserver
Roslyn: https://github.com/dotnet/roslyn
Salsa (A generic framework for on-demand, incremental computation inspired by Rustc query system): https://github.com/salsa-rs/salsa
IntelliJ: https://rust-analyzer.github.io/blog/2020/07/20/three-architectures-for-responsive-ide.html
Clangd: https://clangd.llvm.org/design/
Kotlin Parsers: https://github.com/JetBrains/kotlin/blob/9891f562cc0acb505ee5ff2f30626253ace0201a/compiler/psi/src/org/jetbrains/kotlin/parsing/KotlinParsing.java
Merlin: A Language Server for OCaml (Experience Report): https://arxiv.org/pdf/1807.06702.pdf
The Mun Programming Language: https://github.com/mun-lang/mun
Jsonnet Language Server: https://github.com/grafana/jsonnet-language-server
Go LSP Server Statement Completion: https://github.com/golang/tools/blob/master/gopls/internal/lsp/source/completion/statements.go
Farm: https://github.com/farm-fe/farm

The text was updated successfully, but these errors were encountered:

He1pa · 2023-02-24T09:21:48Z

Difference between AST and lossless syntax tree for IDE and LSP.

Peefy · 2023-03-02T06:30:38Z

Difference between AST and lossless syntax tree for IDE and LSP.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

Peefy added ide Issues or PRs related to kcl LSP and IDE plugins long-term arch labels Feb 21, 2023

Peefy added this to the v0.4.6 Release milestone Feb 21, 2023

Peefy self-assigned this Feb 21, 2023

Peefy changed the title ~~[WIP] [Enhancement] A better KCL compiler technology architecture for IDE plugins~~ [WIP] [Enhancement] A better KCL compiler frontend technology architecture for IDE plugins Feb 21, 2023

Peefy assigned amyXia1994 and He1pa Feb 21, 2023

Peefy changed the title ~~[WIP] [Enhancement] A better KCL compiler frontend technology architecture for IDE plugins~~ [WIP] [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins Feb 22, 2023

Peefy added the important label Feb 23, 2023

Peefy assigned chai2010 and unassigned chai2010, amyXia1994 and He1pa Feb 23, 2023

Peefy changed the title ~~[WIP] [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins~~ [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins Feb 28, 2023

Peefy mentioned this issue Mar 2, 2023

[WIP] Feat(LSP): Semamtic token provider (#297) #432

Closed

16 tasks

Peefy added parser Issues or PRs related to kcl parser ast resolver labels Mar 3, 2023

Peefy assigned amyXia1994, He1pa and zong-zhe Mar 9, 2023

amyXia1994 mentioned this issue Mar 10, 2023

[Enhancement] Support KCL workspace compilation unit identification and recognition #446

Closed

Peefy mentioned this issue Mar 13, 2023

feat: mismatch delimiter error recovery in parser #447

Merged

16 tasks

Peefy mentioned this issue Mar 27, 2023

feat: parse stmt error recovery #467

Merged

16 tasks

This was referenced Apr 4, 2023

[WIP] feat: add more position information for the ast identifier node #485

Closed

feat: invalid comp clause parse and sematic error recovery. #497

Merged

Peefy removed this from the v0.4.6 Release milestone Apr 10, 2023

Peefy changed the title ~~[Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins~~ [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions Apr 23, 2023

Peefy mentioned this issue Apr 23, 2023

[Enhancement][Track] More Editor or IDE extension support based on the kcl language server #513

Open

5 tasks

Peefy assigned i-zhen and NeverRaR and unassigned amyXia1994, He1pa and zong-zhe Sep 12, 2023

Peefy mentioned this issue Sep 23, 2023

[Enhancement][Track] Basic Technology Architecture and Major Issues Governance #673

Closed

Peefy changed the title ~~[Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions~~ [Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions Oct 9, 2023

Peefy mentioned this issue Oct 9, 2023

[Feature] KCL Namer Develop for better sematic model and APIs #742

Closed

amyXia1994 mentioned this issue Oct 18, 2023

[Enhancement][Track] Formatting tools with KCL Code style specifications #627

Open

7 tasks

Peefy mentioned this issue Nov 1, 2023

Feat: add lambda type annotation #826

Merged

16 tasks

Peefy closed this as completed Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

Peefy commented Feb 21, 2023 •

edited

He1pa commented Feb 24, 2023

Peefy commented Mar 2, 2023

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

Comments

Peefy commented Feb 21, 2023 • edited

Background

Goals & Principles

Overview

Design

Incremental Lexer

Parallel Partial Parser with Error Recovery

Error Recovery

Parallel Parsing

Partial Parsing

Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree

Resolver

Semantic Model

Incremental Building

CLI

Language Server Workspace

Language Server

Tasks

Reference

He1pa commented Feb 24, 2023

Peefy commented Mar 2, 2023

Peefy commented Feb 21, 2023 •

edited