Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions #420

Closed
13 tasks done
Peefy opened this issue Feb 21, 2023 · 2 comments
Assignees
Labels
arch ast ide Issues or PRs related to kcl LSP and IDE plugins important long-term parser Issues or PRs related to kcl parser resolver

Comments

@Peefy
Copy link
Contributor

Peefy commented Feb 21, 2023

Background

At present, the KCL compiler front-end can only better meet the role and information reserve of the forward compilation process. It needs a more modern IDE-oriented compiler front-end, which mainly includes the lexer, parser, and resolver parts, to provide more information about the compilation process, and achieve the ability of syntax error recovery and incremental compilation.

Goals & Principles

  • Support syntax error recovery, a better AST design with information pull mode and memory cache, and incremental compilation to provide a basic guarantee for IDE performance and stability.
  • The front end of the IDE or the editor plays the role of knowing nothing. All information is obtained through LSP and kcl-language-server.
  • The IDE and KCL CLI users are treated equally. All the information needed can be obtained during the compilation process, avoiding the need for the CLI and IDE to maintain two sets of front-end just like rustc and rust analyzer.
  • The kcl-language-server acts as the middle layer of the compiler and IDE through the LSP protocol, only does a small amount of information conversion, and does not maintain too heavy logic itself. At the same time, the kcl-language-server should be stable and can automatically recover from various internal panics.
  • Build a set of an overall test plans for the IDE and provide the basic guarantee for the stability of the KCL compiler.
  • In a KCL code library of a certain scale, such as Konfig, the response time of the basic functions of the IDE, such as completion and jump, is within 100ms.

Overview

image

  • "Semantic code model" is basically an object-oriented representation of modules, functions, and types that appear in the source code. This representation is completely "resolved": all expressions have types (Note that there may be expression types that cannot be deduced in KCL, and they will be defined as the any type), and all references are bound to declarations, etc.
  • "Input Data" is basically a set of file contents Vec<(Pathbuf, String)>.
  • The client can submit a small amount of input data (usually changes to a single file) and obtain a new code model to explain the changes.
  • The underlying engine ensures that the model is lazy (on-demand) and incrementally calculated, and can be quickly updated for small changes.

Design

image

  • Incremental Lexer: Modify TokenStream according to the change of input, and use SouceMap to cache source code files in memory. Calculate the diff of Souce each time, and then update the corresponding Token according to the diff.
  • Parallel Partial Parser with Error Recovery: Usually parser returns parser generation (ast::Program, Vec<Error>) instead of Result<ast::Program, Error>. The most important feature of a handwriting parser is its strong support for error recovery and partial parsing (In KCL, we have written LL (1) parsers with recursive descent instead of LR parsers generated according to syntax, so the difficulty of syntax error recovery has been reduced).
  • Resolver: The KCL resolver currently generates scope information including the symbol table. However, IDE requires a richer Semantic Code Model, more accurate location information, Typing Recovery strategy, reverse type derivation, and type guard (to help users write fewer type annotations).

Incremental Lexer

This feature can be ignored in the early stage because the performance improvement is not as good as incremental and parallel parsing.

Parallel Partial Parser with Error Recovery

image

Error Recovery

The program may have different levels of errors:

  • Lexical errors: including misspelling of identifiers, keywords, and operators, as well as incorrect quotation marks on the string text: for example, the identifier person is wrongly written as preson
  • Syntax error: redundant or missing curly brackets
  • Semantic error: the type does not match, for example, the type declared for the schema attribute does not match the type assigned to the default value
  • Logic error: It can be any error caused by the wrong reasoning of the programmer, such as the existence of unreachable code, the existence of a dead cycle, etc. Such a program may be well structured, but it is not consistent with the actual intention of the programmer.

The recovery strategy of KCL is as follows:

image

    /// Create an error node and consume the next token.
    pub(crate) fn err_recover(&mut self, message: &str, recovery: TokenSet) {
        match self.current() {
            T!['{'] | T!['}'] => {
                self.error(message);
                return;
            }
            _ => (),
        }

        if self.at_ts(recovery) {
            self.error(message);
            return;
        }

        let m = self.start();
        self.error(message);
        self.bump_any();
        m.complete(self, ERROR);
    }
  • In the lexical phase, it is necessary to recover (, ), [, ], {, }, and indentation symbols from errors. The specific method is to complete the missing brackets and indentation in the bracket matching and indentation phase of the lexer. For example, the (1 + 2 expression returns AST ParenExpr(BinaryExpr) and errors [ParenMismatchError].
  • For the parser, the parse function of each syntax node, Option<ast::Stmt> or Option<ast::Expr> is returned.

For example

  • dot recovery

image

  • missing expression

image

Parallel Parsing

In the simplest way, we can perform parallel parsing according to the granularity of the KCL package, just like KCL parallel codegen.

Partial Parsing

For local parsing, it means that the parser can start parsing from anywhere between the root node and the leaf node of the AST. We can design different parse entries to meet this point. The definition of the usual parse entry may be as follows

#[derive(Debug)]
pub enum ParseEntryPoint {
    TopLevel,
    Stmt,
    Ty,
    Expr,
    Block,
    Schema,
    // Omit more entries.
}

The specific form of the analytic entry function is

trait Parse {
     type Input;
     type Output;
     fn parse(&self, input: &Self::Input) -> Self::Output;
}

impl Parse for ParseEntryPoint {
    type Input = String;
    type Output = Tree;
    fn parse(&self, input: &Self::Input) -> Self::Output {
        let entry_point: fn(&'_ mut kclvm_parse::Parser<'_>) = match self {
            ParseEntryPoint::TopLevel => kclvm_parse::parse,
            ParseEntryPoint::Stmt => kclvm_parse::parse_stmt,
            ParseEntryPoint::Ty => kclvm_parse::parse_ty,
            ParseEntryPoint::Expr => kclvm_parse::parse_expr,
            ParseEntryPoint::Block => kclvm_parse::parse_block,
            ParseEntryPoint::Schema => kclvm_parse::parse_schema,
        };
        // Omit more code
    }
}

Abstract Syntax Tree & (Lossless/Describing/Concrete) Syntax Tree

When writing an IDE, one of the core data structures is the lossless (describing) syntax tree. It is a full-fidelity tree that represents the original source code in detail, including parenthesis, comments, and whitespace. CSTs are used for the initial analysis of the language. They are also a vocabulary type for refactors. Although the ultimate result of a refactor is a text diff, tree modification is a more convenient internal representation.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

Another example, we want to show some information after the right bracket.

image

Therefore, we have two ways to complete AST information:

  • Provide more span location information of tokens. An AST Module node contains token information to store and avoid redundant calculations based on AST information. Use AST information directly as DB storage of incremental computing engine.
struct Module {}
impl Module {
    pub fn mod_token(&self)       -> Option<SyntaxToken> { ... }
    pub fn item_list(&self)       -> Option<ItemList>    { ... }
    pub fn semicolon_token(&self) -> Option<SyntaxToken> { ... }
}
  • Token stream can be returned on AST node.
/// A trait for AST nodes having (or not having) collected tokens.
pub trait HasTokens {
    fn tokens(&self) -> Option<&LazyAttrTokenStream>;
    fn tokens_mut(&mut self) -> Option<&mut Option<LazyAttrTokenStream>>;
}

Resolver

Semantic Model

  • All AST node has types after type inferencing and checking.
type ExprId = usize;

pub struct SemanticModel {
    program: Arc<kclvm_ast::Program>,  // Add more lossless information for AST nodes by adding AstToken trait.
    ast_expr_mapping: Arc<IndexMap<ExprId, IndexSet<Option<ExprId>>>>,  // Store AST parents and childrens
    scope: Arc<kclvm_sema::Scope>,  // Scope contains builtin functions.
    type_of_expr: Arc<IndexMap<ExprId, kclvm_sema::Ty>>,  // All AST types.
    parse_errors: Vec<Error>,  // Parse errors
    resolve_errors: Vec<Error>,  // Resolve errors
    resolver_warnings: Vec<Warning>,  // Resolve warnings
    // Other fields related to the language deisgn such as session, target and compiler options.
}
  • Walking AST from leaf nodes to root nodes

Incremental Building

  • AST as DB (using salsa as the incremental computation engine). For example
type FileId = usize;
type AstId = usize;
type AstIdMap = IndexMap<AstId, kclvm_ast::Module>;
type ParseResult = (kclvm_ast::Module>, Vec<Error>)

/// SourceMap Database which stores all significant input facts: source code and project model.
#[salsa::query_group(SourceDatabaseStorage)]
pub trait SourceDatabase: salsa::Database {
    /// Text of the file. `salsa::input` denotes that we can call `db.set_file_text(file_id, Arc::new(file_text))` and `file_id` is the database index.
    #[salsa::input]
    fn file_text(&self, file_id: FileId) -> Arc<String>;

    /// Returns the relative path of a file
    fn file_relative_path(&self, file_id: FileId) -> PathBuf;

    /// Returns the relative path of a file
    fn file_absolute_path(&self, file_id: FileId) -> PathBuf;
}

/// Syntax Database
#[salsa::query_group(AstDatabaseStorage)]
pub trait AstDatabase: SourceDatabase {
    /// Parses the file into AST
    #[salsa::invoke(parse_query)]
    fn parse(&self, file_id: FileId) -> ParseResult;

    /// Returns the top-level AST mapping
    #[salsa::invoke(crate::source_id::AstIdMap::ast_id_map_query)]
    fn ast_id_map(&self, file_id: FileId) -> Arc<AstIdMap>;
}

/// A Parse function based on the DB querying
fn parse_query(db: &dyn AstDatabase, file_id: FileId) -> ParseResult {
    let text = db.file_text(file_id);
    let file = db.file_absolute_path(file_id);
    kclvm_parse::parse_file(file, Some(text))
}
  • Semantic Model as Database: We will also salsa as the incremental calculation and query engine, build the storage and query of the semantic model. At this stage, the main tasks are type derivation and type checking.
pub trait SemaDatabaseStorage: AstDatabase + SourceDatabase {
    // Omit methods.
}
  • Codegen Database: If some semantic information is checked by IR in a lower dimension, such as control flow and dead code, you can also build a database to store this information at this stage.
  • Monitor in the Demon Compiler with VFS: Breaking with the Ctrl+C signal on file write, create, remove, and rename events.
#[salsa::database(
    SourceDatabaseStorage,
    AstDatabaseStorage,
    SemaDatabaseStorage,
    CodeGenDatabaseStorage
)]
pub struct CompilerDatabase {
    storage: salsa::Storage<Self>,
}

pub struct CompilerConfig {
    db: CompilerDatabase,
    // Omit other configs
}

pub struct Compiler {
    config: CompilerConfig,
    // Omit other fields
}

/// A sample compile function with the compiler database
fn compile<T: AsRef<Path>>(compiler: &mut Compiler, file: T) {
    let file_id = alloc_file_id(file);
    let db = &compiler.config.db;
    db.set_file_text(file_id, Arc::new(std::fs::read_text(file)))
    let parse_result = db.parse(file_id)
    // Omit the resolve and codegen process.
}

CLI

kclvm_cli build --watch --incremental
  • Watch Mode: Whether to enable monitoring mode, monitor file changes, and compile. (In the language server, the feature will be enabled by default).
  • Incremental Mode: The feature flag decides whether to incrementally build in the compiler.

Language Server Workspace

Similar to rust cargo.toml as a crate, a workspace can contain multiple crates. KCL projects, and a KCL project contains the kcl.yaml compile the manifest file.

use kclvm_config::Config;

/// The configuration used by the language server.
#[derive(Debug, Clone)]
pub struct Config {
    /// The root directory of the workspace
    pub root: AbsPathBuf,
    /// A collection of projects discovered within the workspace
    pub discovered_projects: Option<Vec<Config>>,
}

Language Server

See #297

Note: For the incremental compilation feature, the language server obtains semantic information containing errors from various databases. Different KCL projects in the same workspace share the AST DB. The ability of language-server will not be implemented in the compiler, such as document_ Symbol, completion, jump, etc., which are implemented by language-server and can be implemented by adding caches such as AnalysisDatabase.

Tasks

Reference

@Peefy Peefy added ide Issues or PRs related to kcl LSP and IDE plugins long-term arch labels Feb 21, 2023
@Peefy Peefy added this to the v0.4.6 Release milestone Feb 21, 2023
@Peefy Peefy self-assigned this Feb 21, 2023
@Peefy Peefy changed the title [WIP] [Enhancement] A better KCL compiler technology architecture for IDE plugins [WIP] [Enhancement] A better KCL compiler frontend technology architecture for IDE plugins Feb 21, 2023
@Peefy Peefy changed the title [WIP] [Enhancement] A better KCL compiler frontend technology architecture for IDE plugins [WIP] [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins Feb 22, 2023
@Peefy Peefy assigned chai2010 and unassigned chai2010, amyXia1994 and He1pa Feb 23, 2023
@He1pa
Copy link
Contributor

He1pa commented Feb 24, 2023

Difference between AST and lossless syntax tree for IDE and LSP.

@Peefy Peefy changed the title [WIP] [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins Feb 28, 2023
@Peefy
Copy link
Contributor Author

Peefy commented Mar 2, 2023

Difference between AST and lossless syntax tree for IDE and LSP.

For example, we want to display different highlight colors and prompt information for different attribute operators of KCL. At this time, AST does not have the token position information of attribute operators, and AST cannot calculate this information. At this time, a lossless syntax tree is required.

@Peefy Peefy removed this from the v0.4.6 Release milestone Apr 10, 2023
@Peefy Peefy changed the title [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE plugins [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions Apr 23, 2023
@Peefy Peefy assigned i-zhen and NeverRaR and unassigned amyXia1994, He1pa and zong-zhe Sep 12, 2023
@Peefy Peefy changed the title [Enhancement] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions [Enhancement][Track] A better KCL compiler frontend technology architecture for the LSP tool and IDE extensions Oct 9, 2023
@Peefy Peefy closed this as completed Jan 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch ast ide Issues or PRs related to kcl LSP and IDE plugins important long-term parser Issues or PRs related to kcl parser resolver
Projects
No open projects
Status: Track Issues
Development

No branches or pull requests

7 participants