Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make token parser customizable #2293

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

Kha
Copy link
Member

@Kha Kha commented Jun 27, 2023

In theory, users can implement arbitrary lexical grammars as the built-in token parser is merely another combinator whose use can be avoided; in practice, avoiding it also means forgoing the existing infrastructure around token caching and further parsers built on top of tokenFn, such as symbol implicitly used by syntax.

There are other nice-to-haves we have talked about before like per-category token tables, but this PR implements the most simple change to address the issues mentioned above: it makes the implementation of tokenFn customizable such that people can implement their own tokenization in it without losing access to all the infrastructure built on top of it.

Implementation-wise, the change mostly consist of further splitting ParserContextCore/ParserFn to prevent non-wellfounded type recursion and moving the Lean language's tokenizer into a new module Lean.Parser.LeanToken to distinguish it from language-generic parsers in Basic (which is too big anyway). Making it easier to adapt these Lean token functions for implementing a custom token parser is left as future work.

The change is validated by a simple test that changes the whitespace/comment syntax of the existing Lean language as well as an incomplete implementation of the venerable Whitespace language.

@Kha Kha force-pushed the custom-tokens branch 2 times, most recently from 0e83ad0 to 1650f94 Compare June 27, 2023 10:59
@leodemoura leodemoura self-assigned this Jun 27, 2023
@Kha
Copy link
Member Author

Kha commented Jun 28, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 1650f94.
There were significant changes against commit bb8cc08:

  Benchmark          Metric             Change
  =======================================================
+ stdlib             tactic execution    -2.8% (-483.8 σ)
+ workspaceSymbols   branch-misses       -3.7%  (-16.7 σ)
- workspaceSymbols   task-clock           3.7%   (20.3 σ)
- workspaceSymbols   wall-clock           3.7%   (20.3 σ)

@Kha
Copy link
Member Author

Kha commented Jun 29, 2023

I have now benchmarked the PR on https://github.com/Kha/C-parsing-for-Lean4/tree/custom-tokens as well, before applying any changes to the actual parsing. Performance slightly increases, which doesn't really make sense and as such is probably a linker artifact, but it is at least further evidence that performance of parsing is not negatively affected

$ lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.233 s ±  0.008 s    [User: 1.139 s, System: 0.095 s]
  Range (min … max):    1.224 s …  1.251 s    10 runs
$ git co custom-tokens; lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.187 s ±  0.007 s    [User: 1.102 s, System: 0.087 s]
  Range (min … max):    1.178 s …  1.198 s    10 runs

@leodemoura leodemoura added the awaiting-review Waiting for someone to review the PR label Jul 10, 2023
@Kha
Copy link
Member Author

Kha commented Jul 19, 2023

Just to note some alternatives: instead of making the token parser function dynamically replaceable, we could also "simply" demand embedded languages to call their own token parser function directly. This would involve duplication around all those little functions that build on top of the token parser though, and syntax as well would somehow have to be taught to use a different symbol function. Making the switch to a different token parser explicit via the parser context also means we can reliably enforce that the token cache is cleared at that point like done in this PR.
Furthermore, we could tie token parsers to syntax categories such that they are automatically activated when the category is entered (much like we might want to do with per-category token tables?). This is something that can be done on top of this PR however.

structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where
tokens : TokenTable

def TokenParserFn := TokenParserContext → ParserState → ParserState
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment.

@@ -290,11 +283,29 @@ def mkUnexpectedErrorAt (s : ParserState) (msg : String) (pos : String.Pos) : Pa

end ParserState

structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the comment. It does help. Could you please add a Note to this file explaining the different context objects? We now have: InputContext, ParserModuleContext, CacheableParserContext, TokenParserContext, ParserContextCore, and ParserContext. It would be great if the note also included explanations for TokenParserFn, ParserFn.

default := fun _ s => s

/-- Parser context updateable in `adaptUncacheableContextFn`. -/
structure ParserContextCore extends TokenParserContext where
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand comment and describe the purpose of the new structure.

@@ -0,0 +1,326 @@
import Lean.Parser.Basic

/-! Tokenization functions used for the implementation of the Lean language. -/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you just moved the code? Or, have you made modifications too?

Copy link
Member Author

@Kha Kha Jul 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only superficial changes for moving functions from ParserFn to TokenParserFn, plus Kha@bd44113

@leodemoura
Copy link
Member

Let's discuss this PR in the next meeting.

/-- Function to be called directly after parsing a token. When not in an error state, parses following whitespace,
sets up `SourceInfo`, and pushes result of calling `f` with token substring and info onto stack. -/
@[specialize]
def pushToken (f : Substring → SourceInfo → Syntax) (startPos : String.Pos) (whitespaceFn : TokenParserFn) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to discuss this one. I see the motivations for the change, but the parser is starting to accumulate "patchwork."

@leodemoura leodemoura added the dev meeting It will be discussed at the (next) dev meeting label Jul 28, 2023
@Kha Kha marked this pull request as draft August 3, 2023 14:34
@Kha
Copy link
Member Author

Kha commented Aug 8, 2023

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit a1138ef.
There were significant changes against commit bb73879:

  Benchmark   Metric         Change
  ===========================================
- parser      branches         1.8% (197.6 σ)
- parser      instructions     1.9% (275.7 σ)

@leodemoura leodemoura added awaiting-author Waiting for PR author to address issues and removed awaiting-review Waiting for someone to review the PR labels Aug 9, 2023
@tydeu
Copy link
Member

tydeu commented Oct 13, 2023

@Kha I noticed that you removed the whitespace customization. While it may be really used in the Lean core, I think it will be very useful for DSLs. Particularly, for those that are meant to function in both an embedded context and standalone. For example, in Alloy, when parsing C code in a Lean file, using Lean-style comments is ideal. However, when parsing a C file itself using C-style comments is necessary. Being able to use the same grammar for both would be amazing and would be enabled by the whitespace configuration.

EDIT: Oh, upon further examination of the changes, I noticed the Lean tokens never made use of the context whitespaceFn. Well, it would be nice if they did, 😁

@Kha
Copy link
Member Author

Kha commented Oct 13, 2023

Existing token functions can simply be parameterized over the whitespace parser if desired, the parser framework doesn't have to be changed for that

@tydeu
Copy link
Member

tydeu commented Oct 13, 2023

@Kha

Existing token functions can simply be parameterized over the whitespace parser if desired,

In my case, that would still require a separate grammar for C w/ Lean comments and C w/ C comments. Right? (Since the whitespace function has to be determine a syntax level in such a setup rather than at parser runtime. The only other solution I can see would be to parameterize the entire grammar which would prohibit the use of syntax.)

@github-actions github-actions bot added the stale label Dec 21, 2023
@github-actions github-actions bot removed the stale label Feb 10, 2024
@github-actions github-actions bot added the stale label May 5, 2024
@github-actions github-actions bot removed the stale label May 31, 2024
@github-actions github-actions bot added stale and removed stale labels Jun 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-author Waiting for PR author to address issues dev meeting It will be discussed at the (next) dev meeting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants