Make token parser customizable #2293

Kha · 2023-06-27T10:39:04Z

In theory, users can implement arbitrary lexical grammars as the built-in token parser is merely another combinator whose use can be avoided; in practice, avoiding it also means forgoing the existing infrastructure around token caching and further parsers built on top of tokenFn, such as symbol implicitly used by syntax.

There are other nice-to-haves we have talked about before like per-category token tables, but this PR implements the most simple change to address the issues mentioned above: it makes the implementation of tokenFn customizable such that people can implement their own tokenization in it without losing access to all the infrastructure built on top of it.

Implementation-wise, the change mostly consist of further splitting ParserContextCore/ParserFn to prevent non-wellfounded type recursion and moving the Lean language's tokenizer into a new module Lean.Parser.LeanToken to distinguish it from language-generic parsers in Basic (which is too big anyway). Making it easier to adapt these Lean token functions for implementing a custom token parser is left as future work.

The change is validated by a simple test that changes the whitespace/comment syntax of the existing Lean language as well as an incomplete implementation of the venerable Whitespace language.

Kha · 2023-06-28T08:57:22Z

!bench

leanprover-bot · 2023-06-28T09:02:45Z

Here are the benchmark results for commit 1650f94.
There were significant changes against commit bb8cc08:

  Benchmark          Metric             Change
  =======================================================
+ stdlib             tactic execution    -2.8% (-483.8 σ)
+ workspaceSymbols   branch-misses       -3.7%  (-16.7 σ)
- workspaceSymbols   task-clock           3.7%   (20.3 σ)
- workspaceSymbols   wall-clock           3.7%   (20.3 σ)

Kha · 2023-06-29T10:22:21Z

I have now benchmarked the PR on https://github.com/Kha/C-parsing-for-Lean4/tree/custom-tokens as well, before applying any changes to the actual parsing. Performance slightly increases, which doesn't really make sense and as such is probably a linker artifact, but it is at least further evidence that performance of parsing is not negatively affected

$ lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.233 s ±  0.008 s    [User: 1.139 s, System: 0.095 s]
  Range (min … max):    1.224 s …  1.251 s    10 runs
$ git co custom-tokens; lake build; nix run nixpkgs#hyperfine ./build/bin/cParser -- --warmup 1
Benchmark 1: ./build/bin/cParser
  Time (mean ± σ):      1.187 s ±  0.007 s    [User: 1.102 s, System: 0.087 s]
  Range (min … max):    1.178 s …  1.198 s    10 runs

Kha · 2023-07-19T06:59:18Z

Just to note some alternatives: instead of making the token parser function dynamically replaceable, we could also "simply" demand embedded languages to call their own token parser function directly. This would involve duplication around all those little functions that build on top of the token parser though, and syntax as well would somehow have to be taught to use a different symbol function. Making the switch to a different token parser explicit via the parser context also means we can reliably enforce that the token cache is cleared at that point like done in this PR.
Furthermore, we could tie token parsers to syntax categories such that they are automatically activated when the category is entered (much like we might want to do with per-category token tables?). This is something that can be done on top of this PR however.

leodemoura · 2023-07-28T15:10:57Z

src/Lean/Parser/Types.lean

+structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where
+  tokens  : TokenTable
+
+def TokenParserFn := TokenParserContext → ParserState → ParserState


Please add comment.

leodemoura · 2023-07-28T15:11:18Z

src/Lean/Parser/Types.lean

@@ -290,11 +283,29 @@ def mkUnexpectedErrorAt (s : ParserState) (msg : String) (pos : String.Pos) : Pa

 end ParserState

+structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where


Please add comment.

Thanks for adding the comment. It does help. Could you please add a Note to this file explaining the different context objects? We now have: InputContext, ParserModuleContext, CacheableParserContext, TokenParserContext, ParserContextCore, and ParserContext. It would be great if the note also included explanations for TokenParserFn, ParserFn.

leodemoura · 2023-07-28T15:12:07Z

src/Lean/Parser/Types.lean

+  default := fun _ s => s
+
+/-- Parser context updateable in `adaptUncacheableContextFn`. -/
+structure ParserContextCore extends TokenParserContext where


Please expand comment and describe the purpose of the new structure.

leodemoura · 2023-07-28T15:14:20Z

src/Lean/Parser/LeanToken.lean

@@ -0,0 +1,326 @@
+import Lean.Parser.Basic
+
+/-! Tokenization functions used for the implementation of the Lean language. -/


Have you just moved the code? Or, have you made modifications too?

Only superficial changes for moving functions from ParserFn to TokenParserFn, plus Kha@bd44113

leodemoura · 2023-07-28T15:24:56Z

Let's discuss this PR in the next meeting.

leodemoura · 2023-07-28T15:28:15Z

src/Lean/Parser/Basic.lean

+/-- Function to be called directly after parsing a token. When not in an error state, parses following whitespace,
+    sets up `SourceInfo`, and pushes result of calling `f` with token substring and info onto stack. -/
+@[specialize]
+def pushToken (f : Substring → SourceInfo → Syntax) (startPos : String.Pos) (whitespaceFn : TokenParserFn) :


I want to discuss this one. I see the motivations for the change, but the parser is starting to accumulate "patchwork."

It was used only in the rarely used `rawFn` parser

Kha · 2023-08-08T16:44:04Z

!bench

leanprover-bot · 2023-08-08T23:23:18Z

Here are the benchmark results for commit a1138ef.
There were significant changes against commit bb73879:

  Benchmark   Metric         Change
  ===========================================
- parser      branches         1.8% (197.6 σ)
- parser      instructions     1.9% (275.7 σ)

tydeu · 2023-10-13T18:02:29Z

@Kha I noticed that you removed the whitespace customization. While it may be really used in the Lean core, I think it will be very useful for DSLs. Particularly, for those that are meant to function in both an embedded context and standalone. For example, in Alloy, when parsing C code in a Lean file, using Lean-style comments is ideal. However, when parsing a C file itself using C-style comments is necessary. Being able to use the same grammar for both would be amazing and would be enabled by the whitespace configuration.

EDIT: Oh, upon further examination of the changes, I noticed the Lean tokens never made use of the context whitespaceFn. Well, it would be nice if they did, 😁

Kha · 2023-10-13T20:51:59Z

Existing token functions can simply be parameterized over the whitespace parser if desired, the parser framework doesn't have to be changed for that

tydeu · 2023-10-13T20:56:28Z

@Kha

Existing token functions can simply be parameterized over the whitespace parser if desired,

In my case, that would still require a separate grammar for C w/ Lean comments and C w/ C comments. Right? (Since the whitespace function has to be determine a syntax level in such a setup rather than at parser runtime. The only other solution I can see would be to parameterize the entire grammar which would prohibit the use of syntax.)

Kha force-pushed the custom-tokens branch 2 times, most recently from 0e83ad0 to 1650f94 Compare June 27, 2023 10:59

leodemoura self-assigned this Jun 27, 2023

Kha force-pushed the custom-tokens branch from 1650f94 to 579bd17 Compare June 28, 2023 08:55

leodemoura added the awaiting-review Waiting for someone to review the PR label Jul 10, 2023

leodemoura reviewed Jul 28, 2023

View reviewed changes

leodemoura added the dev meeting It will be discussed at the (next) dev meeting label Jul 28, 2023

Kha marked this pull request as draft August 3, 2023 14:34

Kha added 5 commits August 8, 2023 18:43

feat: make tokenFn and whitespace customizable

b0cf86d

refactor: extract pushToken

62d25bf

test: implement minimal Whitespace

38694c5

refactor: remove TokenparserConfig.whitespaceFn again

4d13838

It was used only in the rarely used `rawFn` parser

doc: extend token parser docstrings

a1138ef

Kha force-pushed the custom-tokens branch from 23c2262 to a1138ef Compare August 8, 2023 16:43

leodemoura added awaiting-author Waiting for PR author to address issues and removed awaiting-review Waiting for someone to review the PR labels Aug 9, 2023

github-actions bot added the stale label Dec 21, 2023

github-actions bot removed the stale label Feb 10, 2024

github-actions bot added the stale label May 5, 2024

github-actions bot removed the stale label May 31, 2024

github-actions bot added stale and removed stale labels Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make token parser customizable #2293

Make token parser customizable #2293

Kha commented Jun 27, 2023

Kha commented Jun 28, 2023

leanprover-bot commented Jun 28, 2023

Kha commented Jun 29, 2023 •

edited

Loading

Kha commented Jul 19, 2023 •

edited

Loading

leodemoura Jul 28, 2023

leodemoura Jul 28, 2023

leodemoura Aug 9, 2023

leodemoura Jul 28, 2023

leodemoura Jul 28, 2023

Kha Jul 31, 2023 •

edited

Loading

leodemoura commented Jul 28, 2023

leodemoura Jul 28, 2023

Kha commented Aug 8, 2023

leanprover-bot commented Aug 8, 2023

tydeu commented Oct 13, 2023 •

edited

Loading

Kha commented Oct 13, 2023

tydeu commented Oct 13, 2023

		@@ -290,11 +283,29 @@ def mkUnexpectedErrorAt (s : ParserState) (msg : String) (pos : String.Pos) : Pa

		end ParserState

		structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where

		@@ -0,0 +1,326 @@
		import Lean.Parser.Basic

		/-! Tokenization functions used for the implementation of the Lean language. -/

Make token parser customizable #2293

Are you sure you want to change the base?

Make token parser customizable #2293

Conversation

Kha commented Jun 27, 2023

Kha commented Jun 28, 2023

leanprover-bot commented Jun 28, 2023

Kha commented Jun 29, 2023 • edited Loading

Kha commented Jul 19, 2023 • edited Loading

leodemoura Jul 28, 2023

Choose a reason for hiding this comment

leodemoura Jul 28, 2023

Choose a reason for hiding this comment

leodemoura Aug 9, 2023

Choose a reason for hiding this comment

leodemoura Jul 28, 2023

Choose a reason for hiding this comment

leodemoura Jul 28, 2023

Choose a reason for hiding this comment

Kha Jul 31, 2023 • edited Loading

Choose a reason for hiding this comment

leodemoura commented Jul 28, 2023

leodemoura Jul 28, 2023

Choose a reason for hiding this comment

Kha commented Aug 8, 2023

leanprover-bot commented Aug 8, 2023

tydeu commented Oct 13, 2023 • edited Loading

Kha commented Oct 13, 2023

tydeu commented Oct 13, 2023

Kha commented Jun 29, 2023 •

edited

Loading

Kha commented Jul 19, 2023 •

edited

Loading

Kha Jul 31, 2023 •

edited

Loading

tydeu commented Oct 13, 2023 •

edited

Loading