-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make token parser customizable #2293
base: master
Are you sure you want to change the base?
Conversation
0e83ad0
to
1650f94
Compare
!bench |
Here are the benchmark results for commit 1650f94. Benchmark Metric Change
=======================================================
+ stdlib tactic execution -2.8% (-483.8 σ)
+ workspaceSymbols branch-misses -3.7% (-16.7 σ)
- workspaceSymbols task-clock 3.7% (20.3 σ)
- workspaceSymbols wall-clock 3.7% (20.3 σ) |
I have now benchmarked the PR on https://github.com/Kha/C-parsing-for-Lean4/tree/custom-tokens as well, before applying any changes to the actual parsing. Performance slightly increases, which doesn't really make sense and as such is probably a linker artifact, but it is at least further evidence that performance of parsing is not negatively affected
|
Just to note some alternatives: instead of making the token parser function dynamically replaceable, we could also "simply" demand embedded languages to call their own token parser function directly. This would involve duplication around all those little functions that build on top of the token parser though, and |
structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where | ||
tokens : TokenTable | ||
|
||
def TokenParserFn := TokenParserContext → ParserState → ParserState |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment.
@@ -290,11 +283,29 @@ def mkUnexpectedErrorAt (s : ParserState) (msg : String) (pos : String.Pos) : Pa | |||
|
|||
end ParserState | |||
|
|||
structure TokenParserContext extends InputContext, ParserModuleContext, CacheableParserContext where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding the comment. It does help. Could you please add a Note to this file explaining the different context objects? We now have: InputContext
, ParserModuleContext
, CacheableParserContext
, TokenParserContext
, ParserContextCore
, and ParserContext
. It would be great if the note also included explanations for TokenParserFn
, ParserFn
.
default := fun _ s => s | ||
|
||
/-- Parser context updateable in `adaptUncacheableContextFn`. -/ | ||
structure ParserContextCore extends TokenParserContext where |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand comment and describe the purpose of the new structure.
@@ -0,0 +1,326 @@ | |||
import Lean.Parser.Basic | |||
|
|||
/-! Tokenization functions used for the implementation of the Lean language. -/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you just moved the code? Or, have you made modifications too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only superficial changes for moving functions from ParserFn
to TokenParserFn
, plus Kha@bd44113
Let's discuss this PR in the next meeting. |
/-- Function to be called directly after parsing a token. When not in an error state, parses following whitespace, | ||
sets up `SourceInfo`, and pushes result of calling `f` with token substring and info onto stack. -/ | ||
@[specialize] | ||
def pushToken (f : Substring → SourceInfo → Syntax) (startPos : String.Pos) (whitespaceFn : TokenParserFn) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to discuss this one. I see the motivations for the change, but the parser is starting to accumulate "patchwork."
It was used only in the rarely used `rawFn` parser
!bench |
Here are the benchmark results for commit a1138ef. Benchmark Metric Change
===========================================
- parser branches 1.8% (197.6 σ)
- parser instructions 1.9% (275.7 σ) |
@Kha I noticed that you removed the whitespace customization. While it may be really used in the Lean core, I think it will be very useful for DSLs. Particularly, for those that are meant to function in both an embedded context and standalone. For example, in Alloy, when parsing C code in a Lean file, using Lean-style comments is ideal. However, when parsing a C file itself using C-style comments is necessary. Being able to use the same grammar for both would be amazing and would be enabled by the whitespace configuration. EDIT: Oh, upon further examination of the changes, I noticed the Lean tokens never made use of the context |
Existing token functions can simply be parameterized over the whitespace parser if desired, the parser framework doesn't have to be changed for that |
In my case, that would still require a separate grammar for C w/ Lean comments and C w/ C comments. Right? (Since the whitespace function has to be determine a syntax level in such a setup rather than at parser runtime. The only other solution I can see would be to parameterize the entire grammar which would prohibit the use of |
In theory, users can implement arbitrary lexical grammars as the built-in token parser is merely another combinator whose use can be avoided; in practice, avoiding it also means forgoing the existing infrastructure around token caching and further parsers built on top of
tokenFn
, such assymbol
implicitly used bysyntax
.There are other nice-to-haves we have talked about before like per-category token tables, but this PR implements the most simple change to address the issues mentioned above: it makes the implementation of
tokenFn
customizable such that people can implement their own tokenization in it without losing access to all the infrastructure built on top of it.Implementation-wise, the change mostly consist of further splitting
ParserContextCore/ParserFn
to prevent non-wellfounded type recursion and moving the Lean language's tokenizer into a new moduleLean.Parser.LeanToken
to distinguish it from language-generic parsers inBasic
(which is too big anyway). Making it easier to adapt these Lean token functions for implementing a custom token parser is left as future work.The change is validated by a simple test that changes the whitespace/comment syntax of the existing Lean language as well as an incomplete implementation of the venerable Whitespace language.