New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trade-off between extensibility and performance #9
Comments
This sounds related to #8
If possible let's avoid backtracking entirely. http://marijnhaverbeke.nl/blog/lezer.html
This may be more complex, but will likely be a better approach for performance. 👍 |
Am I correct in assuming that with lezer / tree-parser you don‘t backtrack, but instead parse multiple syntax trees? Or how would it, more practically, work? I think those two are very interesting, but they have very different roles/goals:
|
Sort of, GLR parsers fork whenever an ambiguity is encountered. For example, parsing ## Hello World ## When it is parsed
GLR parsers work, by implementing the GLR algorithm.
Micromark is intended as the base for remark and by extension MDX correct?
In what sense "complete" and "perfect"? The way I look at it, Micromark is a lexer/tokenizer.
Right, not on HTML, but MDX does depend on valid JSX. |
Blocks, in particular, has an intermediary schema (via Slate) for editing, so the parsing really only occurs when deserializing/serializing the document. Serialization would really only occur once and deserialization whenever one wants to output the MD which can be trivially debounced.
I'm not sure I 100% understand what you mean by a "simplified black box" @wooorm. As in the HTML string isn't parsed itself and is instead put into an HTML node with its raw contents? As @ChristianMurphy states, MDX depends on valid JSX, which essentially will hard fail when not properly written (because babel or JS evaluation will go 💥). It doesn't have the built in "recovery" that browsers have for HTML documents. For MDX, this will require knowledge of the JSX language/grammar, which would be pretty complex to handle. Would the approach be that MDX extends/replaces micromark's HTML tokenizer/parser/compiler and replaces it with its own (likely using Babel)? Personally I'd lean towards being more robust/correct than being the fastest thing ever for the first release. Once it's "correct" we can profile bottlenecks and optimize. Not to mention some of these considerations are pretty edge-casey. Also, a lot of end users of unified (like Gatsby) can, and do, implement their own layers of caching. |
@ChristianMurphy Thanks for explaining, even though I write a lot of parsing things I really don’t know these basics!
That’s true. One example I can think of that CM requires entities to be valid:
Typically, languages have other languages inside them. The two examples here are HTML in Markdown and JSX in MDX. The difference is that Markdown doesn’t parse HTML, it parses some XML-like structures. Whereas MDX indeed seems to parse (in the future?) only valid JSX. This may be a bit of a problem: HTML/MD don’t have “invalid” content, that throws a parse error and crashes. JSX/MDX do have that.
Correct! I think there’s two crucial examples of extensions for micromark: 1: GFM, 2: MDX. The first is probably easier and also very ubiquitous and can function as a proof of concept, the second is necessary but can take a bit longer as its probably a bit hard. |
CMSM does not define backtracking. This therefore removes the possibility of extensions in favour of performance. I see two possibilities of extensions: a) define useful extensions in CMSM enable them with flags, or b) allow some form of hooks. I’d like to table this for now: first priority is to get micromark working (keeping extensions in mind), actually supporting extensions comes after that. |
Say we take:
Do we backtrack to before the blank lines, and check all the tokenisers again (blank line is last probably), or is there a knowledge of what other tokenisers are enabled and can we “eat” every blank line directly?
The trade-off here is that either, with knowledge of other tokens, we can be more performant and scan the buffer fewer times, or that we are more extensible, allowing blank lines to be turned off, or alternative tokenisers from extensions dealing with them?
The text was updated successfully, but these errors were encountered: