Static lexer #1909

calculuschild · 2021-01-23T00:04:26Z

Description

PR #1872 is running into some optimization issues, and part of that is due to the way the Benchmark is run. With each individual call to marked(markdown), we create a brand new Lexer, Tokenizer, Parser, Renderer, etc, and doing this for each of the thousands of calls in the Benchmark makes any setup code take up a significant portion of the execution time, i.e., bind()ing functions in a constructor, which should be faster than call() because it only needs to happen once, is happening with every Marked execution.

I added a static property staticLexer to the Lexer.js, and instead of creating a new Lexer in each call to static Lexer.lex(), it instead returns the existing static staticLexer property, and just applies any changed options. Did a similar thing with the Tokenizer.js. I don't see any reason to be building up entire separate trees of objects every time someone wants to compile a line of markdown; we only use one at a time, so lets "singleton" it?

Only issue is Lint doesn't like it because static property isn't a formal part of JS yet, even though Babel can handle it for us, so we might need to use something like https://github.com/babel/babel/tree/main/eslint/babel-eslint-parser

It seems to give a bit of a speed boost to the current Master, but it's really hard to tell with the current Benchmarking setup since results vary wildly. I'd appreciate a second opinion.

Contributor

Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
no tests required for this PR.
If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

CI is green (no forced merge required).
Squash and Merge PR following conventional commit guidelines.

…rked()

vercel · 2021-01-23T00:04:36Z

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/ddcd2oz4d
✅ Preview: https://markedjs-git-fork-calculuschild-staticlexer.markedjs.vercel.app

UziTech · 2021-01-23T19:33:04Z

It doesn't seem to speed up the current benchmarks but I see how it could if #1872 was merged and bind was used.

UziTech · 2021-01-23T19:35:07Z

src/Lexer.js

@@ -75,6 +78,8 @@ module.exports = class Lexer {
    this.tokenizer.rules = rules;
  }

+  static staticLexer


I don't think this static property is necessary since Lexer.staticLexer will be undefined without this line until it is set anyways.

lib/marked.esm.js

UziTech · 2021-01-24T17:50:32Z

I wonder if this will actually be any faster since we will still have to run setOptions every time anyways to reset everything. I'm thinking we should just pass this into the functions instead of binding if it is faster.

calculuschild · 2021-01-24T18:55:36Z

The current version of #1872 is passing this as a parameter but it doesn't seem to make a difference in my machine. I have some other ideas that seemed to speed up parsing, but repeated calls got quite slow with the setup overhead. Maybe I will post in a separate PR so you can judge.

calculuschild · 2021-01-28T04:49:29Z

Made the fixes requested plus a small tweak so this.tokenizer doesn't have to be re-written from this.options.tokenizer each time. Again, I think it's a small speedup but I've been wrong before.

Also removed unneeded Babel extension, and it now passes the Linter.

Even if performance isn't improved here, I think it will streamline any future updates that include heavier setup steps in the constructor.

UziTech · 2021-01-28T06:05:40Z

src/Lexer.js

    this.tokens = [];
    this.tokens.links = Object.create(null);
    this.options = options || defaults;
-    this.options.tokenizer = this.options.tokenizer || new Tokenizer();
-    this.tokenizer = this.options.tokenizer;
+    if (!this.tokenizer) {


If they pass a different tokenizer in the options wouldn't we want to use that one here?

That still happens, if my logic is correct.

Instead of this with every loop:

Check for this.options.tokenizer

If it doesn't exist, assign it the default value new Tokenizer

Assign the result to this.tokenizer

Do this:

Check for this.tokenizer

If it doesn't exist, assign it a value, preferably this.options.tokenizer, but if not, default to new Tokenizer

The end result is the same: this.tokenizer == this.options.tokenizer || new Token, but we have saved an object write on every loop in the second case (this.tokenizer = this.options.tokenizer)

Instead of "saving" the optional Tokenizer to this.options.tokenizer to be reapplied every loop, we "save" it where it actually belongs, in this.tokenizer, and only have to apply it the first time around.

In fact... I wonder if we even need this separate "setOptions" function run every time at all. I just put it there to appease the unit tests. Once the tokenizer is defined and its options are set the first time, normal use would generally not be swapping options on the fly, right? Wouldn't 99% of use cases just set options one time and then run marked(string) with those same options every time?

The only real reason I see the need to assign options on the fly is for unit testing in development, because it needs to update options as it steps through the spec tests for GFM, Pedantic, etc. in a random order. We could instead modify the unit test script to create a new marked object with each test, and thereby have the correct options as it goes, but I don't see a need for random option switching in production code, and especially not repeated re-assignment of the same options over and over. (I could be missing some reason we need to allow hot-swapping options in production, just can't think of any.)

What happens in this situation?

marked("# test"); marked.use({ tokenizer: { heading() { // no heading } } }); marked("# test");

with this PR this.tokenizer wouldn't be updated and the new heading tokenizer is ignored.

This PR:

<h1 id="test">test</h1> <h1 id="test">test</h1>

latest release:

<h1 id="test">test</h1> <p># test</p>

UziTech · 2021-01-28T15:17:56Z

I see what you are doing here but I don't see any benefits. It doesn't seem any faster especially since we still need to set the options.

If we want it to be faster we should make marked a class that you can use like const marked = new Marked(options) then we can reuse everything on every run. And users can still set all options by creating a new instance.

Currently the way marked sets the options they are global so with this PR the only way to update the tokenizer would be to restart the node process.

calculuschild · 2021-01-28T19:29:08Z

with this PR the only way to update the tokenizer would be to restart the node process.

Ok, so we do need to support hot-swapping options in production. This is good to know.

...I don't see any benefits.

This is the key point I was trying to make in the OP. In its current state, this doesn't do a whole lot, especially if we have to re-apply the same options over and over whether they are changed or not. However, if we want to make future changes that add a lot of overhead to the constructor, such as .bind(), we can avoid that slowdown by doing something like this. The benefit isn't for the current master, but to enable future fixes and optimizations.

It doesn't seem any faster especially since we still need to set the options.

This is a good point, and re-emphasises to me that this PR is a good place to optimize. We don't really see speed gains against the current master because we haven't gone far enough :P. If we are needlessly resetting the same options over and over with each call to marked(src), then that is an area we can speed up. And now that I know we do need to allow changing options, I'm thinking we should check whether the options have changed, and only update them if needed. Or, we can go the route you mentioned with making Marked an instantiable class, and just fully rebuild a new instance for each new set of options.

UziTech · 2021-01-28T20:56:43Z

Maybe we can add a way to do new Marked(options) in v2 and tell people to start using it essentially just returning marked and remove marked.setOptions and implement #1872 in v3.

We could change marked to:

const {Marked} = require("marked");
const marked = new Marked(options);

marked("# markdown");

Or add this way of doing it and keep the old way but just tell people that if they want speed they need to do it this way? It seems like this would have all of theses benefits (plus some ESM benefits) and not alienate users that don't want to change but don't care about speed.

calculuschild · 2021-01-28T21:13:54Z

Sure, I would not be opposed to that. If it opens up better options for #1872 or similar changes down the line I'm all for it.

FYI I just tried the approach of using JSON.stringify(options) === JSON.stringify(this.options) and only applying the options if they have changed. It worked, but the added overhead of strinfigying the objects was more than the benefit from not applying options.

calculuschild added 3 commits January 22, 2021 17:13

Have a single static lexer instead of rebuilding with each call to ma…

1fbafa6

…rked()

Update compiled code

d98e24a

Static Tokenizer as well.

6f78607

vercel bot deployed to Preview January 23, 2021 00:04 View deployment

calculuschild requested a review from UziTech January 23, 2021 00:05

UziTech reviewed Jan 23, 2021

View reviewed changes

calculuschild commented Jan 24, 2021

View reviewed changes

lib/marked.esm.js Outdated Show resolved Hide resolved

This was referenced Jan 27, 2021

[WIP] Tokenizer Array using Stringified functions #1914

Closed

Rework Lexer to use extendable array of tokenizer functions #1872

Closed

Remove unneeded static Tokenizer, simplify

b403520

vercel bot deployed to Preview January 28, 2021 04:45 View deployment

calculuschild requested a review from UziTech January 28, 2021 04:49

UziTech reviewed Jan 28, 2021

View reviewed changes

calculuschild closed this Jun 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Static lexer #1909

Static lexer #1909

calculuschild commented Jan 23, 2021

vercel bot commented Jan 23, 2021 •

edited

UziTech commented Jan 23, 2021

UziTech Jan 23, 2021

calculuschild Jan 28, 2021

UziTech commented Jan 24, 2021

calculuschild commented Jan 24, 2021 •

edited

calculuschild commented Jan 28, 2021

UziTech Jan 28, 2021

calculuschild Jan 28, 2021 •

edited

calculuschild Jan 28, 2021 •

edited

UziTech Jan 28, 2021 •

edited

UziTech commented Jan 28, 2021 •

edited

calculuschild commented Jan 28, 2021 •

edited

UziTech commented Jan 28, 2021

calculuschild commented Jan 28, 2021

Static lexer #1909

Static lexer #1909

Conversation

calculuschild commented Jan 23, 2021

Description

Contributor

Committer

vercel bot commented Jan 23, 2021 • edited

UziTech commented Jan 23, 2021

UziTech Jan 23, 2021

Choose a reason for hiding this comment

calculuschild Jan 28, 2021

Choose a reason for hiding this comment

UziTech commented Jan 24, 2021

calculuschild commented Jan 24, 2021 • edited

calculuschild commented Jan 28, 2021

UziTech Jan 28, 2021

Choose a reason for hiding this comment

calculuschild Jan 28, 2021 • edited

Choose a reason for hiding this comment

calculuschild Jan 28, 2021 • edited

Choose a reason for hiding this comment

UziTech Jan 28, 2021 • edited

Choose a reason for hiding this comment

UziTech commented Jan 28, 2021 • edited

calculuschild commented Jan 28, 2021 • edited

UziTech commented Jan 28, 2021

calculuschild commented Jan 28, 2021

vercel bot commented Jan 23, 2021 •

edited

calculuschild commented Jan 24, 2021 •

edited

calculuschild Jan 28, 2021 •

edited

calculuschild Jan 28, 2021 •

edited

UziTech Jan 28, 2021 •

edited

UziTech commented Jan 28, 2021 •

edited

calculuschild commented Jan 28, 2021 •

edited