Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static lexer #1909

Closed
wants to merge 4 commits into from
Closed

Conversation

calculuschild
Copy link
Contributor

Description

PR #1872 is running into some optimization issues, and part of that is due to the way the Benchmark is run. With each individual call to marked(markdown), we create a brand new Lexer, Tokenizer, Parser, Renderer, etc, and doing this for each of the thousands of calls in the Benchmark makes any setup code take up a significant portion of the execution time, i.e., bind()ing functions in a constructor, which should be faster than call() because it only needs to happen once, is happening with every Marked execution.

I added a static property staticLexer to the Lexer.js, and instead of creating a new Lexer in each call to static Lexer.lex(), it instead returns the existing static staticLexer property, and just applies any changed options. Did a similar thing with the Tokenizer.js. I don't see any reason to be building up entire separate trees of objects every time someone wants to compile a line of markdown; we only use one at a time, so lets "singleton" it?

Only issue is Lint doesn't like it because static property isn't a formal part of JS yet, even though Babel can handle it for us, so we might need to use something like https://github.com/babel/babel/tree/main/eslint/babel-eslint-parser

It seems to give a bit of a speed boost to the current Master, but it's really hard to tell with the current Benchmarking setup since results vary wildly. I'd appreciate a second opinion.

Contributor

  • Test(s) exist to ensure functionality and minimize regression (if no tests added, list tests covering this PR); or,
  • no tests required for this PR.
  • If submitting new feature, it has been documented in the appropriate places.

Committer

In most cases, this should be a different person than the contributor.

@vercel
Copy link

vercel bot commented Jan 23, 2021

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/markedjs/markedjs/ddcd2oz4d
✅ Preview: https://markedjs-git-fork-calculuschild-staticlexer.markedjs.vercel.app

@UziTech
Copy link
Member

UziTech commented Jan 23, 2021

It doesn't seem to speed up the current benchmarks but I see how it could if #1872 was merged and bind was used.

src/Lexer.js Outdated
@@ -75,6 +78,8 @@ module.exports = class Lexer {
this.tokenizer.rules = rules;
}

static staticLexer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this static property is necessary since Lexer.staticLexer will be undefined without this line until it is set anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

lib/marked.esm.js Outdated Show resolved Hide resolved
@UziTech
Copy link
Member

UziTech commented Jan 24, 2021

I wonder if this will actually be any faster since we will still have to run setOptions every time anyways to reset everything. I'm thinking we should just pass this into the functions instead of binding if it is faster.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jan 24, 2021

The current version of #1872 is passing this as a parameter but it doesn't seem to make a difference in my machine. I have some other ideas that seemed to speed up parsing, but repeated calls got quite slow with the setup overhead. Maybe I will post in a separate PR so you can judge.

@calculuschild
Copy link
Contributor Author

Made the fixes requested plus a small tweak so this.tokenizer doesn't have to be re-written from this.options.tokenizer each time. Again, I think it's a small speedup but I've been wrong before.

Also removed unneeded Babel extension, and it now passes the Linter.

Even if performance isn't improved here, I think it will streamline any future updates that include heavier setup steps in the constructor.

this.tokens = [];
this.tokens.links = Object.create(null);
this.options = options || defaults;
this.options.tokenizer = this.options.tokenizer || new Tokenizer();
this.tokenizer = this.options.tokenizer;
if (!this.tokenizer) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they pass a different tokenizer in the options wouldn't we want to use that one here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still happens, if my logic is correct.

Instead of this with every loop:

  1. Check for this.options.tokenizer
  2. If it doesn't exist, assign it the default value new Tokenizer
  3. Assign the result to this.tokenizer

Do this:

  1. Check for this.tokenizer
  2. If it doesn't exist, assign it a value, preferably this.options.tokenizer, but if not, default to new Tokenizer

The end result is the same: this.tokenizer == this.options.tokenizer || new Token, but we have saved an object write on every loop in the second case (this.tokenizer = this.options.tokenizer)

Instead of "saving" the optional Tokenizer to this.options.tokenizer to be reapplied every loop, we "save" it where it actually belongs, in this.tokenizer, and only have to apply it the first time around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact... I wonder if we even need this separate "setOptions" function run every time at all. I just put it there to appease the unit tests. Once the tokenizer is defined and its options are set the first time, normal use would generally not be swapping options on the fly, right? Wouldn't 99% of use cases just set options one time and then run marked(string) with those same options every time?

The only real reason I see the need to assign options on the fly is for unit testing in development, because it needs to update options as it steps through the spec tests for GFM, Pedantic, etc. in a random order. We could instead modify the unit test script to create a new marked object with each test, and thereby have the correct options as it goes, but I don't see a need for random option switching in production code, and especially not repeated re-assignment of the same options over and over. (I could be missing some reason we need to allow hot-swapping options in production, just can't think of any.)

Copy link
Member

@UziTech UziTech Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in this situation?

marked("# test");

marked.use({
  tokenizer: {
    heading() {
      // no heading
    }
  }
});

marked("# test");

with this PR this.tokenizer wouldn't be updated and the new heading tokenizer is ignored.

This PR:

<h1 id="test">test</h1>

<h1 id="test">test</h1>

latest release:

<h1 id="test">test</h1>

<p># test</p>

@UziTech
Copy link
Member

UziTech commented Jan 28, 2021

I see what you are doing here but I don't see any benefits. It doesn't seem any faster especially since we still need to set the options.

If we want it to be faster we should make marked a class that you can use like const marked = new Marked(options) then we can reuse everything on every run. And users can still set all options by creating a new instance.

Currently the way marked sets the options they are global so with this PR the only way to update the tokenizer would be to restart the node process.

@calculuschild
Copy link
Contributor Author

calculuschild commented Jan 28, 2021

with this PR the only way to update the tokenizer would be to restart the node process.

Ok, so we do need to support hot-swapping options in production. This is good to know.

...I don't see any benefits.

This is the key point I was trying to make in the OP. In its current state, this doesn't do a whole lot, especially if we have to re-apply the same options over and over whether they are changed or not. However, if we want to make future changes that add a lot of overhead to the constructor, such as .bind(), we can avoid that slowdown by doing something like this. The benefit isn't for the current master, but to enable future fixes and optimizations.

It doesn't seem any faster especially since we still need to set the options.

This is a good point, and re-emphasises to me that this PR is a good place to optimize. We don't really see speed gains against the current master because we haven't gone far enough :P. If we are needlessly resetting the same options over and over with each call to marked(src), then that is an area we can speed up. And now that I know we do need to allow changing options, I'm thinking we should check whether the options have changed, and only update them if needed. Or, we can go the route you mentioned with making Marked an instantiable class, and just fully rebuild a new instance for each new set of options.

@UziTech
Copy link
Member

UziTech commented Jan 28, 2021

Maybe we can add a way to do new Marked(options) in v2 and tell people to start using it essentially just returning marked and remove marked.setOptions and implement #1872 in v3.

We could change marked to:

const {Marked} = require("marked");
const marked = new Marked(options);

marked("# markdown");

Or add this way of doing it and keep the old way but just tell people that if they want speed they need to do it this way? It seems like this would have all of theses benefits (plus some ESM benefits) and not alienate users that don't want to change but don't care about speed.

@calculuschild
Copy link
Contributor Author

Sure, I would not be opposed to that. If it opens up better options for #1872 or similar changes down the line I'm all for it.

FYI I just tried the approach of using JSON.stringify(options) === JSON.stringify(this.options) and only applying the options if they have changed. It worked, but the added overhead of strinfigying the objects was more than the benefit from not applying options.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants