Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Custom extension to support new syntax #2125

Closed
haayhappen opened this issue Jul 2, 2021 · 11 comments
Closed

Question: Custom extension to support new syntax #2125

haayhappen opened this issue Jul 2, 2021 · 11 comments

Comments

@haayhappen
Copy link

haayhappen commented Jul 2, 2021

Marked version:

Describe the bug

I'm trying to use the custom extension of a tokenizer in order to add support for a custom syntax: {{variable}}
In the end this should render as some custom html.

The tokenizer seems to loop over the text far to often and therefore the results are multiple parsed html blocks with the same context.

To Reproduce
Here is my code for the extension:
https://gist.github.com/haayhappen/57b7c77151165584e810f036d536c3aa

image

Input:

Hey {{1}}

Marked Output:

<p>Hey</p><span class="variable">1</span><span class="variable">1</span><span class="variable">1</span><span class="variable">1</span><span class="variable">1</span><span class="variable">1</span>

Expected Output:

<p>Hey</p><span class="variable">1</span>

Expected behavior
I expect the custom syntax to be evaluated only once.

Did I misuse the extension? Any help highly appreciated!
Thanks

@calculuschild
Copy link
Contributor

calculuschild commented Jul 2, 2021

On my initial skim through, your extension looks fine, but I'd have to dig deeper to see what's going on.

What's the output from the Lexer if you do this?

const tokens = marked.lexer(md);
console.log(tokens);

@calculuschild
Copy link
Contributor

calculuschild commented Jul 2, 2021

The issue may be in your start function. Remove the /g, which may be returning invalid values if you have more than one instance of {{something}} in your document, since it will try to match all of them instead of just the next one in the text.

Second, I would recommend making your rule begin with a ^, as in ^{{(.*?})} to make sure it only matches the current start of the string. A tokenizer wants to see if the immediate next text starts off a token, but without ^ you may have strange results since it may consume text in the wrong order.

@calculuschild
Copy link
Contributor

@haayhappen Were you able to resolve this issue?

@haayhappen
Copy link
Author

Sorry about my absence @calculuschild

I wasn't able to resolve it and had to move on to use a different approach. However I highly appreciate your help and think this could also be helpful to others.

@qubyte
Copy link

qubyte commented Aug 22, 2021

I'm implementing a custom extension to do highlighting with the mark element and I'm seeing the same issue. The markdown will look like: this ==doesn't== work and should render to this <mark>doesn't</mark> work. Unfortunately the match seems to apply multiple times.

const extension = {
  name: 'mark',
  level: 'inline',
  start(src) {
    return src.match(/==(?!\s)/)?.index;
  },
  tokenizer(src) {
    const rule = /==(?!\s)([^\n]+)(?!\s)==/;
    const match = rule.exec(src);

    if (match) {
      console.log(src, match) // This shows me multiple matches.
      return {
        type: 'mark',
        raw: match[0],
        inner: this.lexer.inlineTokens(match[1].trim())
      };
    }
  },
  renderer(token) {
    return `<mark>${this.parser.parseInline(token.inner)}</mark>`;
  }
};

Example logs:

I'm (John) Smith ==of== ABC (corp). [
  '==of==',
  'of',
  index: 17,
  input: "I'm (John) Smith ==of== ABC (corp).",
  groups: undefined
]
ohn) Smith ==of== ABC (corp). [
  '==of==',
  'of',
  index: 11,
  input: 'ohn) Smith ==of== ABC (corp).',
  groups: undefined
]
mith ==of== ABC (corp). [
  '==of==',
  'of',
  index: 5,
  input: 'mith ==of== ABC (corp).',
  groups: undefined
]

So it looks like the start index of the source string is shifting, but feeding some of the same characters to the extension multiple times.

@calculuschild
Copy link
Contributor

@qubyte I would suggest changing const rule by inserting the "start of string" marker ^ so that your token only matches when the lexer is actually at that position.

const rule = /^==(?!\s)([^\n]+)(?!\s)==/;

@qubyte
Copy link

qubyte commented Aug 22, 2021

Nice. That works! I'd have never figured that out on my own. Thanks! 😅

@qubyte
Copy link

qubyte commented Aug 22, 2021

It does make me a little confused about what the start function is used for. I was assuming that it was feeding the start index forward.

@calculuschild
Copy link
Contributor

calculuschild commented Aug 22, 2021

The start function does feed that index forward; essentially it acts as a hint to the lexer to pause lexing of standard Markdown at that point and check if there is a valid extension token there so it doesn't get consumed by another token. However this is only a hint that there may be a token here to avoid another token consuming it. It is not the only way your extension might be triggered.

There are many cases when the string not be aligned with the start of your token. By default, all user extensions are given priority over standard Markdown syntax, and the lexer will first run your tokenizer attempting to match if I'm (John) Smith ==of== ABC (corp). before attempting to parse it as a paragraph. Without ^, your tokenizer sees that there is indeed a match somewhere in the string, and treat it as a valid token even though it isn't in the right location. Then, because you returned a token to the lexer, it will consume characters as normal (from the wrong location) and then repeat the search for the next token, giving your extension priority again.

I don't know if I've explained it well but feel free to ask if you have questions.

@qubyte
Copy link

qubyte commented Aug 22, 2021

Thank you for the detailed explanation! You've also answered a couple of other questions I had before I'd formed them (mainly about priority). My little extension has other issues (the regex needs a bit of tweaking), but you've given me everything I need to finish building it. Thanks again!

@qubyte
Copy link

qubyte commented Aug 22, 2021

In case it's helpful to anyone else, the solution I ended up with eschewed a regular expression for the tokenizer because the regular expression was coming out too complex for me to reliably maintain. I still used one for the start function though:

const extension = {
  name: 'mark',
  level: 'inline',
  start(src) {
    return src.match(/==(?!\s)/)?.index;
  },
  tokenizer(src) {
    if (!src.startsWith('==')) {
      return;
    }

    const nextIndex = src.indexOf('==', 2);

    if (nextIndex !== -1) {
      return {
        type: 'mark',
        raw: src.slice(0, nextIndex + 2),
        inner: this.lexer.inlineTokens(src.slice(2, nextIndex))
      };
    }
  },
  renderer(token) {
    return `<mark>${this.parser.parseInline(token.inner)}</mark>`;
  }
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants