Nested HTML not parsed correctly #322

wereHamster · 2020-07-07T04:02:59Z

BEFORE

<div>
  <div>
    <div></div>
  </div>
</div>

AFTER

The text was updated successfully, but these errors were encountered:

ricardo-quinones · 2020-08-13T21:42:47Z

@probablyup I've verified this within a project that I'm working on that is utilizing this library. I traced the issue down to this regex which is only capable of matching nested tags of the same parent tag name one level deep. Any levels deeper and this rendering issue occurs. You can confirm by running the below snippet in any JS REPL

const regex = /^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)\/{0}>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)<\/\1>\n*/i;
console.log(regex.exec('<div><div><div></div></div></div>')[0])
// outputs unbalanced HTML '<div><div><div></div></div>'

I tried modifying the regex to the following

/^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)\/{0}>\n?(\s*((?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)*)<\/\1>\n*/i

to simulate a recursive match but performance took a steep hit (not surprisingly).

I looked through the codebase and it's using regex parsing heavily but I wanted to ask if there been thoughts of parsing via tokenization instead, at least for slightly better parsing of HTML such as the example in this issue? I feel that parsing HTML via regex will always have its limitations.

wereHamster · 2020-08-14T09:02:28Z

We gave up on this library after discovering this issue, and switched to mdx-js. Even though mdx-js is larger, it's more robust and that's what was important for us. Regex is just not reliable enough to parse even slightly more complex HTML. Good luck.

quantizor · 2020-10-08T14:52:11Z

I looked through the codebase and it's using regex parsing heavily but I wanted to ask if there been thoughts of parsing via tokenization instead, at least for slightly better parsing of HTML such as the example in this issue? I feel that parsing HTML via regex will always have its limitations.

It'd vastly inflate the bundle size, which is a big part of the point of this library. All libraries have tradeoffs, and imperfect but serviceable HTML parsing at a tiny bundle size is this one.

neopostmodern · 2021-02-18T21:13:29Z

It would be good if this were pointed out in the README.

quantizor closed this as completed Oct 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested HTML not parsed correctly #322

Nested HTML not parsed correctly #322

wereHamster commented Jul 7, 2020

ricardo-quinones commented Aug 13, 2020

wereHamster commented Aug 14, 2020

quantizor commented Oct 8, 2020

neopostmodern commented Feb 18, 2021

Nested HTML not parsed correctly #322

Nested HTML not parsed correctly #322

Comments

wereHamster commented Jul 7, 2020

ricardo-quinones commented Aug 13, 2020

wereHamster commented Aug 14, 2020

quantizor commented Oct 8, 2020

neopostmodern commented Feb 18, 2021