Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested HTML not parsed correctly #322

Closed
wereHamster opened this issue Jul 7, 2020 · 4 comments
Closed

Nested HTML not parsed correctly #322

wereHamster opened this issue Jul 7, 2020 · 4 comments

Comments

@wereHamster
Copy link

BEFORE

<div>
  <div>
    <div></div>
  </div>
</div>

AFTER

image

@ricardo-quinones
Copy link

@probablyup I've verified this within a project that I'm working on that is utilizing this library. I traced the issue down to this regex which is only capable of matching nested tags of the same parent tag name one level deep. Any levels deeper and this rendering issue occurs. You can confirm by running the below snippet in any JS REPL

const regex = /^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)\/{0}>\n?(\s*(?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)<\/\1>\n*/i;
console.log(regex.exec('<div><div><div></div></div></div>')[0])
// outputs unbalanced HTML '<div><div><div></div></div>'

I tried modifying the regex to the following

/^ *(?!<[a-z][^ >/]* ?\/>)<([a-z][^ >/]*) ?([^>]*)\/{0}>\n?(\s*((?:<\1[^>]*?>[\s\S]*?<\/\1>|(?!<\1)[\s\S])*?)*)<\/\1>\n*/i

to simulate a recursive match but performance took a steep hit (not surprisingly).

I looked through the codebase and it's using regex parsing heavily but I wanted to ask if there been thoughts of parsing via tokenization instead, at least for slightly better parsing of HTML such as the example in this issue? I feel that parsing HTML via regex will always have its limitations.

@wereHamster
Copy link
Author

We gave up on this library after discovering this issue, and switched to mdx-js. Even though mdx-js is larger, it's more robust and that's what was important for us. Regex is just not reliable enough to parse even slightly more complex HTML. Good luck.

@quantizor
Copy link
Owner

I looked through the codebase and it's using regex parsing heavily but I wanted to ask if there been thoughts of parsing via tokenization instead, at least for slightly better parsing of HTML such as the example in this issue? I feel that parsing HTML via regex will always have its limitations.

It'd vastly inflate the bundle size, which is a big part of the point of this library. All libraries have tradeoffs, and imperfect but serviceable HTML parsing at a tiny bundle size is this one.

@neopostmodern
Copy link

It would be good if this were pointed out in the README.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants