Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loadHtml incorrectly parsing HTML #21609

Open
theAkito opened this issue Apr 2, 2023 · 3 comments
Open

loadHtml incorrectly parsing HTML #21609

theAkito opened this issue Apr 2, 2023 · 3 comments

Comments

@theAkito
Copy link
Contributor

theAkito commented Apr 2, 2023

Description

test.html

<li class="list-entry" id="sb">
<span class="list-item item-type">
  <span class="item-icons">
  <span class="list-item item-size">2&nbsp;GiB<input type="hidden" name="size" value="24653"></span>
  <span class="list-item item-count1">0</span>
  <span class="list-item item-count2">0&nbsp;</span>
</li>

testhtml.nim

##[
  Test HTML Loading
]##

import
  std/[
    htmlparser,
    xmltree,
  ]

let
  xContent = "test.html".loadHtml

echo $xContent

Run

nim c -r testhtml.nim

Result

<li class="list-entry" id="sb">
<span class="list-item item-type">
  <span class="item-icons">
  <span class="list-item item-size">2 GiB<input type="hidden" value="24653" name="size" />0</span>
  <span class="list-item item-count2">0 </span>
</span></span></li>

Nim Version

Nim Compiler Version 1.6.12 [Linux: amd64]
Compiled at 2023-03-10
Copyright (c) 2006-2023 by Andreas Rumpf

git hash: 1aa9273640c0c51486cf3a7b67282fe58f360e91
active boot switches: -d:release

Current Output

<li class="list-entry" id="sb">
<span class="list-item item-type">
  <span class="item-icons">
  <span class="list-item item-size">2 GiB<input type="hidden" value="24653" name="size" />0</span>
  <span class="list-item item-count2">0 </span>
</span></span></li>

Expected Output

<li class="list-entry" id="sb">
<span class="list-item item-type">
  <span class="item-icons">
  <span class="list-item item-size">2&nbsp;GiB<input type="hidden" name="size" value="24653"></span>
  <span class="list-item item-count1">0</span>
  <span class="list-item item-count2">0&nbsp;</span>
</li>

Possible Solution

No response

Additional Information

  • Works fine in literally every browser.

I just noticed the input tag has a slash missing. Okay, fine, but the result is still bonkers. Does not make sense. If the browser renders it properly, there should be a way for the parser to handle this properly. Either through an error or whatever way, so it doesn't just get swallowed.

The problem is, that the result is even wronger than the input. It shouldn't work that way.

@metagn
Copy link
Collaborator

metagn commented Apr 2, 2023

The first 2 spans don't have a closing tag, how is that supposed to be handled?

@theAkito
Copy link
Contributor Author

theAkito commented Apr 2, 2023

The first 2 spans don't have a closing tag, how is that supposed to be handled?

That's the part that actually works. 😄


Well, so obviously it's hard to handle every "wild" HTML & stuff, etc. I get it.
But the other side of the truth is - the browser can handle it. So, there is proof right there, that it already is handled in real life. Therefore, there should be a Nim solution for this problem.

I'm not saying, that every HTML should be perfectly parsed without issues. I'm saying, either parse it or tell me what you couldn't understand. Just throw an exception or whatever. But don't let me stand in the cold rain, pretending the parsing went fine. 🙂

@metagn
Copy link
Collaborator

metagn commented Apr 3, 2023

It would have helped if you pointed out what was wrong and maybe even minimized the example because I did not catch the issue the first time. In fact the first 2 spans are misleading because they are not needed to recreate the issue. From what I understand this recreates the issue you are encountering, which is fixed when a closing / is added to input:

<span class="1">a<input></span>
<span class="2">b</span>
<span class="3">c</span>

For the record a somewhat similar issue #14073 was closed which said that fusion htmlparser might not have that same issue. So this as well might work with fusion htmlparser but I can't test this as I don't have fusion. But I think it's possible as input is in the SingleTags constant in fusion and not in the standard library

There is also a way to get errors from loadHtml by passing a var seq[string] argument

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants