Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Doctype after <?xml> or with internal subsets results in parse errors #50

Open
EmilGedda opened this issue Oct 17, 2021 · 0 comments
Open

Comments

@EmilGedda
Copy link

Currently, the Xeno.DOM.Robust does not properly handle XML doctypes.

Doctypes are removed if they appear at the start of the document, however, usually the doctypes are placed after the XML-declaration: <?xml ...><!DOCTYPE html>.

i.e., this test fails:

describe "skipDoctype" $ do
  it "strips doctype after xml declaration" $ do
    skipDoctype "<?xml version=\"1.0\" encoding=\"UTF-8\"?><!DOCTYPE html>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

One thought is that skipDoctype should check the first two < if they are followed !DOCTYPE and then remove the matching node.
I don't think supporting a doctype at the end of a document is something to be bothered with.

On top of that, skipDoctype does also not handle doctypes with internal subsets such as

<!DOCTYPE html [
  <!-- an internal subset can be embedded here -->
]>

Appropriate test:

describe "skipDoctype" $ do
  it "strips doctype with internal subsets" $ do
    skipDoctype "<!DOCTYPE html [ <!-- --> ]><?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello" `shouldBe` "<?xml version=\"1.0\" encoding=\"UTF-8\"?>Hello"

In this case, skipDoctype will return a ByteString which starts with ]>.
Ideally, skipDoctype should drop until [ or >, and if a [ was matched, then continue to drop until ]>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants