Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML Declaration ignored by DOMParser #2615

Open
kenlyon opened this issue Jun 19, 2019 · 12 comments
Open

XML Declaration ignored by DOMParser #2615

kenlyon opened this issue Jun 19, 2019 · 12 comments

Comments

@kenlyon
Copy link

kenlyon commented Jun 19, 2019

Basic info:

  • Node.js version: v10.15.3
  • jsdom version: 15.1.1

Minimal reproduction case

const { JSDOM } = require("jsdom");
const { XMLSerializer } = require("w3c-xmlserializer");

const inputXml = `<?xml version="1.0" encoding="ASCII"?><test>hello</test>`;
const options = { contentType: "application/xml" };
const dom = new JSDOM(inputXml, options);
const XMLSerializer_ctor = XMLSerializer.interface;
const serializer = new XMLSerializer_ctor();
const outputXml = serializer.serializeToString(dom.window.document);

console.log("inputXml:");
console.log(inputXml);
console.log("outputXml:");
console.log(outputXml); // Expected this to match inputXml.
  • I realize this includes w3c-xmlserializer, but I don't see any other way to demonstrate the full process without it, since the deserialization is not done by jsdom itself.
  • I had initially logged a bug about this in the saxes project, but they seemed to think DOMParser needs to retrieve the XML Declaration details from xmlDecl. (saxes issue #16)
  • It sounds like I'm re-stating Is there a way to access the XML declaration in XHTML documents #415, but it did not actually address the original problem as described. It allowed Processing Instructions to be parsed, but not the actual XML Declaration, which was what the bug was about. Crucially, saxes does not emit the onprocessinginstruction event for the XML Declaration, just other Processing Instructions.

How does similar code behave in browsers?

Example in jsbin

  • Firefox produces an XML Declaration, but the encoding is changed to "UTF-8".
  • Chrome produces an XML Declaration that matches the original.
  • IE does not produce an XML Declaration.
  • Edge does not produce an XML Declaration.
@Sebmaster
Copy link
Member

In your example you can get the XMLSerializer interface from dom.window.XMLSerializer to drop the explicit w3c-xmlserializer dependency, but yeah, that should give you the result.

I think the issue here is that the w3c XML serialization algorithm doesn't describe that a XML declaration should be prepended to the output which is kinda interesting. I'm not sure if that's intended or a w3c oversight though.

@kenlyon
Copy link
Author

kenlyon commented Jun 20, 2019

Hmm, for some reason, I don't see XMLSerializer on dom.window. The only way I see to get at w3c-xmlserializer via jsdom is to use dom.serialize(). I'm using typescript with @types/jsdom v12.2.3. Maybe it's out of date? I didn't see it in jsdom either, though.

I think that including the XML Declaration does sound like it's optional in the XML spec as well. Maybe this issue is more of a feature request than a bug report.

My use case is that I'm writing middleware that deserializes XML, makes some changes and then re-serializes it and passes it on. I'd prefer not to lose the XML Declaration along the way.

From the other thread I mentioned, it sounds like the necessary values could be preserved via a small change, allowing a consumer like me to construct an XML Declaration from the constituent parts if desired.

One part I'm still not 100% clear on is the encoding value, though. If a document is serialized, will it be serialized with the encoding that was supplied, or the encoding of the doc. (usually UTF-8 from what I've seen.)

@Sebmaster
Copy link
Member

I think since you're reserialising, the xml declaration of the original document doesn't carry any weight anyways. What I mean is: due to the reserialization, you'll always have a normalised XML document compatible with version 1.0 and encoded as UTF-8. It may actually be wrong to keep the original document's declaration (I'm somewhat more confident this is the case with the encoding than the version, but should apply to both).

So I think you should be fine just generating your own declaration and prefixing it on the serialised doc.

@kenlyon
Copy link
Author

kenlyon commented Jun 20, 2019

On the subject of XMLSerializer, I had a closer look. The only way I can see to reference it directly is to do this:

const { XMLSerializer } = require("jsdom/lib/jsdom/living");
const serializer = new XMLSerializer();

I'm not sure if this will be safe going forward, but it appears to work for now.


On the subject of XML Declaration, I see two possible courses of action that a consumer might want to take:

  1. Attempt to match the original xml as much as possible. Serialize a document using the same encoding that it was deserialized from to begin with, re-creating any XML Declaration that present, but not adding one if there was none present.
  2. Have the option to add an XML Declaration that is true to the new representation of the document. version would be "1.0", encoding would be the encoding of the document and I don't know about standalone.

My use case is more the former. I don't know how difficult it would be to serialize using a different encoding than the document.

I do know, though, that in parseFragment() and parseIntoDocument() you could inspect parser.xmlDecl prior to calling parser.close() in order to find out if any version, encoding or standalone were provided.

As it currently stands, if I wanted to do the latter option myself, would it be safe to use dom.window.document.inputEncoding to detect the encoding that is in use?

@domenic
Copy link
Member

domenic commented Jun 21, 2019

We have discussions about this in the context of HTML parsing/serialization all the time. In short, serialization/parsing are not meant to preserve the original form of the document. They are only preserving of the original information (i.e., the abstract stuff that survives into the parsed form). (And sometimes, not even that; see the warning and examples below the algorithm at https://html.spec.whatwg.org/#serialising-html-fragments). See inikulin/parse5#261 (comment) for more.

As it currently stands, if I wanted to do the latter option myself, would it be safe to use dom.window.document.inputEncoding to detect the encoding that is in use?

I'm not sure, as I'm not sure what definition of "safety" you're using. But see https://dom.spec.whatwg.org/#dom-document-inputencoding for the definition of inputEncoding.

@kenlyon
Copy link
Author

kenlyon commented Jun 21, 2019

I see your point. The original format of the document shouldn't matter, since parsing the new version should still produce the same result as parsing the original. I think I was being a bit too focused on trying to preserve the original literal xml as much as possible to limit changes introduced by my middleware. However, I don't think the end consumer of the file should really be making any decisions based on the encoding used, so I don't think it will matter.

By "safe to use" I had meant "Does dom.window.document.inputEncoding always contain the encoding that will be used in the serialized output?" Thanks for the link. I see that it contains the encoding of the document, but does the serializer always use the encoding of the document?

My plan at this point is to:

  1. When parsing, determine if there was an XML Declaration present. Ideally, this would be possible via something being set in the document by parseIntoDocument(), although failing that I can inspect the start of the string myself.
  2. When re-serializing, include an XML Declaration if one had been present initially. I would like to ensure that the encoding that I specify is correct.

I had a dig through the serialization code in w3c-xmlserializer and for text nodes, it appears to just read node.data and escape a few important characters. It doesn't seem to change any encoding, so I think this means the document's encoding is preserved.

@Sebmaster
Copy link
Member

I would like to ensure that the encoding that I specify is correct.

I've been thinking about this. When you serialise the doc, you just get back a JS string, which (I think) means it should be encoding agnostic.

I think what you write into the XML declaration depends on how you write the file. If you use fs.writeFile and specify the string without an encoding, the file will always end up with utf-8 encoding.

However, it seems like jsdom always does the HTML encoding sniffing, even for XML docs. I'm not sure if that's intended or if there's a bug lurking there, but that could lead to double decodes 🤷‍♂ Definitely room to test that better there.

@kenlyon
Copy link
Author

kenlyon commented Jun 21, 2019

I think what you write into the XML declaration depends on how you write the file. If you use fs.writeFile and specify the string without an encoding, the file will always end up with utf-8 encoding.

Interesting point. In my case, I don't even deal with files at either end, since my code is dealing with a response in the form of a string or buffer. Maybe everything is implicitly utf-8 in my case. Perhaps it is the responsibility of the browser to respect the encoding when writing the xml to a file.

@kenlyon
Copy link
Author

kenlyon commented Jun 21, 2019

OK, so here's a brief summary of what I have just now. This may be adequate for what I need:

// Declare a custom symbol for storing the result of my check
const xmlDeclarationPresentSymbol = Symbol("XmlDeclarationPresent");

// Parsing:

const xmlString = Buffer.isBuffer(xml) ? xml.toString() : xml;
const dom = new JSDOM(xmlString, { contentType: "application/xml"});
// Detect if there was an XML Declaration present
// This requires me to convert from Buffer to string prior to passing the xml to `JSDOM()`
// and the `RegExp` is not doing any validation of the XML Declaration beyond the fact
// that "xml" is lowercase and in the right place.
dom.window.document[xmlDeclarationPresentSymbol] = /^<\?xml\s[^?]+\?>/.test(xmlString);

// Serializing:

const serializer = new XMLSerializer();
const xmlString = serializer.serializeToString(document);
// Refer to the symbol we set when parsing to see if
// we should add the XML Declaration.
if (document[xmlDeclarationPresentSymbol]) {
    // I'm using `characterSet` rather than `inputEncoding` but the value should be the same.
    // I also have some logic to guess what to use for a line break.
    // I've not included it here.
    xmlString = `<?xml version="1.0" encoding="${document.characterSet}"?>${inferLineBreakStr(xmlString)}${xmlString}`;
}

@domenic
Copy link
Member

domenic commented Jan 5, 2020

This appears to be an issue where we're implementing the spec but the spec does not match browsers. Hopefully the spec will get fixed at some point. w3c/DOM-Parsing#50

@frank-dspeed
Copy link

@domenic i did read the issue and did not find the exact action to take here if there is anything that should be fixed in the parser please comment below me and ping me with simply keywords that tell me what to offer.

@amaster507
Copy link

I see this is still open. Any plan to serialize the XML declaration by default, or maybe provide an option to include the declaration in the serialized output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants