-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML Declaration ignored by DOMParser #2615
Comments
In your example you can get the I think the issue here is that the w3c XML serialization algorithm doesn't describe that a XML declaration should be prepended to the output which is kinda interesting. I'm not sure if that's intended or a w3c oversight though. |
Hmm, for some reason, I don't see I think that including the XML Declaration does sound like it's optional in the XML spec as well. Maybe this issue is more of a feature request than a bug report. My use case is that I'm writing middleware that deserializes XML, makes some changes and then re-serializes it and passes it on. I'd prefer not to lose the XML Declaration along the way. From the other thread I mentioned, it sounds like the necessary values could be preserved via a small change, allowing a consumer like me to construct an XML Declaration from the constituent parts if desired. One part I'm still not 100% clear on is the encoding value, though. If a document is serialized, will it be serialized with the encoding that was supplied, or the encoding of the doc. (usually UTF-8 from what I've seen.) |
I think since you're reserialising, the xml declaration of the original document doesn't carry any weight anyways. What I mean is: due to the reserialization, you'll always have a normalised XML document compatible with version 1.0 and encoded as UTF-8. It may actually be wrong to keep the original document's declaration (I'm somewhat more confident this is the case with the encoding than the version, but should apply to both). So I think you should be fine just generating your own declaration and prefixing it on the serialised doc. |
On the subject of const { XMLSerializer } = require("jsdom/lib/jsdom/living");
const serializer = new XMLSerializer(); I'm not sure if this will be safe going forward, but it appears to work for now. On the subject of XML Declaration, I see two possible courses of action that a consumer might want to take:
My use case is more the former. I don't know how difficult it would be to serialize using a different encoding than the document. I do know, though, that in As it currently stands, if I wanted to do the latter option myself, would it be safe to use |
We have discussions about this in the context of HTML parsing/serialization all the time. In short, serialization/parsing are not meant to preserve the original form of the document. They are only preserving of the original information (i.e., the abstract stuff that survives into the parsed form). (And sometimes, not even that; see the warning and examples below the algorithm at https://html.spec.whatwg.org/#serialising-html-fragments). See inikulin/parse5#261 (comment) for more.
I'm not sure, as I'm not sure what definition of "safety" you're using. But see https://dom.spec.whatwg.org/#dom-document-inputencoding for the definition of inputEncoding. |
I see your point. The original format of the document shouldn't matter, since parsing the new version should still produce the same result as parsing the original. I think I was being a bit too focused on trying to preserve the original literal xml as much as possible to limit changes introduced by my middleware. However, I don't think the end consumer of the file should really be making any decisions based on the encoding used, so I don't think it will matter. By "safe to use" I had meant "Does My plan at this point is to:
I had a dig through the serialization code in w3c-xmlserializer and for text nodes, it appears to just read |
I've been thinking about this. When you serialise the doc, you just get back a JS string, which (I think) means it should be encoding agnostic. I think what you write into the XML declaration depends on how you write the file. If you use However, it seems like jsdom always does the HTML encoding sniffing, even for XML docs. I'm not sure if that's intended or if there's a bug lurking there, but that could lead to double decodes 🤷♂ Definitely room to test that better there. |
Interesting point. In my case, I don't even deal with files at either end, since my code is dealing with a response in the form of a string or buffer. Maybe everything is implicitly utf-8 in my case. Perhaps it is the responsibility of the browser to respect the encoding when writing the xml to a file. |
OK, so here's a brief summary of what I have just now. This may be adequate for what I need: // Declare a custom symbol for storing the result of my check
const xmlDeclarationPresentSymbol = Symbol("XmlDeclarationPresent");
// Parsing:
const xmlString = Buffer.isBuffer(xml) ? xml.toString() : xml;
const dom = new JSDOM(xmlString, { contentType: "application/xml"});
// Detect if there was an XML Declaration present
// This requires me to convert from Buffer to string prior to passing the xml to `JSDOM()`
// and the `RegExp` is not doing any validation of the XML Declaration beyond the fact
// that "xml" is lowercase and in the right place.
dom.window.document[xmlDeclarationPresentSymbol] = /^<\?xml\s[^?]+\?>/.test(xmlString);
// Serializing:
const serializer = new XMLSerializer();
const xmlString = serializer.serializeToString(document);
// Refer to the symbol we set when parsing to see if
// we should add the XML Declaration.
if (document[xmlDeclarationPresentSymbol]) {
// I'm using `characterSet` rather than `inputEncoding` but the value should be the same.
// I also have some logic to guess what to use for a line break.
// I've not included it here.
xmlString = `<?xml version="1.0" encoding="${document.characterSet}"?>${inferLineBreakStr(xmlString)}${xmlString}`;
} |
This appears to be an issue where we're implementing the spec but the spec does not match browsers. Hopefully the spec will get fixed at some point. w3c/DOM-Parsing#50 |
@domenic i did read the issue and did not find the exact action to take here if there is anything that should be fixed in the parser please comment below me and ping me with simply keywords that tell me what to offer. |
I see this is still open. Any plan to serialize the XML declaration by default, or maybe provide an option to include the declaration in the serialized output? |
Basic info:
Minimal reproduction case
DOMParser
needs to retrieve the XML Declaration details fromxmlDecl
. (saxes issue #16)onprocessinginstruction
event for the XML Declaration, just other Processing Instructions.How does similar code behave in browsers?
Example in jsbin
"UTF-8"
.The text was updated successfully, but these errors were encountered: