Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character causes invalid character error #147

Closed
willm opened this issue Dec 16, 2016 · 20 comments
Closed

Character causes invalid character error #147

willm opened this issue Dec 16, 2016 · 20 comments
Labels

Comments

@willm
Copy link

willm commented Dec 16, 2016

I had an issue with a module consuming this one, and I found that the problem boils down to this test whch throws the following error. Is this a bug or is it intentional?

var xmlbuilder = require('xmlbuilder');
var rootXml = xmlbuilder.create('root');
rootXml.ele('test', '\u0000');
console.log(rootXml.end({}));

/lib/XMLStringifier.js:148
throw new Error("Invalid character (" + chr + ") in string: " + str + " at index " + chr.index);
^

Error: Invalid character (

@oozcitak
Copy link
Owner

oozcitak commented Dec 21, 2016

Unfortunately the null char is not valid in XML.

@willm
Copy link
Author

willm commented Dec 22, 2016

In that case, would it make sense to strip it out of the string before attempting to serialise it ?

@oozcitak
Copy link
Owner

oozcitak commented Dec 23, 2016

Yes, probably. If you are serializing binary data there are more control characters that need to be stripped. You may want to take at a look at:
http://stackoverflow.com/questions/730133/invalid-characters-in-xml
http://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java

@willm
Copy link
Author

willm commented Dec 30, 2016

Thanks I've had a read and didn't realise that some characters were simply invalid in xml! However, shouldn't this library handle this? According to the stack overflow posts you shared, the allowed characters are:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Do you think it makes sense for this package to strip any other characters when building an xml? Or do you think this is up to each consumer to handle?

@oozcitak
Copy link
Owner

oozcitak commented Jan 2, 2017

I may add a utility function -to be be explicitly called or configured by the user- to remove illegal characters since I don't think it would be a good idea to silently destroy people's hard-earned characters.

Woodstox XML API for example throws on invalid characters by default but it also has a replacer function that can be configured by the user.

@ebickle
Copy link

ebickle commented Mar 27, 2017

I ran into this issue with a production app; we discovered our users were entering emoji characters into description fields and we would end up getting an invalid character error from the xmlbuilder library as we converted the input to XML to be routed into an enterprise messaging/eventing system.

From the XML Specification: https://www.w3.org/TR/xml11/#charsets

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

And above that:

New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char.]

XMLBuilder throws an exception for valid characters within the range specified above, in particular, #x10000 and above; this includes Emoji.

Providing a utility to strip the characters out would be an even worse option, as this would change the meaning of the data. I can't determine whether characters codepoints #x10000 and above are valid in an XML markup, but at minimum they should be encoded; for example:
😀

Did you accidentally implement the XML 1.0 specification instead of XML 1.1? In the changelist for XML 1.1, they state:

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents.

@ebickle
Copy link

ebickle commented Mar 27, 2017

A bit more info on this. There are two separate issues:

  • The character ranges listed in the library's regular expression block valid XML 1.1 characters, such as control characters, that should be escaped.
  • Javascript, prior to ES6, did not have support for supplementary Unicode planes. This causes an issue with emoji characters, as described above. These are becoming very common in modern usage, particular on mobile devices.

[\u0000-\u0008\u000B-\u000C\u000E-\u001F\uD800-\uDFFF\uFFFE-\uFFFF] in XMLStringifier.coffee is matching Unicode characters >= #X10000 as a "one surrogate character, then one other character". A lone surrogate character is prohibited in XML, but valid surrogate pairs (supplementary plane characters) are permitted.

ES6 fixes this issue by adding Unicode support to regular expressions. The /u flag enables Unicode support for regular expressions; without it, Javascript's Regex functionality incorrectly breaks with any supplementary plane character from Unicode.

https://mathiasbynens.be/notes/javascript-unicode

If the library only supports ES6, the solution is easy - add the "/u" flag at the end of the regular expressions in XMLStringifier.coffee and you're good to go. To adhere to the XML 1.1 specification, you'll also need to escape the valid control characters that are also being incorrectly prohibited by the library.

If you need backwards compatibility, it's trickier. One option might be to use a 'legacy' regular expression that detects valid surrogate characters instead of the "allowSurrogateChars" boolean (remember, an XML encoder MUST support the character specification by default). This would require some tricky regular expressions that detect valid surrogate pairs but block invalid long surrogates.

@pietersv
Copy link

pietersv commented Apr 7, 2017

Ran into this today, our pptx export was crashed by a poop emoji, actual error log below. I suppose it's "cleaner" to remove these particular characters, but they're also arguably relevant content as it was a text open-end response to a survey of gastroenterologists.

Error: Invalid character (�) in string: "1. We never want to scope anyone. 💩💩💩💩" at index 34

@TylerShin
Copy link

colud you deploy npm library with 1f9b41a applied version?

oozcitak added a commit that referenced this issue Jun 6, 2017
Use an ES5 compatible Regexp to assert legal characters. Closes #147.
mad-gooze added a commit to mad-gooze/karma-html-detailed-reporter that referenced this issue Dec 4, 2018
xmlbuilder has fixed some bugs which my crash karma-html-detailed-reporter like oozcitak/xmlbuilder-js#147 (if there are invalid characters in report)
@leafac
Copy link

leafac commented Mar 25, 2020

I may add a utility function -to be be explicitly called or configured by the user- to remove illegal characters since I don't think it would be a good idea to silently destroy people's hard-earned characters.

Woodstox XML API for example throws on invalid characters by default but it also has a replacer function that can be configured by the user.

@oozcitak, did you ever add that utility function? I ran into a problem in which I’m trying to generate XML based on user input, and I just want to ignore characters that aren’t valid, for example, control characters.

(For what it’s worth, I’m using xmlbuild-js through node-xml2js, so I don’t think xmlbuilder2 is an option).

@oozcitak
Copy link
Owner

oozcitak commented Mar 26, 2020

Just added the invalidCharReplacement option. When set; xmlbuilder will replace all invalid characters (#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]) with this replacement character. Example:

const obj = {
  'node\x00': 'text\x08content'
}

const xmlStr = builder.create(obj, { invalidCharReplacement: '' }).end({ pretty: true });
// <?xml version="1.0"?>
// <node>textcontent</node>

@oozcitak
Copy link
Owner

invalidCharReplacement is passed directly as the second parameter to string.replace so you can also use a function:

const obj = {
  'node\x00': 'text\x08content'
}
const options = { 
  invalidCharReplacement: (c) => c === '\x00' ? '' : '_'
};
const xmlStr = builder.create(obj, options).end({ pretty: true });
// <?xml version="1.0"?>
// <node>text_content</node>

@leafac
Copy link

leafac commented Mar 26, 2020

Awesome. Thank you very much 😃

@leafac
Copy link

leafac commented Mar 26, 2020

I guess I won’t be able to use this in xml2js, which is several versions behind on xmlbuilder (11). Should I just migrate to use xmlbuilder2 directly? It comes with a builder & a parser, so it should work fine right? Is invalidCharReplacement a thing in xmlbuilder2?

@leafac
Copy link

leafac commented Mar 26, 2020

Upon further investigation, the situation on xmlbuilder2 seems to be:

  1. There’s no invalidCharReplacement option (or equivalent) that I can find.

  2. Unlike xmlbuilder, xmlbuilder2 doesn’t check for invalid characters and may produce an invalid XML. For example, I wrote the following program:

const xmlbuilder2 = require("xmlbuilder2");
const fs = require("fs");

fs.writeFileSync(
  "example.xml",
  xmlbuilder2.convert(
    { root: "A control character (backspace): \b" },
    { format: "xml", prettyPrint: true }
  )
);

Then I put the generated example.xml through https://www.xmlvalidation.com and it says that the XML is invalid.

Should I open an issue on https://github.com/oozcitak/xmlbuilder2?

@leafac
Copy link

leafac commented Mar 26, 2020

In summary, I don’t know of a good solution for folks like me who need to parse & generate XML and support user input with strange characters:

  • xml2js uses a version of xmlbuilder that’s too old and doesn’t support invalidCharReplacement.
  • xmlbuilder doesn’t parse XML.
  • xmlbuilder2 doesn’t seem to handle strange characters well.

@oozcitak
Copy link
Owner

@leafac I will be adding invalidCharReplacement to xmlbuilder2.

Unlike xmlbuilder, xmlbuilder2 doesn’t check for invalid characters and may produce an invalid XML.

To make sure you get well-formed XML on serialization, xmlbuilder2 provides the wellFormed flag:

const xmlbuilder2 = require("xmlbuilder2");
const fs = require("fs");

fs.writeFileSync(
  "example.xml",
  xmlbuilder2.convert(
    { root: "A control character (backspace): \b" },
    { format: "xml", prettyPrint: true, wellFormed: true }
  )
);

wellFormed defaults to false to match common browser implementations. For example, try copy/pasting this in to your browser's console and you'll see there will be no errors:

new XMLSerializer().serializeToString(document.createTextNode("A control character (backspace): \b"));

@leafac
Copy link

leafac commented Mar 27, 2020

Thanks for all your help in navigating this space. I’m trying to use xmlbuilder2, but I’m having the following issues:

  1. Conversion from JavaScript object to XML string requires namespaces to be declared as part of the name of the root element, for example:

    xmlbuilder2.convert(
      { "feed@http://www.w3.org/2005/Atom": "" },
      { wellFormed: true }
    );
    // <?xml version="1.0"?><feed xmlns="http://www.w3.org/2005/Atom"/>

    If I declare it as a regular attribute, xmlbuilder2 rejects it:

    xmlbuilder2.convert(
      { feed: { "@xmlns": "http://www.w3.org/2005/Atom" } },
      { wellFormed: true }
    );
    // InvalidStateError: The object is in an invalid state.

    But the conversion from XML string to JavaScript object produces an object like the latter, not the former:

    xmlbuilder2.convert(
      `<?xml version="1.0"?><feed xmlns="http://www.w3.org/2005/Atom"/>`,
      { format: "object", wellFormed: true }
    );
    // { feed: { '@xmlns': 'http://www.w3.org/2005/Atom' } }

    This means that a roundtrip from XML string to JavaScript object and back may fail. I guess I could work around this by writing code to detect an @xmlns attribute and merge it into the root element, but I feel like I’m going against the grain.

  2. The types in TypeScript are weird to use. For example, the convert() function always returns an XMLSerializedValue, which may be a string or a XMLSerializedObject, among other things. This means that I have to cast the result of convert() every time: convert(___, { format: "object" }) as any. Would it be possible for the return type of convert() to be dependent on the format, so that, for example, the return type of convert(___, { format: "object" }) is a regular JavaScript object?

@oozcitak
Copy link
Owner

Thanks for taking the time to investigate. I opened these two issues in xmlbuilder2.

@oozcitak
Copy link
Owner

Same invalidCharReplacement option is now available in xmlbuilder2 with version 2.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants