Character causes invalid character error #147

willm · 2016-12-16T00:04:43Z

I had an issue with a module consuming this one, and I found that the problem boils down to this test whch throws the following error. Is this a bug or is it intentional?

var xmlbuilder = require('xmlbuilder');
var rootXml = xmlbuilder.create('root');
rootXml.ele('test', '\u0000');
console.log(rootXml.end({}));

/lib/XMLStringifier.js:148
throw new Error("Invalid character (" + chr + ") in string: " + str + " at index " + chr.index);
^

Error: Invalid character (

oozcitak · 2016-12-21T21:24:01Z

Unfortunately the null char is not valid in XML.

willm · 2016-12-22T06:52:21Z

In that case, would it make sense to strip it out of the string before attempting to serialise it ?

oozcitak · 2016-12-23T06:34:36Z

Yes, probably. If you are serializing binary data there are more control characters that need to be stripped. You may want to take at a look at:
http://stackoverflow.com/questions/730133/invalid-characters-in-xml
http://stackoverflow.com/questions/4237625/removing-invalid-xml-characters-from-a-string-in-java

willm · 2016-12-30T12:03:45Z

Thanks I've had a read and didn't realise that some characters were simply invalid in xml! However, shouldn't this library handle this? According to the stack overflow posts you shared, the allowed characters are:

#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Do you think it makes sense for this package to strip any other characters when building an xml? Or do you think this is up to each consumer to handle?

oozcitak · 2017-01-02T19:44:56Z

I may add a utility function -to be be explicitly called or configured by the user- to remove illegal characters since I don't think it would be a good idea to silently destroy people's hard-earned characters.

Woodstox XML API for example throws on invalid characters by default but it also has a replacer function that can be configured by the user.

ebickle · 2017-03-27T15:53:28Z

I ran into this issue with a production app; we discovered our users were entering emoji characters into description fields and we would end up getting an invalid character error from the xmlbuilder library as we converted the input to XML to be routed into an enterprise messaging/eventing system.

From the XML Specification: https://www.w3.org/TR/xml11/#charsets

Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

And above that:

New characters may be added to these standards by amendments or new editions. Consequently, XML processors MUST accept any character in the range specified for Char.]

XMLBuilder throws an exception for valid characters within the range specified above, in particular, #x10000 and above; this includes Emoji.

Providing a utility to strip the characters out would be an even worse option, as this would change the meaning of the data. I can't determine whether characters codepoints #x10000 and above are valid in an XML markup, but at minimum they should be encoded; for example:
😀

Did you accidentally implement the XML 1.0 specification instead of XML 1.1? In the changelist for XML 1.1, they state:

Finally, there is considerable demand to define a standard representation of arbitrary Unicode characters in XML documents. Therefore, XML 1.1 allows the use of character references to the control characters #x1 through #x1F, most of which are forbidden in XML 1.0. For reasons of robustness, however, these characters still cannot be used directly in documents.

ebickle · 2017-03-27T17:28:32Z

A bit more info on this. There are two separate issues:

The character ranges listed in the library's regular expression block valid XML 1.1 characters, such as control characters, that should be escaped.
Javascript, prior to ES6, did not have support for supplementary Unicode planes. This causes an issue with emoji characters, as described above. These are becoming very common in modern usage, particular on mobile devices.

[\u0000-\u0008\u000B-\u000C\u000E-\u001F\uD800-\uDFFF\uFFFE-\uFFFF] in XMLStringifier.coffee is matching Unicode characters >= #X10000 as a "one surrogate character, then one other character". A lone surrogate character is prohibited in XML, but valid surrogate pairs (supplementary plane characters) are permitted.

ES6 fixes this issue by adding Unicode support to regular expressions. The /u flag enables Unicode support for regular expressions; without it, Javascript's Regex functionality incorrectly breaks with any supplementary plane character from Unicode.

https://mathiasbynens.be/notes/javascript-unicode

If the library only supports ES6, the solution is easy - add the "/u" flag at the end of the regular expressions in XMLStringifier.coffee and you're good to go. To adhere to the XML 1.1 specification, you'll also need to escape the valid control characters that are also being incorrectly prohibited by the library.

If you need backwards compatibility, it's trickier. One option might be to use a 'legacy' regular expression that detects valid surrogate characters instead of the "allowSurrogateChars" boolean (remember, an XML encoder MUST support the character specification by default). This would require some tricky regular expressions that detect valid surrogate pairs but block invalid long surrogates.

pietersv · 2017-04-07T20:24:05Z

Ran into this today, our pptx export was crashed by a poop emoji, actual error log below. I suppose it's "cleaner" to remove these particular characters, but they're also arguably relevant content as it was a text open-end response to a survey of gastroenterologists.

Error: Invalid character (�) in string: "1. We never want to scope anyone. 💩💩💩💩" at index 34

TylerShin · 2017-05-04T01:16:27Z

colud you deploy npm library with 1f9b41a applied version?

Use an ES5 compatible Regexp to assert legal characters. Closes #147.

xmlbuilder has fixed some bugs which my crash karma-html-detailed-reporter like oozcitak/xmlbuilder-js#147 (if there are invalid characters in report)

leafac · 2020-03-25T19:10:47Z

I may add a utility function -to be be explicitly called or configured by the user- to remove illegal characters since I don't think it would be a good idea to silently destroy people's hard-earned characters.

Woodstox XML API for example throws on invalid characters by default but it also has a replacer function that can be configured by the user.

@oozcitak, did you ever add that utility function? I ran into a problem in which I’m trying to generate XML based on user input, and I just want to ignore characters that aren’t valid, for example, control characters.

(For what it’s worth, I’m using xmlbuild-js through node-xml2js, so I don’t think xmlbuilder2 is an option).

oozcitak · 2020-03-26T11:10:40Z

const obj = {
  'node\x00': 'text\x08content'
}

const xmlStr = builder.create(obj, { invalidCharReplacement: '' }).end({ pretty: true });
// <?xml version="1.0"?>
// <node>textcontent</node>

oozcitak · 2020-03-26T11:24:12Z

invalidCharReplacement is passed directly as the second parameter to string.replace so you can also use a function:

const obj = {
  'node\x00': 'text\x08content'
}
const options = { 
  invalidCharReplacement: (c) => c === '\x00' ? '' : '_'
};
const xmlStr = builder.create(obj, options).end({ pretty: true });
// <?xml version="1.0"?>
// <node>text_content</node>

leafac · 2020-03-26T17:27:50Z

Awesome. Thank you very much 😃

leafac · 2020-03-26T17:34:27Z

I guess I won’t be able to use this in xml2js, which is several versions behind on xmlbuilder (11). Should I just migrate to use xmlbuilder2 directly? It comes with a builder & a parser, so it should work fine right? Is invalidCharReplacement a thing in xmlbuilder2?

leafac · 2020-03-26T20:48:31Z

Upon further investigation, the situation on xmlbuilder2 seems to be:

There’s no invalidCharReplacement option (or equivalent) that I can find.
Unlike xmlbuilder, xmlbuilder2 doesn’t check for invalid characters and may produce an invalid XML. For example, I wrote the following program:

const xmlbuilder2 = require("xmlbuilder2");
const fs = require("fs");

fs.writeFileSync(
  "example.xml",
  xmlbuilder2.convert(
    { root: "A control character (backspace): \b" },
    { format: "xml", prettyPrint: true }
  )
);

Then I put the generated example.xml through https://www.xmlvalidation.com and it says that the XML is invalid.

Should I open an issue on https://github.com/oozcitak/xmlbuilder2?

leafac · 2020-03-26T20:51:31Z

In summary, I don’t know of a good solution for folks like me who need to parse & generate XML and support user input with strange characters:

xml2js uses a version of xmlbuilder that’s too old and doesn’t support invalidCharReplacement.
xmlbuilder doesn’t parse XML.
xmlbuilder2 doesn’t seem to handle strange characters well.

oozcitak · 2020-03-26T21:22:49Z

@leafac I will be adding invalidCharReplacement to xmlbuilder2.

Unlike xmlbuilder, xmlbuilder2 doesn’t check for invalid characters and may produce an invalid XML.

To make sure you get well-formed XML on serialization, xmlbuilder2 provides the wellFormed flag:

const xmlbuilder2 = require("xmlbuilder2");
const fs = require("fs");

fs.writeFileSync(
  "example.xml",
  xmlbuilder2.convert(
    { root: "A control character (backspace): \b" },
    { format: "xml", prettyPrint: true, wellFormed: true }
  )
);

wellFormed defaults to false to match common browser implementations. For example, try copy/pasting this in to your browser's console and you'll see there will be no errors:

new XMLSerializer().serializeToString(document.createTextNode("A control character (backspace): \b"));

leafac · 2020-03-27T19:14:16Z

Thanks for all your help in navigating this space. I’m trying to use xmlbuilder2, but I’m having the following issues:

Conversion from JavaScript object to XML string requires namespaces to be declared as part of the name of the root element, for example:

xmlbuilder2.convert(
  { "feed@http://www.w3.org/2005/Atom": "" },
  { wellFormed: true }
);
// <?xml version="1.0"?><feed xmlns="http://www.w3.org/2005/Atom"/>

If I declare it as a regular attribute, xmlbuilder2 rejects it:

xmlbuilder2.convert(
  { feed: { "@xmlns": "http://www.w3.org/2005/Atom" } },
  { wellFormed: true }
);
// InvalidStateError: The object is in an invalid state.

But the conversion from XML string to JavaScript object produces an object like the latter, not the former:

xmlbuilder2.convert(
  `<?xml version="1.0"?><feed xmlns="http://www.w3.org/2005/Atom"/>`,
  { format: "object", wellFormed: true }
);
// { feed: { '@xmlns': 'http://www.w3.org/2005/Atom' } }

This means that a roundtrip from XML string to JavaScript object and back may fail. I guess I could work around this by writing code to detect an @xmlns attribute and merge it into the root element, but I feel like I’m going against the grain.

The types in TypeScript are weird to use. For example, the convert() function always returns an XMLSerializedValue, which may be a string or a XMLSerializedObject, among other things. This means that I have to cast the result of convert() every time: convert(___, { format: "object" }) as any. Would it be possible for the return type of convert() to be dependent on the format, so that, for example, the return type of convert(___, { format: "object" }) is a regular JavaScript object?

oozcitak · 2020-03-27T23:44:34Z

Thanks for taking the time to investigate. I opened these two issues in xmlbuilder2.

oozcitak · 2020-03-30T16:21:25Z

Same invalidCharReplacement option is now available in xmlbuilder2 with version 2.0.0.

willm mentioned this issue Dec 30, 2016

Strip invalid xml 1.0 chars aghassemi/tap-xunit#13

Merged

oozcitak closed this as completed in 1f9b41a Apr 30, 2017

oozcitak added the Bug label Apr 30, 2017

oozcitak mentioned this issue Apr 30, 2017

I've continuously got a problem with unicode #154

Closed

chefhoobajoob mentioned this issue May 6, 2017

Does not generate index.html when there are failing test cases and includeFailureMsg = true Hargne/jest-html-reporter#1

Closed

tregagnon mentioned this issue Jun 5, 2017

Use an ES5 compatible Regexp to assert legal characters #161

Merged

oozcitak added a commit that referenced this issue Jun 6, 2017

Merge pull request #161 from Dashlane/es5-compatible-unicode-regexp

7b96de9

Use an ES5 compatible Regexp to assert legal characters. Closes #147.

mad-gooze mentioned this issue Dec 4, 2018

upgrade xmlbuilder to latest version a11smiles/karma-html-detailed-reporter#22

Open

leafac mentioned this issue Mar 25, 2020

Error: Invalid character in string: at index 0 Leonidas-from-XIV/node-xml2js#547

Open

oozcitak reopened this Mar 25, 2020

oozcitak closed this as completed in 0431ea1 Mar 26, 2020

oozcitak mentioned this issue Mar 26, 2020

Add an option to replace invalid characters on input oozcitak/xmlbuilder2#12

Closed

leafac mentioned this issue Mar 27, 2020

Doesn’t handle invalid characters nashwaan/xml-js#139

Open

This was referenced Mar 27, 2020

Parsing JS object with default namespace declaration attribute throws error oozcitak/xmlbuilder2#13

Closed

Return type of convert requires casting when used in TypeScript oozcitak/xmlbuilder2#14

Closed

stoeffel mentioned this issue Sep 15, 2022

Junit reporter breaks with new error messages of elm-program-test rtfeldman/node-test-runner#615

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Character causes invalid character error #147

Character causes invalid character error #147

willm commented Dec 16, 2016

oozcitak commented Dec 21, 2016 •

edited

Loading

willm commented Dec 22, 2016

oozcitak commented Dec 23, 2016 •

edited

Loading

willm commented Dec 30, 2016

oozcitak commented Jan 2, 2017 •

edited

Loading

ebickle commented Mar 27, 2017 •

edited

Loading

ebickle commented Mar 27, 2017

pietersv commented Apr 7, 2017

TylerShin commented May 4, 2017

leafac commented Mar 25, 2020 •

edited

Loading

oozcitak commented Mar 26, 2020 •

edited

Loading

oozcitak commented Mar 26, 2020

leafac commented Mar 26, 2020

leafac commented Mar 26, 2020 •

edited

Loading

leafac commented Mar 26, 2020

leafac commented Mar 26, 2020

oozcitak commented Mar 26, 2020

leafac commented Mar 27, 2020 •

edited

Loading

oozcitak commented Mar 27, 2020

oozcitak commented Mar 30, 2020

Character causes invalid character error #147

Character causes invalid character error #147

Comments

willm commented Dec 16, 2016

oozcitak commented Dec 21, 2016 • edited Loading

willm commented Dec 22, 2016

oozcitak commented Dec 23, 2016 • edited Loading

willm commented Dec 30, 2016

oozcitak commented Jan 2, 2017 • edited Loading

ebickle commented Mar 27, 2017 • edited Loading

ebickle commented Mar 27, 2017

pietersv commented Apr 7, 2017

TylerShin commented May 4, 2017

leafac commented Mar 25, 2020 • edited Loading

oozcitak commented Mar 26, 2020 • edited Loading

oozcitak commented Mar 26, 2020

leafac commented Mar 26, 2020

leafac commented Mar 26, 2020 • edited Loading

leafac commented Mar 26, 2020

leafac commented Mar 26, 2020

oozcitak commented Mar 26, 2020

leafac commented Mar 27, 2020 • edited Loading

oozcitak commented Mar 27, 2020

oozcitak commented Mar 30, 2020

oozcitak commented Dec 21, 2016 •

edited

Loading

oozcitak commented Dec 23, 2016 •

edited

Loading

oozcitak commented Jan 2, 2017 •

edited

Loading

ebickle commented Mar 27, 2017 •

edited

Loading

leafac commented Mar 25, 2020 •

edited

Loading

oozcitak commented Mar 26, 2020 •

edited

Loading

leafac commented Mar 26, 2020 •

edited

Loading

leafac commented Mar 27, 2020 •

edited

Loading