Pedantic white space preservation not supported. #242

jodyp12 · 2014-12-11T21:35:47Z

In the following xml the whitespace in text, which happens to be a space character, is stripped.
<tspan font-weight="bold"> </tspan>

This happens when XMLNode::ParseDeep() calls XMLDocument::Identify(), which in turn calls XMLUtil::SkipWhiteSpace().

If a character comes anywhere after the whitespace Identify() correctly creates a text element and backs up to the 1st character, correctly keeping the space character along with the following character.
<tspan font-weight="bold"> a</tspan>

The result is that whitespace is not fully preserved in text - which doesn't match the documentation. This example isn't just an exercise, it's an actual shipstopper when reading legacy files in a well known application that has been migrated to tinyxml2.

The text was updated successfully, but these errors were encountered:

leethomason · 2014-12-11T23:35:07Z

I'm sorry this isn't what you want. The behavior is documented in the README, although the specific case you have isn't clear. (The rule also is applied within an element, and the formatting example could be clearer that it includes both end of line normalization and whitespace.)

The best approach for TinyXML-2 has been discussed before, and this is a case where TinyXML-2 is intentionally choosing the generally more useful yet non-compliant behavior.

If you want to submit a pull request for a new behavior (PEDANTIC_WHITESPACE maybe?) it would be a worthwhile integration if it doesn't add too much code complexity.

peterbiglr · 2015-03-09T19:24:00Z

Next example shows that whitespace only is not preserved:

#include "tinyxml2.h"

using namespace tinyxml2;

int main( int argc, const char ** argv )
{
    // leading and trailing whitespace is preserved
    static const char* test1 = "<element>  leading and trailing whitespace   </element>";
    XMLDocument doc;
    doc.Parse( test1 );
    doc.Print();

    // whitespace only is not preserved !!
    static const char* test2 = "<element2>      </element2>";
    XMLDocument doc2;
    doc2.Parse( test2 );
    doc2.Print();

    return 0;
}

Gives output:

<element>  leading and trailing whitespace   </element>
<element2/>

leethomason · 2015-03-15T23:28:35Z

Leaving open in case someone wants to submit a patch for this. TinyXML2 is working as intended; it would need a new whitespace mode to fix.

petko · 2015-04-01T10:14:00Z

I agree that a new whitespace preservation option is needed, because currently legitimate HTML like this, fails to be parsed as expected. This:
formatted text with <a href=\"\">link</a>

is printed as:
formatted text with<a href=\"\">link</a>
which is loss of meaningful information.

I am trying to patch it myself, but so far, I can't manage to do it, because to work properly, such PEDANTIC_WHITESPACE option requires context knowledge of the surrounding nodes (whitespace should be interpreted as text only if it is inside the <body> tag, no in the <head>.

TPS · 2015-12-27T23:40:49Z

@ minimum, should support xml:space="preserve", as mentioned @ JayXon/Leanify#3.

TPS · 2016-01-10T14:29:22Z

@leethomason @jodyp12 @peterbiglr @petko zeux/pugixml#74 shows how https://github.com/zeux/pugixml has a mode that might be helpful to y'all, though it's not preciselyxml:space="preserve"support.

TangataRereke · 2023-05-11T02:05:35Z

I've looked at this and created a few supporting unit tests. Latest pull request: #938

IMHO it is a problem just for some rare legacy systems such as ours. It is essential to some but only rare use-cases. As a result rather than relying on current whitespace options, I've created one called PRESERVERRAW_WHITESPACE. White space being just space at present. Seems the only use-case for legacy systems.

So <element> </element> becomes " " if whitespace is PREVESERVERRAW_WHITESPACE is used. Otherwise, it will be "".

"<element>
</element>" is obviously still "".

I didn't worry about <element> \r\n</element> because I haven't seen a need for this. My guess is it'll still show as space, but can't imagine a legacy system that is lazy enough to not put quotes would bother to put a CrLf.

leethomason changed the title ~~WhiteSpace preservation broken in some cases~~ Pedantic white space preservation not supported. Mar 15, 2015

TPS mentioned this issue Dec 27, 2015

some spaces in docx is removed JayXon/Leanify#3

Closed

TPS mentioned this issue Dec 30, 2015

Should explicitly support xml:space="preserve" zeux/pugixml#74

Closed

gdbentley mentioned this issue Dec 2, 2022

Added pedantic whitespace preservation. #928

Closed

kcsaul mentioned this issue May 28, 2023

Pedantic Whitespace Mode #941

Merged

leethomason closed this as completed in #941 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pedantic white space preservation not supported. #242

Pedantic white space preservation not supported. #242

jodyp12 commented Dec 11, 2014

leethomason commented Dec 11, 2014

peterbiglr commented Mar 9, 2015

leethomason commented Mar 15, 2015

petko commented Apr 1, 2015

TPS commented Dec 27, 2015

TPS commented Jan 10, 2016

TangataRereke commented May 11, 2023 •

edited

Pedantic white space preservation not supported. #242

Pedantic white space preservation not supported. #242

Comments

jodyp12 commented Dec 11, 2014

leethomason commented Dec 11, 2014

peterbiglr commented Mar 9, 2015

leethomason commented Mar 15, 2015

petko commented Apr 1, 2015

TPS commented Dec 27, 2015

TPS commented Jan 10, 2016

TangataRereke commented May 11, 2023 • edited

TangataRereke commented May 11, 2023 •

edited