Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pedantic white space preservation not supported. #242

Closed
jodyp12 opened this issue Dec 11, 2014 · 7 comments · Fixed by #941
Closed

Pedantic white space preservation not supported. #242

jodyp12 opened this issue Dec 11, 2014 · 7 comments · Fixed by #941

Comments

@jodyp12
Copy link

jodyp12 commented Dec 11, 2014

In the following xml the whitespace in text, which happens to be a space character, is stripped.
<tspan font-weight="bold"> </tspan>

This happens when XMLNode::ParseDeep() calls XMLDocument::Identify(), which in turn calls XMLUtil::SkipWhiteSpace().

If a character comes anywhere after the whitespace Identify() correctly creates a text element and backs up to the 1st character, correctly keeping the space character along with the following character.
<tspan font-weight="bold"> a</tspan>

The result is that whitespace is not fully preserved in text - which doesn't match the documentation. This example isn't just an exercise, it's an actual shipstopper when reading legacy files in a well known application that has been migrated to tinyxml2.

@leethomason
Copy link
Owner

I'm sorry this isn't what you want. The behavior is documented in the README, although the specific case you have isn't clear. (The rule also is applied within an element, and the formatting example could be clearer that it includes both end of line normalization and whitespace.)

The best approach for TinyXML-2 has been discussed before, and this is a case where TinyXML-2 is intentionally choosing the generally more useful yet non-compliant behavior.

If you want to submit a pull request for a new behavior (PEDANTIC_WHITESPACE maybe?) it would be a worthwhile integration if it doesn't add too much code complexity.

@peterbiglr
Copy link

Next example shows that whitespace only is not preserved:

#include "tinyxml2.h"

using namespace tinyxml2;

int main( int argc, const char ** argv )
{
    // leading and trailing whitespace is preserved
    static const char* test1 = "<element>  leading and trailing whitespace   </element>";
    XMLDocument doc;
    doc.Parse( test1 );
    doc.Print();

    // whitespace only is not preserved !!
    static const char* test2 = "<element2>      </element2>";
    XMLDocument doc2;
    doc2.Parse( test2 );
    doc2.Print();

    return 0;
}

Gives output:

<element>  leading and trailing whitespace   </element>
<element2/>

@leethomason
Copy link
Owner

Leaving open in case someone wants to submit a patch for this. TinyXML2 is working as intended; it would need a new whitespace mode to fix.

@leethomason leethomason changed the title WhiteSpace preservation broken in some cases Pedantic white space preservation not supported. Mar 15, 2015
@petko
Copy link

petko commented Apr 1, 2015

I agree that a new whitespace preservation option is needed, because currently legitimate HTML like this, fails to be parsed as expected. This:
<p><span class=\"class1\">formatted text with</span> <a href=\"\">link</a></p>

is printed as:
<p><span class=\"class1\">formatted text with</span><a href=\"\">link</a></p>
which is loss of meaningful information.

I am trying to patch it myself, but so far, I can't manage to do it, because to work properly, such PEDANTIC_WHITESPACE option requires context knowledge of the surrounding nodes (whitespace should be interpreted as text only if it is inside the <body> tag, no in the <head>.

@TPS
Copy link

TPS commented Dec 27, 2015

@ minimum, should support xml:space="preserve", as mentioned @ JayXon/Leanify#3.

@TPS
Copy link

TPS commented Jan 10, 2016

@leethomason @jodyp12 @peterbiglr @petko zeux/pugixml#74 shows how https://github.com/zeux/pugixml has a mode that might be helpful to y'all, though it's not preciselyxml:space="preserve"support.

@TangataRereke
Copy link

TangataRereke commented May 11, 2023

I've looked at this and created a few supporting unit tests. Latest pull request: #938

IMHO it is a problem just for some rare legacy systems such as ours. It is essential to some but only rare use-cases. As a result rather than relying on current whitespace options, I've created one called PRESERVERRAW_WHITESPACE. White space being just space at present. Seems the only use-case for legacy systems.

So <element> </element> becomes " " if whitespace is PREVESERVERRAW_WHITESPACE is used. Otherwise, it will be "".

"<element>
</element>" is obviously still "". 

I didn't worry about <element> \r\n</element> because I haven't seen a need for this. My guess is it'll still show as space, but can't imagine a legacy system that is lazy enough to not put quotes would bother to put a CrLf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants