encoding/xml: add whitespace normalization from spec #20614
What version of Go are you using (
The spec seems very clear that this is required, and we're not doing it. Seems OK to do. Note the oddity that
are not equivalent. The literal newline turns into a space but the escaped newline remains a newline.
It also looks like we are not handling end-of-line \r correctly (https://www.w3.org/TR/REC-xml/#sec-line-ends).
CLs welcome for Go 1.10 (but please keep it simple).
changed the title
encoding/xml: whitespace characters in attribute value should be normalized to a space character
Jun 15, 2017
I am trying to verify my diff.
The result with
Do I misunderstand the spec?
The spec says:
But for our purposes the attribute type is always CDATA, since we don't parse the related entity declarations. So that sentence does not apply.
Hi I've been looking at this issue and the patch that has been submitted ( https://go-review.googlesource.com/c/46433/ ) and there is a problem. The patch is modifying the text function of xml.go, this function is called both for character data nodes and for attribute values. The problem is that the normalization for character data and attribute values is different and the patch is applying the attribute value normalization to character data.
This is an error, since for character data the spec states (https://www.w3.org/TR/REC-xml/#sec-white-space):
"In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.
An XML processor MUST always pass all characters in a document that are not markup through to the application..."
The only normalization that must be done on Character Data nodes is the end of line handling:
On the other hand for attribute values both the end of line handling and the attribute value normalization (https://www.w3.org/TR/REC-xml/#AVNormalize) must be done.
But applying the attribute values normalization to text nodes is incorrect and will create all kinds of problems on client applications. This is evident by the changes that the patch does to tests, for example the atom feed test here: https://go-review.googlesource.com/c/46433/4/src/encoding/xml/read_test.go#b119
Anyhow, I have read the spec and the code extensively, I think I know how to fix both issues (end of line on both character data and attribute values and attribute value normalization on attribute values) simply and fast, so if this is still open I would love to give it a try.
referenced this issue
Mar 16, 2018
The attribute value was read as text. The existing attribute reader logic is fixed as an attribute may have a namespace or only a prefix. Other possibilities have been removed.
To keep the behavior of raw token which allows many faults in attributes list, error handling is heavily using the Strict parameter of the decoder. Specific tests have been added including list of attributes.
To keep the behavior of unmarshal, escaped characters handling has been added but it is not symmetrical to Marshal for quotes but follows XML specification. Testing has been extended.