New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: add whitespace normalization from spec #20614

Open
santhosh-tekuri opened this Issue Jun 8, 2017 · 12 comments

Comments

Projects
None yet
7 participants
@santhosh-tekuri
Contributor

santhosh-tekuri commented Jun 8, 2017

What version of Go are you using (go version)?

go version go1.8.3 darwin/amd64

What operating system and processor architecture are you using (go env)?

darwin/amd64

What did you do?

for attribute values, all whitespace characters (#x20, #xD, #xA, #x9), should be normalized to a space character (see https://www.w3.org/TR/REC-xml/#AVNormalize)

go program used to test: (https://play.golang.org/p/mCRrxvyh25)

package main

import (
	"encoding/xml"
	"fmt"
	"strings"
)

func main() {
	decoder := xml.NewDecoder(strings.NewReader(`<a p1="v1

v1"/>`))
	t, _ := decoder.Token()
	if e, ok := t.(xml.StartElement); ok {
		fmt.Printf("%q", e.Attr[0].Value)
	}
}

What did you expect to see?

"v1  v1"

What did you see instead?

"v1\n\nv1"

Tested the same with xmllint program:

$ cat test.xml
<x p1="v1

v2"/>
$ xmllint test.xml
<?xml version="1.0"?>
<x p1="v1  v2"/>

@bradfitz bradfitz added this to the Go1.10 milestone Jun 8, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 8, 2017

/cc @rsc for XML.

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 14, 2017

The spec seems very clear that this is required, and we're not doing it. Seems OK to do. Note the oddity that

<x a="
v">
<x a="&#x0Av">

are not equivalent. The literal newline turns into a space but the escaped newline remains a newline.

It also looks like we are not handling end-of-line \r correctly (https://www.w3.org/TR/REC-xml/#sec-line-ends).

CLs welcome for Go 1.10 (but please keep it simple).

@rsc rsc added NeedsFix and removed NeedsDecision labels Jun 14, 2017

@Kinghack

This comment has been minimized.

Contributor

Kinghack commented Jun 15, 2017

could I try this?

@bradfitz

This comment has been minimized.

Member

bradfitz commented Jun 15, 2017

Go for it.

@rsc rsc changed the title from encoding/xml: whitespace characters in attribute value should be normalized to a space character to encoding/xml: implement spec rules for whitespace normalization Jun 15, 2017

@rsc rsc changed the title from encoding/xml: implement spec rules for whitespace normalization to encoding/xml: add whitespace normalization from spec Jun 15, 2017

@Kinghack

This comment has been minimized.

Contributor

Kinghack commented Jun 21, 2017

I am trying to verify my diff.
I found that with xmllint, they are not dealing with leading and trailing space (#x20).
Example is

<x p1=" v1

\nv2

" p2=" v2 v1 "/>

The result with
xmllint test2.xml
is
<?xml version="1.0"?> <x p1=" v1 \nv2 " p2=" v2 v1 "/>

Do I misunderstand the spec?

@rsc

This comment has been minimized.

Contributor

rsc commented Jun 21, 2017

The spec says:

If the attribute type is not CDATA, then the XML processor must further process the normalized attribute value by discarding any leading and trailing space (#x20) characters, and by replacing sequences of space (#x20) characters by a single space (#x20) character.

But for our purposes the attribute type is always CDATA, since we don't parse the related entity declarations. So that sentence does not apply.

@gopherbot

This comment has been minimized.

gopherbot commented Jun 22, 2017

CL https://golang.org/cl/46433 mentions this issue.

@emluque

This comment has been minimized.

emluque commented Jul 20, 2017

Hi I've been looking at this issue and the patch that has been submitted ( https://go-review.googlesource.com/c/46433/ ) and there is a problem. The patch is modifying the text function of xml.go, this function is called both for character data nodes and for attribute values. The problem is that the normalization for character data and attribute values is different and the patch is applying the attribute value normalization to character data.

This is an error, since for character data the spec states (https://www.w3.org/TR/REC-xml/#sec-white-space):

"In editing XML documents, it is often convenient to use "white space" (spaces, tabs, and blank lines) to set apart the markup for greater readability. Such white space is typically not intended for inclusion in the delivered version of the document. On the other hand, "significant" white space that should be preserved in the delivered version is common, for example in poetry and source code.

An XML processor MUST always pass all characters in a document that are not markup through to the application..."

The only normalization that must be done on Character Data nodes is the end of line handling:

https://www.w3.org/TR/REC-xml/#sec-line-ends

On the other hand for attribute values both the end of line handling and the attribute value normalization (https://www.w3.org/TR/REC-xml/#AVNormalize) must be done.

But applying the attribute values normalization to text nodes is incorrect and will create all kinds of problems on client applications. This is evident by the changes that the patch does to tests, for example the atom feed test here: https://go-review.googlesource.com/c/46433/4/src/encoding/xml/read_test.go#b119

Anyhow, I have read the spec and the code extensively, I think I know how to fix both issues (end of line on both character data and attribute values and attribute value normalization on attribute values) simply and fast, so if this is still open I would love to give it a try.

@rsc rsc modified the milestones: Go1.10, Go1.11 Nov 22, 2017

@santhosh-tekuri

This comment has been minimized.

Contributor

santhosh-tekuri commented Feb 5, 2018

can I try this ?

@gopherbot

This comment has been minimized.

gopherbot commented Apr 4, 2018

Change https://golang.org/cl/104655 mentions this issue: encoding/xml : fix normalization of attributes values

@iWdGo

This comment has been minimized.

Contributor

iWdGo commented Apr 4, 2018

The attribute value was read as text. The existing attribute reader logic is fixed as an attribute may have a namespace or only a prefix. Other possibilities have been removed.

To keep the behavior of raw token which allows many faults in attributes list, error handling is heavily using the Strict parameter of the decoder. Specific tests have been added including list of attributes.

To keep the behavior of unmarshal, escaped characters handling has been added but it is not symmetrical to Marshal for quotes but follows XML specification. Testing has been extended.

@gopherbot

This comment has been minimized.

gopherbot commented Apr 27, 2018

Change https://golang.org/cl/109855 mentions this issue: encoding/xml : Fixes to enforce XML namespace standard

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 13, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment