New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: add generic representation of XML data #13504

Open
fluhus opened this Issue Dec 5, 2015 · 24 comments

Comments

Projects
None yet
8 participants
@fluhus

fluhus commented Dec 5, 2015

It would be helpful to have a more 'natural' data-object for XML data, so that all the information is preserved. Something like the DOM nodes in javascript.

Here is an implementation of what I would expect an XML parser to return:
https://github.com/fluhus/gostuff/tree/master/xmlnode

What do you think? Can we add such a feature to the standard XML parser?

@fluhus

This comment has been minimized.

fluhus commented Dec 6, 2015

Just found something similar here:
https://godoc.org/golang.org/x/net/html#Node

Would be great to have something of that kind for XMLs as well.

@rsc rsc changed the title from proposal: encoding/xml: DOM-node representation of XML data to proposal: encoding/xml: node representation of XML data Dec 28, 2015

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 28, 2015

I agree: package xml could use something like package json's ability to unmarshal into interface{}. What that is I'm not sure. I think a reasonable starting point would be something minimal like:

// An Element represents the complete parse of a single XML element.
type Element struct {
    StartElement
    Child []Child
}

// A Child is an interface holding one of the element child types:
// *Element, CharData, or Comment.
type Child interface{}

I'm not sure it makes sense to add all kinds of navigation help on top of this. The more specialized it is, the more likely it is to not work for a significant number of users.

Edit: fixed s/*Node/*Element/ in the comment on Child.

@rsc rsc added the Proposal label Dec 28, 2015

@rsc rsc added this to the Proposal milestone Dec 28, 2015

@fluhus

This comment has been minimized.

fluhus commented Dec 29, 2015

I agree about simplification.

Maybe node types can follow what the tokenizer currently returns.

You can see my implementation here (that's what I use for my work). It offers the information that the tokenizer can handle plus children and parent, with only a single exported type (Node). That's the simplest interface I could come up with for representing the different types of data.

Would be happy to hear your opinion.

@rsc

This comment has been minimized.

Contributor

rsc commented Dec 29, 2015

I saw your implementation, and I think it's too complex. There's no need for an interface. What I wrote above is significantly simpler and it does follow what the tokenizer currently returns.

@fluhus

This comment has been minimized.

fluhus commented Dec 30, 2015

Thanks for your comment!

I trust your solution so I won't argue too much, but can you explain why your approach is simpler?

As far as I understand it exposes more types (involves users with token types while my representation encapsulates them), and requires type-checks in code that uses it. How is that making things easier for users?

To demonstrate, here is what traversal looks like using my library:

// Only one type involved - Node - encapsulates the rest of the XML package.
func traverse(n Node) {
  doTextSearch(n.Text())
  for _, child := n.Children() {
    traverse(child)
  }
}

And with your solution I guess it would look something like (correct me if I am wrong):

func traverse(c Child) {
  // Need to understand the different types and their structure.
  switch c := c.(type) {
  case CharData:
    doTextSearch([]byte(c))
  case *Element:
    for _, child := range c.Child {
      traverse(child)
    }
  }
}

Can you specify use cases that would be simpler to solve using your approach?

My library is always available so I won't argue :) but I am happy to learn from your insights.

@rsc

This comment has been minimized.

Contributor

rsc commented Jan 5, 2016

@fluhus

This comment has been minimized.

fluhus commented Jan 14, 2016

Understood. Thanks for sharing your thoughts!

@cannona

This comment has been minimized.

cannona commented Jan 29, 2016

If we truly want a way to losslessly represent XML data, don't we need a way to represent empty-element tags? I.E. we need to have a way to distinguish between and . Or perhaps that level of losslessness isn't a requirement for what you are proposing.

@cannona

This comment has been minimized.

cannona commented Jan 29, 2016

Oops. Let's try that again.

I.E. we need to have a way to distinguish
between <tag></tag> and <tag/>.

@rsc

This comment has been minimized.

Contributor

rsc commented Jan 29, 2016

I don't think we're talking about lossless. If you needed lossless you'd also have to keep track of how every literal character was written: A vs A and so on.

@cannona

This comment has been minimized.

cannona commented Feb 1, 2016

Makes sense. I just assumed that when @fluhus said "It would be helpful to have a more 'natural' data-object for XML data, so that all the information is preserved," he really meant all

@adg

This comment has been minimized.

Contributor

adg commented Aug 15, 2016

Is anyone interested in moving this forward? If so, please write a proposal doc.

@adg

This comment has been minimized.

Contributor

adg commented Sep 26, 2016

Ping?

@SamWhited

This comment has been minimized.

Member

SamWhited commented Sep 27, 2016

I will begin writing a design doc since I use the XML package heavily for XMPP which I think could benefit from a more tree-like API.

EDIT: Removed the "if no one else is interested" disclaimer; I will start writing a design doc either way, if someone else already has something or wants to handle it themselves, feel free to ping me.

EDIT 2: Quickly threw together a draft: https://go-review.googlesource.com/c/30364/ happy to continue working on it or turn over work to one of the original issue participants.

It currently does exactly what was proposed by rsc above. Current usage would look something like this:

// A simple XMPP message; don't worry about the syntax, what's important is that it's XML.
const msg = `<message type="chat" to="notviola@chat.shakespeare.lit" from="feste@shakespeare.lit">
<body>Foolery, sir, does walk about the orb, like the sun; it shines everywhere.</body>
<thread>0297358d-df91-4741-9435-c3783ec456ba</thread>
</message>`

d := xml.NewDecoder(strings.NewReader(msg))

tok, _ := d.Token()

// Decode full element:
// el, err := d.Element(tok.(StartElement))

// Decode partial element (we only care about the body)
el := xml.Element{StartElement: tok.(StartElement)}
for ; err == nil; tok, err = d.Token() {
    if start, ok := tok.(StartElement); ok && start.Name.Local == "body" {
        child := xml.Child{}
        _ = xml.DecodeElement(&child, start)
        el.Child = append(el.Child, child)
        return
    }
}

The only real problem I see with this API (which may not be a problem at all) is that it doesn't really simplify things over working with the raw token stream all that much when we only want to partially decode an element, although this may be outside of the scope of this proposal and not something we care to solve right now (if at all).

@gopherbot

This comment has been minimized.

gopherbot commented Sep 27, 2016

CL https://golang.org/cl/29861 mentions this issue.

@SamWhited

This comment has been minimized.

Member

SamWhited commented Oct 5, 2016

Not sure how, but I appear to have overwritten the proposal with a completely different one and didn't notice. Correct CL is now submitted.

@gopherbot

This comment has been minimized.

gopherbot commented Oct 5, 2016

CL https://golang.org/cl/30364 mentions this issue.

gopherbot pushed a commit to golang/proposal that referenced this issue Mar 7, 2017

proposal: add natural XML proposal
See golang/go#13504

Change-Id: Ie9877b10ae3eed8ad5e5763d35e48d94c6f8f584
Reviewed-on: https://go-review.googlesource.com/30364
Reviewed-by: Russ Cox <rsc@golang.org>
@rsc

This comment has been minimized.

Contributor

rsc commented Mar 7, 2017

Submitted the proposal CL by @SamWhited, now at https://golang.org/design/13504-natural-xml. I think it seems like a reasonable start (unsurprisingly). I'd also like to make sure we can marshal those back; I assume that's straightforward.

@SamWhited, if you're still interested and want to sketch an implementation, that seems like a reasonable next step. Now that the proposal is viewable online I'll wait to see if there are more comments here. My guess is that we're targeting Go 1.10 for this but it's fine to have CLs out for review during this cycle. If it's super-easy and uncontroversial we could think about Go 1.9. Thanks.

@rsc

This comment has been minimized.

Contributor

rsc commented Mar 7, 2017

Also, apologies for the long delay here. Catching up with some proposals that fell to the bottom of the "I need to read and think about this" stack.

@SamWhited

This comment has been minimized.

Member

SamWhited commented Mar 7, 2017

Oh wow, I'd forgotten about this one again; thanks Russ. I'll see if I can't dig up some old code or knock together a CL in the next few weeks.

@gopherbot

This comment has been minimized.

gopherbot commented Mar 9, 2017

CL https://golang.org/cl/37945 mentions this issue.

@SamWhited

This comment has been minimized.

Member

SamWhited commented Mar 10, 2017

Done ⤴ I've also submitted a related, but ultimately orthogonal proposal to futher improve the experience while using the XML library:

XML stream tracking issue: https://golang.org/issue/19480 (#19480)
XML stream design doc: https://golang.org/design/19480-xml-stream

@rsc

This comment has been minimized.

Contributor

rsc commented Apr 10, 2017

I'm sufficiently happy with where the CL is headed, and there's been no objection here, that I think we can accept this proposal.

@rsc rsc modified the milestones: Go1.9, Proposal Apr 10, 2017

@rsc rsc changed the title from proposal: encoding/xml: node representation of XML data to encoding/xml: add generic representation of XML data Apr 10, 2017

@bradfitz

This comment has been minimized.

Member

bradfitz commented May 24, 2017

@rsc, can you review the open CL? https://go-review.googlesource.com/c/37945/

@bradfitz bradfitz modified the milestones: Go1.10, Go1.9 Jul 24, 2017

@rsc rsc modified the milestones: Go1.10, Gccgo, Go1.11 Dec 4, 2017

@bradfitz bradfitz modified the milestones: Go1.11, Go1.12 Jun 13, 2018

@ianlancetaylor ianlancetaylor modified the milestones: Go1.12, Go1.13 Dec 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment