Skip to content

Commit

Permalink
proposal: add natural XML proposal
Browse files Browse the repository at this point in the history
See golang/go#13504

Change-Id: Ie9877b10ae3eed8ad5e5763d35e48d94c6f8f584
Reviewed-on: https://go-review.googlesource.com/30364
Reviewed-by: Russ Cox <rsc@golang.org>
  • Loading branch information
SamWhited authored and rsc committed Mar 7, 2017
1 parent d7b1d9e commit f2f2bb9
Showing 1 changed file with 134 additions and 0 deletions.
134 changes: 134 additions & 0 deletions design/13504-natural-xml.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Proposal: Natural XML

Author(s): Sam Whited <sam@samwhited.com>

Last updated: 2016-09-27

Discussion at https://golang.org/issue/13504.


## Abstract

The `encoding/xml` API is arguably difficult to work with.
In order to fix these issues, a more natural API is needed that acts on nodes in
a tree like structure instead of directly on the token stream.


## Background

XML parsers generally operate in one of two modes of operation, a "DOM style"
mode in which entire documents are parsed into a tree-like data structure, the
"Document Object Model" (DOM), and an event-driven "SAX style" mode (Simple API
for XML) in which tokens are streamed one at a time and only handled if they
would trigger a callback or event.
The benefit of a DOM style node is that all information contained in the XML is
rapidly accessible and can be accessed at will, whereas in a SAX style mode
only information at the current parse location is readily available and other
arrangements have to be made to store previously visible information.
However, the SAX style mode generally provides a relatively small and stable
memory footprint, while the DOM style mode requires parsers to load an entire
document into memory.

Go currently supports a hybrid approach to this situation: entire documents or
elements may be read into native data structures, or individual tokens may be
read off the wire and handled directly by the application.
This works well for simple elements where the entire structure is known, but for
XML with an arbitrary format it forces use of the low-level token stream APIs
directly which is error prone and cumbersome.


## Proposal

Having a higher level tree-like API will allow users to manipulate arbitrary XML
in a more natural way that is compatible with Go's hybrid SAX and DOM style
approach to parsing XML.


### Implementation

An interface originally [suggested][167632824] by RSC is proposed:

[167632824]: https://github.com/golang/go/issues/13504#issuecomment-167632824


```go
// An Element represents the complete parse of a single XML element.
type Element struct {
StartElement
Child []Child
}

// A Child is an interface holding one of the element child types:
// *Element, CharData, or Comment.
type Child interface{}
```

The `*Element` type will implement `xml.Marshaler` and `xml.Unmarshaler` to make
it compatible with the existing `(*xml.Encoder) Encode` and `(*xml.Decoder)
Decode` methods for situations where entire XML elements should be consumed.
This makes it compatible with both styles of XML parsing in Go.
For example, an entire element could be unmarshaled simply:

```go
el := xml.Element{}
err := d.Decode(&el)
```

Or specific children could be unmarshaled:

```go
tok, _ := d.Token()
el := xml.Element{StartElement: tok.(StartElement)}

// Only unmarshal the child named "body"
for ; err == nil; tok, err = d.Token() {
if start, ok := tok.(StartElement); ok && start.Name.Local == "body" {
child := xml.Child{}
_ = xml.DecodeElement(&child, start)
el.Child = append(el.Child, child)
}
}
```

The author volunteers to complete this work in the next release cycle with
enough time left after this proposal is accepted and conservatively estimates
that a week of work would be required to complete the changes, including tests.
The changes themselves are relatively easy and this lengthy estimate is mostly
because the authors time is limited to evenings and weekends.
If someone who's job permitted them to work on Go were to accept the task, the
work could almost certainly be completed much quicker.


## Rationale

For large XML documents or streams that cannot be parsed all at once, the given
approach does make parsing less complicated since we still have to iterate over
the token stream.
It may be possible to fix this by adding new methods to the `*xml.Encode` and
`*xml.Decode` types specifically for dealing with elements, but the author
deems that the benefit is not worth the added complexity to the XML package.
The current solution is simple and does not preclude adding a more robust
Element based API at a later date.


## Compatibility

This proposal does not introduce any changes that would break compatibility
with existing code.
It adds two types which would need to be covered under the compatibility
promise in the future.


## Open issues (if applicable)

* For elements with large numbers of children, accessing a specific child via
a slice may be slow.
Using a map would be a simple fix, but this makes accessing arrays with few
elements slower (the crossover is somewhere around 10 elements in a very
informal benchmark).
Using a trie or some other appropriate tree-like structure can give us the
best of both worlds, but adds a great deal of complexity that is almost
certainly not worth it.
It may, however, be worth not making the children slice public (and using
accessor methods instead) so that the implementation could easily be switched
out at a later date.

0 comments on commit f2f2bb9

Please sign in to comment.