-
Notifications
You must be signed in to change notification settings - Fork 390
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
See golang/go#13504 Change-Id: Ie9877b10ae3eed8ad5e5763d35e48d94c6f8f584 Reviewed-on: https://go-review.googlesource.com/30364 Reviewed-by: Russ Cox <rsc@golang.org>
- Loading branch information
Showing
1 changed file
with
134 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
# Proposal: Natural XML | ||
|
||
Author(s): Sam Whited <sam@samwhited.com> | ||
|
||
Last updated: 2016-09-27 | ||
|
||
Discussion at https://golang.org/issue/13504. | ||
|
||
|
||
## Abstract | ||
|
||
The `encoding/xml` API is arguably difficult to work with. | ||
In order to fix these issues, a more natural API is needed that acts on nodes in | ||
a tree like structure instead of directly on the token stream. | ||
|
||
|
||
## Background | ||
|
||
XML parsers generally operate in one of two modes of operation, a "DOM style" | ||
mode in which entire documents are parsed into a tree-like data structure, the | ||
"Document Object Model" (DOM), and an event-driven "SAX style" mode (Simple API | ||
for XML) in which tokens are streamed one at a time and only handled if they | ||
would trigger a callback or event. | ||
The benefit of a DOM style node is that all information contained in the XML is | ||
rapidly accessible and can be accessed at will, whereas in a SAX style mode | ||
only information at the current parse location is readily available and other | ||
arrangements have to be made to store previously visible information. | ||
However, the SAX style mode generally provides a relatively small and stable | ||
memory footprint, while the DOM style mode requires parsers to load an entire | ||
document into memory. | ||
|
||
Go currently supports a hybrid approach to this situation: entire documents or | ||
elements may be read into native data structures, or individual tokens may be | ||
read off the wire and handled directly by the application. | ||
This works well for simple elements where the entire structure is known, but for | ||
XML with an arbitrary format it forces use of the low-level token stream APIs | ||
directly which is error prone and cumbersome. | ||
|
||
|
||
## Proposal | ||
|
||
Having a higher level tree-like API will allow users to manipulate arbitrary XML | ||
in a more natural way that is compatible with Go's hybrid SAX and DOM style | ||
approach to parsing XML. | ||
|
||
|
||
### Implementation | ||
|
||
An interface originally [suggested][167632824] by RSC is proposed: | ||
|
||
[167632824]: https://github.com/golang/go/issues/13504#issuecomment-167632824 | ||
|
||
|
||
```go | ||
// An Element represents the complete parse of a single XML element. | ||
type Element struct { | ||
StartElement | ||
Child []Child | ||
} | ||
|
||
// A Child is an interface holding one of the element child types: | ||
// *Element, CharData, or Comment. | ||
type Child interface{} | ||
``` | ||
|
||
The `*Element` type will implement `xml.Marshaler` and `xml.Unmarshaler` to make | ||
it compatible with the existing `(*xml.Encoder) Encode` and `(*xml.Decoder) | ||
Decode` methods for situations where entire XML elements should be consumed. | ||
This makes it compatible with both styles of XML parsing in Go. | ||
For example, an entire element could be unmarshaled simply: | ||
|
||
```go | ||
el := xml.Element{} | ||
err := d.Decode(&el) | ||
``` | ||
|
||
Or specific children could be unmarshaled: | ||
|
||
```go | ||
tok, _ := d.Token() | ||
el := xml.Element{StartElement: tok.(StartElement)} | ||
|
||
// Only unmarshal the child named "body" | ||
for ; err == nil; tok, err = d.Token() { | ||
if start, ok := tok.(StartElement); ok && start.Name.Local == "body" { | ||
child := xml.Child{} | ||
_ = xml.DecodeElement(&child, start) | ||
el.Child = append(el.Child, child) | ||
} | ||
} | ||
``` | ||
|
||
The author volunteers to complete this work in the next release cycle with | ||
enough time left after this proposal is accepted and conservatively estimates | ||
that a week of work would be required to complete the changes, including tests. | ||
The changes themselves are relatively easy and this lengthy estimate is mostly | ||
because the authors time is limited to evenings and weekends. | ||
If someone who's job permitted them to work on Go were to accept the task, the | ||
work could almost certainly be completed much quicker. | ||
|
||
|
||
## Rationale | ||
|
||
For large XML documents or streams that cannot be parsed all at once, the given | ||
approach does make parsing less complicated since we still have to iterate over | ||
the token stream. | ||
It may be possible to fix this by adding new methods to the `*xml.Encode` and | ||
`*xml.Decode` types specifically for dealing with elements, but the author | ||
deems that the benefit is not worth the added complexity to the XML package. | ||
The current solution is simple and does not preclude adding a more robust | ||
Element based API at a later date. | ||
|
||
|
||
## Compatibility | ||
|
||
This proposal does not introduce any changes that would break compatibility | ||
with existing code. | ||
It adds two types which would need to be covered under the compatibility | ||
promise in the future. | ||
|
||
|
||
## Open issues (if applicable) | ||
|
||
* For elements with large numbers of children, accessing a specific child via | ||
a slice may be slow. | ||
Using a map would be a simple fix, but this makes accessing arrays with few | ||
elements slower (the crossover is somewhere around 10 elements in a very | ||
informal benchmark). | ||
Using a trie or some other appropriate tree-like structure can give us the | ||
best of both worlds, but adds a great deal of complexity that is almost | ||
certainly not worth it. | ||
It may, however, be worth not making the children slice public (and using | ||
accessor methods instead) so that the implementation could easily be switched | ||
out at a later date. |