-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
encoding/xml: add decode from TokenReader, to enable stream transformers #19480
Comments
Other considerations: If you use this sort of API, people will want to write / use consumers. Since consumers would be guided by the Tokenizer API, what would they look like (and can we do better)? Right now I think they'd have to be functions like the following: // Encode consumes a tokenizer and reencodes its tokens.
// If an error is returned on the Decode or Encode side, it is returned
// immediately.
// Since Encode is defined as consuming the stream until the end, io.EOF is not
// returned.
// If no error would be returned, Encode flushes the underlying encoder when it
// is done.
func Encode(e *xml.Encoder, t Tokenizer) error {} example usage: func ExampleEncode() {
removequote := xmlstream.Remove(func(t xml.Token) bool {
switch tok := t.(type) {
case xml.StartElement:
return tok.Name.Local == "quote"
case xml.EndElement:
return tok.Name.Local == "quote"
}
return false
})
e := xml.NewEncoder(os.Stdout)
xmlstream.Encode(e, removequote(xml.NewDecoder(strings.NewReader(`
<quote>
<p>Foolery, sir, does walk about the orb, like the sun; it shines everywhere.</p>
</quote>`))))
// Output:
// <p>Foolery, sir, does walk about the orb, like the sun; it shines everywhere.</p>
} I vaguely feel that this API could be improved upon, but I'll have to think about it. |
More thoughts: I reread this proposal today and I don't think I made this clear, but one reason it might be good to have this in the // Creates a new Decoder that unmarshals based on the token stream returned by t.
func NewTokenDecoder(d *Decoder, t Tokenizer) *Decoder {} Which means that something like the following would work: // Unmarshals "foo" into an "bar" struct
fooMap := xml.Map(func(t xml.Token) xml.Token {
switch tok := t.(type) {
case xml.StartElement:
if tok.Name.Local == "foo" {
tok.Name.Local = "bar"
return tok
}
case xml.EndElement:
if tok.Name.Local == "foo" {
tok.Name.Local = "bar"
return tok
}
}
return t
})
const somexml = "<foo>Test</foo>"
d := xml.NewDecoder(strings.NewReader(somexml))
d = xml.NewTokenDecoder(d, fooMap(d))
s := struct {
XMLName xml.Name `xml:"bar"`
Text string `xml:",cdata"`
}{}
d.Unmarshal(&s)
assert(s.Text == "Test") Wraping a Tokenizer in a Concrete decoder struct is trivial to implement, but what's not exactly clear to me if we ever wanted to do this is what RawToken would become. Would it bypass the transformation entirely and continue to return exactly what it would have for the underlying tokenizer, or would it continue to not perform namespace substitution but still have whatever the transformation is (and if so, how does the transformation know what to do with it, do we have to add RawToken to the tokenizer API too?) |
Can this be done outside the package? If not, what is fundamentally required in encoding/xml to make this possible to develop outside? |
It depends if we wanted some sort of concrete-Decoder-that-wraps-a-tokenizer compatibility API like I mentioned in my previous comment (which I think would be very useful; I'd also love to be able to use methods like Unmarshal without reimplementing all the struct tag reading logic). If we do want that, we would have to do this in the Alternatively, I've been thinking for a while now that it might make sense to think about deprecating |
I'm not interested in deprecating encoding/xml, but I'm also not interested in having it balloon into bigger things. Let's keep doing what we're doing, but let's leave transforms to third-party packages. |
Alrighty; it would be nice to have some way to use these in existing Decoder's though (so that if you want to Unmarshal you don't have to reimplement all the logic for a transformer) |
Adding
seems OK, provided it's not much code. |
That makes sense to me (and is minimal, I've got a branch ready on my other machine that has an implementation of the NewTokenDecoder method, I'll submit a CL later today). |
What behavior RawToken should have in this case is still unclear to me, otherwise though I've pushed up a CL that demonstrates the change (I will write more tests as soon as the behavior is certain). EDIT: Gobot's slacking off today: https://go-review.googlesource.com/c/38791/ |
Another interesting side effect of this proposal which I hadn't considered originally is that this allows the XML package to decode any tokenizer that outputs XML tokens, even if the original input was not XML. It can be used to easily write "codecs" which translate other formats into XML at the token level instead of having to deal with wrapping an underlying For instance, one could feed an XMPP library a special decoder that took CBOR (which would be sent over the wire) as its input, and converted it to XML, or which decoded the base profile of EXI (a binary compression format for XML) without the underlying XML library ( |
Another thought: Maybe this should have a
EDIT: CL updated while I play with and think about this API. |
@SamWhited, what's the current status here? I see that CL 38791 defines:
I don't see why Skip is needed. The Decoder implementation of Skip just calls Token/RawToken repeatedly until it finds the closing tag it wants. Why can't any consumer of a TokenReader do the same? As for Token vs RawToken, it seems like there are two main differences: (1) RawToken doesn't guarantee well-formed (properly matched) input, which I hope is not a concern here, and (2) RawToken doesn't have to do namespace parsing and expansion. That's the big question in my mind: if you're doing transformation you probably do want to see the full name space info, right? So then the input back into the Decoder is going to have expanded name space info, so it should be Token not RawToken. RawToken can just return an error for a Decoder constructed by NewTokenDecoder. |
That's fair; I suppose it's not that much extra stuff to reimplement so it's probably worth leaving it off just to keep the interface small.
It does make things a little easier in some cases (no searching for the xmlns or xmlns:prefix attributes). I could go either way, the only reason I lean towards RawToken is the issue of what to do with the RawToken method on decoders that wrap a TokenReader; as you said:
While I suppose that's true, it feels poor to not ever be sure if RawToken will work on any given Decoder (since once you've wrapped a tokenizer, there would be no way to tell a normal decoder apart from a special transformer decoder). It doesn't technically break the Decoder API since RawToken could always return an error, but before I would have assumed that an error from RawToken was a read error, now I have to assume that it could not actually be an "error" per se just a signal that RawToken isn't actually a method I should be using (even though it is a method on the struct). This doesn't feel great to me. |
What do the transformers you've written assume about the form of the name space data? That seems like the key consideration. |
I've written them both ways (using Token and RawToken) to see what the API "felt like"; there wasn't much difference. The only exception is |
CL https://golang.org/cl/38791 mentions this issue. |
Apologies for ignoring this for a while. CL 38791 PS 6 is so very simple that as long as you can confirm that it suffices for the things you want to build outside the standard library, I'm certainly happy with it. We should leave the submit until Go 1.10 at this point. Please do build some real transformers with it to exercise that it's the right level, but it looks good to me. Will mark this approved. |
Thanks @rsc; I've sort of been ignoring this one too and was focusing on the more complex DOM-like API. I rebased the CL and pushed a copy of my transformers that use this patch here: https://bitbucket.org/mellium/xmlstream/branch/encoding_xml_decode_wrapper (my CI on Bitbucket is failing, but only because it doesn't have the patch, the transformers all appear to work the same locally). The only odd one was the EDIT: I pushed another branch that works the same way, but uses the API as if the This is in fact a bit easier in that one case since you can still use Skip. I'd still worry that it would feel broken for people who have to use RawToken heavily though. The correct behavior here is eluding me. EDIT: Pushed the change to use Token instead of RawToken again. I suspect I'm overthinking this. Token makes things marginally easier in many cases, and having RawToken just ignore the transformation if you still need to use it sort of makes sense. It's just a vague feeling that ignoring RawToken will cause problems anyways; I thought I had a concrete example of why it was bad at one point, but I can't think of it anymore so I've changed it back. |
This is a tracking issue for this design document.
Abstract
The
encoding/xml
package contains an API for tokenizing an XML stream, but noAPI exists for processing or manipulating the resulting token stream.
This proposal describes such an API.
Proposed API
Please see the formal proposal for a list of open questions and justification.
Example implementation: https://godoc.org/mellium.im/xmlstream
Design doc: https://golang.org/design/19480-xml-stream
The text was updated successfully, but these errors were encountered: