Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: Support for alternate encodings #8937

Closed
gopherbot opened this Issue Oct 15, 2014 · 4 comments

Comments

Projects
None yet
3 participants
@gopherbot
Copy link

gopherbot commented Oct 15, 2014

by pico303:

In Go 1.3.3, the XML parser for Go is locked into UTF-8 encodings.  In
encoding/xml/xml.go (around line 576), there's the line:

    enc := procInstEncoding(string(data))
    if enc != "" && enc != "utf-8" && enc != "UTF-8" {

For documents with:

    <?xml version="1.0" encoding="ISO-8859-1"?>

you get this error message:

    Invalid body content: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

You can override the reader to support alternative encodings, but this means pre-parse
the XML []byte yourself for the proper encoding, setup the reader, then parse the XML. 

Could the package be adapted somehow so you could provide alternate readers ahead of
time, based on the encoding value?  Something like this (pseudocode):

    func init() {
        xml.AddCharsetReader("iso-8859-1", ISO8859Reader)
    }

    func Parse(doc []byte) (SomeStruct, error) {
        var myobj SomeStruct
        if err := xml.Unmarshal(doc, &myobj); err != nil {
            return nil, err
        }
        return myobj, nil
    }
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

ianlancetaylor commented Oct 15, 2014

Comment 1:

Labels changed: added repo-main, release-none.

@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Oct 16, 2014

Comment 2:

This hook already exists.
Use xml.Decoder, not xml.Unmarshal, and set Decoder.CharsetReader, as the error message
says.

Labels changed: added performance.

Status changed to WorkingAsIntended.

@gopherbot

This comment has been minimized.

Copy link
Author

gopherbot commented Oct 16, 2014

Comment 3 by pico303:

Except that to do that, you have to know the encoding ahead of time. Our servers get
messages in either UTF-8 or ISO-8859-1. So we basically have to parse the incoming
stream for the encoding parameter, load the correct reader, and unmarshal.  Feels clunky.
@bradfitz

This comment has been minimized.

Copy link
Member

bradfitz commented Oct 16, 2014

Comment 4:

Look at the docs:
        // CharsetReader, if non-nil, defines a function to generate
        // charset-conversion readers, converting from the provided
        // non-UTF-8 charset into UTF-8. If CharsetReader is nil or
        // returns an error, parsing stops with an error. One of the
        // the CharsetReader's result values must be non-nil.
        CharsetReader func(charset string, input io.Reader) (io.Reader, error)
Your hook gets passed in the charset. You don't need to parse it yourself.

@golang golang locked and limited conversation to collaborators Jun 25, 2016

This issue was closed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.