Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding/xml: unable to handle utf-16-encoded file without manual manipulation of source bytes #38335

Open
mccolljr opened this issue Apr 9, 2020 · 1 comment

Comments

@mccolljr
Copy link

@mccolljr mccolljr commented Apr 9, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14 darwin/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/mccolljr/Library/Caches/go-build"
GOENV="/Users/mccolljr/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/mccolljr/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/mccolljr/go/src/github.com/golang/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/mccolljr/go/src/github.com/golang/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/Users/mccolljr/go/src/github.com/orthly/3oxz/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4g/0y_btbcj46v3x478swzt64140000gn/T/go-build229162746=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

I have a file like the following, encoded in utf-16 (BOM is little endian) on disk:

<?xml version="1.0" encoding="utf-16"?>
<SomeValidXML></SomeValidXML>

When I read the file from disk, the bytes (including the line containing <?xml version="1.0" encoding="utf-16"?>), are encoded in utf-16.

I wanted to parse the full file, with no modification, using the encoding/xml package.

What did you expect to see?

I expected to be able to either A: transform the file's bytes to utf8 and pass that reader to xml.NewDecoder to successfully parse the utf8 data as xml, or B: pass the utf16-encoded bytes to xml.NewDecider and provide a CharsetReader to successfully parse the utf16 data as XML.

What did you see instead?

There were a couple of error cases.

  1. When I pass the resultOfOsOpen directly to xml.NewDecoder, with or without setting CharsetReader to charset.NewReaderLabel: XML syntax error on line 1: invalid UTF-8
  2. When I pass the utf-8 reader returned by charset.NewReader(resultOfOsOpen, "text/xml") to xml.NewDecoder: xml: encoding "utf-16" declared but Decoder.CharsetReader is nil
  3. When I pass the utf-8 reader returned by charset.NewReader(resultOfOsOpen, "text/xml") to xml.NewDecoder, AND set CharsetReader to charset.NewReaderLabel: The (now utf-8-encoded) data is interpreted as utf-16 and I the decoder reads the file as gibberish.

It seems to me that the encoding/xml package expects the line containing <?xml version="1.0" encoding="utf-16"?> to be in some encoding that resembles valid utf-8-encoded text in order to read the encoding line and properly parse the rest of the file, OR for the line to be removed if manual transformation of the input is done beforehand (like with utf-16, which cannot be read as valid utf-8 text).

Am I missing something? Is there a way to do this without modifying the input bytes?

@andybons
Copy link
Member

@andybons andybons commented Apr 10, 2020

@andybons andybons added this to the Unplanned milestone Apr 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.